Archives For D3

Over on this link, you’ll find a data-driven document (D3 FTW!) showing collaboration between the most authorship-intensive institutions in astronomy. The document is a chord diagram showing the strength of collaboration between research centres, based on co-authorship of papers.

I’ve included some screenshots here to give you the idea – the one above is for worldwide institutions between 2010 and July 2012.

This diagram is for UK institutions in 2011. Links between locations on the plot are not symmetric and are coloured to show the dominant partner. Strength of links are inversely proportional to the square root of author position so that the 1st authors counts for 1 point, 2nd author gets 1/√2, third get 1/√3 etc. This way I weight toward authors higher up on the list.

The data shown is for 125,000 geocoded papers from a total of 236,000 published by MNRAS, ApJ, AJ, A&A and PASP through to July 2012 (read about the mining here). 485,000 authorships are included, out of 820,000 in total. Data was geocoded based on author affiliation and grouped using the resultant lat/longs to 3 decimal places.

This plot, for worldwide institutions in 2011 is shown highlighting links between CEA Saclay and the other top publishing locations. You can play with many combinations of the data, displaying varying numbers of institutions from different parts of the world. Explore the data at

[This is same data I used to create collaboration maps in this previous post, and can be found here on Google Fusion tables.]

I have been exploring the terms used in the astronomical literature (see previous post), and have turned my attention to terms that seem to correlate with each other in astronomy publications. I thought it would at least be interesting to see how often one word is mention alongside another.

To do this I take terms and generate three lists from the database.

  • A count of papers including term 1
  • A count of papers including term 2
  • A count of papers including both term 1 and term 2

I can then say that when term 1 is mentioned, term 2 is also mentioned X% of the time. This is different saying that when term 2 is mentioned, term 2 is also mentioned Y% of the time.

For example, ‘M87’ is mentioned in the abstracts/titles of 516 papers in my database*, and ‘elliptical’ is mentioned in 5075 papers. They are both mentioned in 130 papers. So 25% of ‘M87’ papers mention ‘elliptical’, but only 2% of ‘elliptical’ papers mention ‘M87’.

To visualise this we can take several terms and create a grid. In the following example, I had some help from Brooke Simmons, who conjured up a list of AGN-related terms. The opacity of each square represents the strength of the correlation between the terms. You read the term on the left first and the square tells you how often it occurs with the term on the top. All images link to interactive the version.

From this plot we can also see that ‘NGC 1068’ always appears with the term ‘NGC’ – which it really ought to – whereas the reverse is not the case, because there are many NGC objects in the sky.

The colours are used to categorise the objects. In this plot, orange shows specific object names (M87, NGC1968 and IC 2497) and from this we can see that M87 correlates well with occurrences of ‘jet’, ‘elliptical’ ,’x-ray’, ‘radio’, ‘NGC’, and ‘black hole’. None of those matches should surprise astronomers, and M87 nor will the fact that it doesn’t correlate well with ‘spiral’, ‘quasar’ or ‘IC 2497’.

Here’s another matrix, this time showing star-formation terms:

I think that it’s hard to see some the relationships here so I changed the opacity scaling. Here the opacity scales from 0.0 to 1.0 as the correlation scales with 0.0 to 0.5 – basically any pair that appear together more than half the time are now shown as solid blocks. I prefer this scaling and so it is what I’ve used on the interactive plots too.

Here you can see that ‘dust’ and ‘cluster’ are very relevant topics. You can also see that the facilities/instruments used in star formation (green on this chart) . SCUBA has spent a lot of time looking at Orion and Perseus, and is well matched with Spitzer and the JCMT (which is good since SCUBA is a camera on the JCMT!).

The star-formation regions (blue) all show different characteristics through their different correlations. The Pipe Nebula is dusty, clustered and well-studied by Spitzer and IRAS. The Polaris Flare is dusty, possibly prestellar and well-studies by IRAS and Herschel.

The idea behind this exercise was not to create nee science, but to see if this technique even works at all. It does seem to produce plots that show expected relationships. None of this very exact, but it may provide a way to generate new hypotheses. If you can relate terms to each other, which are not thought generally to be related then there is something interesting there.

These correlations are all direct, in that these words appear together in the titles and abstracts of papers. Perhaps more revealing will be terms that are correlated through a third term. To begin that analysis means mining every term and comparing it with every other term. This is computationally more intensive, but I’ll blog more about it next week once I’ve let some code run for a while 🙂

Meanwhile, I’ve created a set of interesting matrices of terms. The two examples above are included along with a cross-match of all the Messier objects (categorised by type), planets and moons (categorised by type), and all the constellations (categorised by hemisphere). Suggestions new word sets are very welcome. Tweet me @orbitingfrog!

* Which contains all the papers from MNRAS, A&A and ApJ since 1827.