Archives For data mining

Over on this link, you’ll find a data-driven document (D3 FTW!) showing collaboration between the most authorship-intensive institutions in astronomy. The document is a chord diagram showing the strength of collaboration between research centres, based on co-authorship of papers.

I’ve included some screenshots here to give you the idea – the one above is for worldwide institutions between 2010 and July 2012.

This diagram is for UK institutions in 2011. Links between locations on the plot are not symmetric and are coloured to show the dominant partner. Strength of links are inversely proportional to the square root of author position so that the 1st authors counts for 1 point, 2nd author gets 1/√2, third get 1/√3 etc. This way I weight toward authors higher up on the list.

The data shown is for 125,000 geocoded papers from a total of 236,000 published by MNRAS, ApJ, AJ, A&A and PASP through to July 2012 (read about the mining here). 485,000 authorships are included, out of 820,000 in total. Data was geocoded based on author affiliation and grouped using the resultant lat/longs to 3 decimal places.

This plot, for worldwide institutions in 2011 is shown highlighting links between CEA Saclay and the other top publishing locations. You can play with many combinations of the data, displaying varying numbers of institutions from different parts of the world. Explore the data at

[This is same data I used to create collaboration maps in this previous post, and can be found here on Google Fusion tables.]

A couple of weeks ago I began to geocode the database of astronomical research I scraped from NASA ADS during .Astronomy 4. This database consists of all the published astronomical research in five major journals (almost 250,000 papers going back decades, from MNRAS, ApJ, AJ, A&A and PASP) up to July 2012. You can read more about that here and here.

Geocoding is the process by which latitude and longitude are derived from a sting of text, e.g. a street address. You use it all the time if you use Google Maps, Bing Maps, or whatever Yahoo call their maps (Yahoo Maps?). The recent débâcle over the failure of the iPhone’s mapping service mostly comes down to the fact that Apple’s geocoding capabilities are not up to scratch.

I’ve been using a Ruby Gem called Geocoder to obtain a latitude and longitude for the affiliations of authors in the astro-literature. Why? Well I thought it be interesting to see how astronomers around the world collaborate. The idea is that we can take those lists of co-authors and visualise how each university or research centre works with the others.

To do this I take a map of the world and every time two institutions work together on a paper I draw a link between them. I do this in R, which fun to try out, and there is a great guide at this site here. Each single line is drawn very faintly but you can see that they quickly build up. The result are maps like this one below:

Europe, 2005

This map shows only the connections between European nations and only in 2005. Those research centres that work most with others pop out fairly easily: Paris, Edinburgh, ESO/Max-Planck in Garching, Germany. Paris (Saclay) in particular is very strongly linked with many places in Europe. My home institution of Oxford can just be picked out in the very busy UK. You can easily make out many of the areas which were less involved in the astronomy-research community in 2005: Northwestern France, Norway and much of Eastern Europe.

On this map, big collaborations dominate. If there is a paper with ten different institutions represented then all ten of those institutions will be highlighted. One big collaboration would make a fairly complex web on its own. In 2009 the was a paper published by the LIGO consortium, involving more than 700 authors from a huge range of research centres around the world. This makes the 2009 plot for Europe, and virtually anywhere else, look quite busy.

The journals I’ve picked are large but they are only English-language. That’s because of my own bias, since I wouldn’t be able to check my working if I dealt with all the major journals from all languages. Also, I have no idea what journals exist for astronomers outside of the English-speaking world and I’m aware that English is fairly dominant in astronomy worldwide. You have to take this into account when looking at the maps.

In all these plots the intensity of the colour is normalised, such that the peak strength of connection is always set to be 100% opaque and it works down from there (linearly, if you’re still following). This means that where you see relatively bright arcs all over the map, it shows you that each place is collaborating with each other fairly evenly. When you see just a few bright arcs, it shows that those places work together a lot more relative to the others.

Here’s Europe changing slightly, over the last few years:

Europe, 2007Europe, 2008Europe, 2009Europe, 2010Europe, 2011

Lets look now at North America. Here is the plot for 2009:

USA, 2009

You can see a clear band of strong connections between California and the North-East of the country (roughly). There are also myriad other links drawm more faintly. Honolulu and Mauna Kea are clearly highlighted, jumping out from the Pacific – and of course this is no surprise since many major telescopes are to be found there.

Now let’s see how Europe and the USA link up. These are the two hubs of English-speaking astronomy. Here’s a plot showing links between all these places in 2008.

North Atlantic, 2008

The strength across the atlantic is very intense – just as strong as within: showing that an ocean’s gap between them has little effect in working terms for the USA and Europe. With astronomy this doesn’t surprise me. Many of the big telescopes are in the US for starters. But also, researchers go where the money is and will happily jump across the pond when needed. The global picture also reveals great collaborative efforts within astrophysics increase as the years go by.

World, 2011

I’ve made a set of these global and regional maps that can be found on Flickr.

The other approach is to highlight all the links to and from just one institution. Let’s take Oxford University, since it’s where I currently work:

Oxford, 2011

This is the map for 2010 and you can see that Oxford, as a major astronomy reseach centre, has links to a lot of places. More interestingly, you can also see how these connections weigh against each other. Oxford is no more tied to Europe than America and appears to collaborate across the UK fairly extensively.

If we use the same intensity scale and compare this to my former institution, Cardiff, in the same year:

Caridff, 2011

then we can see that the pattern is slightly different. Cardiff is less linked in general but has stronger connections to several locations, many European. This must be in large part due to the fact that several important instruments were built here, for Herschel and Planck. The instruments on these spacecraft, and the consortia that operate them, have been the source of a great deal of collaboration in the last few years. 2010 was a notable year for publications from those instruments.

Finally there are regions I know little about, but which appear to tell their own stories when I look at the maps. Take Australia, for example:

Oceania, 2010

This map of 2010 activity down under shows Sydney as the leading collaborator in the region. It also shows that New Zealand and Australia coperarte broadly in astronomy research. In East Asia in 2010 the map shows again that there are a variety of insititutions cooperating on papers, but that Tokyo appears to be a key hub in the community.

East Asia, 2010

You can find a whole array of maps on Flickr at In my next post, I’ll take a closer look at the way these institutions work with others, and see who are the most collaborative in astronomy, and think about why that might be.

UPDATE: If you’re interested in exploring or downloading the data yourself, take a look at this Google Fusion table.

I have been exploring the terms used in the astronomical literature (see previous post), and have turned my attention to terms that seem to correlate with each other in astronomy publications. I thought it would at least be interesting to see how often one word is mention alongside another.

To do this I take terms and generate three lists from the database.

  • A count of papers including term 1
  • A count of papers including term 2
  • A count of papers including both term 1 and term 2

I can then say that when term 1 is mentioned, term 2 is also mentioned X% of the time. This is different saying that when term 2 is mentioned, term 2 is also mentioned Y% of the time.

For example, ‘M87’ is mentioned in the abstracts/titles of 516 papers in my database*, and ‘elliptical’ is mentioned in 5075 papers. They are both mentioned in 130 papers. So 25% of ‘M87’ papers mention ‘elliptical’, but only 2% of ‘elliptical’ papers mention ‘M87’.

To visualise this we can take several terms and create a grid. In the following example, I had some help from Brooke Simmons, who conjured up a list of AGN-related terms. The opacity of each square represents the strength of the correlation between the terms. You read the term on the left first and the square tells you how often it occurs with the term on the top. All images link to interactive the version.

From this plot we can also see that ‘NGC 1068’ always appears with the term ‘NGC’ – which it really ought to – whereas the reverse is not the case, because there are many NGC objects in the sky.

The colours are used to categorise the objects. In this plot, orange shows specific object names (M87, NGC1968 and IC 2497) and from this we can see that M87 correlates well with occurrences of ‘jet’, ‘elliptical’ ,’x-ray’, ‘radio’, ‘NGC’, and ‘black hole’. None of those matches should surprise astronomers, and M87 nor will the fact that it doesn’t correlate well with ‘spiral’, ‘quasar’ or ‘IC 2497’.

Here’s another matrix, this time showing star-formation terms:

I think that it’s hard to see some the relationships here so I changed the opacity scaling. Here the opacity scales from 0.0 to 1.0 as the correlation scales with 0.0 to 0.5 – basically any pair that appear together more than half the time are now shown as solid blocks. I prefer this scaling and so it is what I’ve used on the interactive plots too.

Here you can see that ‘dust’ and ‘cluster’ are very relevant topics. You can also see that the facilities/instruments used in star formation (green on this chart) . SCUBA has spent a lot of time looking at Orion and Perseus, and is well matched with Spitzer and the JCMT (which is good since SCUBA is a camera on the JCMT!).

The star-formation regions (blue) all show different characteristics through their different correlations. The Pipe Nebula is dusty, clustered and well-studied by Spitzer and IRAS. The Polaris Flare is dusty, possibly prestellar and well-studies by IRAS and Herschel.

The idea behind this exercise was not to create nee science, but to see if this technique even works at all. It does seem to produce plots that show expected relationships. None of this very exact, but it may provide a way to generate new hypotheses. If you can relate terms to each other, which are not thought generally to be related then there is something interesting there.

These correlations are all direct, in that these words appear together in the titles and abstracts of papers. Perhaps more revealing will be terms that are correlated through a third term. To begin that analysis means mining every term and comparing it with every other term. This is computationally more intensive, but I’ll blog more about it next week once I’ve let some code run for a while 🙂

Meanwhile, I’ve created a set of interesting matrices of terms. The two examples above are included along with a cross-match of all the Messier objects (categorised by type), planets and moons (categorised by type), and all the constellations (categorised by hemisphere). Suggestions new word sets are very welcome. Tweet me @orbitingfrog!

* Which contains all the papers from MNRAS, A&A and ApJ since 1827.