Archives For Literature Hacking

Over on this link, you’ll find a data-driven document (D3 FTW!) showing collaboration between the most authorship-intensive institutions in astronomy. The document is a chord diagram showing the strength of collaboration between research centres, based on co-authorship of papers.

I’ve included some screenshots here to give you the idea – the one above is for worldwide institutions between 2010 and July 2012.

This diagram is for UK institutions in 2011. Links between locations on the plot are not symmetric and are coloured to show the dominant partner. Strength of links are inversely proportional to the square root of author position so that the 1st authors counts for 1 point, 2nd author gets 1/√2, third get 1/√3 etc. This way I weight toward authors higher up on the list.

The data shown is for 125,000 geocoded papers from a total of 236,000 published by MNRAS, ApJ, AJ, A&A and PASP through to July 2012 (read about the mining here). 485,000 authorships are included, out of 820,000 in total. Data was geocoded based on author affiliation and grouped using the resultant lat/longs to 3 decimal places.

This plot, for worldwide institutions in 2011 is shown highlighting links between CEA Saclay and the other top publishing locations. You can play with many combinations of the data, displaying varying numbers of institutions from different parts of the world. Explore the data at

[This is same data I used to create collaboration maps in this previous post, and can be found here on Google Fusion tables.]

A couple of weeks ago I began to geocode the database of astronomical research I scraped from NASA ADS during .Astronomy 4. This database consists of all the published astronomical research in five major journals (almost 250,000 papers going back decades, from MNRAS, ApJ, AJ, A&A and PASP) up to July 2012. You can read more about that here and here.

Geocoding is the process by which latitude and longitude are derived from a sting of text, e.g. a street address. You use it all the time if you use Google Maps, Bing Maps, or whatever Yahoo call their maps (Yahoo Maps?). The recent débâcle over the failure of the iPhone’s mapping service mostly comes down to the fact that Apple’s geocoding capabilities are not up to scratch.

I’ve been using a Ruby Gem called Geocoder to obtain a latitude and longitude for the affiliations of authors in the astro-literature. Why? Well I thought it be interesting to see how astronomers around the world collaborate. The idea is that we can take those lists of co-authors and visualise how each university or research centre works with the others.

To do this I take a map of the world and every time two institutions work together on a paper I draw a link between them. I do this in R, which fun to try out, and there is a great guide at this site here. Each single line is drawn very faintly but you can see that they quickly build up. The result are maps like this one below:

Europe, 2005

This map shows only the connections between European nations and only in 2005. Those research centres that work most with others pop out fairly easily: Paris, Edinburgh, ESO/Max-Planck in Garching, Germany. Paris (Saclay) in particular is very strongly linked with many places in Europe. My home institution of Oxford can just be picked out in the very busy UK. You can easily make out many of the areas which were less involved in the astronomy-research community in 2005: Northwestern France, Norway and much of Eastern Europe.

On this map, big collaborations dominate. If there is a paper with ten different institutions represented then all ten of those institutions will be highlighted. One big collaboration would make a fairly complex web on its own. In 2009 the was a paper published by the LIGO consortium, involving more than 700 authors from a huge range of research centres around the world. This makes the 2009 plot for Europe, and virtually anywhere else, look quite busy.

The journals I’ve picked are large but they are only English-language. That’s because of my own bias, since I wouldn’t be able to check my working if I dealt with all the major journals from all languages. Also, I have no idea what journals exist for astronomers outside of the English-speaking world and I’m aware that English is fairly dominant in astronomy worldwide. You have to take this into account when looking at the maps.

In all these plots the intensity of the colour is normalised, such that the peak strength of connection is always set to be 100% opaque and it works down from there (linearly, if you’re still following). This means that where you see relatively bright arcs all over the map, it shows you that each place is collaborating with each other fairly evenly. When you see just a few bright arcs, it shows that those places work together a lot more relative to the others.

Here’s Europe changing slightly, over the last few years:

Europe, 2007Europe, 2008Europe, 2009Europe, 2010Europe, 2011

Lets look now at North America. Here is the plot for 2009:

USA, 2009

You can see a clear band of strong connections between California and the North-East of the country (roughly). There are also myriad other links drawm more faintly. Honolulu and Mauna Kea are clearly highlighted, jumping out from the Pacific – and of course this is no surprise since many major telescopes are to be found there.

Now let’s see how Europe and the USA link up. These are the two hubs of English-speaking astronomy. Here’s a plot showing links between all these places in 2008.

North Atlantic, 2008

The strength across the atlantic is very intense – just as strong as within: showing that an ocean’s gap between them has little effect in working terms for the USA and Europe. With astronomy this doesn’t surprise me. Many of the big telescopes are in the US for starters. But also, researchers go where the money is and will happily jump across the pond when needed. The global picture also reveals great collaborative efforts within astrophysics increase as the years go by.

World, 2011

I’ve made a set of these global and regional maps that can be found on Flickr.

The other approach is to highlight all the links to and from just one institution. Let’s take Oxford University, since it’s where I currently work:

Oxford, 2011

This is the map for 2010 and you can see that Oxford, as a major astronomy reseach centre, has links to a lot of places. More interestingly, you can also see how these connections weigh against each other. Oxford is no more tied to Europe than America and appears to collaborate across the UK fairly extensively.

If we use the same intensity scale and compare this to my former institution, Cardiff, in the same year:

Caridff, 2011

then we can see that the pattern is slightly different. Cardiff is less linked in general but has stronger connections to several locations, many European. This must be in large part due to the fact that several important instruments were built here, for Herschel and Planck. The instruments on these spacecraft, and the consortia that operate them, have been the source of a great deal of collaboration in the last few years. 2010 was a notable year for publications from those instruments.

Finally there are regions I know little about, but which appear to tell their own stories when I look at the maps. Take Australia, for example:

Oceania, 2010

This map of 2010 activity down under shows Sydney as the leading collaborator in the region. It also shows that New Zealand and Australia coperarte broadly in astronomy research. In East Asia in 2010 the map shows again that there are a variety of insititutions cooperating on papers, but that Tokyo appears to be a key hub in the community.

East Asia, 2010

You can find a whole array of maps on Flickr at In my next post, I’ll take a closer look at the way these institutions work with others, and see who are the most collaborative in astronomy, and think about why that might be.

UPDATE: If you’re interested in exploring or downloading the data yourself, take a look at this Google Fusion table.

I (or rather my computer) spent most of this morning geocoding the database of astronomical papers that I scraped from NASA ADS a while back. I’ve got about a quarter of a million papers, covering several of the major astronomical journals (MNRAS, ApJ, A&A, PASP and AJ) back to their first publications. There are 7 million citations and 900,000 authorships in the database.

I want to geocode the affiliations listed in those authorships, in order to explore the relationships between different institutes. Geocoding is the process of finding the latitude and longitude coordinates for a place given the address. Authors of papers give their institute’s address but they write them very inconsistently. By geocoding them down to a lat/long pair its easier to normalise the data and get a better feel for when two affiliations are the same place.. The other day I found a Ruby gem called Geocoder that does exactly what I want and so I set about trying to avidly avoid Google’s API rate limit.

Those 900,000 authorships (individual authors on each paper) come from about 70,000 unique affiliations, of which 50,000 appear to parseable as a potential address. Each one takes a second to be geocoded to it could take a while to do them all. I had the sense to start with the most-affiliated addresses and work down though, so in fact I already have 230,000 of those 900,000 authorships covered.

So far the fifteen most authorship-rich institutes are:

  1. Harvard (9576, 4.16%)
  2. Johns Hopkins University (6076, 2.64%)
  3. Cavendish Laboratory, Cambridge (5191, 2.26%)
  4. Universität Bonn (4468, 1.94%)
  5. CalTech (4442, 1.93%)
  6. Max-Planck-Institut für Astrophysik, Garching (4311, 1.87%)
  7. Tucson, Arizona (3452, 1.50%)
  8. ESO, Garching (3409, 1.48%)
  9. NASA, Goddard Space Flight Center (3342, 1.45%)
  10. Durham University (2643, 1.15%)
  11. MIT (2136, 0.93%)
  12. Paris, Observatoire (2125, 0.92%)
  13. Big Bear Solar Observatory, Pasadena (2030, 0.88%)
  14. California Univ., Berkeley (1878, 0.82%)
  15. Max Planck Institute for Astronomy, Heidelberg (1878, 0.82%)

These aren’t the ones that publish the most papers, but rather the centres that have put out the most cumulative author-credits. I’ve not normalised for date either. For all I know Harvard just published one 9,576 author paper, for example (FACT: they didn’t).

The other thing I realised as soon as they started to come in was that I can now see which research centres have the most awkward names. ESO Garching, for example, has been written in at least 15 different ways in the data I’ve gone through so far (see list at the end). It however does have a lot of papers, so you’d expect variations to arise.

Another inconsistently named centre is the California Institute of Technology in Pasadena. With several sub departments and multiple ways to write its name, it appears have more than 44 variations in the way it is credited!

If we consider only the locations with more than 5 address variants, and normalise to the total number of author-credits we get the following top-ten list of institutions with inconsistently written affiliations. These are the institutions where the number of different names are highest compared to the number of times it appears in total.

  1. University of California Berkeley, USA
  2. CalTech, USA
  3. INAF – IASF Bologna, Italy
  4. Instituto de Astrofísica de Andalucía, CSIC, Granada, Spain
  5. Department of Astronomy, Kyoto University, Japan
  6. Universität Bonn, Germany
  7. Instituut voor Sterrenkunde, Leuven, Belgium
  8. Universitäts-Sternwarte München, Germany
  9. Department of Applied Mathematics, The University of Leeds, UK
  10. Yale University, USA
Should this be a worry? I suppose many research centres have ‘defined’ names that everyone should be using, but no one is doing so (or at least no one is checking). I know that here in Oxford there is everyday discrepancy between ‘Oxford University’ and the ‘University of Oxford’ and that is just the start of these things.
In the world of big data, geographical information is very important. Big data is also often reliant on the once-typed or written words of human beings. (e.g. If academic researchers cannot credit their institutes consistently then presumably no one is typing the addresses of many places correctly. Perhaps research papers should be encoded with co-ordinates? Either way, geocoding is a very important tool in an era of big, personal data.
Once I have more of the ADS data geocoded, there is more that can be done here.


List of variations for affiliation credits to ESO, Garching:

  • ESO, Garching bei München, Germany
  • ESO, Karl-Schwarzschild-Str. 2, 85748 Garching bei München, Germany
  • ESO, Karl-Schwarzschild-Strasse 2, 85748 Garching bei München, Germany
  • ESO, Karl-Schwarzschild-Strasse 2, D-85748 Garching bei München, Germany
  • European Southern Observatory, 85748 Garching bei München, Germany
  • European Southern Observatory, D-85748 Garching bei München, Germany
  • European Southern Observatory, Garching bei München, Germany
  • European Southern Observatory, Karl Schwarzschild Strasse 2, 85748 Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Str. 2, 85748 Garching b. München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Str. 2, 85748 Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Str. 2, 85748 Garching, Germany
  • European Southern Observatory, Karl-Schwarzschild-str. 2, 85748, Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Str. 2, D-85748 Garching bei Munchen, Germany
  • European Southern Observatory, Karl-Schwarzschild-Strasse 2, 85748 Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Strasse 2, D-85748 Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschildstr. 2, 85748 Garching bei München, Germany

If you’ve been following the recent series of posts about my data mining, then a) I apoligise and b) it just got better!

The short story is that research in astrophysics is generally made available online and is entirely available, in digital form, all the way back to the begining of the refereed jounrals on the topic in 1827. I have downloaded a lot of the data and have been mining it for my own interest (mostly on the bus).

This week I expanded the database, so that it now inclidues the five main journals for astronomy: MNRAS, ApJ, A&ampA, AJ and PASP. If you think I’m missing something important, please tweet me. I also decided it was time to grab the data regarding authorship, meaning I now have the list of authors, and their affiliations, for each paper since 1827.

Incidentally the 1827 papers, from MNRAS include Charles Babbage amongst the authors, discussing log tables. My favourite paper title from that first year is titled Observations of eclipses of Jupiter’s satellites by Col. Beaufoy’s telescope. There are a lot of Colonels, Majors and other ranks listed in those days.

Another notable fact from that time is that just about every paper was written by one person. People worked alone, and the society was a chance to gather and share findings. This is in stark contrast to today when astronomers generally work in groups and publish as such, together.

Authorship in Astronomy

The above plot shows the average number of authors, per paper since 1827. You can see the trend is not subtle. Around 1960 the value begins climing very quickly, and accelerates. Here’s the same plot on a log scale and showing the maximum number of authors on any paper from that year – another indicator of group sizes in general.

The size of astronomical collaborations are growing fast. In 2011 a group of 770 people co-authored a paper Search for Gravitational Wave Bursts from Six Magnetars, in ApJ. The same collaboration published the 668 -author Searches for Gravitational Waves from Known Pulsars with Science Run 5 LIGO Data a year earlier. One has to question the concept of ‘authorship’ when conisdered in this way, and also the value of citations for these authorships.

In case you were wondering, the large group of co-authors in 1857 is due to an occultation of Jupiter by the Moon that year. The event was observed from all over the UK, and coordinated by the Astronomer Royal into one large paper.

A better way to underatand the changing way we publish, might be the plot above. Here we see the percentage of total papers written by 1, 2, 3, 4 or more authors. You can see that single-author papers dominated for most of the 20th Century. Around 1960 we see the decline begin, as 2- and 3-author papers begin to become a significant chunk of the whole. In 1978&#1602-author papers become more prevalent than single-author papers.

In the 1990s single-authorships continue to decline, and multiple-atuhorships in general are in the ascendency. The distribution flattens out, and by 2012 2-, 3- and 4-author papers each make up about 15% of the literature (single-authorships are down to 6%), and the largest contribution now comes from papers with 5–9 authors. Groups of 10 or more are clearly on the rise too.

If we plot the same chart but in terms of citations, rather than just publications, we get the above. The trends are much the same, but the overall influence of single-author papers declines harder, and slightly faster, after the 1960s. Notably, papers with 5 or more authors appear to be cited more often, relative to their publicatrion rate. Perhaps reflecting the fact that big surveys, and cutting-edge instrumentation requires putting a lot of heads together and that such efforts are beneficial to the community.

The Population

If we take all the names on the papers, de-duplicate, and count them up we get a crude measure of the population of working research astronomers. It’s crude because it doesn’t take into account the fact that multiple people can have the same name, and nor does it notice changes in spelling or initials. So at present the code doesn’t know that Simpson R.J. and Simpson R. might be the same person. I am also not using affiliation information at this time, because the purpose here is just to get a feel for the trends. It would take a lot longer to collect everyone up and cluster all their various names together.

So the population of the research community also changes around 1960 – which is no surprise really as this is when publishing in general begins to boom (see my first post on all of this) and when MNRAS, ApJ and A&ampA all begin the trend of publishing more year-on-year. So let’s compare this to the number of papers to make it more meaningful.

Here we see that people begin to outpace papers in the 1960s, meaning what exactly? Well I suppose it must be related to the first plot, in that we’re publishing in larger groups. It may reflect the fact that as we get more technical as a field, and more specialised, it takes more people to write the same number of papers? This seems like a reasonable idea.

Here we see the ratio of people to papers in terms of papers, per member of the research population. This is similar to the first plot, but accounts for people publishing on more than one paper.

With more papers being published, and more people taking part, I had always assumed that people published more work collectively, and that the communications network allowed expertise to be deployed where it was useful. However, it seems that we need more people to acheive the same amount of work that we did in the 1950s. This doesn’t feel right, and I’ve replotted it a few times, and seems to be the case.

Just for fun, I took a list of molecules you can find in space and made a word matrix from it. The result shows the relationship between molecular species and their occurrence in the astronomical literature.

There is a nice cluster of Hydrogen-bearing molecules that seem to correlate well, same for Carbon. I don’t even know about all the Potassium compounds on here, but they seem to correlate with lots of stuff. I may have to pester some real astrochemistry friends about this one…

[Direct Link]

As it turns out, I’m using my bus journeys to create more lists once in a while. If anyone has suggestions for lists, please tweet me @orbitingfrog.

I have been exploring the terms used in the astronomical literature (see previous post), and have turned my attention to terms that seem to correlate with each other in astronomy publications. I thought it would at least be interesting to see how often one word is mention alongside another.

To do this I take terms and generate three lists from the database.

  • A count of papers including term 1
  • A count of papers including term 2
  • A count of papers including both term 1 and term 2

I can then say that when term 1 is mentioned, term 2 is also mentioned X% of the time. This is different saying that when term 2 is mentioned, term 2 is also mentioned Y% of the time.

For example, ‘M87’ is mentioned in the abstracts/titles of 516 papers in my database*, and ‘elliptical’ is mentioned in 5075 papers. They are both mentioned in 130 papers. So 25% of ‘M87’ papers mention ‘elliptical’, but only 2% of ‘elliptical’ papers mention ‘M87’.

To visualise this we can take several terms and create a grid. In the following example, I had some help from Brooke Simmons, who conjured up a list of AGN-related terms. The opacity of each square represents the strength of the correlation between the terms. You read the term on the left first and the square tells you how often it occurs with the term on the top. All images link to interactive the version.

From this plot we can also see that ‘NGC 1068’ always appears with the term ‘NGC’ – which it really ought to – whereas the reverse is not the case, because there are many NGC objects in the sky.

The colours are used to categorise the objects. In this plot, orange shows specific object names (M87, NGC1968 and IC 2497) and from this we can see that M87 correlates well with occurrences of ‘jet’, ‘elliptical’ ,’x-ray’, ‘radio’, ‘NGC’, and ‘black hole’. None of those matches should surprise astronomers, and M87 nor will the fact that it doesn’t correlate well with ‘spiral’, ‘quasar’ or ‘IC 2497’.

Here’s another matrix, this time showing star-formation terms:

I think that it’s hard to see some the relationships here so I changed the opacity scaling. Here the opacity scales from 0.0 to 1.0 as the correlation scales with 0.0 to 0.5 – basically any pair that appear together more than half the time are now shown as solid blocks. I prefer this scaling and so it is what I’ve used on the interactive plots too.

Here you can see that ‘dust’ and ‘cluster’ are very relevant topics. You can also see that the facilities/instruments used in star formation (green on this chart) . SCUBA has spent a lot of time looking at Orion and Perseus, and is well matched with Spitzer and the JCMT (which is good since SCUBA is a camera on the JCMT!).

The star-formation regions (blue) all show different characteristics through their different correlations. The Pipe Nebula is dusty, clustered and well-studied by Spitzer and IRAS. The Polaris Flare is dusty, possibly prestellar and well-studies by IRAS and Herschel.

The idea behind this exercise was not to create nee science, but to see if this technique even works at all. It does seem to produce plots that show expected relationships. None of this very exact, but it may provide a way to generate new hypotheses. If you can relate terms to each other, which are not thought generally to be related then there is something interesting there.

These correlations are all direct, in that these words appear together in the titles and abstracts of papers. Perhaps more revealing will be terms that are correlated through a third term. To begin that analysis means mining every term and comparing it with every other term. This is computationally more intensive, but I’ll blog more about it next week once I’ve let some code run for a while 🙂

Meanwhile, I’ve created a set of interesting matrices of terms. The two examples above are included along with a cross-match of all the Messier objects (categorised by type), planets and moons (categorised by type), and all the constellations (categorised by hemisphere). Suggestions new word sets are very welcome. Tweet me @orbitingfrog!

* Which contains all the papers from MNRAS, A&A and ApJ since 1827.

At .Astronomy 4 in Heidelberg, I began hacking on some natural language processing of the astronomical literature as part of my Hack Day project with Sarah Kendrew and Karen Masters. It began as a version of BrainScanr for astronomy – which it can still become – however it also provides an interesting database to explore, filled with the terms used in astronomy and how they change over time. 

Basically, I’m grabbing and processing (thanks to Alasdair Allan) the titles and abstracts of all the papers from the Astrophysical Journal (ApJ), Astronomy & Astrophysics (A&A), and the Monthly Notices ofthe Royal Astronomical Society (MNRAS). The records for each of these publications go back some way. MNRAS has been published since 1827, and ADS has digested the entire collection. Here’s how many papers these three journals have published since 1827, per year*.

Here’s the data in a cumulative form:

It may be more useful to see this plot with a log(10) y-axis:

Since 1960-ish, the rate of publication in astronomy appears to follow a power law, and publishing in its current form it cannot usefully continue at this rate of acceleration. If it did we’d be pushing out 1,000 articles a week by 2040 – my arXiv RSS is already unreadably busy.

With the above plots in mind, we can see how individual topics or terms are represented in the literature. Here is the plot for papers mentioning the phrase ‘dark matter’.

It rapidly becomes hard to understand these numbers in the context of the whole corpus of the three journals, so here is the same data, but represented as a fraction of the collective body of literature that year, i.e. it shows what fraction of the literature was concerned with dark matter in any given year.

In 1978 there are two papers using the phrase ‘dark matter’ in their title or abstract. By 2011 a significant chunk of all papers in these journals are talking about it! It looks like ApJ was more concerned with dark matter than A&A or MNRAS, until recently. Looking at a non-cumulative version we see that thus does appear to be the case. Since 2000, MNRAs been steadily publishing more often about dark matter, and ApJ has been publishing less often.

But how does dark matter compare with other terms? Considering the entire corpus, let’s compare the term ‘dark matter’ with a few other, related terms: ’cosmology’, ‘big bang’, ‘dark energy’ and ‘wmap’. you can see cosmology has been getting more popular since the 1990s, and dark energy is a recent addition.

Here are some more plots, just for fun. Here the relative popularity of the planets Earth, Jupiter and Saturn (the three most-popular in fact). You can see that we’re more obsessed with Earth than ever, and that Jupiter is getting more interesting:

Here are terms related to exoplanets. You can see that the use of the term ‘exoplanet’ overtook ‘extrasolar planet’ in about 2007:

Here are some famous space telescopes. I like that Hubble, Chandra and Spitzer seem to have taken turns in hogging the limelight, much as COBE, WMAP and Planck has each contributed to our knowledge of the CMB in successive decades, Planck still on the rise.

It is worth mentioning again that I’m currently only using data from three journals. I intend to add more, such as PASP, and to add supplements and letters. The more data I add, the slower my code will become and I need to refactor it before I let the database get bigger.

My next step is looking at correlations between terms in the literature, to find known, and possibly unknown, relationships between words. I’ll also try to put this all online into some usable form soon.

* All the charts on this post are linked to the interactive version.