Mining the Astronomical Literature

At .Astronomy 4 in Heidelberg, I began hacking on some natural language processing of the astronomical literature as part of my Hack Day project with Sarah Kendrew and Karen Masters. It began as a version of BrainScanr for astronomy – which it can still become – however it also provides an interesting database to explore, filled with the terms used in astronomy and how they change over time. 

Basically, I’m grabbing and processing (thanks to Alasdair Allan) the titles and abstracts of all the papers from the Astrophysical Journal (ApJ), Astronomy & Astrophysics (A&A), and the Monthly Notices ofthe Royal Astronomical Society (MNRAS). The records for each of these publications go back some way. MNRAS has been published since 1827, and ADS has digested the entire collection. Here’s how many papers these three journals have published since 1827, per year*.

Here’s the data in a cumulative form:

It may be more useful to see this plot with a log(10) y-axis:

Since 1960-ish, the rate of publication in astronomy appears to follow a power law, and publishing in its current form it cannot usefully continue at this rate of acceleration. If it did we’d be pushing out 1,000 articles a week by 2040 – my arXiv RSS is already unreadably busy.

With the above plots in mind, we can see how individual topics or terms are represented in the literature. Here is the plot for papers mentioning the phrase ‘dark matter’.

It rapidly becomes hard to understand these numbers in the context of the whole corpus of the three journals, so here is the same data, but represented as a fraction of the collective body of literature that year, i.e. it shows what fraction of the literature was concerned with dark matter in any given year.

In 1978 there are two papers using the phrase ‘dark matter’ in their title or abstract. By 2011 a significant chunk of all papers in these journals are talking about it! It looks like ApJ was more concerned with dark matter than A&A or MNRAS, until recently. Looking at a non-cumulative version we see that thus does appear to be the case. Since 2000, MNRAs been steadily publishing more often about dark matter, and ApJ has been publishing less often.

But how does dark matter compare with other terms? Considering the entire corpus, let’s compare the term ‘dark matter’ with a few other, related terms: ’cosmology’, ‘big bang’, ‘dark energy’ and ‘wmap’. you can see cosmology has been getting more popular since the 1990s, and dark energy is a recent addition.

Here are some more plots, just for fun. Here the relative popularity of the planets Earth, Jupiter and Saturn (the three most-popular in fact). You can see that we’re more obsessed with Earth than ever, and that Jupiter is getting more interesting:

Here are terms related to exoplanets. You can see that the use of the term ‘exoplanet’ overtook ‘extrasolar planet’ in about 2007:

Here are some famous space telescopes. I like that Hubble, Chandra and Spitzer seem to have taken turns in hogging the limelight, much as COBE, WMAP and Planck has each contributed to our knowledge of the CMB in successive decades, Planck still on the rise.

It is worth mentioning again that I’m currently only using data from three journals. I intend to add more, such as PASP, and to add supplements and letters. The more data I add, the slower my code will become and I need to refactor it before I let the database get bigger.

My next step is looking at correlations between terms in the literature, to find known, and possibly unknown, relationships between words. I’ll also try to put this all online into some usable form soon.

* All the charts on this post are linked to the interactive version.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a website or blog at

Up ↑

%d bloggers like this: