Archives For journals

I have been exploring the terms used in the astronomical literature (see previous post), and have turned my attention to terms that seem to correlate with each other in astronomy publications. I thought it would at least be interesting to see how often one word is mention alongside another.

To do this I take terms and generate three lists from the database.

  • A count of papers including term 1
  • A count of papers including term 2
  • A count of papers including both term 1 and term 2

I can then say that when term 1 is mentioned, term 2 is also mentioned X% of the time. This is different saying that when term 2 is mentioned, term 2 is also mentioned Y% of the time.

For example, ‘M87’ is mentioned in the abstracts/titles of 516 papers in my database*, and ‘elliptical’ is mentioned in 5075 papers. They are both mentioned in 130 papers. So 25% of ‘M87’ papers mention ‘elliptical’, but only 2% of ‘elliptical’ papers mention ‘M87’.

To visualise this we can take several terms and create a grid. In the following example, I had some help from Brooke Simmons, who conjured up a list of AGN-related terms. The opacity of each square represents the strength of the correlation between the terms. You read the term on the left first and the square tells you how often it occurs with the term on the top. All images link to interactive the version.

From this plot we can also see that ‘NGC 1068’ always appears with the term ‘NGC’ – which it really ought to – whereas the reverse is not the case, because there are many NGC objects in the sky.

The colours are used to categorise the objects. In this plot, orange shows specific object names (M87, NGC1968 and IC 2497) and from this we can see that M87 correlates well with occurrences of ‘jet’, ‘elliptical’ ,’x-ray’, ‘radio’, ‘NGC’, and ‘black hole’. None of those matches should surprise astronomers, and M87 nor will the fact that it doesn’t correlate well with ‘spiral’, ‘quasar’ or ‘IC 2497’.

Here’s another matrix, this time showing star-formation terms:

I think that it’s hard to see some the relationships here so I changed the opacity scaling. Here the opacity scales from 0.0 to 1.0 as the correlation scales with 0.0 to 0.5 – basically any pair that appear together more than half the time are now shown as solid blocks. I prefer this scaling and so it is what I’ve used on the interactive plots too.

Here you can see that ‘dust’ and ‘cluster’ are very relevant topics. You can also see that the facilities/instruments used in star formation (green on this chart) . SCUBA has spent a lot of time looking at Orion and Perseus, and is well matched with Spitzer and the JCMT (which is good since SCUBA is a camera on the JCMT!).

The star-formation regions (blue) all show different characteristics through their different correlations. The Pipe Nebula is dusty, clustered and well-studied by Spitzer and IRAS. The Polaris Flare is dusty, possibly prestellar and well-studies by IRAS and Herschel.

The idea behind this exercise was not to create nee science, but to see if this technique even works at all. It does seem to produce plots that show expected relationships. None of this very exact, but it may provide a way to generate new hypotheses. If you can relate terms to each other, which are not thought generally to be related then there is something interesting there.

These correlations are all direct, in that these words appear together in the titles and abstracts of papers. Perhaps more revealing will be terms that are correlated through a third term. To begin that analysis means mining every term and comparing it with every other term. This is computationally more intensive, but I’ll blog more about it next week once I’ve let some code run for a while 🙂

Meanwhile, I’ve created a set of interesting matrices of terms. The two examples above are included along with a cross-match of all the Messier objects (categorised by type), planets and moons (categorised by type), and all the constellations (categorised by hemisphere). Suggestions new word sets are very welcome. Tweet me @orbitingfrog!

* Which contains all the papers from MNRAS, A&A and ApJ since 1827.

At .Astronomy 4 in Heidelberg, I began hacking on some natural language processing of the astronomical literature as part of my Hack Day project with Sarah Kendrew and Karen Masters. It began as a version of BrainScanr for astronomy – which it can still become – however it also provides an interesting database to explore, filled with the terms used in astronomy and how they change over time. 

Basically, I’m grabbing and processing (thanks to Alasdair Allan) the titles and abstracts of all the papers from the Astrophysical Journal (ApJ), Astronomy & Astrophysics (A&A), and the Monthly Notices ofthe Royal Astronomical Society (MNRAS). The records for each of these publications go back some way. MNRAS has been published since 1827, and ADS has digested the entire collection. Here’s how many papers these three journals have published since 1827, per year*.

Here’s the data in a cumulative form:

It may be more useful to see this plot with a log(10) y-axis:

Since 1960-ish, the rate of publication in astronomy appears to follow a power law, and publishing in its current form it cannot usefully continue at this rate of acceleration. If it did we’d be pushing out 1,000 articles a week by 2040 – my arXiv RSS is already unreadably busy.

With the above plots in mind, we can see how individual topics or terms are represented in the literature. Here is the plot for papers mentioning the phrase ‘dark matter’.

It rapidly becomes hard to understand these numbers in the context of the whole corpus of the three journals, so here is the same data, but represented as a fraction of the collective body of literature that year, i.e. it shows what fraction of the literature was concerned with dark matter in any given year.

In 1978 there are two papers using the phrase ‘dark matter’ in their title or abstract. By 2011 a significant chunk of all papers in these journals are talking about it! It looks like ApJ was more concerned with dark matter than A&A or MNRAS, until recently. Looking at a non-cumulative version we see that thus does appear to be the case. Since 2000, MNRAs been steadily publishing more often about dark matter, and ApJ has been publishing less often.

But how does dark matter compare with other terms? Considering the entire corpus, let’s compare the term ‘dark matter’ with a few other, related terms: ’cosmology’, ‘big bang’, ‘dark energy’ and ‘wmap’. you can see cosmology has been getting more popular since the 1990s, and dark energy is a recent addition.

Here are some more plots, just for fun. Here the relative popularity of the planets Earth, Jupiter and Saturn (the three most-popular in fact). You can see that we’re more obsessed with Earth than ever, and that Jupiter is getting more interesting:

Here are terms related to exoplanets. You can see that the use of the term ‘exoplanet’ overtook ‘extrasolar planet’ in about 2007:

Here are some famous space telescopes. I like that Hubble, Chandra and Spitzer seem to have taken turns in hogging the limelight, much as COBE, WMAP and Planck has each contributed to our knowledge of the CMB in successive decades, Planck still on the rise.

It is worth mentioning again that I’m currently only using data from three journals. I intend to add more, such as PASP, and to add supplements and letters. The more data I add, the slower my code will become and I need to refactor it before I let the database get bigger.

My next step is looking at correlations between terms in the literature, to find known, and possibly unknown, relationships between words. I’ll also try to put this all online into some usable form soon.

* All the charts on this post are linked to the interactive version.