More Astronomy Data Mining: It’s Word-Matrix Time!

July 27, 2012 — Leave a comment

I have been exploring the terms used in the astronomical literature (see previous post), and have turned my attention to terms that seem to correlate with each other in astronomy publications. I thought it would at least be interesting to see how often one word is mention alongside another.

To do this I take terms and generate three lists from the database.

  • A count of papers including term 1
  • A count of papers including term 2
  • A count of papers including both term 1 and term 2

I can then say that when term 1 is mentioned, term 2 is also mentioned X% of the time. This is different saying that when term 2 is mentioned, term 2 is also mentioned Y% of the time.

For example, ‘M87’ is mentioned in the abstracts/titles of 516 papers in my database*, and ‘elliptical’ is mentioned in 5075 papers. They are both mentioned in 130 papers. So 25% of ‘M87’ papers mention ‘elliptical’, but only 2% of ‘elliptical’ papers mention ‘M87’.

To visualise this we can take several terms and create a grid. In the following example, I had some help from Brooke Simmons, who conjured up a list of AGN-related terms. The opacity of each square represents the strength of the correlation between the terms. You read the term on the left first and the square tells you how often it occurs with the term on the top. All images link to interactive the version.

From this plot we can also see that ‘NGC 1068’ always appears with the term ‘NGC’ – which it really ought to – whereas the reverse is not the case, because there are many NGC objects in the sky.

The colours are used to categorise the objects. In this plot, orange shows specific object names (M87, NGC1968 and IC 2497) and from this we can see that M87 correlates well with occurrences of ‘jet’, ‘elliptical’ ,’x-ray’, ‘radio’, ‘NGC’, and ‘black hole’. None of those matches should surprise astronomers, and M87 nor will the fact that it doesn’t correlate well with ‘spiral’, ‘quasar’ or ‘IC 2497’.

Here’s another matrix, this time showing star-formation terms:

I think that it’s hard to see some the relationships here so I changed the opacity scaling. Here the opacity scales from 0.0 to 1.0 as the correlation scales with 0.0 to 0.5 – basically any pair that appear together more than half the time are now shown as solid blocks. I prefer this scaling and so it is what I’ve used on the interactive plots too.

Here you can see that ‘dust’ and ‘cluster’ are very relevant topics. You can also see that the facilities/instruments used in star formation (green on this chart) . SCUBA has spent a lot of time looking at Orion and Perseus, and is well matched with Spitzer and the JCMT (which is good since SCUBA is a camera on the JCMT!).

The star-formation regions (blue) all show different characteristics through their different correlations. The Pipe Nebula is dusty, clustered and well-studied by Spitzer and IRAS. The Polaris Flare is dusty, possibly prestellar and well-studies by IRAS and Herschel.

The idea behind this exercise was not to create nee science, but to see if this technique even works at all. It does seem to produce plots that show expected relationships. None of this very exact, but it may provide a way to generate new hypotheses. If you can relate terms to each other, which are not thought generally to be related then there is something interesting there.

These correlations are all direct, in that these words appear together in the titles and abstracts of papers. Perhaps more revealing will be terms that are correlated through a third term. To begin that analysis means mining every term and comparing it with every other term. This is computationally more intensive, but I’ll blog more about it next week once I’ve let some code run for a while :)

Meanwhile, I’ve created a set of interesting matrices of terms. The two examples above are included along with a cross-match of all the Messier objects (categorised by type), planets and moons (categorised by type), and all the constellations (categorised by hemisphere). Suggestions new word sets are very welcome. Tweet me @orbitingfrog!

* Which contains all the papers from MNRAS, A&A and ApJ since 1827.

No Comments

Be the first to start the conversation!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s