Archives For data


Just over three years the Zooniverse launched the Milky Way Project (MWP), my first citizen science project. I have been leading the development and science of the MWP ever since. 50,000 volunteers have taken part from all over the world, and they’ve helped us do real science, including creating astronomy’s largest catalogue of infrared bubbles – which is pretty cool.

Today the original Milky Way Project (MWP) is complete. It took about three years and users have drawn more than 1,000,000 bubbles and several million other objects, including star clusters, green knots, and galaxies. It’s been a huge success but: there’s even more data! So it is with glee that we have announced the brand new Milky Way Project! It’s got more data, more objects to find, and it’s even more gorgeous.

Screenshot 2013-12-12 11.58.42

This second incarnation of my favourite Zooniverse project[1] has been an utterly different experience for me. Three years ago I had only recently learned how to build Ruby on Rails apps and had squirrelled myself away for hours carefully crafting the look and feel for my as-yet-unnamed citizen science project. I knew that it had to live up to the standards of Galaxy Zoo in both form and function – and that it had to produce science eventually.

Building and launching at that time was simpler in one sense (it was just me and Arfon that did most of the coding[2]) but so much harder as I was referring to the Rails manual constantly and learning Amazon Web Services on the fly. This week I have had the help of a team of experts at Zooniverse Chicago, who I normally collectively refer to as the development team. They have helped me by designing and building the website and also by integrating it seamlessly into the now buzzing Zooniverse infrastructure. The result has been an easier, smoother process with a far superior end result. I’ve essentially acted more like a consultant scientist, with a specification and requirements. I’ve still gotten my hands dirty (as you can see in the open source Milky Way Project GitHub repo) but I’ve managed to actually keep doing everything else I now to day-to-day at the Zooniverse. It’s been a fantastic experience to see personally how far we’ve come as an organisation.

The new MWP is being launched to include data from different regions of the galaxy in a new infrared wavelength combination. The new data consists of Spitzer/IRAC images from two surveys: Vela-Carina, which is essentially an extension of GLIMPSE covering Galactic longitudes 255°–295°, and GLIMPSE 3D, which extends GLIMPSE 1+2 to higher Galactic latitudes (at selected longitudes only). The images combine 3.6, 4.5, and 8.0 µm in the “classic” Spitzer/IRAC color scheme[3]. There are roughly 40,000 images to go through.


An EGO (or two) sitting in the dust near a young star cluster

The latest Zooniverse tech and design is being brought to bear on this big data problem. We are using our newest features to retire images with nothing in them (as determined by the volunteers of course) and to give more screen time to those parts of the galaxy where there are lots of pillars, bubbles and clusters – as well as other things. We’re marking more objects –  bow shocks, pillars, EGOs  – and getting rid of some older ones that either aren’t visible in the new data or weren’t as scientifically useful as we’d hoped (specifically: red fuzzies and green knots).

It’s very exciting! I’d highly recommend that you go now(!) and start classifying at – we need your help to map and measure our galaxy.


[1] It’s like choosing between your children

[2] Arfon may recall my resistance to unit tests

[3] Classic to very geeky infrared astronomers

ESA’s Planck mission reported results today showing the Cosmic Microwave Background (CMB, see below) in greater detail than ever before.


Planck achieves this amazing view of the earliest light in the Universe by combining and cleverly cross-matching data across a combination of 9 different frequencies, ranging from 30-857 GHz. In this way they can remove foreground emissions and effectively strip away the content of the whole Universe, to reveal the faint CMB that lies behind it. It’s amazing work.

To accompany the announcement, Planck have released a Chromoscope-based version of their full data set here. This site shows all 9 bands (plus a composite image and the visible sky for reference) and lets you slide between them, exploring the different structures found at different wavelengths.



You can rearrange the different bands and turn on useful markers like constellations and known microwave sky features. It’s just great!

Also! There is also an option to view the data in terms of the content – or components – of the Universe. You can see that version here. You can switch between these views using the options box on the left hand side.

In this version of the site you’re able to see the different structures that contribute to the overall Planck sky image. This is how you can really start to understand what Planck is seeing and how we need to ‘extract’ the foreground emission from the data. In this view you can look at the dust, the emission purely from Carbon Monoxide (a common molecule at these wavelengths), the CMB itself and the low-frequency emission from elsewhere (such as astronomical radio sources).



Cardiff’s Chris North has put this site together (you can find him on Twitter @chrisenorth) and it was Chris, along with Stuart Lowe and I that first put Chromoscope together many moons ago now. I can’t take much credit for Chromoscope really but it’s fantastic to see it put to use here.

This is the wonderful blend of open science and public engagement that I love, and that astronomy is getting better at in general. What Planck are doing here is making the data freely available in a form that is digestible to the enthusiastic non-specialist.

This sort of ‘outreach’ is enabled by the modern web’s ability to make beautiful websites relatively painless to build and cheap to host. It’s also possible because we have people, like Chris North, who know about both the science and the web. Being comfortable on the Internet and ‘getting’ the web are so important today for anyone that wants to engage people with data and science.

So, go explore! You can zoom right in on the data and even do so in 9 frequencies. There is a lot to come from Planck – as scientists get to work pumping out papers using these data – so this site will be a handy reference in the future. It’s also awesome: did I mention that?


Over on this link, you’ll find a data-driven document (D3 FTW!) showing collaboration between the most authorship-intensive institutions in astronomy. The document is a chord diagram showing the strength of collaboration between research centres, based on co-authorship of papers.

I’ve included some screenshots here to give you the idea – the one above is for worldwide institutions between 2010 and July 2012.

This diagram is for UK institutions in 2011. Links between locations on the plot are not symmetric and are coloured to show the dominant partner. Strength of links are inversely proportional to the square root of author position so that the 1st authors counts for 1 point, 2nd author gets 1/√2, third get 1/√3 etc. This way I weight toward authors higher up on the list.

The data shown is for 125,000 geocoded papers from a total of 236,000 published by MNRAS, ApJ, AJ, A&A and PASP through to July 2012 (read about the mining here). 485,000 authorships are included, out of 820,000 in total. Data was geocoded based on author affiliation and grouped using the resultant lat/longs to 3 decimal places.

This plot, for worldwide institutions in 2011 is shown highlighting links between CEA Saclay and the other top publishing locations. You can play with many combinations of the data, displaying varying numbers of institutions from different parts of the world. Explore the data at

[This is same data I used to create collaboration maps in this previous post, and can be found here on Google Fusion tables.]

A couple of weeks ago I began to geocode the database of astronomical research I scraped from NASA ADS during .Astronomy 4. This database consists of all the published astronomical research in five major journals (almost 250,000 papers going back decades, from MNRAS, ApJ, AJ, A&A and PASP) up to July 2012. You can read more about that here and here.

Geocoding is the process by which latitude and longitude are derived from a sting of text, e.g. a street address. You use it all the time if you use Google Maps, Bing Maps, or whatever Yahoo call their maps (Yahoo Maps?). The recent débâcle over the failure of the iPhone’s mapping service mostly comes down to the fact that Apple’s geocoding capabilities are not up to scratch.

I’ve been using a Ruby Gem called Geocoder to obtain a latitude and longitude for the affiliations of authors in the astro-literature. Why? Well I thought it be interesting to see how astronomers around the world collaborate. The idea is that we can take those lists of co-authors and visualise how each university or research centre works with the others.

To do this I take a map of the world and every time two institutions work together on a paper I draw a link between them. I do this in R, which fun to try out, and there is a great guide at this site here. Each single line is drawn very faintly but you can see that they quickly build up. The result are maps like this one below:

Europe, 2005

This map shows only the connections between European nations and only in 2005. Those research centres that work most with others pop out fairly easily: Paris, Edinburgh, ESO/Max-Planck in Garching, Germany. Paris (Saclay) in particular is very strongly linked with many places in Europe. My home institution of Oxford can just be picked out in the very busy UK. You can easily make out many of the areas which were less involved in the astronomy-research community in 2005: Northwestern France, Norway and much of Eastern Europe.

On this map, big collaborations dominate. If there is a paper with ten different institutions represented then all ten of those institutions will be highlighted. One big collaboration would make a fairly complex web on its own. In 2009 the was a paper published by the LIGO consortium, involving more than 700 authors from a huge range of research centres around the world. This makes the 2009 plot for Europe, and virtually anywhere else, look quite busy.

The journals I’ve picked are large but they are only English-language. That’s because of my own bias, since I wouldn’t be able to check my working if I dealt with all the major journals from all languages. Also, I have no idea what journals exist for astronomers outside of the English-speaking world and I’m aware that English is fairly dominant in astronomy worldwide. You have to take this into account when looking at the maps.

In all these plots the intensity of the colour is normalised, such that the peak strength of connection is always set to be 100% opaque and it works down from there (linearly, if you’re still following). This means that where you see relatively bright arcs all over the map, it shows you that each place is collaborating with each other fairly evenly. When you see just a few bright arcs, it shows that those places work together a lot more relative to the others.

Here’s Europe changing slightly, over the last few years:

Europe, 2007Europe, 2008Europe, 2009Europe, 2010Europe, 2011

Lets look now at North America. Here is the plot for 2009:

USA, 2009

You can see a clear band of strong connections between California and the North-East of the country (roughly). There are also myriad other links drawm more faintly. Honolulu and Mauna Kea are clearly highlighted, jumping out from the Pacific – and of course this is no surprise since many major telescopes are to be found there.

Now let’s see how Europe and the USA link up. These are the two hubs of English-speaking astronomy. Here’s a plot showing links between all these places in 2008.

North Atlantic, 2008

The strength across the atlantic is very intense – just as strong as within: showing that an ocean’s gap between them has little effect in working terms for the USA and Europe. With astronomy this doesn’t surprise me. Many of the big telescopes are in the US for starters. But also, researchers go where the money is and will happily jump across the pond when needed. The global picture also reveals great collaborative efforts within astrophysics increase as the years go by.

World, 2011

I’ve made a set of these global and regional maps that can be found on Flickr.

The other approach is to highlight all the links to and from just one institution. Let’s take Oxford University, since it’s where I currently work:

Oxford, 2011

This is the map for 2010 and you can see that Oxford, as a major astronomy reseach centre, has links to a lot of places. More interestingly, you can also see how these connections weigh against each other. Oxford is no more tied to Europe than America and appears to collaborate across the UK fairly extensively.

If we use the same intensity scale and compare this to my former institution, Cardiff, in the same year:

Caridff, 2011

then we can see that the pattern is slightly different. Cardiff is less linked in general but has stronger connections to several locations, many European. This must be in large part due to the fact that several important instruments were built here, for Herschel and Planck. The instruments on these spacecraft, and the consortia that operate them, have been the source of a great deal of collaboration in the last few years. 2010 was a notable year for publications from those instruments.

Finally there are regions I know little about, but which appear to tell their own stories when I look at the maps. Take Australia, for example:

Oceania, 2010

This map of 2010 activity down under shows Sydney as the leading collaborator in the region. It also shows that New Zealand and Australia coperarte broadly in astronomy research. In East Asia in 2010 the map shows again that there are a variety of insititutions cooperating on papers, but that Tokyo appears to be a key hub in the community.

East Asia, 2010

You can find a whole array of maps on Flickr at In my next post, I’ll take a closer look at the way these institutions work with others, and see who are the most collaborative in astronomy, and think about why that might be.

UPDATE: If you’re interested in exploring or downloading the data yourself, take a look at this Google Fusion table.

I (or rather my computer) spent most of this morning geocoding the database of astronomical papers that I scraped from NASA ADS a while back. I’ve got about a quarter of a million papers, covering several of the major astronomical journals (MNRAS, ApJ, A&A, PASP and AJ) back to their first publications. There are 7 million citations and 900,000 authorships in the database.

I want to geocode the affiliations listed in those authorships, in order to explore the relationships between different institutes. Geocoding is the process of finding the latitude and longitude coordinates for a place given the address. Authors of papers give their institute’s address but they write them very inconsistently. By geocoding them down to a lat/long pair its easier to normalise the data and get a better feel for when two affiliations are the same place.. The other day I found a Ruby gem called Geocoder that does exactly what I want and so I set about trying to avidly avoid Google’s API rate limit.

Those 900,000 authorships (individual authors on each paper) come from about 70,000 unique affiliations, of which 50,000 appear to parseable as a potential address. Each one takes a second to be geocoded to it could take a while to do them all. I had the sense to start with the most-affiliated addresses and work down though, so in fact I already have 230,000 of those 900,000 authorships covered.

So far the fifteen most authorship-rich institutes are:

  1. Harvard (9576, 4.16%)
  2. Johns Hopkins University (6076, 2.64%)
  3. Cavendish Laboratory, Cambridge (5191, 2.26%)
  4. Universität Bonn (4468, 1.94%)
  5. CalTech (4442, 1.93%)
  6. Max-Planck-Institut für Astrophysik, Garching (4311, 1.87%)
  7. Tucson, Arizona (3452, 1.50%)
  8. ESO, Garching (3409, 1.48%)
  9. NASA, Goddard Space Flight Center (3342, 1.45%)
  10. Durham University (2643, 1.15%)
  11. MIT (2136, 0.93%)
  12. Paris, Observatoire (2125, 0.92%)
  13. Big Bear Solar Observatory, Pasadena (2030, 0.88%)
  14. California Univ., Berkeley (1878, 0.82%)
  15. Max Planck Institute for Astronomy, Heidelberg (1878, 0.82%)

These aren’t the ones that publish the most papers, but rather the centres that have put out the most cumulative author-credits. I’ve not normalised for date either. For all I know Harvard just published one 9,576 author paper, for example (FACT: they didn’t).

The other thing I realised as soon as they started to come in was that I can now see which research centres have the most awkward names. ESO Garching, for example, has been written in at least 15 different ways in the data I’ve gone through so far (see list at the end). It however does have a lot of papers, so you’d expect variations to arise.

Another inconsistently named centre is the California Institute of Technology in Pasadena. With several sub departments and multiple ways to write its name, it appears have more than 44 variations in the way it is credited!

If we consider only the locations with more than 5 address variants, and normalise to the total number of author-credits we get the following top-ten list of institutions with inconsistently written affiliations. These are the institutions where the number of different names are highest compared to the number of times it appears in total.

  1. University of California Berkeley, USA
  2. CalTech, USA
  3. INAF – IASF Bologna, Italy
  4. Instituto de Astrofísica de Andalucía, CSIC, Granada, Spain
  5. Department of Astronomy, Kyoto University, Japan
  6. Universität Bonn, Germany
  7. Instituut voor Sterrenkunde, Leuven, Belgium
  8. Universitäts-Sternwarte München, Germany
  9. Department of Applied Mathematics, The University of Leeds, UK
  10. Yale University, USA
Should this be a worry? I suppose many research centres have ‘defined’ names that everyone should be using, but no one is doing so (or at least no one is checking). I know that here in Oxford there is everyday discrepancy between ‘Oxford University’ and the ‘University of Oxford’ and that is just the start of these things.
In the world of big data, geographical information is very important. Big data is also often reliant on the once-typed or written words of human beings. (e.g. If academic researchers cannot credit their institutes consistently then presumably no one is typing the addresses of many places correctly. Perhaps research papers should be encoded with co-ordinates? Either way, geocoding is a very important tool in an era of big, personal data.
Once I have more of the ADS data geocoded, there is more that can be done here.


List of variations for affiliation credits to ESO, Garching:

  • ESO, Garching bei München, Germany
  • ESO, Karl-Schwarzschild-Str. 2, 85748 Garching bei München, Germany
  • ESO, Karl-Schwarzschild-Strasse 2, 85748 Garching bei München, Germany
  • ESO, Karl-Schwarzschild-Strasse 2, D-85748 Garching bei München, Germany
  • European Southern Observatory, 85748 Garching bei München, Germany
  • European Southern Observatory, D-85748 Garching bei München, Germany
  • European Southern Observatory, Garching bei München, Germany
  • European Southern Observatory, Karl Schwarzschild Strasse 2, 85748 Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Str. 2, 85748 Garching b. München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Str. 2, 85748 Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Str. 2, 85748 Garching, Germany
  • European Southern Observatory, Karl-Schwarzschild-str. 2, 85748, Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Str. 2, D-85748 Garching bei Munchen, Germany
  • European Southern Observatory, Karl-Schwarzschild-Strasse 2, 85748 Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Strasse 2, D-85748 Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschildstr. 2, 85748 Garching bei München, Germany