Archives For big data

I (or rather my computer) spent most of this morning geocoding the database of astronomical papers that I scraped from NASA ADS a while back. I’ve got about a quarter of a million papers, covering several of the major astronomical journals (MNRAS, ApJ, A&A, PASP and AJ) back to their first publications. There are 7 million citations and 900,000 authorships in the database.

I want to geocode the affiliations listed in those authorships, in order to explore the relationships between different institutes. Geocoding is the process of finding the latitude and longitude coordinates for a place given the address. Authors of papers give their institute’s address but they write them very inconsistently. By geocoding them down to a lat/long pair its easier to normalise the data and get a better feel for when two affiliations are the same place.. The other day I found a Ruby gem called Geocoder that does exactly what I want and so I set about trying to avidly avoid Google’s API rate limit.

Those 900,000 authorships (individual authors on each paper) come from about 70,000 unique affiliations, of which 50,000 appear to parseable as a potential address. Each one takes a second to be geocoded to it could take a while to do them all. I had the sense to start with the most-affiliated addresses and work down though, so in fact I already have 230,000 of those 900,000 authorships covered.

So far the fifteen most authorship-rich institutes are:

  1. Harvard (9576, 4.16%)
  2. Johns Hopkins University (6076, 2.64%)
  3. Cavendish Laboratory, Cambridge (5191, 2.26%)
  4. Universität Bonn (4468, 1.94%)
  5. CalTech (4442, 1.93%)
  6. Max-Planck-Institut für Astrophysik, Garching (4311, 1.87%)
  7. Tucson, Arizona (3452, 1.50%)
  8. ESO, Garching (3409, 1.48%)
  9. NASA, Goddard Space Flight Center (3342, 1.45%)
  10. Durham University (2643, 1.15%)
  11. MIT (2136, 0.93%)
  12. Paris, Observatoire (2125, 0.92%)
  13. Big Bear Solar Observatory, Pasadena (2030, 0.88%)
  14. California Univ., Berkeley (1878, 0.82%)
  15. Max Planck Institute for Astronomy, Heidelberg (1878, 0.82%)

These aren’t the ones that publish the most papers, but rather the centres that have put out the most cumulative author-credits. I’ve not normalised for date either. For all I know Harvard just published one 9,576 author paper, for example (FACT: they didn’t).

The other thing I realised as soon as they started to come in was that I can now see which research centres have the most awkward names. ESO Garching, for example, has been written in at least 15 different ways in the data I’ve gone through so far (see list at the end). It however does have a lot of papers, so you’d expect variations to arise.

Another inconsistently named centre is the California Institute of Technology in Pasadena. With several sub departments and multiple ways to write its name, it appears have more than 44 variations in the way it is credited!

If we consider only the locations with more than 5 address variants, and normalise to the total number of author-credits we get the following top-ten list of institutions with inconsistently written affiliations. These are the institutions where the number of different names are highest compared to the number of times it appears in total.

  1. University of California Berkeley, USA
  2. CalTech, USA
  3. INAF – IASF Bologna, Italy
  4. Instituto de Astrofísica de Andalucía, CSIC, Granada, Spain
  5. Department of Astronomy, Kyoto University, Japan
  6. Universität Bonn, Germany
  7. Instituut voor Sterrenkunde, Leuven, Belgium
  8. Universitäts-Sternwarte München, Germany
  9. Department of Applied Mathematics, The University of Leeds, UK
  10. Yale University, USA
Should this be a worry? I suppose many research centres have ‘defined’ names that everyone should be using, but no one is doing so (or at least no one is checking). I know that here in Oxford there is everyday discrepancy between ‘Oxford University’ and the ‘University of Oxford’ and that is just the start of these things.
In the world of big data, geographical information is very important. Big data is also often reliant on the once-typed or written words of human beings. (e.g. oldweather.org) If academic researchers cannot credit their institutes consistently then presumably no one is typing the addresses of many places correctly. Perhaps research papers should be encoded with co-ordinates? Either way, geocoding is a very important tool in an era of big, personal data.
Once I have more of the ADS data geocoded, there is more that can be done here.

—-

List of variations for affiliation credits to ESO, Garching:

  • ESO, Garching bei München, Germany
  • ESO, Karl-Schwarzschild-Str. 2, 85748 Garching bei München, Germany
  • ESO, Karl-Schwarzschild-Strasse 2, 85748 Garching bei München, Germany
  • ESO, Karl-Schwarzschild-Strasse 2, D-85748 Garching bei München, Germany
  • European Southern Observatory, 85748 Garching bei München, Germany
  • European Southern Observatory, D-85748 Garching bei München, Germany
  • European Southern Observatory, Garching bei München, Germany
  • European Southern Observatory, Karl Schwarzschild Strasse 2, 85748 Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Str. 2, 85748 Garching b. München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Str. 2, 85748 Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Str. 2, 85748 Garching, Germany
  • European Southern Observatory, Karl-Schwarzschild-str. 2, 85748, Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Str. 2, D-85748 Garching bei Munchen, Germany
  • European Southern Observatory, Karl-Schwarzschild-Strasse 2, 85748 Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschild-Strasse 2, D-85748 Garching bei München, Germany
  • European Southern Observatory, Karl-Schwarzschildstr. 2, 85748 Garching bei München, Germany