Executable Publications as Sensors for Science

July 10, 2014 — Leave a comment


Executable papers are a cool idea in research [1]. You take a study, write it up as a paper and bundle together all your code, scripts and analysis in such a way that other people can take the ‘paper’ and run in themselves. This has three main attractive features, as I see it:

  1. It provides transparency for other researchers and allows everyone to run through your working to follow along step-by-step.
  2. It allows your peers to give you detailed feedback and ideas for improvements – or do the improvements themselves
  3. It allows others to take your work and try it out on their own data

The main problem is that these don’t really exist ‘in the wild’, and where they do they’re in bespoke formats even if they’re open source. iPython Notebook is a great way of doing something very much like an executable paper, for example. Another way would be to bundle up a virtual machine and share a disk image. Executable papers would allow for rapid-turnaround science to happen. For example, let’s imagine that you create a study and use some current data to form a theory or model. You do an analysis and create an executable paper. You store that paper in a library and the library periodically reruns the study when new data become available [2]. The library might be a university library server, or maybe it’s something like the arXiv, ePrints, or GitHub.

This is roughly what happens in some very competitive fields of science already – only with humans. Researchers write papers using simulated data and the instant they can access the anticipated data the import, run and publish. With observations of the Cosmic Microwave Background (CMB) it is the case that several competing researchers are waiting to work on the data – and new data come sour very rarely. In fact that day after the Planck CMB data was released last year, there was a flurry of papers submitted to the arXiv. Those who got in early, likely had pre-written much of the work and simply ran their code as soon as they had downloaded and parsed new, published data.

If executable papers could be left alone to scan the literature for new, useful data then they could also look for new results from each other. A set of executable papers could work together, without planning, to create new hypotheses and new understanding of the world. Whilst one paper crunches new environmental data, processing it into a catalogue, another could use the new catalogue to update climate change models and even automatically publish significant changes or new potential impacts for the economy.

I should be possible to make predictions in executable papers and have them automatically check for certain observational data and automatically republish updated results. So one can imagine a topical astronomy example where the BICEP2 results would be automatically checked against any released Planck data and then create new publications when statistical tests are met. Someone should do this if they haven’t already. In this way, papers can continue to further, or verify, our understanding long after publication.

SKA Rendering (Wikimedia Commons)

SKA Rendering (Wikimedia Commons)

This is high-frequency science [3], akin to high-frequency trading, and it seems like an interesting approach to some upcoming data-flow issues in science. The Large Hadron Collider (LHC), Large Synoptic Survey Telescope) LSST, and Square Kilometre Array (SKA) are all huge scientific instruments set to explore new parts o the universe and gathering huge volumes of data to be analysed.

Even the deployment of Zooniverse-scale citizen science cannot get around the fact that instruments like the SKA will create volumes of data that we don’t know what to do with, at a pace we’ve never seen before. I wonder if executable papers, set to scour the SKA servers for new data, could alleviate part of the issue by automatically searching for theorised trends. The papers would be sourced by the whole community, and peer-reviewed as is done today, effectively crowdsourcing the hypotheses through publications. This cloud of interconnected, virtual researchers, would continuously generate analyses that could be verified by some second peer-review process; since one would expect a great deal of nonsense in such a setup.

When this came up at a meeting the other day, Kevin Page (OeRC) remarked that we might just be describing sensors. In a way he’s right – but these are software sensors, built on the platform and infrastructure of the scientific community. They’re more like advanced tools; a set of ghost researchers, left to think about an idea in perpetuity, in service of the community that created them.

I’ve no idea if I’m describing anything real here – of it’s just an expression of way of partially automating the process of science. The idea stuck with me and I found myself writing about it to flesh it out – thus here is a blog post – and wondering how to code something like it. Maybe you have a notion too. If so, get in touch!


[1] But not a new one really. It did come up again at a recent Social Machines meeting though, hence this post.
[2] David De Roure outlined this idea quite casually in a meeting the other day, I’ve no ice air it’s his or just something he’s heard a lot and thought was quite cool.
[3] This phrasing isn’t mine, but as soon as I heard it, I loved it. The whole room got chatting about this very quickly so provenance was lost I’m afraid.

No Comments

Be the first to start the conversation!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s