DbDb and Reproducible Data

Travis Rich

A project for completion towards Travis Rich's written General Exam requirement. Administered by W. Daniel Hillis.

Introduction

Many domains of science increasingly use large datasets from which they draw their conclusions. However, subjecting these datasets to the same peer-review process as the written portion of scientific documentation has proven to be challenging.

This project seeks to build a tool that allows a user to upload datasets, perform analysis, visualize the transformation of datasets, and edit or fork this process anywhere along this transformation path.

Technical challenges of publishing data

There exist both technical and cultural challenges to realizing the open-data goals of many scientists. One of the more obvious technical challenges is the transfer of large scientific datasets, which can range from gigabytes to petabytes depending on the domain [1]. This challenge is compounded by the facts that derivate data and analyses can often be larger than the raw data, the format of data and its applications are rapidly evolving (e.g. the rise of voxel or genetic datasets), and that many journals impose restrictively small maximums for supplemental data attachments [2].

Attempts to use peer-to-peer networks to alleviate and distribute some of this load have been made, notably the Biotorrent project which focused on distributing large scientific datasets [3]. Another key challenge of keeping track of changes to datasets as it evolves is approached by a number of projects that do everything from schema-recommendation to git-like version control [4][5].

However, simply tracking the data is often insufficient. Analyses, methods, and provenance are critically important, especially in instances where the measurement environment is sufficiently complex or rare (e.g. data collected from the Deep Horizons oil spill) [3].

Cultural challenges of publishing data

Beyond the technical architecture challenges, the cultural landscape around scientific data also provides challenges. Most notably, there is a general apprehension to releasing raw data publicly, as authors can be scared that others will scoop their results. This fear exists despite evidence that open access mechanisms increase citation impact [7]. Many have suggested that publishing datasets on their own and leveraging existing citation workflows is one technique to provide existing incentives that can counteract this fear [2]. Support for this idea has manifested itself in several ways, including Nature launching a peer-reviewed journal for datasets. Further support has come in the form of authors providing best practices and common legal frameworks to make the data sharing process an easy transition [6][9].

Another fear, noted by Alberts et al., surrounds the connotations of having work retracted. Some authors worry that publishing datasets will lead to the discovery of a mistake, leading to their work being retracted. Retraction currently serves the role for fixing honest mistakes as well as punishing fraudulent work. By conflating these two, honest mistakes are pushed towards the domain of shameful, as opposed to be heralded as healthy components of the scientific process. Alberts et al. suggest changing the vocabulary for these types of events (e.g. from 'retraction' to 'voluntary-withdrawal') [3]. The project is further motivated by the notion that documenting the full flow of data analysis may help differentiate honest mistakes from fraudulent attempts.

The result of these challenges is that the basic goal of data disclosure is unfulfilled. If not literally then often practically, as the data, environments to analyze it, and tools to understand and access it are severely limited.

Contemporary Needs

Beyond the idealist goals of many open data proponents, there exist very real contemporary needs for tools to access, manage, and understand open data. One such need is in the domain of climate science, where enormous data sets are being used as the crux of billion-dollar decisions and government policies [4]. These decision makers are often not the climate scientists collecting the data, or even scientists at all, so the tools to make these complex and evolving datasets explorable and comprehensible are of utter need.

Furthermore, growing interest in analyzing the scientific process is greatly aided through the cultural and technical progression towards open data tools. Metaknowledge type analyses are often performed on the citation count, word contents, author affiliation, etc of scientific publications [5]. However, robust and popular open data tools would allow scientists to extend such analyses to the raw data fueling many different fields. Differing thresholds for certainty or varying practices for analysis could be applied across fields to create more globally driven consensus.

Project Objectives

The core project deliverable is a web system that:

Allows datasets to be uploaded and stored
Allows uploaded datasets to be analyzed (using python as a first common language, but capable of expanding to other languages)
Provides a tool for “forking” and re-analyzing data at any stage in its analysis timeline
Provides a visual tool for exploring the map of forked and re-analyzed datasets
Provides access to data at any stage in a dataset’s analysis timeline (for visualization, external storage, etc)

Project Architecture

The frontend of the project is written in Angular - a javascript framework - and interacts with a backend through an API written in Node. The backend API stores and reads data from a Mongo instance hosted on Mongolab. A python worker exists to run jobs (explained later) and runs as its own independent process that waits and checks for job notifications. To more deeply describe the architecture, we walk through a typical usage scenario of the platform (currently called DbDb - pronounced 'dubdub').

Creating a project

The landing page of DbDb displays all currently existing data analysis projects. A single project starts with a base dataset and can branch into many children datasets and analysis codes.

The first step to creating a new project is to upload a dataset. Uploading a dataset triggers our backend service to store the original file as well as cache the content for easy access. Currently DbDb supports csv upload, but it is trivial to implement support for other formats such as JSON, XLS, or direct database URI connections. Once the dataset has been processed and is available in our backend, we trigger a notification on the frontend (a green flash of the 'create new' node) and populate a data node in its place.

Clicking this data node allows you to explore the dataset uploaded as well as add a title to describe the dataset. Clicking the '+' icon below the data node creates a new code node. A code node by default is sourced with a variable 'inputData' that represents the parent data node (in this initial case, the uploaded csv).

Adding code

Code can be written in a code node. Currently, DbDb supports python, but it is straightforward to implement support for other languages. When code is entered and run, a notification is triggered on our backend. This notification is received by a python worker module that takes in the given code, executes it, and returns the output. All print output is returned as text to the code node 'output' section. Error messages are returned in a similar way and marked with red text. If a figure is plotted with the code, this figure will be saved and made accessible from within the code node (as well as through a small thumbnail image on top of the node icon). If there is a returned variable, this variable will be interpreted as the output dataset of the code nodes 'transition function' and will generate the creation of a new data node below the current code node.

This new data node is now identical in functionality to the original data node. A code node can be appended as its child and the chain of data processing can continue.

Forking Analysis Trees

At certain points in the analysis chain of a dataset, it may be desired to fork off a new line of processing and exploration. To support this, DbDb allows any node to be 'forked'. Forking a node creates a new code node which replaces any child nodes at the point of forking. This allows a user to have a new independent chain of analysis that does not interfere with other written analyses.

Exploring the Analysis Tree

As nodes are forked and the analysis chain grows, the branching structure of this node tree is captured and made explorable. Clicking the Tree button reveals the full node tree. Hovering nodes shows the chain that they are a member of and clicking a node isolates the view to show only that line of analysis. In this isolated view, edits to the code nodes can be made and the step-by-step datasets can be inspected. A button exists to toggle whether node titles are displayed to allow simple viewing of tree hierarchies.

Importantly, the data at each stage of every analysis tree is archived and available for download. This makes it simple to download and use only the portion of data that is needed.

Test Case: Dinosaur Growth Rates

To explore how DbDb could be used in the scientific process, we use a real-world test case from Myhrvold’s [6] challenge of Bybee’s work on dinosaur growth rates (notably [7]).

In his paper, Myhrvold critiques many different components of research focusing on the question, 'why were dinosaurs so big?'. There are many different analyses and papers by different groups that explore dinosaur growth models. One analysis of interest, which we use DbDb to reproduce, focuses on using measurements of femurs of allosauruses of varying age. This analysis is one that was originally put forward by Bybee et al. and was critiqued by Myhrvold. Myhrvold recounts the collection of longitudinal data of femur fossils from six specimens and states that Bybee et al. fit curves by eye using integer age offsets (note, I cannot gain access to Bybee's original paper as it is behind a paywall that MIT does not subscribe to). Myhrvold presents his own curve-fitting analysis, "using a least-squares minimization procedure that does not constrain the age offsets to integer values". He goes on to note that the outcome varies substantially based on which procedure is used. I was unable to find anything documenting Myhrvold's least-squares minimization procedure. While least-sqares optimization is a very common technique, it is usually done when a cluster of data points are to be fitted with a best-fit line. Fitting two or more given lines (each with varying sections of overlap) is not necessarily as obvious of a procedure.

The figure and caption used in Myhrvold's paper are shown below [8].

Longitudinal time series for A. fragilis femora. A, raw data from Bybee et al.; the time series of LAGs from each specimen is plotted as a separate curve. B, the data set Allosaurus fc1, in which age offsets, , were applied to line up the time series by eye, as published in Bybee. C, the data set Allosaurus fc2, in which a least-squares minimization procedure was applied to optimize the age offsets to produce a single cluster. D, splitting the A. fragilis femur data into two groups and separately clustering each group by using the least-squares method yields the Allosaurus fc3 (left cluster) and Allosaurus fc4 (right cluster) data sets.

Reproducing the Raw Data

While Myhrvold provides a table recounting the data as plotted in the above figure for subplots B,C, and D - he does not provide the raw data. I believe this data is represented in Bybee's original paper [9], but again, restricted access means I can't consult this work. To gather the raw data, I take the data given in subplot B and proceed to subtract the associated age offsets listed. Note however, that there is no indication of which plotted series corresponds to which offset (i.e. which line is S1?). Decoding this is painstakingly done by consulting subplots B,C, and D and trying to back-produce the naming of each curve based on associated offsets.

Furthermore, and in possible error, note that in the raw data (subplot A), all specimen data starts at age = 0. In subplot B, two specimen data are given an age offset of 3.0, yet all plotted curves begin at or above 4. For the sake of proceeding forward, I assume this is an error in the plotting of the y-axis labels, and use the listed offsets as truth. It does raise a concern though that either the listed offsets or plotted lines do not represent the data in its final form, and that one of the two (either the axes labels or the offsets listed) are out of sync with the conclusive result.

Once recovered, the raw dataset is added into DbDb. A series of parsing and cleaning steps are added as code nodes and a final plotting step is made.

The resulting plot matches the raw plot provided by Myhrvold in subplot A.

To reproduce the remaining subplots the listed offsets are applied. Each of these trajectories is created as its own forked branch from the original raw dataset. Having the data and its analysis structured in this way allows us to use DbDb to show an overview analysis of how each subplot is generated.

The resulting plots are all shown below in the same orientation as Myhrvold's original figure. Note the difference in y-axis labeling as my reproduced plots all start below 4.

A few frustrations in reproducing this data include the discrepancy in offset values and axis labels, the lack of clarity on the specific least-squares minimization procedure, as well as the choice of axes in these plots. While this could have been adopted by Myhrvold because of Bybee's original choice, it seems misleading to plot age (the independent variable) on the y-axis. With age on the y-axis and size on the x-axis, the shape of the graph looks as though it is tapering off at some convergent maximum. However, in reality the graph is saying that size is accelerating as the dinosaur grows older. Myhrvold does acknowledge that this is a "pattern that seems biologically unlikely and is not consistent with trajectories seen in the humerus, tibia, or ulna data for Allosaurus".

Lessons learned with DbDb

One of the main takeaways is that each of the frustrations and irreproducible aspects of the work in Myhrvold's paper can be much more specifically and definitively addressed when documented in DbDb or a similar format. Seeing the full chain of analyses, where they fork, and the outputs of each fork along with associated code removes any ambiguity about the procedure. Offset discrepancies could be fully-traced and labels could be updated to reflect the color of each plot.

Future Steps and Conclusion

The work above describes a proof-of-concept build of a data-archiving and data-analysis tracking system. The tool created could provide great value to the challenge of increasing the reproducibility of scientific work. There exist many components that would need to be developed to transition this to a public-ready tool. Specifically, features like user accounts, permissioning, and greater language or data format support would be necessary. These are straightforward to implement and are merely a matter of development time.

It is clear that the need for such a tool exists. Taking the real-world example of Dr. Myhrvold [10] challenging the reproducibility of research on dinosaur growth rates (notably [11]}) and possibly falling victim to the same traps he critiques (errors in graphs, ambiguous procedures, etc) suggests that these problems are not ones of mere laziness or fraud, but are due to genuine difficulty. Providing tools to aide in the avoidance of such errors and to make them addressable when discovered is therefore critically important.

Citations:

[1]Alberts, B.B. et al. Self-correction in science at work. Science. 348, 1420–1422.

[2]Bybee, P.J. et al. Sizing the Jurassic theropod dinosaur Allosaurus: assessing growth strategy and evolution of ontogenetic scaling of limbs. Journal of Morphology. 267, 3, 347–359.

[3]Bybee, P.J. et al. Sizing the Jurassic theropod dinosaur Allosaurus: assessing growth strategy and evolution of ontogenetic scaling of limbs. Journal of Morphology. 267, 3, 347–359.

[4]Bybee, P.J. et al. Sizing the Jurassic theropod dinosaur Allosaurus: assessing growth strategy and evolution of ontogenetic scaling of limbs. Journal of Morphology. 267, 3, 347–359.

[5]Chavan, V. and Josef Settele, L.P. Ecology Metadata as Peer ‑ Reviewed Data Papers.

[6]Evans, J. and Foster, J. Metaknowledge. Science. 331, 721–725.

[7]Myhrvold, N.P. Revisiting the estimation of dinosaur growth rates. PloS one. 8, 12, e81917.

[8]Myhrvold, N.P. Revisiting the estimation of dinosaur growth rates. PloS one. 8, 12, e81917.

[9]Myhrvold, N.P. Revisiting the estimation of dinosaur growth rates. PloS one. 8, 12, e81917.

[10]Overpeck, J.T. et al. Climate data challenges in the 21st century. Science (New York, N.Y.). 331, 6018, 700–702.

[11]Reichman, O.J. Challenges and Opportunities of Open Data in Ecology. Sciene. 703–705.

[12]Reichman, O.J. Challenges and Opportunities of Open Data in Ecology. Sciene. 703–705.

[13]Rowe, T. and Frank, L.R. The disappearing third dimension. Science (New York, N.Y.). 331, 6018, 712–714.

[14]Stodden, V. and Miguez, S. Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research. Journal of Open Research Software. 2, 1, 21–1.

[15]Biotorrents: A file sharing service for scientific data. PLoS ONE. 5, 4, 1–5.

[16]dat.

[17]Data publication in the open access initiative. Data Science Journal. 5, June, 79–83.

[18]DataHub: Collaborative Data Science & Dataset Version Management at Scale. 7.

[19]Publication-Archiving, Data-Archiving and Scientometrics. Access. August 2007, 1–10.