[URFC -4] Data & Model Sharing: Questions to capture provenance & context. v.6 2020-09-30
Detailed provenance and context are essential for collaborating around data, especially large datasets. It helps evaluate how to interpret derived models and composite datasets, and how to qualify observations, claims, & conclusions. That informs reviews and replications, and future metastudies or refactoring in new research contexts.
Here is a taxonomy of questions that I’ve found helpful to answer early and often, as data is produced, shared, updated and analysed. For composite datasets that draw on multiple sources, a single characteristics may vary across the set - in which case describing how one makes sense of the variance is worth extra elaboration.
Check that answers to the relevant questions are easy to find in your documentation and, where appropriate, in structured metadata.
See also: FAIRDOM’s data management checklist & OKFN’s frictionless data tools
What schemas, formats, and metadata standards do your data and models use?
What are their sources, what processes were used to produce them, what tools are used to generate, edit, and work with them?
0.1 Schemas
0.11 What ontologies, schemas and shapes are used?
0.12 Are these defined in a spec? How do they relate to common standards?
0.13 Are the shapes + their specs versioned?
0.2 Formats
0.21 What file and data formats are used?
0.22 What database structures and formats are used?
0.23 Can these be easily converted to common shared formats?
If not, what alternatives do you recommend for export and reuse?
0.3 Metadata
0.31 What metadata do you publish with your data?
0.32 Is it in a standard machine-readable format?
0.33 Is it clear how, why and by whom the data were created and processed?
0.4 Integrity
0.41 What methods are used to ensure the integrity of files? (checksums, content hashing?)
0.42 What methods are used to ensure the sequence integrity of aggregated files? (e.g., hash trees, ledgers)
1.1 Sources
1.11 What sources are used; drawn from what set?
1.12 Are the sources versioned?
1.13 How often is each source pulled / pushed / otherwise updated?
1.2 Credit
1.21 Who was involved in producing and sharing data?
1.22 Is there CREDiT-style attribution for different roles?
1.3 Toolchains
1.31 What toolchains and pipelines are used to generate this?
1.32 What upstreams contribute to this work?
1.33 How are changes to these workflows recorded?
1.34 When do changes trigger a recompilation?
1.4 Dependencies
1.41 Is there an explicit process- or workflow-dependency tree?
1.42 When does a stale dependency trigger a recompilation?
1.43 Are there any push options for updates, or flagging of critical updates?
1.44 Are there social, institutional, or environmental dependencies?
cf: Repositories_of_CWL_Tools_and_Workflows
2.1 Editors and viewers
2.11 What tools are commonly used to view and edit this, in whole or part?
2.2 Utilities
2.21 What tools are used to generate code or datasets from other sources?
2.22 What tools are used to convert code or data from other formats?
2.3 Code libraries
2.31 What code libraries or components are commonly used to work with or extend this?
2.32 If any of the above require a software license, are there community resources for researchers who don’t otherwise have access?
How is data selected and cleaned? How is noise and error estimated and analysed?
What decisions in data cleaning and analysis were decided and recorded in advance?
3.1 Filters
3.11 How was data chosen for measurement/inclusion?
3.12 How is it noted when this changes?
3.2 Data cleaning
3.21 What data cleaning or noise correction, were used in compiling the data?
3.22 What other workflows were applied to the raw data?
3.23 How were these workflows registered before the raw data was gathered?
3.24 How are these workflows and pipelines named and versioned?
3.3 Parallels
3.31 What similar efforts or alternatives exist?
3.32 How does this data compare with previous studies of similar questions?
4.1 Estimation
4.11 How are sources of error, and sources of noise, estimated for each process or observation?
4.12 Are there benchmarks these are compared to?
4.13 What process is used for estimating and spot-checking error?
4.14 Are rates of false positives & false negatives captured separately?
4.15 For models: are ROC curves used and parameters noted where possible?
4.2 Propagation
4.21 How are margins of error propagated through processes & combinations?
4.22 How are resultant error bars conveyed in conclusions or charts?
5.1 Confirmation target
5.11 Was there a public target the research was intended to confirm or verify?
5.12 How was confirmation bias avoided in the design and analysis?
5.13 How were fallacies like the garden of forking paths avoided?
5.2 Change logs
5.21 Are there logs kept of changes to protocols, processes, and data cleaning?
5.22 Are there lab notebooks kept of the development of the research?
5.3 Registration
5.31 Which decisions and methods were pre-registered with an independent service before work began?
5.32 Which were registered as the work developed, before its final analysis?
How discoverable and accessible is this for different audiences? How can it be downloaded, reused, or remixed, and how can those uses be discovered?
6.1 Findability
6.11 What persistent identifiers are assigned to your data?
6.12 What identifiers do you use to ground references and links to other data?
6.13 What repositories or catalogs index the data and store metadata about it?
6.2 Accessibility
6.21 Is the metadata publicly accessible by looking up your data’s persistent ID?
6.22 What protocol should be used to access the full data?
6.23 If the data is not public, are there archival backups in case the initial method for accessing the data stops working?
7.1 Licensing and Embargoes
7.11 Is the data public?
7.12 If not yet public, is there an embargo period for public sharing?
7.13 Once public, under what license is it available? Is this a standard license, presented in a machine readable way?
7.14 Is the license compatible with the licenses of major contributing datasets, or other popular knowledge bases (such as Wikidata)?
7.2 Dumps
7.21 How are dumps provided: name, format, versioning?
7.22 Is there a feed of updates to dumps?
7.23 For larger datasets:
7.231 Are there options for iterative downloads, or for shipping by mail?
7.232 Does retrieval require reading from tape drives or backup systems?
7.233 Are there standard subsets or shards used for tests or benchmarks?
(e.g., randomly selected, or chosen by date/topic/type)
7.3 Logging use
7.31 What downstreams are using this work?
7.32 Is a log of this use visible kept, and at what level of detail?
7.33 Is this usage visible to other reusers, via pingbacks or other?
7.4 Remixing
7.41 Is this used, or planned for use, in any metastudies?
7.42 What processing (schema mappings, fuzzings or anonymization, other) is used by metastudies that include it?
7.43 Are these processes encoded in a way that other metastudies could use?
How can this be replicated or forked? How can forks and replications be discovered? How is this maintained, updated, and archived for posterity?
8.1 Replicability
8.11 What is the whole tale of your work -- what environment and setup are needed to replicate it?
8.12 Is this articulated in a [whole tale] file?
8.13 Does this file include workflow + usage notes?
8.2 Replicatedness
8.21 Has your process been replicated in practice?
8.22 By how many independent parties has it been replicated?
8.3 Forks
8.31 Are there recommended ways to fork code or data?
8.32 Is there a way to connect similar or derivative efforts back to this work?
9.1 Maintainers
9.11 Who manages requests for more information, or for downloads?
9.12 What is the plan for maintaining this data or service over time? Is it tied to a host institution or group?
9.13 How can others contribute to maintenance?
9.2 Updates
9.21 Are updates planned? On what schedule?
9.22 Are forks and reusers encouraged to notify the source of changes, and potentially pass changes back upstream?
9.22 How can people submit comments, requests, bugs, or suggest revisions?
9.3 Mirrors and archives
9.31 Describe your long term plan for this: is it intended to be around for years, decades? To serve as a reference for others?
9.32 Is this stored in multiple catalogs and repositories?
9.33 Is there a copy explicitly serving as a long-term archive (like CLOCKSS)?
When writing up a data description for your work, consider the above, and make sure you address: where you can find data + models, how you can access them, how your work interacts with them, how one can reproduce it.