Skip to main content

Data and model sharing checklist

Published onSep 17, 2019
Data and model sharing checklist
·

[URFC -4] Data & Model Sharing: Questions to capture provenance & context. v.6 2020-09-30

Detailed provenance and context are essential for collaborating around data, especially large datasets. It helps evaluate how to interpret derived models and composite datasets, and how to qualify observations, claims, & conclusions. That informs reviews and replications, and future metastudies or refactoring in new research contexts.

Here is a taxonomy of questions that I’ve found helpful to answer early and often, as data is produced, shared, updated and analysed. For composite datasets that draw on multiple sources, individual characteristics may vary across the set - in which case describing how one makes sense of the variance is worth extra elaboration.

Check that answers to the relevant questions are easy to find in your documentation and, where appropriate, in structured metadata.

See also: FAIRDOM’s data management checklist & OKFN’s frictionless data tools

Questions for data collaboratives

Structure · Methods · Access · Lifespan

Structure

What schemas, formats, and metadata standards do your data and models use?
What are their sources, what processes were used to produce them, what tools are used to generate, edit, and work with them?

0 Data structure

0.1 Schemas
0.11 What ontologies, schemas and shapes are used?
0.12 Are these defined in a spec? How do they relate to common standards?
0.13 Are the shapes + their specs versioned?

0.2 Formats
0.21 What file and data formats are used?
0.22 What database structures and formats are used?
0.23 Can these be easily converted to common shared formats?
If not, what alternatives do you recommend for export and reuse?

0.3 Metadata
0.31 What metadata do you publish with your data?
0.32 Is it in a standard machine-readable format?
0.33 Is it clear how, why and by whom the data were created and processed?

0.4 Integrity
0.41 What methods are used to ensure the integrity of files? (checksums, content hashing?)
0.42 What methods are used to ensure the sequence integrity of aggregated files? (e.g., hash trees, ledgers)

1 Provenance

1.1 Sources
1.11 What sources are used; drawn from what set?
1.12 Are the sources versioned?
1.13 How often is each source pulled / pushed / otherwise updated?

1.2 Credit
1.21 Who was involved in producing and sharing data?
1.22 Is there CREDiT-style attribution for different roles?

1.3 Toolchains
1.31 What toolchains and pipelines are used to generate this?
1.32 What upstreams contribute to this work?
1.33 How are changes to these workflows recorded?
1.34 When do changes trigger a recompilation?

1.4 Dependencies
1.41 Is there an explicit process- or workflow-dependency tree?
1.42 When does a stale dependency trigger a recompilation?
1.43 Are there any push options for updates, or flagging of critical updates?
1.44 Are there social, institutional, or environmental dependencies?

2 Tools

cf: Repositories_of_CWL_Tools_and_Workflows

2.1 Editors and viewers
2.11 What tools are commonly used to view and edit this, in whole or part?

2.2 Utilities
2.21 What tools are used to generate code or datasets from other sources?
2.22 What tools are used to convert code or data from other formats?

2.3 Code libraries
2.31 What code libraries or components are commonly used to work with or extend this?
2.32 If any of the above require a software license, are there community resources for researchers who don’t otherwise have access?

Methods

How is data selected and cleaned? How is noise and error estimated and analysed?
What decisions in data cleaning and analysis were decided and recorded in advance?

3 Data selection

3.1 Filters
3.11 How was data chosen for measurement/inclusion?
3.12 How is it noted when this changes?

3.2 Data cleaning
3.21 What data cleaning or noise correction, were used in compiling the data?
3.22 What other workflows were applied to the raw data?
3.23 How were these workflows registered before the raw data was gathered?
3.24 How are these workflows and pipelines named and versioned?

3.3 Parallels
3.31 What similar efforts or alternatives exist?
3.32 How does this data compare with previous studies of similar questions?

4 Error

4.1 Estimation
4.11 How are sources of error, and sources of noise, estimated for each process or observation?
4.12 Are there benchmarks these are compared to?
4.13 What process is used for estimating and spot-checking error?
4.14 Are rates of false positives & false negatives captured separately?
4.15 For models: are ROC curves used and parameters noted where possible?

4.2 Propagation
4.21 How are margins of error propagated through processes & combinations?
4.22 How are resultant error bars conveyed in conclusions or charts?

5 Registration

5.1 Confirmation target
5.11 Was there a public target the research was intended to confirm or verify?
5.12 How was confirmation bias avoided in the design and analysis?
5.13 How were fallacies like the garden of forking paths avoided?

5.2 Change logs
5.21 Are there logs kept of changes to protocols, processes, and data cleaning?
5.22 Are there lab notebooks kept of the development of the research?

5.3 Registration
5.31 Which decisions and methods were pre-registered with an independent service before work began?
5.32 Which were registered as the work developed, before its final analysis?

Access

How discoverable and accessible is this for different audiences? How can it be downloaded, reused, or remixed, and how can those uses be discovered?

6 Discovery and Access

6.1 Findability
6.11 What persistent identifiers are assigned to your data?
6.12 What identifiers do you use to ground references and links to other data?
6.13 What repositories or catalogs index the data and store metadata about it?

6.2 Accessibility
6.21 Is the metadata publicly accessible by looking up your data’s persistent ID?
6.22 What protocol should be used to access the full data?
6.23 If the data is not public, are there archival backups in case the initial method for accessing the data stops working?

7 Reuse

7.1 Licensing and Embargoes
7.11 Is the data public?
7.12 If not yet public, is there an embargo period for public sharing?
7.13 Once public, under what license is it available? Is this a standard license, presented in a machine readable way?
7.14 Is the license compatible with the licenses of major contributing datasets, or other popular knowledge bases (such as Wikidata)?

7.2 Dumps
7.21 How are dumps provided: name, format, versioning?
7.22 Is there a feed of updates to dumps?
7.23 For larger datasets:
7.231 Are there options for iterative downloads, or for shipping by mail?
7.232 Does retrieval require reading from tape drives or backup systems?
7.233 Are there standard subsets or shards used for tests or benchmarks?
(e.g., randomly selected, or chosen by date/topic/type)

7.3 Logging use
7.31 What downstreams are using this work?
7.32 Is a log of this use visible kept, and at what level of detail?
7.33 Is this usage visible to other reusers, via pingbacks or other?

7.4 Remixing
7.41 Is this used, or planned for use, in any metastudies?
7.42 What processing (schema mappings, fuzzings or anonymization, other) is used by metastudies that include it?
7.43 Are these processes encoded in a way that other metastudies could use?

Lifespan

How can this be replicated or forked? How can forks and replications be discovered? How is this maintained, updated, and archived for posterity?

8 Replication

8.1 Replicability
8.11 What is the whole tale of your work -- what environment and setup are needed to replicate it?
8.12 Is this articulated in a [whole tale] file?
8.13 Does this file include workflow + usage notes?

8.2 Replicatedness
8.21 Has your process been replicated in practice?
8.22 By how many independent parties has it been replicated?

8.3 Forks
8.31 Are there recommended ways to fork code or data?
8.32 Is there a way to connect similar or derivative efforts back to this work?

9 Maintenance

9.1 Maintainers
9.11 Who manages requests for more information, or for downloads?
9.12 What is the plan for maintaining this data or service over time? Is it tied to a host institution or group?
9.13 How can others contribute to maintenance?

9.2 Updates
9.21 Are updates planned? On what schedule?
9.22 Are forks and reusers encouraged to notify the source of changes, and potentially pass changes back upstream?
9.22 How can people submit comments, requests, bugs, or suggest revisions?

9.3 Mirrors and archives
9.31 Describe your long term plan for this: is it intended to be around for years, decades? To serve as a reference for others?
9.32 Is this stored in multiple catalogs and repositories?
9.33 Is there a copy explicitly serving as a long-term archive (like CLOCKSS)?

Summary

When writing up a data description for your work, consider the above, and make sure you address: where you can find data + models, how you can access them, how your work interacts with them, how one can reproduce it.

Comments
13
Samuel Klein: ‘Do you have a formal preservation plan?‘
Peter Suber: Yes, that would do it.
+ 1 more...
Peter Suber: Another Q for this section: Is your dataset itself already the result of remixing? If so, do you cite the contributing sources?
Peter Suber: If so, is the embargo period (or embargo expiration date) publicly disclosed?
Peter Suber: For “differential privacy”, you could link to the WP article, https://en.wikipedia.org/wiki/Differential_privacy
Peter Suber: Here or hereabouts I’d ask this Q: If the data is not public, and you offer a method of access for researchers who meet certain conditions, do those conditions permit maximum access consistent with the background reasons for privacy (e.g. medical privacy)? Do you use differential privacy for users in different situations?
Peter Suber: Here or hereabouts I’d ask this Q: If you store your dataset in a repository, and periodically enlarge or improve your dataset, how regularly do you update the version in the repository?
Peter Suber: I interpret this to mean: Use these questions to improve your data and data-sharing. They do NOT lay down necessary conditions. Moreover, improving your data and data-sharing is a nearly endless approximation process, and these questions can help nudge you along the way. If I’m interpreting you correctly, I like this model and use it myself sometimes. (Glad to say more.) However, if you want to say that some of these questions DO lay down necessary conditions (for example, metadata MUST be machine-readable, Q 0.32), you should flag them to indicate their special status.
Samuel Klein: I do mean the former, but will think about which should be highlighted with strong recommendations, or which existing recommenations to reference. Ideally this would link to separate normative guidelines, describing their relevance. The brief summary-rec could look like “aim to meet the [AIR-6] standard, which allows interplay w/ the most prevalent research networks. If possible, meet [FLAIR-12], ensuring integration with global archives.”
TG
Thad Guidry: I also think an important point in Reuse and Licensing, and specifically about clearly defining the purpose of the data sharing in agreements. Specific purposes with specific people. Point made here: https://youtu.be/JbsAnKTYKis?t=2046 and Cambridge Analytica fiasco comes to my mind. What a data sharing partner is allowed to do with the data…and what they are not.
TG
Thad Guidry: In the Enterprise and Government domains, a good programme Data Pitch https://datapitch.eu/ which originally dealt with sharing between Business to Business with practices and questions.
TG
Thad Guidry: Data sharing agreements, who owns the IP rights, what are the contracts around sharing agreements, how long is data shared, for how long, with whom has access for that period? GDPR mentions “a data controller” that holds some of this responsibility of oversight, so might garner a mention somewhere.Maybe it might make sense to cover Licensing under the umbrella of Access? A license is meant to formalize the legal assessment of what is accessible, to whom, for how long, etc. https://www.youtube.com/watch?v=JbsAnKTYKis
TG
Thad Guidry: There should be a distinction with Licensing etc. between both the actual data and the metadata. Sometimes these are considered different, one shared, one not, one licensed CC0, the other not having any license.The split of actual data and metadata also needs to be addressed with a point in Discovery and Access as well as Provenance. In fact, the split constantly needs to be made aware to a user. ODI are pretty good about always talking about this split.
Samuel Klein: Somewhere I’d like to reference a framework for layers of access [existence, type, structure, prov, other metadata, structured summary, data, history, generator] but don’t know of a good one to point to
Samuel Klein: Great point.
TG
Thad Guidry: Hmm, Scaling should be a point made somewhere within Lifespan, which has impacts on dumps, backups, replication, and maintenance, updates. Examples: Dealing with more and more deployed sensor data in Agriculture, or increasing testkits with increasing results in Covid-19 times.
Joshua Gay: I wonder if three should be another top level section: 9 Tools. Taking as an example the way Common Workflow Language organizes info about tools related to it. You might have questions about: 9.1 Editors and viewers. 9.2. Utilities. 9.3 Code generators and converters. 9.4 Code libraries See: https://www.commonwl.org/#Repositories_of_CWL_Tools_and_Workflows
Samuel Klein: That’s a fine idea. I’m not sure there’s enough to merit its own section… testing it out in the latest revision, along with aa bit more semantic clustering of the top sections.
Samuel Klein: Thanks, added.
Joshua Gay: You might ask if the data set is very large and if so to describe concerns related to: very large file size, retrieval and access requiring reading from tape drives or other large data backup systems; or if the bandwidth costs or total time to download it make it less expense or faster to ship the data via postal mail on disk or tape. … there is probably a simpler way to word such a question.