[URFC -4] Data & Model Sharing: Questions to capture provenance & context. v.3 2020-09-10
When publishing research, the provenance and context of one’s data and models is essential information. It is needed to fully describe and qualify observations, claims, & conclusions, and to inform future replication and metastudies. It is also needed to allow reuse or refactoring in new contexts.
Here is a taxonomy of questions to answer early and often, to capture provenance and context as data is produced, shared, updated and analysed. For composite datasets drawn from from multiple sources, these characteristics may vary across the set. Maintaining a current set of answers to such questions alongside a dataset can help
0.11 What ontologies, schemas and shapes are used?
0.12 Are these defined in an overall spec?
0.13 Are the shapes + their specs versioned?
0.21 What file and data formats are used?
0.22 What database structures and formats are used?
0.23 Can these be easily converted to common shared formats?
1.11 What sources are used; drawn from what set?
1.12 Are the sources versioned?
1.13 How often is each source pulled / pushed / otherwise updated?
1.21 Who was involved in producing and sharing data?
1.22 Is there CREDiT-style attribution for different roles?
2.11 What toolchains and pipelines are involved?
2.12 What upstreams contribute to this work?
2.13 How are changes to these workflows recorded?
2.14 When do changes trigger a recompilation?
2.21 Is there an explicit process- or workflow-dependency tree?
2.22 When does a stale dependency trigger a recompilation?
2.23 Are there any push options for updates, or flagging of critical updates?
2.24 Are there social, institutional, or environmental dependencies?
3.1 Licensing and Embargoes
3.11 Is the data public?
3.12 If not yet public, is there an embargo period for public sharing?
3.13 Once public, under what license is it available?
3.14 Is the license compatible with the licenses of major contributing datasets, or other popular knowledge bases (such as Wikidata)?
3.21 How are dumps provided: name, format, versioning?
3.22 Is there a feed of updates to dumps?
3.3 Logging use
3.31 What downstreams are using this work?
3.32 Is a log of this use visible kept, and at what level of detail?
3.33 Is this usage visible to other reusers, via pingbacks or other?
3.41 Is this used, or planned for use, in any metastudies?
3.42 What processing (schema mappings, fuzzings or anonymization, other) is used for each including metastudy?
3.43 Is the mapping for use in any metastudy encoded in a named package or configuration file that others could use?
4.11 How was data chosen for measurement/inclusion?
4.12 How is it noted when this changes?
4.2 Data cleaning
4.21 What data cleaning or noise correction, were used in compiling the data?
4.22 What other workflows were applied to the raw data?
4.23 How were these workflows registered before the raw data was gathered?
4.24 How are these workflows and pipelines named and versioned?
4.31 What similar efforts or alternatives exist?
5.11 What is the whole tale of your work -- what environment and setup are needed to replicate it?
5.12 Is this articulated in a [whole tale] file?
5.13 Does this file include workflow + usage notes?
5.21 Has your process been replicated in practice?
5.22 By how many independent parties has it been replicated?
6.11 Which of the above were pre-registered with a registration service?
6.12 Which were registered or announced during the research, before its final analysis and conclusions
6.2 Change logs
6.21 Are there logs kept of changes to protocols, processes, and data cleaning?
6.22 Are there lab notebooks kept of the development of the research?
6.3 Confirmation target
6.31 Was there a public target the research was intended to confirm or verify?
6.32 How was confirmation bias avoided in the design and analysis?
6.33 How was forking-paths fallacy avoided?
7.11 How are errors and noise estimated for each process or observation?
7.21 How are errors propagated through processes and combinations?
7.22 How are resultant errors in conclusions described or characterized?