‘Do you have a formal preservation plan?‘
Yes, that would do it.
Another Q for this section: Is your dataset itself already the result of remixing? If so, do you cite the contributing sources?
If so, is the embargo period (or embargo expiration date) publicly disclosed?
For “differential privacy”, you could link to the WP article, https://en.wikipedia.org/wiki/Differential_privacy
Here or hereabouts I’d ask this Q: If the data is not public, and you offer a method of access for researchers who meet certain conditions, do those conditions permit maximum access consistent with the background reasons for privacy (e.g. medical privacy)? Do you use differential privacy for users in different situations?
Here or hereabouts I’d ask this Q: If you store your dataset in a repository, and periodically enlarge or improve your dataset, how regularly do you update the version in the repository?
I interpret this to mean: Use these questions to improve your data and data-sharing. They do NOT lay down necessary conditions. Moreover, improving your data and data-sharing is a nearly endless approximation process, and these questions can help nudge you along the way.
If I’m interpreting you correctly, I like this model and use it myself sometimes. (Glad to say more.)
However, if you want to say that some of these questions DO lay down necessary conditions (for example, metadata MUST be machine-readable, Q 0.32), you should flag them to indicate their special status.
I do mean the former, but will think about which should be highlighted with strong recommendations, or which existing recommenations to reference.
Ideally this would link to separate normative guidelines, describing their relevance. The brief summary-rec could look like “aim to meet the [AIR-6] standard, which allows interplay w/ the most prevalent research networks. If possible, meet [FLAIR-12], ensuring integration with global archives.”
I also think an important point in Reuse and Licensing, and specifically about clearly defining the purpose of the data sharing in agreements. Specific purposes with specific people. Point made here: https://youtu.be/JbsAnKTYKis?t=2046 and Cambridge Analytica fiasco comes to my mind. What a data sharing partner is allowed to do with the data…and what they are not.
In the Enterprise and Government domains, a good programme Data Pitch https://datapitch.eu/ which originally dealt with sharing between Business to Business with practices and questions.
Data sharing agreements, who owns the IP rights, what are the contracts around sharing agreements, how long is data shared, for how long, with whom has access for that period? GDPR mentions “a data controller” that holds some of this responsibility of oversight, so might garner a mention somewhere.
Maybe it might make sense to cover Licensing under the umbrella of Access? A license is meant to formalize the legal assessment of what is accessible, to whom, for how long, etc. https://www.youtube.com/watch?v=JbsAnKTYKis
There should be a distinction with Licensing etc. between both the actual data and the metadata. Sometimes these are considered different, one shared, one not, one licensed CC0, the other not having any license.
The split of actual data and metadata also needs to be addressed with a point in Discovery and Access as well as Provenance. In fact, the split constantly needs to be made aware to a user. ODI are pretty good about always talking about this split.
Somewhere I’d like to reference a framework for layers of access [existence, type, structure, prov, other metadata, structured summary, data, history, generator] but don’t know of a good one to point to
Great point.
Hmm, Scaling should be a point made somewhere within Lifespan, which has impacts on dumps, backups, replication, and maintenance, updates. Examples: Dealing with more and more deployed sensor data in Agriculture, or increasing testkits with increasing results in Covid-19 times.
I wonder if three should be another top level section: 9 Tools. Taking as an example the way Common Workflow Language organizes info about tools related to it. You might have questions about: 9.1 Editors and viewers. 9.2. Utilities. 9.3 Code generators and converters. 9.4 Code libraries
See: https://www.commonwl.org/#Repositories_of_CWL_Tools_and_Workflows
That’s a fine idea. I’m not sure there’s enough to merit its own section… testing it out in the latest revision, along with aa bit more semantic clustering of the top sections.
Thanks, added.
You might ask if the data set is very large and if so to describe concerns related to: very large file size, retrieval and access requiring reading from tape drives or other large data backup systems; or if the bandwidth costs or total time to download it make it less expense or faster to ship the data via postal mail on disk or tape. … there is probably a simpler way to word such a question.