Thanks @trobertson for your detailed reply.
First, Iâd like to emphasize that I much appreciate that GBIF is looking into ways to distribute their aggregate data products beyond the GBIF âcloudâ (e.g., servers that run GBIF processes and store/serve data).
I notice that roughly four topics came up in this forum discussion:
- getting data closer to where folks do their analysis
- ensuring that data publishers are attributed
- ensuring that provenance of data is clear
- ensuring that the integrity of the data can be verified
With this in mind, Iâd to respond to some of the your comments.
Yes, I am able to download a zipfile by clicking on the âdownloadâ button of the occurrence download page DOI Download
. However, I have no way of telling that the zip file I receive today is the same zip file that my future self will download in 5, 10, 20 years from now. In addition, Iâd say that DOI and the zip file will be around as long as GBIF has the will, capability and funds to do so.
Realizing that the internet is a wild and dynamic place and funding comes and goes, I would want to figure out a way to keep many copies around in different places without losing the ability to cite, find, retrieve and verify the integrity of, the datasets.
A first step towards more reliably referencing data would be to provide checksums (or content hashes) associated to the provided data products and include these in the data citations along with the DOIs. Providing this information is standard practice for distribution of digital content (e.g., Zenodo provides md5 hashes for each file they host).
Iâd say that when distributing content across different platforms or âcloudsâ, including the checksum/content hash would help to ensure that no data gets corrupted in transmission.
âRawâ records are neatly packed in zip files or tar balls and made available through institutional URLs. As you noted, you can version these files using checksums or content hashes. With this, you can version any ârawâ record and further enhance the provenance of the GBIF annotated records by including a reliable reference (= content hash) to the version of the provided source archive from which the ârawâ material came.
As our study has also kept track of all the dataset versions registered in the GBIF network since 2018, Iâd be very interested to compare our collection of immutable/versioned source archives with those that you have.
Thank you for taking the time to respond and for considering my feedback.