FAQ

GigaDB is the home for all data/files/tools/software associated with GigaScience manuscripts. GigaDB curators will ensure the information is complete and appropriately formatted, before cataloging and publishing. Submission of data to GigaDB complements but does not serve as a replacement for community approved public repositories, supporting data and source code should still be made publicly available in a suitable public repository. GigaDB can link any and all publicly deposited data together with additional files/tools that do not have a natural home in any other public repository.

At the present only GigaScience.

GigaScience is committed to enabling reproducible science, to do this, readers need to be able to easily find and get hold off all the underlying data, methods, workflows, software and anything else that was used in the research. In the past authors of research articles have made (justified) claims that there is no way of making all their data available, now GigaDB has filled that gap. GigaDB complements but does not serve as a replacement for community approved public repositories, and can link any and all publicly deposited data together with additional files/tools that do not have a natural home in any other public repository. Any and all data/files required for reproducibility of GigaScience manuscripts should be either hosted in, or linked to from, GigaDB.

Anything related to a GigaScience article that does NOT already have a relevant public repository (e.g. sequence data should still be deposited in the INSDC archives and/or the SRA).

As long as the data is fully consented and legally and ethically approved for public release, we encourage complete disclosure including a blank copy of the consent form signed where possible.

There are many specific formats depending on the date types, but the rule of thumb is to use non-propitiatory formats and where possible follow the standard for the relevant field, our curators will be on hand to help with any specific questions on this matter.

At the same time or soon after submission of the manuscript. GigaScience reviewers will be looking to see if the underlying data is available and appropriate, so they will need access to it. This can be via your own private servers if you prefer, but we offer a secure staging area to host data under review/pre-release.

While the dataset is "pre-release" status, any files can be replaced by over-writing the original file with the newly modified one. After publishing the data, no over-writing will be permissible, only the addition of new files, all published files will remain available (unless there is a very good reason to remove them). Versioning is still possible for updates and major changes to the files if they need to be changed post-publication.

Organise your data into logical directories/folders, name files consistently and without using spaces or special characters. Re-read you methods section to check that everything mentioned there is available, either by links to public repositories or as files you have organised to submit to GigaDB.

By ensuring all files are in non-proprietary formats that anyone can use without the need for expensive software.By making sure data tables are in tables not PDF.By using a CC0 waiver or other suitable public domain licenses for datasets. By using OSI (Open Source Initiative) licenses for software, and linking to versions in code repositories for updates and forking.By including as much metadata about the samples/specimens/files/methods etc... as possible.

All data submissions should be approved before being started, please contact [email protected] to discuss your article and associated data with our editors.Once approved, there are two possible routes to provide the metadata about your data:

use the online submission wizard - this is a good option for datasets with few authors, and few files. The wizard currently does not have functionality to upload tabular information so everything must be typed in individually.
use the template spreadsheet (excel, but compatible with open office too) downloadable from here: Link to template excel file - This option is better where there are multiple authors and/or multiple files and/or samples. NB. the spreadsheet contains macros, but these are only to allow the forward and back buttons to work so can be disabled, you can just click the relevant tabs at the bottom of the spreadsheet.

For more details on submitting using the Spreadsheet please see here.

The readme file is an important part of any dataset and our curators will be able and willing to assist with this if required. We intend to formalise the readme format at some point in the near future, but for now here is an example of the format we try to work to:

filename = readme.txt
format = ASCII plain text (not RTF, not .doc !)


==========
:, GigaScience database, 
summary:
---------
[optionally you may include a summary text about the dataset or directory structure used here]
Associated data:
--------------
[list any URL links or DOIs to other public repository data]

Directories:
----------
[list any directories of related files with a description to help users understand why these files are grouped into a directory]
 - 

Files:
-----
[list the files available in this dataset with a brief description for each]
 -

GigaScience supports and has signed the FORCE 11 Joint Declaration of the Data Citation Principles, feeling strongly that data should be accorded the same importance in the scholarly record as citations of publications. GigaDB datasets can and should be cited in the same manner as any other reference, although the format is journal specific based on their instructions. Following DCC and DataCite guildelines, in GigaScience journal the citation within the references section will be of the form:Author List; publication year: "Dataset title", GigaScience Database. DOI. Example.Peter E Larsen; Yang Dai; (2015): Supporting materials for "Metabolome of Human gut microbiome is predictive of host dysbiosis".; GigaScience Database. http://dx.doi.org/10.5524/100163

There shouldn't be, and a major rationale for data publishing is to incentivise earlier release of data in this manner. It is commonly understood throughout the publishing community that publishing data (as a Data Note or in a public archive) is a good thing to be encouraged, and as such, there are no penalties to then subsequently publishing research based on those data. GigaScience has published plenty of data notes and released data sets prior to the analysis papers being published, some examples are:

3,000 Rice Genomes Project (13.4 Tb data).
Polar Bear genome - dataset released in GigaDB nearly 3 years before the analysis paper was published in Cell.
Deadly 2011 outbreak E. Coli genome that lead to over 50 deaths in Germany (and eventually published in NEJM).

Our Polar bear genome data was released nearly three years before any official publication came out from the project, and despite being used by at least 5 other studies, the analysis paper made the cover of Cell (see the blog for more).

Journals do not consider the publication of a dataset with a DOI and associated protocol information as a 'prior publication' that would preclude subsequent publication of new results obtained from such a dataset. F1000 Research did a useful survey to confirm this with a number of publishers (see: F1000 policy), and this is only going to become increasingly observed and accepted as most of the publishers are now promoting their own Data Journals.

As early release as possible is encouraged, although the standard protocol we follow is to maintain data as private to the peer reviewers only until the associated manuscript has been formally accepted, at which point the dataset is released, this is usually several days prior to the manuscript publication due to production times of the BMC publishing system. While we cannot foresee any reason why datasets should be embargoed for extended periods we can discuss this further on a case-by-case basis.If you have major concerns about someone else publishing on your data before you, we can add a Fair Use policy statement on the GigaDB dataset page which looks like this:

These data are made available pre-publication under the recommendations of the Fort Lauderdale/Toronto meetings. Please respect the rights of the data producers to publish their whole dataset analysis first. The data is being made available so that the research community can make use of them for more focused studies without having to wait for publication of the whole dataset analysis paper. If you wish to perform analyses on this complete dataset, please contact the authors directly so that you can work in collaboration rather than in competition.

This dataset fair use agreement is in place until <author can specify a data up to 12 months away>

There are currently no separate Data Publishing Charges (DPCs) for GigaDB as we currently do not accept data that is not accompanied by a GigaScience manuscript. All DPCs for GigaScience manuscripts are covered by the Article Publishing Charges (APCs) of that manuscript (up to a terabyte automatically included, but contact us if you need more). For APCs of GigaScience manuscripts please see in Gigascience journal pricing.

No. All data provided by GigaDB is free to download and use. On occasion when datasets are very large and internet connections are slow, some user may request data to be sent by hard disk, GigaDB cannot bare the cost of this but we will assist in the copy of the data onto the disks and help arrange shipment, but the user will be required to cover the cost of the disks and shipment.

There are 2 ways to download data from GigaDB:

FTP. This is the "normal" method, click the download button on any dataset page and this is how your data will be sent.
Hard drive shipment. On occasion when datasets are very large and internet connections are slow, some user may request data to be sent by hard disk, GigaDB cannot bare the cost of this but we will assist in the copy of the data onto the disks and help arrange shipment, but the user will be required to cover the cost of the disks and shipment.

On each dataset page there are 3 buttons after the authors names, "RIS", "BIBTEX" and "TEXT" you may use these to download the citation of the dataset in those formats.

The term dataset in GigaDB refers to a collection of related works, including but not limited to; files, software, workflows, experiments, data, metadata and results. Each dataset has its own webpage which has a DOI (digital object identifier). These datasets are permanent and citable records of research output designed to allow for a modernization of the classical publishing framework while maintaining the familiarity of citations and metrics thereof.While uncommon, it is possible for a dataset to be made-up of several other datasets in a nested fashion, for example the Avian phylogenomics project data dataset (http://dx.doi.org/10.5524/101000) is a compilation of 48 other datasets, some of those were published before and some at the same time. This allows the original authors to cite just one dataset to cover them all, but also allows future users to cite individual datasets if they require. We will discuss the merits of such procedures on a case-by-case basis with the submitter.

While we have no formal agreements with any particular journals, we are happy to work with other journals to ensure timely and coherent joint publications, please discuss with the editors ([email protected]).

A Digial Object Identifier (DOI) is a stable, citable link to an electronic resource. A GigaDB DOI is a stable and citable link to a dataset hosted by GigaDB.

It is widely recognized that publicly funded research data should be made publicly available for free to be used by anyone. The Creative Commons Zero (CC0) waiver provides the explicit statement of that fact, and it is transparent to all that the data hosted by GigaDB are all freely available for any use case. CC0 is thought to be the most appropriate method for dedicating data to the public domain, but for more on the rationale and practicalities see this BMC Research Notes editorial. Citation of data usage is greatly encouraged in order to provide the recognition to the data producers, both for their efforts in the production and in their foresight and generosity in making the data CC0.

At the present time GigaDB doesn't have the resources to assist in data management plans, but there are many useful resources available on the internet, including places like the DCC (UK focus)or the CIC (US focus).

There are currently no separateData Publishing Charges (DPCs) for GigaDB as we currently do not accept data that is not accompanied by a GigaScience manuscript. All DPCs for GigaScience manuscripts are covered by the Article Publishing Charges (APCs) of that manuscript (up to a terabyte automatically included, but contact us if you need more). For APCs of GigaScience manuscripts please see in Gigascience journal pricing.

Due to various differences in the BMC's editorial tools and the GigaDB system, unfortunately at this time it is not possible to integrate the submission process, but our editors and curators will do everything they can to make the process as smooth as possible for authors.

We have a regular backup of data, so if you find a corrupt file please let us know and we will replace it with a copy from back-up.

We will host your data on our private GigaDB server giving access to the reviewers, if the manuscript is accepted we will move the data to our public production server. If it is unsuccessful the data will be deleted.

urrrm. Sort-of yes. If a user clicks the download button on the website, it is recorded in the database and you can see on the dataset page how many times this has happened. However, if a file is pulled directly from the FTP server it is currently not recorded in the database. This functionality is on our to-do list and will be addressed as soon as we can.

It can be used for anything by anyone, most* data is given the licence CC0 specifically to remove any restrictions on reuse. * - on occasion we host some files for convenience of our users that are already covered by other licences (e.g. more appropriate OSI-compliant licenses for software, or multiple (all open) licenses in a workflow or virtual machine), where this happens we make every effort to make users aware of the different licences.

Hypothes.is is an open source project helping to bring a discussion, annotation and curation layer to the web, we are collaborating with Hypothes.is in order to make all our datasets open to discussion by anyone (with a hypothesis user account). Simply highlight the text of interest on the page and click "New Note" icon that appears. To see previous notes, click the number on the side bar, or open the side bar to see all previous public annotations.

There are various reason why certain data values may need to not be included in the sample metadata, but you still want it to be compliant with particular Minimum Information standards such the GSC MIxS. To maintain compliance when there are missing values within the mandatory fields please use the following terms only:

Term: Definition
not applicable: information is inappropriate to report, can indicate that the standard itself fails to model or represent the information appropriately
restricted access: information exists but can not be released openly because of privacy concerns
not provided: information is not available at the time of submission, a value may be provided at the later stage
not collected: information was not collected and will therefore never be available

The addition of comprehensive sample metadata will ensure the best possible reach of these data and help users find and filter relevant data.

This is simply a display issue. In the longer term, we wish to display the directory structure on the GigaDB dataset pages, however, for the moment the files appear as a flat list. By mousing over a filename in the list you can see the complete filepath which shows the directory structure has been maintained. Additionally you can click the "FTP site" link at the top of any file table to be taken to the FTP server which displays the complete directory structure.

If you inadvertently discover the identity of any patient/individual, then (a) you agree that you will make no use of this knowledge, (b) that you will notify us ([email protected]) of the incident, and (c) that you will inform no one else of the discovered identity. We will assess the specific case and remove/reduce the amount of metadata available for the subjects involved, and inform the data owners/submitters of the situation.

All datasets are curated to a high standard including but not limited to; the checking and conversion of file formats if required to ensure open (non-proprietary) and stable formats are used whenever possible; Sample metadata to meet appropriate standards and to include ontology terms where possible; creation of specialist display formats like 3D images from STL image stacks and JBrowse genome browser files from genome assemblies and annotation files. All datasets are manually curated with email correspondence to the submitting author to ensure completeness. Where possible our curators follow guidelines provided by international bodies such as the Genomics Standards Consortium (gensc.org) for the minimal information about any genomic sequences. Dataset level metadata is also checked and curated to go above and beyond DataCite standards.

We ensure data files provided are not corrupt in transfer by use of md5 checksums whenever files are received or moved. All changes to the datafiles and or metadata (after publication) are tracked in a history log present on each dataset page. In the event that a major update is requested we would initiate a full new dataset and maintain the previous dataset as the archival record, placing a notice on the archival record informing users that there is a newer version available, with a link.

When curation is complete the dataset is registered with a datacite DOI. Each dataset can have multiple links to external repositories / websites, these are manually curated at the time of submission and automatic link resolution checks are performed weekly to try to catch links to sites that move / disappear. We check validity of email addresses from submitting author and ask for identifiers such as ORCID if available for all named authors, but no additional checks are made on those authors.

GigaDB data is currently hosted in China National GeneBank (CNGB) servers in Shenzhen, who promise persistent and stable storage. CNGB is a central government and Shenzhen government funded research organisation tasked with supporting public welfare, innovation and scientific research. These servers are built on the Alibaba Cloud Elastic Compute Service (ECS), with Anti-DDoS protection to safeguard the data. This infrastructure is covered by the Alibaba service level agreements. Regular data backup uses the Alibaba Cloud Object Storage Service (OSS) to backup and archive all data in the data repository, and automatically storing two copies of data in different locations (please see Alibaba Cloud help pages for more details https://www.alibabacloud.com/help/). As well as providing rapid data recovery. We ensure data files provided are not corrupt in transfer by use of md5 checksums whenever files are received or moved (see the data integrity FAQ for more). As full members of DataCite, CC0 metadata is sent to them upon public release and is discoverable and searchable via search.datacite.org and other linked data search indexes.

GigaDB - is the data repository. GigaScience - is the journal that created the GigaDB platform and has been used for the initial test case application of GigaDB. Additionally other organisations involved with GigaScience and GigaDB include: BGI Group - formerly the Beijing Genomics Institute, but now based in Shenzhen and known as just BGI. BGI is the institution that has provided all the funding for the journal and GigaDB development to date.

Oxford University Press (OUP) - the publisher with which BGI currently has a partnership to run GigaScience journal.

BGI Hong Kong Tech Ltd. - A member of the BGI group of companies that is a legal entity in Hong Kong, this is where the metadata is hosted, and most of the GigaDB staff are employed.

CNGB (China National GeneBank) - A government funded institute constructed by BGI and administered by the Development and Reform Commission of Shenzhen Municipality, that provides most of the informatics infrastructure (storage) for GigaDB. Aliyun - A Chinese cloud services provider that CNGB have contracted out their IT infrastructure to, and is currently hosting the GigaDB’s data in their servers

The agreements with CNGB and BGI ensures GigaDB will be actively maintained for the foreseeable future. The linking of GigaDB to datasets accompanying open access journal articles with included article and data processing charges to help cover storage and curation costs also provides a model to enable the sustained growth of GigaDB.

We follow a variety of standards for metadata collection, primarily the DataCite specification (https://schema.datacite.org/) for the dataset level metadata and the Genomics Standards Consortium minimal information standards for the sample level metadata. We also follow community norms for file metadata. Additionally all dataset pages are marked up with Schema.org compliant metadata to facilitate discovery by generic web search engines such as Google.

GigaDBs code is open source (available on GitHub https://github.com/gigascience/gigadb-website), based on PostgreSQL database and Yii PHP frameworks, and we utilise and integrate with a growing number of open source plugins and widgets. e.g. Jbrowse, a community supported genome browser. All of the servers run on Centos 7, which is also open source.

If you spot errors in data or metadata or anywhere in this website please contact the GigaDB curators via [email protected]. We also provide a moderation space for more interactive feedback or discussion on our datasets using hypothes.is integration. Hypothes.is is an open-source open annotation tool to enable users of its website to make comments, highlight important sections of articles and engage with fellow readers online. Anyone who might question any of the information or have additional things to link can do so via this functionality, producing a conversation layer over our datasets. We have integrated a plugin to allow public annotations to be highlighted on the landing pages, and using the hypothes.is icon that hovers over the top-right of GigaDB landing pages you can login and join the discussion. Adding comments and your own annotations.

You must include the "full_table.tsv" and the "missing_busco_list.tsv" files, any other output files are optional. It is acceptable to include the entire "output" of a BUSCO run in a tar.gz archive if you prefer. See this website for more details about the various outputs from BUSCOv4.

GigaDB displays file sizes using the binary prefixes (e.g. 1KB = 1024 byte; 1GB= 1073741824 byte)

The submission of data to GigaDB is integrated with the submission of a manuscript to GigaScience, the general workflow followed is outlined in the GigaDB Submission Guidelines. During the process you may track the progress in "your datasets" on your personal profile page in GigaDB. Each of your datasets will have one of the statuses listed below.

FAQ

What is GigaDB?

What journals are integrated with GigaDB?

Why use GigaDB?

What kinds of data does GigaDB accept?

My research is on human subjects. Can I archive my data in GigaDB?

In what file format(s) should I submit my data?

When should I submit my data?

How can I modify files I have submitted to GigaDB while my article is in review?

What should I prepare before submission?

How can I make my data submission as accessible and reusable as possible?

How do I submit data?

How do I write a ReadMe file?

How do I cite the data in my manuscript?

Are there any problems with publishing my final research paper AFTER publishing the data in GigaScience?

Do I have the option to embargo release of my data?

How much does it cost?

Do I have to pay to download or use the data?

How do I download a large dataset with my slow internet connection?

How do I cite data from GigaDB?

How do I download information to my citation management software?

What is a dataset?

Does my journal work with GigaDB and how?

What is a GigaDB DOI?

Why does GigaDB use Creative Commons Zero?

Can the GigaDB repository help me prepare a data management plan?

What are the charges for submitting data?

Why is submission to GigaDB not closely integrated with submission to GigaScience?

How are datasets in GigaDB backed up?

What happens to data after it is submitted?

Can I see how often my dataset is being used and downloaded?

How may data from GigaDB be reused?

What is Hypothes.is?

How do I report missing values in my metadata?

Why do you request so many sample attributes?

Why is the directory structure missing from the file table view on my dataset page?

What should I do if I accidentally identify an individual from anonymized human (meta)data within a dataset?

What curation do you carry out?

What procedures are in place to ensure data integrity?

What data storage procedures do you follow?

What's the relationship between GigaScience and GigaDB?

What is the long term preservation plan for GigaDB?

What metadata do you collect?

Does GigaDB use community-supported Open Source software?

Do you allow comments, moderation or annotation of dataset entries?

Which BUSCO genome completeness files should I include in my dataset?

Which prefix (decimal or binary) is used for file size display?

What does my dataset status mean?

Dataset status lifecycle