We need open data, especially open SARS-CoV-2 sequence data, and open science to beat COVID-19 and to prepare for future outbreaks.
Why open data is so important
When responding to a health crisis, data play a critical role in understanding transmission, infection and symptoms, and in identifying drug targets, developing vaccines and designing public health responses. The COVID-19 pandemic has highlighted the critical value of open data and open science and international collaborations to progress scientific discovery when time is of the utmost importance.
We, the scientific community, need to ensure open science is a practised standard, and remove barriers that restrain effective data sharing. As much research and healthcare data as possible need to be taken out of silos and stored into an open, connected and FAIR (Findable, Accessible, Interoperable and Reusable) environment to prepare our healthcare systems for future pandemics, and to unleash the fast flow of research advances into clinical use for the benefit of society.
Commit to open data sharing
To aid these efforts, the signatories of this letter jointly call upon the data submitters, data users, policy makers and the wider research community to:
- Submit raw SARS-CoV-2 data to the databases of the International Nucleotide Sequence Database Collaboration (INSDC)
- Submit consensus/assembled SARS-CoV-2 data to the databases of the INSDC
- Provide information relating to the sequenced isolate or sample as part of the sequence submission; a minimum of time and place of isolation/sampling and an isolate/sample identifier should be provided to maximise the value of the sequences
- In cases where scientists have already established submissions to other databases, these submissions should continue in parallel to the INSDC submissions
How we can open up SARS-CoV-2 data
To address these needs, the European COVID-19 Data Portal was launched in April 2020. It is the interface – the visible part – of the European COVID-19 Data Platform and it provides a single-entry point for researchers to upload, access and analyse COVID-19 related data and specialist datasets such as:
- Viral genomes from patients (flowing through the SARS-CoV-2 Data Hubs or direct submissions to the INSDC databases)
- Biomolecular data relating to the virus and its human and animal hosts, spanning the scientific literature, gene expression, protein structures, biochemical pathways, compounds and drug targets
- National and local cohorts with genomes of COVID-19 patients to help identify risk factors for disease progression (in the Federated European Genome-phenome Archive or Federated EGA)
- National COVID-19 portals and other relevant resources like biobanks, clinical trials and imaging repositories
The European COVID-19 Data Platform enables national data producers to share biomolecular data with the international scientific community, making these data available for reuse. This approach supports the advancement of COVID-19 research and knowledge sharing on a global scale.
How we can build on existing infrastructure
The rapid deployment of the European COVID-19 Data Platform would have been impossible without the research infrastructure provided by existing international research database collaborations built up over many years.
The INSDC databases play a crucial role in mobilising the SARS-CoV-2 sequence data. They capture, organise, preserve and present nucleotide sequence data as part of the open scientific record. INSDC member institutions – EMBL’s European Bioinformatics Institute (EMBL-EBI), the NIG DNA Data Bank of Japan (NIG-DDBJ) and the National Library of Medicine’s National Center for Biotechnology Information at NIH (NCBI) – are committed to continuously deliver this critical element of scientific infrastructure for the rapid and open sharing of data relating to this outbreak and beyond, as custodians of sustained open research and scientific development.
All three INSDC members have prioritised the processing of SARS-CoV-2 sequence data and have streamlined the submission process to provide users with support and minimise the effort required for data submission.
Why we need to act now
Availability of data through the INSDC databases is vital in the fight against COVID-19. It provides:
- Rapid open access – INSDC quickly makes submitted data freely and permanently available to everyone, without the need to log in or any restrictions on reuse
- Systematic and standardised presentation of data, suitable for immediate reuse in all mainstream sequence analysis tools
- Linkage of raw sequence read data to genome assemblies, enabling researchers to validate the integrity of assemblies and investigate asserted mutations and changes in genome sequences
- Integration of SARS-CoV-2 sequences with the entirety of INSDC data, including related coronavirus genome sequences, enabling comparison across species
- Linkage of sequences to the published literature
- Availability of data for integration into the world’s ecosystem of bioinformatics databases that draw data from INSDC and add value and depth through curation and analysis workflows
- Integrated data analysis tools to further understanding of the virus
The recent availability of SARS-CoV-2 B1.1.7. lineage genome sequence data, or lack thereof, from countries worldwide highlights very well how openly available data allows or constrains rapid analysis and dissemination so that evidence-based political decisions can be made. Only fast, comprehensive and global SARS-CoV-2 sequencing and a rapid flow of SARS-CoV-2 sequence data into the INSDC databases will ensure the rapid dissemination of data with maximal impact due to their connectivity to the global bioinformatics data infrastructure.
Being able to link viral data, human genetics data, clinical outcome data, serology and seroprevalence data, vaccination data, and many other crucial data types in an interoperable way across borders is and will remain a difficult task. However, we are in a position to achieve this goal in a fast, scalable and widely usable way that will bring vast benefits to our societies. To do so, we have to make sure that crucial progress in research and healthcare data is being treated in a FAIR way and not unnecessarily siloed, fragmented and closed.
About the authors
EMBL-EBI, as the home to the world’s most comprehensive range of freely available data resources, has initiated this call as open data is key to accelerating COVID-19 research.