A group of US and European epidemiologists on Tuesday launched an open international database that presents granular information on 5 million anonymized COVID-19 cases from more than 100 countries.
Each case record in the database, which is housed at Global.health, contains up to 40 variables, including information such as the patient’s demographics and location, the date on which the patient first had COVID-19 symptoms, the date on which the patient received a positive test result, and travel history.
The project was aided by Google, which provided software experts to create the infrastructure for the database. Google and the Rockefeller Foundation supplied $1.25 million in financial support for the epidemiology project.
The researchers hope to use the data to answer questions such as how rapidly new variants spread among people, whether vaccines protect against them, and how long immunity to the virus lasts.
An article published in Nature on Tuesday reports that “the researchers hope the database will help them to monitor coronavirus variants and vaccines in the months to come, and provide a template for tracking real-time data in future epidemics.”
In the short term, according to the Nature article, the researchers hope to use the data to answer questions such as how rapidly new variants spread among people, whether vaccines protect against new variants, and how long immunity to the virus lasts.
In the longer term, when cases of COVID-19 start to become rare, the researchers hope that the repository will help public health authorities track and respond to new virus variants, said Samuel Scarpino, PhD, director of the Emergent Epidemics Lab at Northeastern University, Boston, Massachusetts.
The database was created by 21 epidemiologists at the University of Oxford, Harvard, Northeastern, Boston Children’s Hospital, Georgetown University, the University of Washington, and Johns Hopkins Center for Health Security.
“Much More Granular Picture”
Early last year, the researchers began to compile case records manually, using a Google spreadsheet. They bumped into the software’s limit when they reached about 80,000 cases, Scarpino said. At that point, they asked Google for help. The tech giant responded by lending the epidemiologists a dozen software engineers, designers, and data scientists for 6 months.
At the start, the researchers were entering data culled from newspaper articles, public health agencies, and other sources. The Google engineers wrote computer codes that allowed the consortium to upload daily coronavirus data from about 60 governments in a standardized format, the Nature article says. In addition, the software experts created an algorithm to merge information being added from around the world into a single cloud-based repository.
So far, the researchers have collected data on about 24 million cases. Data for a dozen variables have been collected for about half of these cases, and there’s more data for about 10% of cases. Altogether, more than 160 million data points on individual infections are available.
It was not immediately clear why the Global.health repository currently provides data on only 5 million cases. Scarpino said in his STAT interview that the plan was to launch the database with information on twice that many cases.
Although aggregate data on COVID-19 cases, hospitalizations, and deaths are essential, Scarpino said, it’s also necessary to provide granular information about each case. “Things like travel history. Age distribution, race, ethnicity, if they’re reported. Symptoms, if they’re reported. If there’s an outcome reported, like a death or a hospitalization. And so it allows you to see a much more granular picture of what’s happening,” he said.
It’s unclear how the epidemiologic database will be funded going forward, noted Scarpino. The money supplied so far might last a few years, he said. But he and his colleagues have a 5-year plan that involves tracking new outbreaks of diseases such as tuberculosis and malaria, as well as the next virus that could spark a global pandemic.
Although epidemiologists have endorsed the value of the database, the World Health Organization (WHO) is unlikely to finance it. Scarpino noted that the WHO couldn’t support the current project because it lacked the resources for such an expensive, labor-intensive effort.