Here’s why that’s undermining personalized medicine

In a paper published last week, scientists led by Dr. Pui-Yan Kwok of the University of California, San Francisco, analyzed 154 genomes from 26 ethnic populations, from Han Chinese and Tuscans to Yoruba, Esan, Puerto Ricans, and Peruvians. They found 60 million bases in one or more of these populations that are missing from the reference genome.

“The reference genome was a huge triumph, but when it was done people weren’t thinking that much about population-geographic genetic variation,” said bioinformatics professor Mark Gerstein of Yale University. “One of its problems is that it’s very European-biased, which means that an African has many more differences from the reference than a European does.”

That can keep non-Europeans from benefiting from the genetic revolution. At Duke University, geneticists recently analyzed the DNA of a young African-American woman’s intellectual disability and progressive cognitive decline, hoping to identify its genetic cause. They turned up 10 abnormal variants, said medical geneticist Dr. Queenie Tan. With white patients, whose genomes are a closer match to the reference, it’s rare to get more than a couple hits; that can guide parents if they wish to undergo prenatal genetic testing before having additional children. “But with 10 candidates, it’s hard to come to any conclusion about whether one gene is more important than the others,” said genetic counselor Heidi Cope of Duke. “With this patient, we were stuck.”

In other cases, the reference genome is missing vast quantities of the DNA found in non-Europeans. Computational biologist Steven Salzberg of Johns Hopkins University and colleagues sequenced the genomes of 910 African Americans and measured how many pieces are present in all of them but are missing from the reference genome. Their count: 296,485,284 base pairs — nearly 10 percent of the human genome — they reported last November. One missing fragment is 100,000 base pairs long, and millions are at least 1,000 long.

Some experts believe Salzberg’s count is too high, but none disagrees with his conclusion. “Eighteen years after finishing the human genome, why are we still relying on just one genome, a mosaic of a few dozen people, to guide thousands of experiments?” he asked. “We can do far better.”

The National Institutes of Health is placing a multimillion-dollar bet that he’s right. At a 2018 meeting convened by its National Human Genome Research Institute, experts concluded that the reference “does not adequately represent human [genetic] variation,” and that it needed to be improved by creating a “pan-genome” that has all of that variation stuffed into it. NHGRI is now evaluating proposals to do that, offering up to $6 million per year to produce high-quality sequences of about 350 genomes.

“The number is less important than what populations we should sample,” said NHGRI’s Adam Felsenfeld. The current reference genome “is good for many, many things, but it’s not as good or as complete as it could be.”

The problems start with the standard way of sequencing a genome, including for medical purposes such as finding the genetic cause of a mystery syndrome. Scientists chop it into millions of segments, about 100 base pairs long. They feed these short reads into next-generation sequencing machines, which determine the order of the A’s, T’s, C’s, and G’s. Algorithms then figure out where each short read falls on a chromosome by using the reference genome as a guide.

When the reference is missing a page, like that vandalized dictionary, scientists are stuck. That’s what happened to the Baratela Scott scientists. Only after time-consuming detours to the mouse genome and to alternative DNA sequencing that bypassed the reference genome did they finally find the abnormality — in a region upstream of XYLT1 — and confirm the boys had Baratela Scott.

“The problem was, this 238-base-pair region isn’t in the reference,” said Dr. Heather Mefford, the UW pediatrician and geneticist who led the sequencing analysis: The abnormality was a nucleotide stutter, with CGG repeated hundreds of times in a segment of DNA that activates XYLT1.

Because the region with the stutter is missing from the reference genome, if labs less advanced that UW’s analyze DNA from patients thought to have Baratela Scott, they will almost certainly miss it. The syndrome, a recessive disorder, has no cure, so that wouldn’t affect patient care. “But if you want to test for it prenatally,” said Mefford, perhaps when prospective parents know or suspect it runs in their family, “it wouldn’t be found.”

Mefford’s lab is grappling with a similar medical mystery. In a large fraction of patients with a form of epilepsy that she strongly suspects is genetic, she’s been unable to find any glitches in their DNA when she compares their short reads to the reference genome. “One of our nagging questions,” she said, “is, are the relevant regions missing from the reference genome?”

Experts aren’t sure why DNA sequencing identifies the genetic cause of a child’s mystery disease only about 40 percent of the time, said Felsenfeld, “but failure to align a patient’s short reads on the reference genome might be one reason.”

That’s especially likely to happen if the patient belongs to an ethnic group that is poorly represented in the reference. It has none of the thousands of variants that are specific to people from the Philippine island of Panay, for instance. That caused problems when scientists analyzed the genomes of 403 Panays with a rare neurodegenerative disease called X-linked-dystonia parkinsonism, looking for its precise genetic cause.

It turned out that “the causal mutation is in a stretch of DNA that exists only in the Panay population and isn’t in the reference genome,” said neuro-genomics expert Michael Talkowski of Massachusetts General Hospital, who led a 2018 study that, like the Baratela Scott team, eventually used an alternative approach to identify the cause of this parkinsonism. It turned out to be DNA that jumped into a gene called TAF1. That made the gene as meaningless as inserting letters into an English woPHOrd.

Several labs have tried to remedy the ethnic bias of the reference genome by producing Chinese, Korean, and Ashkenazi reference genomes. The problem is, people are often mistaken about their ancestry, so geneticists would get nowhere by trying to compare someone’s genome sequence to the wrong reference. Having a single reference genome is the only way to avoid that.

How to make one that best represents human diversity is a hot topic among computational biologists, with ideas such as “graph genomes” and “pan-genomes” competing for backing like presidential candidates for 2020 and promising to improve the solve rate of mystery diseases.

There’s no disagreement that, without a more representative reference genome, genetic medicine will never reach some ethnic groups, warns genome scientist Alicia Martin of Mass. General. Medical genetics is moving away from assessing disease risk from one or two genes and toward calculating a “polygenic risk score” based on hundreds.

But with the European bias of the reference genome and other tools, polygenic risk scores for people who trace their ancestry to Africa, in particular, are often only “marginally better, if at all,” than flipping a coin, Martin and her colleagues argue in a paper posted on the preprint site bioRxiv. “They are therefore least likely to benefit” from DNA-based medicine — at least until genome scientists move beyond Buffalo.