Monday, 12 June 2023

Human pangenome reference will enable more complete and equitable understanding of genomic diversity

 UC Santa Cruz scientists, along with a consortium of researchers, have released a draft of the first human pangenome -- a new, usable reference for genomics that combines the genetic material of 47 individuals from different ancestral backgrounds to allow for a deeper, more accurate understanding of worldwide genomic diversity.

By adding 119 million bases -- the "letters" in DNA sequences -- to the existing genomics reference, the pangenome provides a representation of human genetic diversity that was not possible with a single reference genome. It is highly accurate, more complete and dramatically increases the detection of variants in the human genome, as shown in a collection of groundbreaking papers published today in the journals NatureGenome ResearchNature Biotechnology, and Nature Methods.

The pangenome was produced by the Human Pangenome Reference Consortium (HPRC), which is co-led by UCSC's Associate Professor of Biomolecular Engineering Benedict Paten and Assistant Professor of Biomolecular Engineering Karen Miga and is now available for use in an assembly hub on the UCSC Genome Browser. More than a dozen UCSC researchers and students are contributors to this project, which will continue into 2024 when the researchers plan to release a final pangenome with genomic information from 350 individuals.

"We are introducing more diversity and equity into the reference by sampling diverse human beings and including them in this structure that everyone can use," said Paten, who is the senior author on the main marker paper. "One genome isn't enough to represent everybody -- the pangenome will ultimately be something that is inclusive and representative."

Understanding genomic variation

Each person's genome varies slightly -- by about 0.4 percent compared to the next person, on average -- and understanding these differences can provide insight into their health, help to diagnose disease, predict medical outcomes, and guide treatments. Using the pangenome reference will improve scientists' ability to detect and understand variation in future studies.

Typically when scientists and clinicians study an individual's genome to look for variation, they compare that individuals' DNA to that of a standard reference to determine where there are differences of one or more base pairs. Until now, the reference genome has primarily been represented by a single sequence for each human chromosome, mostly sourced from one individual. But, this reference is nearly 20 years old and fundamentally limited in that it can not represent the wealth of genetic variations present in the human population. This introduces an issue called reference bias into genome analysis.

In contrast, the new pangenome is a reference that combines the genomes of 47 individuals from various ancestral backgrounds. The pangenome looks like a linear reference in areas where the sequences have the same bases, and expands to show the areas where there are differences. It represents many different versions of the human genome sequence at the same time, and gives scientists a more accurate point of comparison for variation that is present in some populations but not others.

"One genome can't possibly represent all of the rich variation we know can be observed and studied around the world," said Miga, Director of the HPRC Production Center at UCSC. "The No. 1 goal of the human pangenome reference is to try to broaden the representation of a reference resource to be more inclusive and more equitable for studying the human species, as a collection of references and not just one."

Genomic variation can be small, consisting of differences of just one or a few DNA bases, or it can be large structural variants, classified as variants that are 50 base pairs or larger. These larger, structural variants can have important health implications. Until now, researchers have been unable to identify more than 70 percent of the structural variants that exist in human genomes due to limited technologies and the bias of using a single reference sequence.

Of the 119 million new bases added to the reference with the pangenome, roughly 90 million of these derive from structural variation. Structural variants are complex and may be inversions of sequences, insertions, deletions, or tandem repeats -- a segment of two or more bases repeated numerous times. These new bases will help researchers to study regions in the genome for which there was previously no reference, and potentially be able to associate structural variants with disease in future studies.

"Now, we can map to more structural variants, so we're finding features and areas in the genome that just weren't there before," Miga said. "That's exciting because it's allowing us to look at gene regulation in a unique way that we couldn't study before, because those areas probably would have been inappropriately mapped or just ignored altogether."

Using the pangenome reference for genomic analysis increases the detection of structural variants by 104 percent as compared to detection using the standard reference. The pangenome reference also increases the accuracy of calling small variants, those just a few bases long, by about 34 percent because of the increased amount of data present in the pangenome.

Each human carries a paired set of chromosomes -- one set inherited from the mother and one from the father. The individual genomes present in the pangenome reference contains haplotype-resolved information, meaning it can confidently distinguish the two parental sets of chromosomes -- a major scientific feat. Having this information will help scientists better understand how various genes and diseases are inherited.

This also means the current reference actually includes 94 distinct genome sequences, with the goal of getting to 700 by 2024.

Creating the pangenome

The pangenome was made possible through the development of advanced computational techniques to align the multiple genome sequences into one, usable reference in a structure called a pangenome graph. Paten and researchers in the UCSC Computational Genomics lab helped lead the HPRC efforts to develop the algorithmic methods needed to create this pangenome graph structure.

Source: ScienceDaily

No comments:

Post a Comment