It’s only been three years since UC Santa Cruz researchers proved that long-read human genome assembly using the same nanopore technology developed on campus could be done at all. At the time, it was a monumental effort, requiring 150,000 hours of computing time and weeks of work.
About a year later, using the PromethION nanopore sequencer, a similar effort proved significantly faster, cheaper, and easier, clocking in at about a week. “We sequenced eleven human genomes in nine days, which was unprecedented at the time,” said UC Santa Cruz Research Scientist Miten Jain.
Now, researchers at UC Santa Cruz researchers have collaborated on an algorithm designed to accurately and precisely assemble individual, complete human genomes from long-read sequencing data in about six hours and for about $70.
The researchers said they hope their assembler will increase the pace of genomics research and open opportunities. This includes enabling pangenome research to represent the true scale of human diversity, a decidedly more practical pursuit.
Until recently, genomic research has relied exclusively on the reference genome from a single individual selected to represent an entire species. To reflect true human diversity, UC Santa Cruz has embarked on a pangenomic initiative to sequence 350 new, individual human genomes.
As a part of this work, UC Santa Cruz Genomics Institute researchers developed a nanopore long-read sequencing protocol that consistently yields ~60X coverage (~200 gigabases) of a human genome at unprecedented lengths (median read N50 of 42 kb) using three PromethION flow cells. Additionally, ~7X coverage of the genome is in reads exceeding 100 kb in length. This method is highly scalable, both in terms of cost and the number of genomes that can be processed simultaneously. We are now improving this method for higher read lengths and throughput, which will further facilitate our goal of achieving complete, phased, reference-quality genomes.
This large inflow of data necessitated the development of highly efficient software tools, starting with an assembler. “Our new assembler was designed to be cheap and quick, with the goal to be on the cloud,” said UC Santa Cruz’s Benedict Paten. “It gives us the power to scale nanopore sequencing. Now, I’m confident that we’ll be easily assembling hundreds of de novo genomes in the next couple of years.”
An extensive team of researchers and developers that was led by Paolo Carnevali from the Chan Zuckerberg Initiative (CZI) — and included many at the Computational Genomics Lab at the UC Santa Cruz Genomics Institute — contributed to this solution.
“When I saw the Jain 2018 paper, I was impressed and realized that I could contribute to the computational side of this line of investigation,” said Paolo Carnevali. “I had recently met Benedict Paten and decided I wanted to work with his team at UCSC.
The team were soon collaborating. Within months, they had developed and tested the special algorithmic sauce, which they called Shasta.
Shasta is an in-memory computing-driven algorithm that can now help complete a de novo (new, never before processed) human genome assembly in under six hours, the authors say, for an average cost of $70 per sample.
In their paper, “Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes,” published today in Nature Biotechnology, they describe how Shasta not only yields comparable or better accuracy as its contemporaries but also has the lowest number of misassemblies.
Not satisfied with this milestone, the team saw an opportunity to improve the draft assembly at an affordable cost and turn-around time. “To improve the base-level quality of the assemblies, we used a sequence polisher based on a deep neural network as the final assembly step,” explained lead author Kishwar Shafin. “This brought the total cost of the assembly process to less than $200 and 37 hours — which further reduced the computational overhead of generating long-read assemblies dramatically — by a factor of five.”
The researchers assessed the precision and then validated the accuracy, and noted that they had achieved 99.9% accurate assembly using only nanopore data, a first for the human genome. Further, they generated chromosome-level scaffolds for these polished assemblies using HiC sequencing data.
Research scientist and co-author Karen Miga, who is directing the Data Production Center at UCSC for the Human Pangenome Project, points out the significance of the team’s achievements in improved accuracy. “Our aim is not only to expand the diversity of the reference genome but also to resolve the hundreds of gaps that persist across the genome,” Miga explains. “Now that we can routinely include these uncharted regions, we have a truly complete assembly of a human genome, and we can begin to explore variations of unknown consequence.”