Перевести на Переведено сервисом «Яндекс.Перевод»

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

Description

Developers

Konstantin Berlin, Sergey Koren, Jane M Landolin, Adam M Phillippy, etc.

Description of the technology

The proposed technology has improved the genome assembly methods and made them available to assemble sufficiently large eukaryotic genomes of single-celled organisms, plants and some sequences of human genome. Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes.

New technology introduces the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using locality-sensitive hashing — a probabilistic dimensionality reduction method of multivariable data. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. The assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and low-complexity sequences were assembled from CHM1 that fill gaps in the human GRCh38 reference (Genome Reference Consortium Human genome build 38). Thus, using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.

Practical application

Algorithm MHAP, proposed in this technology, can serve as a replacement for current long-read overlapping methods. The sensitivity of MHAP is well suited for MHAP application for the assembling of haploid or inbred genomes, as well as outbred genomes.

In addition to SMRT sequencing, MHAP will be suitable for assembling nanopore sequences, which are expected to have similar read length and error characteristics. It could also be applied to reference alignment, sequence clustering and alignment-free distance estimation. The technology can be a step to the creating of fast and sensitive methods, which are needed both for long-read overlapping and to address the ever-expanding scale of genomic data for various applications.

Laboratories

  • Department of Chemistry and Biochemistry, University of Maryland, College Park (USA)
  • Institute for Advanced Computer Studies, University of Maryland, College Park (USA)
  • Invincea Labs, Arlington (USA)
  • National Biodefense Analysis and Countermeasures Center, Frederick, College Park (USA)
  • Pacific Biosciences of California, Inc., Menlo Park (USA)

Links

http://www.nature.com/nbt/journal/v33/n6/full/nbt.3238.html

Publications

  • Berlin, K. et al. «Assembling large genomes with single-molecule sequencing and locality-sensitive hashing." 33.6 Nat Biotechnol. (2015): 623−630.
  • Koren, S. & Phillippy, A.M. «One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly." 23 Curr. Opin. Microbiol. (2015): 110–120.
  • Koren, S. et al. «Reducing assembly complexity of microbial genomes with single-molecule sequencing." 14 Genome Biol. (2013): R101.