A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.

Jain, Chirag; Dilthey, Alexander; Koren, Sergey; Aluru, Srinivas; Phillippy, Adam M

Jain, Chirag; Dilthey, Alexander; Koren, Sergey; Aluru, Srinivas; Phillippy, Adam M.

Affiliation

Jain C; 1 School of Computational Science and Engineering, Georgia Institute of Technology , Atlanta, Georgia .
Dilthey A; 2 National Human Genome Research Institute, National Institutes of Health , Bethesda, Maryland.
Koren S; 2 National Human Genome Research Institute, National Institutes of Health , Bethesda, Maryland.
Aluru S; 2 National Human Genome Research Institute, National Institutes of Health , Bethesda, Maryland.
Phillippy AM; 1 School of Computational Science and Engineering, Georgia Institute of Technology , Atlanta, Georgia .

J Comput Biol ; 25(7): 766-779, 2018 07.

Article in En | MEDLINE | ID: mdl-29708767

ABSTRACT

Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.

Subject(s)
Key words

Jaccard; MinHash; long-read mapping; minimizers; sketching; winnowing

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Software / Genome, Human / High-Throughput Nucleotide Sequencing Type of study: Prognostic_studies Limits: Humans Language: En Journal: J Comput Biol Journal subject: BIOLOGIA MOLECULAR / INFORMATICA MEDICA Year: 2018 Document type: Article Affiliation country: Georgia Country of publication: Estados Unidos

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google