Pesquisa | BVS IEC

LevioSAM: fast lift-over of variant-aware reference alignments.

Mun, Taher; Chen, Nae-Chyun; Langmead, Ben.

Bioinformatics ; 37(22): 4243-4245, 2021 11 18.

Artigo em Inglês | MEDLINE | ID: mdl-34037690

RESUMO

MOTIVATION: As more population genetics datasets and population-specific references become available, the task of translating ('lifting') read alignments from one reference coordinate system to another is becoming more common. Existing tools generally require a chain file, whereas VCF files are the more common way to represent variation. Existing tools also do not make effective use of threads, creating a post-alignment bottleneck. RESULTS: LevioSAM is a tool for lifting SAM/BAM alignments from one reference to another using a VCF file containing population variants. LevioSAM uses succinct data structures and scales efficiently to many threads. When run downstream of a read aligner, levioSAM is more than 7 times faster than an aligner when both are run with 16 threads. AVAILABILITY AND IMPLEMENTATION: Software Package: https://github.com/alshai/levioSAM, Experiments: https://github.com/langmead-lab/levioSAM-experiments. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Análise de Sequência de DNA

Minimizing Reference Bias with an Impute-First Approach.

Vaddadi, Naga Sai Kavya; Mun, Taher; Langmead, Ben.

bioRxiv ; 2023 Dec 02.

Artigo em Inglês | MEDLINE | ID: mdl-38076784

RESUMO

Pangenome indexes reduce reference bias in sequencing data analysis. However, a greater reduction in bias can be achieved using a personalized reference, e.g. a diploid human reference constructed to match a donor individual's alleles. We present a novel impute-first alignment framework that combines elements of genotype imputation and pangenome alignment. It begins by genotyping the individual from a subsample of the input reads. It next uses a reference panel and efficient imputation algorithm to impute a personalized diploid reference. Finally, it indexes the personalized reference and applies a read aligner, which could be a linear or graph aligner, to align the full read set to the personalized reference. This framework has higher variant-calling recall (99.54% vs. 99.37%), precision (99.36% vs. 99.18%), and F1 (99.45% vs. 99.28%) compared to a graph-based pangenome. The personalized reference is also smaller and faster to query compared to a pangenome index, making it an overall advantageous choice for whole-genome DNA sequencing experiments.

Pangenomic genotyping with the marker array.

Mun, Taher; Vaddadi, Naga Sai Kavya; Langmead, Ben.

Algorithms Mol Biol ; 18(1): 2, 2023 May 05.

Artigo em Inglês | MEDLINE | ID: mdl-37147657

RESUMO

We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while reducing the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods. The method is implemented in the open source software tool rowbowt available at https://github.com/alshai/rowbowt .

Pangenomic Genotyping with the Marker Array.

Mun, Taher; Vaddadi, Naga Sai Kavya; Langmead, Ben.

Algorithms Bioinform ; 2422022 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-36409181

RESUMO

We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while avoiding the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods.

Reference flow: reducing reference bias using multiple population genomes.

Chen, Nae-Chyun; Solomon, Brad; Mun, Taher; Iyer, Sheila; Langmead, Ben.

Genome Biol ; 22(1): 8, 2021 01 04.

Artigo em Inglês | MEDLINE | ID: mdl-33397413

RESUMO

Most sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.

Assuntos

Genoma Humano , Metagenômica , Cromossomos Humanos Par 21 , Humanos , Alinhamento de Sequência , Análise de Sequência de DNA , Sequenciamento Completo do Genoma

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.

Kuhnle, Alan; Mun, Taher; Boucher, Christina; Gagie, Travis; Langmead, Ben; Manzini, Giovanni.

J Comput Biol ; 27(4): 500-513, 2020 04.

Artigo em Inglês | MEDLINE | ID: mdl-32181684

RESUMO

Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that-when used with the rank data structure-allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.

Assuntos

Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Genoma Humano/genética , Humanos , Análise de Sequência de DNA/métodos

Matching Reads to Many Genomes with the r-Index.

Mun, Taher; Kuhnle, Alan; Boucher, Christina; Gagie, Travis; Langmead, Ben; Manzini, Giovanni.

J Comput Biol ; 27(4): 514-518, 2020 04.

Artigo em Inglês | MEDLINE | ID: mdl-32181686

RESUMO

The r-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This article shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on an FASTA file to build an r-index for that file; and how to query that index with ri-align.

Assuntos

Genoma/genética , Genômica , Análise de Sequência de DNA/métodos , Bases de Dados Genéticas , Humanos , Alinhamento de Sequência/métodos , Software

Prefix-free parsing for building big BWTs.

Boucher, Christina; Gagie, Travis; Kuhnle, Alan; Langmead, Ben; Manzini, Giovanni; Mun, Taher.

Algorithms Mol Biol ; 14: 13, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31149025

RESUMO

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive-a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA