Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Bioinformatics ; 32(17): i538-i544, 2016 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-27587672

RESUMO

MOTIVATION: As the volume of next-generation sequencing (NGS) data increases, faster algorithms become necessary. Although speeding up individual components of a sequence analysis pipeline (e.g. read mapping) can reduce the computational cost of analysis, such approaches do not take full advantage of the particulars of a given problem. One problem of great interest, genotyping a known set of variants (e.g. dbSNP or Affymetrix SNPs), is important for characterization of known genetic traits and causative disease variants within an individual, as well as the initial stage of many ancestral and population genomic pipelines (e.g. GWAS). RESULTS: We introduce lightweight assignment of variant alleles (LAVA), an NGS-based genotyping algorithm for a given set of SNP loci, which takes advantage of the fact that approximate matching of mid-size k-mers (with k = 32) can typically uniquely identify loci in the human genome without full read alignment. LAVA accurately calls the vast majority of SNPs in dbSNP and Affymetrix's Genome-Wide Human SNP Array 6.0 up to about an order of magnitude faster than standard NGS genotyping pipelines. For Affymetrix SNPs, LAVA has significantly higher SNP calling accuracy than existing pipelines while using as low as ∼5 GB of RAM. As such, LAVA represents a scalable computational method for population-level genotyping studies as well as a flexible NGS-based replacement for SNP arrays. AVAILABILITY AND IMPLEMENTATION: LAVA software is available at http://lava.csail.mit.edu CONTACT: bab@mit.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Genótipo , Polimorfismo de Nucleotídeo Único , Alelos , Análise por Conglomerados , Genoma Humano , Humanos , Software
2.
PLoS Comput Biol ; 10(3): e1003502, 2014 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-24675685

RESUMO

As the more recent next-generation sequencing (NGS) technologies provide longer read sequences, the use of sequencing datasets for complete haplotype phasing is fast becoming a reality, allowing haplotype reconstruction of a single sequenced genome. Nearly all previous haplotype reconstruction studies have focused on diploid genomes and are rarely scalable to genomes with higher ploidy. Yet computational investigations into polyploid genomes carry great importance, impacting plant, yeast and fish genomics, as well as the studies of the evolution of modern-day eukaryotes and (epi)genetic interactions between copies of genes. In this paper, we describe a novel maximum-likelihood estimation framework, HapTree, for polyploid haplotype assembly of an individual genome using NGS read datasets. We evaluate the performance of HapTree on simulated polyploid sequencing read data modeled after Illumina sequencing technologies. For triploid and higher ploidy genomes, we demonstrate that HapTree substantially improves haplotype assembly accuracy and efficiency over the state-of-the-art; moreover, HapTree is the first scalable polyplotyping method for higher ploidy. As a proof of concept, we also test our method on real sequencing data from NA12878 (1000 Genomes Project) and evaluate the quality of assembled haplotypes with respect to trio-based diplotype annotation as the ground truth. The results indicate that HapTree significantly improves the switch accuracy within phased haplotype blocks as compared to existing haplotype assembly methods, while producing comparable minimum error correction (MEC) values. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2-5.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Ploidias , Algoritmos , Alelos , Animais , Teorema de Bayes , Biologia Computacional/métodos , Epigênese Genética , Peixes , Genes Fúngicos , Genoma , Genoma Humano , Genótipo , Haplótipos , Humanos , Funções Verossimilhança , Mutação , Plantas/genética , Polimorfismo de Nucleotídeo Único , Probabilidade , Reprodutibilidade dos Testes
3.
Genome Res ; 21(6): 840-9, 2011 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-21131385

RESUMO

Human genomes are now being rapidly sequenced, but not all forms of genetic variation are routinely characterized. In this study, we focus on Alu retrotransposition events and seek to characterize differences in the pattern of mobile insertion between individuals based on the analysis of eight human genomes sequenced using next-generation sequencing. Applying a rapid read-pair analysis algorithm, we discover 4342 Alu insertions not found in the human reference genome and show that 98% of a selected subset (63/64) experimentally validate. Of these new insertions, 89% correspond to AluY elements, suggesting that they arose by retrotransposition. Eighty percent of the Alu insertions have not been previously reported and more novel events were detected in Africans when compared with non-African samples (76% vs. 69%). Using these data, we develop an experimental and computational screen to identify ancestry informative Alu retrotransposition events among different human populations.


Assuntos
Elementos Alu/genética , Variação Genética , Genoma Humano/genética , Sequência de Bases , População Negra/genética , Biologia Computacional/métodos , Genômica/métodos , Humanos , Dados de Sequência Molecular , Análise de Sequência de DNA/métodos
4.
BMC Genomics ; 14: 550, 2013 Aug 14.
Artigo em Inglês | MEDLINE | ID: mdl-23941359

RESUMO

BACKGROUND: Chimeric transcripts, including partial and internal tandem duplications (PTDs, ITDs) and gene fusions, are important in the detection, prognosis, and treatment of human cancers. RESULTS: We describe Barnacle, a production-grade analysis tool that detects such chimeras in de novo assemblies of RNA-seq data, and supports prioritizing them for review and validation by reporting the relative coverage of co-occurring chimeric and wild-type transcripts. We demonstrate applications in large-scale disease studies, by identifying PTDs in MLL, ITDs in FLT3, and reciprocal fusions between PML and RARA, in two deeply sequenced acute myeloid leukemia (AML) RNA-seq datasets. CONCLUSIONS: Our analyses of real and simulated data sets show that, with appropriate filter settings, Barnacle makes highly specific predictions for three types of chimeric transcripts that are important in a range of cancers: PTDs, ITDs, and fusions. High specificity makes manual review and validation efficient, which is necessary in large-scale disease studies. Characterizing an extended range of chimera types will help generate insights into progression, treatment, and outcomes for complex diseases.


Assuntos
Duplicação Gênica/genética , Perfilação da Expressão Gênica/métodos , Fusão Gênica/genética , Genômica , Neoplasias da Mama/genética , Éxons/genética , Humanos , Leucemia Mieloide Aguda/genética , Anotação de Sequência Molecular , RNA Mensageiro/genética , Estatística como Assunto
5.
Bioinformatics ; 28(12): i179-87, 2012 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-22689759

RESUMO

MOTIVATION: Computational identification of genomic structural variants via high-throughput sequencing is an important problem for which a number of highly sophisticated solutions have been recently developed. With the advent of high-throughput transcriptome sequencing (RNA-Seq), the problem of identifying structural alterations in the transcriptome is now attracting significant attention. In this article, we introduce two novel algorithmic formulations for identifying transcriptomic structural variants through aligning transcripts to the reference genome under the consideration of such variation. The first formulation is based on a nucleotide-level alignment model; a second, potentially faster formulation is based on chaining fragments shared between each transcript and the reference genome. Based on these formulations, we introduce a novel transcriptome-to-genome alignment tool, Dissect (DIScovery of Structural Alteration Event Containing Transcripts), which can identify and characterize transcriptomic events such as duplications, inversions, rearrangements and fusions. Dissect is suitable for whole transcriptome structural variation discovery problems involving sufficiently long reads or accurately assembled contigs. RESULTS: We tested Dissect on simulated transcripts altered via structural events, as well as assembled RNA-Seq contigs from human prostate cancer cell line C4-2. Our results indicate that Dissect has high sensitivity and specificity in identifying structural alteration events in simulated transcripts as well as uncovering novel structural alterations in cancer transcriptomes. AVAILABILITY: Dissect is available for public use at: http://dissect-trans.sourceforge.net.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Transcriptoma , Linhagem Celular Tumoral , Perfilação da Expressão Gênica/métodos , Humanos , Masculino , Modelos Teóricos , Neoplasias da Próstata , RNA/genética , Alinhamento de Sequência
6.
Bioinformatics ; 26(12): i350-7, 2010 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-20529927

RESUMO

UNLABELLED: Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. In this article, we provide a complete and novel formulation to discover both loci and classes of transposons inserted into genomes sequenced with high-throughput sequencing technologies. In addition, we also present 'conflict resolution' improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from the Venter genome (HuRef) and are able to discover >85% of transposon insertion events with precision of >90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of the art (such as original VariationHunter, BreakDancer and MoDIL) algorithms when tested on the genome of the Yoruba African individual (NA18507). AVAILABILITY: The implementation of algorithm is available at http://compbio.cs.sfu.ca/strvar.htm. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Elementos de DNA Transponíveis , Genômica/métodos , Genoma
7.
Nat Commun ; 11(1): 4662, 2020 09 16.
Artigo em Inglês | MEDLINE | ID: mdl-32938926

RESUMO

Haplotype reconstruction of distant genetic variants remains an unsolved problem due to the short-read length of common sequencing data. Here, we introduce HapTree-X, a probabilistic framework that utilizes latent long-range information to reconstruct unspecified haplotypes in diploid and polyploid organisms. It introduces the observation that differential allele-specific expression can link genetic variants from the same physical chromosome, thus even enabling using reads that cover only individual variants. We demonstrate HapTree-X's feasibility on in-house sequenced Genome in a Bottle RNA-seq and various whole exome, genome, and 10X Genomics datasets. HapTree-X produces more complete phases (up to 25%), even in clinically important genes, and phases more variants than other methods while maintaining similar or higher accuracy and being up to 10×  faster than other tools. The advantage of HapTree-X's ability to use multiple lines of evidence, as well as to phase polyploid genomes in a single integrative framework, substantially grows as the amount of diverse data increases.


Assuntos
Desequilíbrio Alélico , Haplótipos , Análise de Sequência de RNA , Algoritmos , Bases de Dados Genéticas , Diploide , Humanos , Células K562 , Modelos Genéticos , Modelos Estatísticos , Polimorfismo de Nucleotídeo Único , Poliploidia , RNA-Seq , Análise de Sequência de RNA/métodos , Análise de Sequência de RNA/estatística & dados numéricos
8.
Res Comput Mol Biol ; 8394: 385-399, 2014 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-28825060

RESUMO

It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a k-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the k-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets. AVAILABILITY: An implementation of our method, RQS, is available for download at: http://rqs.csail.mit.edu/.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa