Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Genome Res ; 29(11): 1860-1877, 2019 11.
Artigo em Inglês | MEDLINE | ID: mdl-31628256

RESUMO

Available computational methods for tumor phylogeny inference via single-cell sequencing (SCS) data typically aim to identify the most likely perfect phylogeny tree satisfying the infinite sites assumption (ISA). However, the limitations of SCS technologies including frequent allele dropout and variable sequence coverage may prohibit a perfect phylogeny. In addition, ISA violations are commonly observed in tumor phylogenies due to the loss of heterozygosity, deletions, and convergent evolution. In order to address such limitations, we introduce the optimal subperfect phylogeny problem which asks to integrate SCS data with matching bulk sequencing data by minimizing a linear combination of potential false negatives (due to allele dropout or variance in sequence coverage), false positives (due to read errors) among mutation calls, and the number of mutations that violate ISA (real or because of incorrect copy number estimation). We then describe a combinatorial formulation to solve this problem which ensures that several lineage constraints imposed by the use of variant allele frequencies (VAFs, derived from bulk sequence data) are satisfied. We express our formulation both in the form of an integer linear program (ILP) and-as a first in tumor phylogeny reconstruction-a Boolean constraint satisfaction problem (CSP) and solve them by leveraging state-of-the-art ILP/CSP solvers. The resulting method, which we name PhISCS, is the first to integrate SCS and bulk sequencing data while accounting for ISA violating mutations. In contrast to the alternative methods, typically based on probabilistic approaches, PhISCS provides a guarantee of optimality in reported solutions. Using simulated and real data sets, we demonstrate that PhISCS is more general and accurate than all available approaches.


Assuntos
Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Neoplasias/genética , Filogenia , Análise de Célula Única/métodos , Humanos , Neoplasias/patologia
2.
Bioinformatics ; 36(12): 3703-3711, 2020 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-32259207

RESUMO

MOTIVATION: The ubiquitous abundance of circular RNAs (circRNAs) has been revealed by performing high-throughput sequencing in a variety of eukaryotes. circRNAs are related to some diseases, such as cancer in which they act as oncogenes or tumor-suppressors and, therefore, have the potential to be used as biomarkers or therapeutic targets. Accurate and rapid detection of circRNAs from short reads remains computationally challenging. This is due to the fact that identifying chimeric reads, which is essential for finding back-splice junctions, is a complex process. The sensitivity of discovery methods, to a high degree, relies on the underlying mapper that is used for finding chimeric reads. Furthermore, all the available circRNA discovery pipelines are resource intensive. RESULTS: We introduce CircMiner, a novel stand-alone circRNA detection method that rapidly identifies and filters out linear RNA sequencing reads and detects back-splice junctions. CircMiner employs a rapid pseudo-alignment technique to identify linear reads that originate from transcripts, genes or the genome. CircMiner further processes the remaining reads to identify the back-splice junctions and detect circRNAs with single-nucleotide resolution. We evaluated the efficacy of CircMiner using simulated datasets generated from known back-splice junctions and showed that CircMiner has superior accuracy and speed compared to the existing circRNA detection tools. Additionally, on two RNase R treated cell line datasets, CircMiner was able to detect most of consistent, high confidence circRNAs compared to untreated samples of the same cell line. AVAILABILITY AND IMPLEMENTATION: CircMiner is implemented in C++ and is available online at https://github.com/vpc-ccg/circminer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
RNA Circular , RNA , Sequência de Bases , RNA/genética , Splicing de RNA , Análise de Sequência de RNA
3.
Bioinformatics ; 35(1): 20-27, 2019 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-30561550

RESUMO

Motivation: Recent advances in genomics and precision medicine have been made possible through the application of high throughput sequencing (HTS) to large collections of human genomes. Although HTS technologies have proven their use in cataloging human genome variation, computational analysis of the data they generate is still far from being perfect. The main limitation of Illumina and other popular sequencing technologies is their short read length relative to the lengths of (common) genomic repeats. Newer (single molecule sequencing - SMS) technologies such as Pacific Biosciences and Oxford Nanopore are producing longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. Unfortunately, because of their high sequencing error rate, reads generated by these technologies are very difficult to work with and cannot be used in many of the standard downstream analysis pipelines. Note that it is not only difficult to find the correct mapping locations of such reads in a reference genome, but also to establish their correct alignment so as to differentiate sequencing errors from real genomic variants. Furthermore, especially since newer SMS instruments provide higher throughput, mapping and alignment need to be performed much faster than before, maintaining high sensitivity. Results: We introduce lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint. Availability and implementation: lordFAST is implemented in C++ and supports multi-threading. The source code of lordFAST is available at https://github.com/vpc-ccg/lordfast. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Software , Biologia Computacional , Genoma Humano , Humanos
4.
Bioinformatics ; 34(4): 678-680, 2018 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-29045591

RESUMO

Summary: De novo genome assembly of next-generation sequencing data is a fundamental problem in bioinformatics. There are many programs that assemble small genomes, but very few can assemble whole human genomes. We present a new algorithm for parallel overlap graph construction, which is capable of assembling human genomes and improves upon the current state-of-the-art in genome assembly. Availability and implementation: SAGE2 is written in C ++ and OpenMP and is freely available (under the GPL 3.0 license) at github.com/lucian-ilie/SAGE2. Contact: ilie@uwo.ca. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Animais , Humanos
5.
Bioinformatics ; 33(22): 3652-3654, 2017 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-29036421

RESUMO

MOTIVATION: Third-generation sequencing (TGS) platforms that generate long reads, such as PacBio and Oxford Nanopore technologies, have had a dramatic impact on genomics research. However, despite recent improvements, TGS reads suffer from high-error rates and the development of read correction methods is an active field of research. This motivates the need to develop tools that can evaluate the accuracy of noisy long reads correction tools. RESULTS: We introduce LRCstats, a tool that measures the accuracy of long reads correction tools. LRCstats takes advantage of long reads simulators that provide each simulated read with an alignment to the reference genome segment they originate from, and does not rely on a step of mapping corrected reads onto the reference genome. This allows for the measurement of the accuracy of the correction while being consistent with the actual errors introduced in the simulation process used to generate noisy reads. We illustrate the usefulness of LRCstats by analyzing the accuracy of four hybrid correction methods for PacBio long reads over three datasets. AVAILABILITY AND IMPLEMENTATION: https://github.com/cchauve/lrcstats. CONTACT: laseanl@sfu.ca or cedric.chauve@sfu.ca. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Sequenciamento de Nucleotídeos em Larga Escala/normas
6.
Bioinformatics ; 32(17): i545-i551, 2016 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-27587673

RESUMO

MOTIVATION: Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads. RESULTS: We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods. AVAILABILITY AND IMPLEMENTATION: The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap CONTACT: ehaghshe@sfu.ca or cedric.chauve@sfu.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Análise de Sequência de DNA , Biologia Computacional , Genoma , Linguagens de Programação , Alinhamento de Sequência , Software
7.
iScience ; 23(9): 101508, 2020 Sep 25.
Artigo em Inglês | MEDLINE | ID: mdl-32896768

RESUMO

[This corrects the article DOI: 10.1016/j.isci.2020.100883.].

8.
iScience ; 23(8): 101389, 2020 Aug 21.
Artigo em Inglês | MEDLINE | ID: mdl-32781410

RESUMO

Third-generation sequencing technologies from companies such as Oxford Nanopore and Pacific Biosciences have paved the way for building more contiguous and potentially gap-free assemblies. The larger effective length of their reads has provided a means to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive, whereas faster methods are not as accurate. Moreover, despite recent advances in third-generation sequencing, researchers still tend to generate accurate short reads for many of the analysis tasks. Here, we present HASLR, a hybrid assembler that uses error-prone long reads together with high-quality short reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on most of the samples, while being on par with other assemblers in terms of contiguity and accuracy.

9.
iScience ; 23(3): 100883, 2020 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-32109676

RESUMO

One of the remaining challenges to describing an individual's genetic variation lies in the highly heterogeneous and complex genomic regions that impede the use of classical reference-guided mapping and assembly approaches. Once such region is the Immunoglobulin heavy chain locus (IGH), which is critical for the development of antibodies and the adaptive immune system. We describe ImmunoTyper, the first PacBio-based genotyping and copy number calling tool specifically designed for IGH V genes (IGHV). We demonstrate that ImmunoTyper's multi-stage clustering and combinatorial optimization approach represents the most comprehensive IGHV genotyping approach published to date, through validation using gold-standard IGH reference sequence. This preliminary work establishes the feasibility of fine-grained genotype and copy number analysis using error-prone long reads in complex multi-gene loci and opens the door for in-depth investigation into IGHV heterogeneity using accessible and increasingly common whole-genome sequence.

10.
IEEE Trans Cybern ; 46(12): 2899-2910, 2016 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26540725

RESUMO

Classification of temporal data sequences is a fundamental branch of machine learning with a broad range of real world applications. Since the dimensionality of temporal data is significantly larger than static data, and its modeling and interpreting is more complicated, performing classification and clustering on temporal data is more complex as well. Hidden Markov models (HMMs) are well-known statistical models for modeling and analysis of sequence data. Besides, ensemble methods, which employ multiple models to obtain the target model, revealed good performances in the conducted experiments. All these facts are a high level of motivation to employ HMM ensembles in the task of classification and clustering of time series data. So far, no effective classification and clustering method based on HMM ensembles has been proposed. Moreover, employing the limited existing HMM ensemble methods has trouble separating models of distinct classes as a vital task. In this paper, according to previous points a new framework based on HMM ensembles for classification and clustering is proposed. In addition to its strong theoretical background by employing the Rényi entropy for ensemble learning procedure, the main contribution of the proposed method is addressing HMM-based methods problem in separating models of distinct classes by considering the inverse emission matrix of the opposite class to build an opposite model. The proposed algorithms perform more effectively compared to other methods especially other HMM ensemble-based methods. Moreover, the proposed clustering framework, which derives benefits from both similarity-based and model-based methods, together with the Rényi-based ensemble method revealed its superiority in several measurements.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA