Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 31(12): i106-15, 2015 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-26072472

RESUMO

UNLABELLED: Ongoing advances in high-throughput technologies have facilitated accurate proteomic measurements and provide a wealth of information on genomic and transcript level. In proteogenomics, this multi-omics data is combined to analyze unannotated organisms and to allow more accurate sample-specific predictions. Existing analysis methods still mainly depend on six-frame translations or reference protein databases that are extended by transcriptomic information or known single nucleotide polymorphisms (SNPs). However, six-frames introduce an artificial sixfold increase of the target database and SNP integration requires a suitable database summarizing results from previous experiments. We overcome these limitations by introducing MSProGene, a new method for integrative proteogenomic analysis based on customized RNA-Seq driven transcript databases. MSProGene is independent from existing reference databases or annotated SNPs and avoids large six-frame translated databases by constructing sample-specific transcripts. In addition, it creates a network combining RNA-Seq and peptide information that is optimized by a maximum-flow algorithm. It thereby also allows resolving the ambiguity of shared peptides for protein inference. We applied MSProGene on three datasets and show that it facilitates a database-independent reliable yet accurate prediction on gene and protein level and additionally identifies novel genes. AVAILABILITY AND IMPLEMENTATION: MSProGene is written in Java and Python. It is open source and available at http://sourceforge.net/projects/msprogene/.


Assuntos
Perfilação da Expressão Gênica , Genômica/métodos , Proteômica/métodos , Análise de Sequência de RNA , Algoritmos , Animais , Bartonella/genética , Bases de Dados Genéticas , Filarioidea/genética , Espectrometria de Massas , Peptídeos/química , Polimorfismo de Nucleotídeo Único , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Software
2.
BMC Genomics ; 16: 134, 2015 Feb 26.
Artigo em Inglês | MEDLINE | ID: mdl-25766582

RESUMO

BACKGROUND: Gene prediction is a challenging but crucial part in most genome analysis pipelines. Various methods have evolved that predict genes ab initio on reference sequences or evidence based with the help of additional information, such as RNA-Seq reads or EST libraries. However, none of these strategies is bias-free and one method alone does not necessarily provide a complete set of accurate predictions. RESULTS: We present IPred (Integrative gene Prediction), a method to integrate ab initio and evidence based gene identifications to complement the advantages of different prediction strategies. IPred builds on the output of gene finders and generates a new combined set of gene identifications, representing the integrated evidence of the single method predictions. CONCLUSION: We evaluate IPred in simulations and real data experiments on Escherichia Coli and human data. We show that IPred improves the prediction accuracy in comparison to single method predictions and to existing methods for prediction combination.


Assuntos
Genoma Bacteriano , Genoma Humano , Software , Algoritmos , Biologia Computacional , Escherichia coli , Etiquetas de Sequências Expressas , Humanos , Análise de Sequência de RNA
3.
Bioinformatics ; 30(1): 9-16, 2014 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-23685787

RESUMO

MOTIVATION: Accurate estimation, comparison and evaluation of read mapping error rates is a crucial step in the processing of next-generation sequencing data, as further analysis steps and interpretation assume the correctness of the mapping results. Current approaches are either focused on sensitivity estimation and thereby disregard specificity or are based on read simulations. Although continuously improving, read simulations are still prone to introduce a bias into the mapping error quantitation and cannot capture all characteristics of an individual dataset. RESULTS: We introduce ARDEN (artificial reference driven estimation of false positives in next-generation sequencing data), a novel benchmark method that estimates error rates of read mappers based on real experimental reads, using an additionally generated artificial reference genome. It allows a dataset-specific computation of error rates and the construction of a receiver operating characteristic curve. Thereby, it can be used for optimization of parameters for read mappers, selection of read mappers for a specific problem or for filtering alignments based on quality estimation. The use of ARDEN is demonstrated in a general read mapper comparison, a parameter optimization for one read mapper and an application example in single-nucleotide polymorphism discovery with a significant reduction in the number of false positive identifications. AVAILABILITY: The ARDEN source code is freely available at http://sourceforge.net/projects/arden/.


Assuntos
Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Algoritmos , Sequência de Aminoácidos , Animais , Sequência de Bases , Caenorhabditis elegans , Polimorfismo de Nucleotídeo Único
4.
Bioinformatics ; 30(5): 606-13, 2014 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-24123675

RESUMO

MOTIVATION: The reliable identification of genes is a major challenge in genome research, as further analysis depends on the correctness of this initial step. With high-throughput RNA-Seq data reflecting currently expressed genes, a particularly meaningful source of information has become commonly available for gene finding. However, practical application in automated gene identification is still not the standard case. A particular challenge in including RNA-Seq data is the difficult handling of ambiguously mapped reads. RESULTS: We present GIIRA (Gene Identification Incorporating RNA-Seq data and Ambiguous reads), a novel prokaryotic and eukaryotic gene finder that is exclusively based on a RNA-Seq mapping and inherently includes ambiguously mapped reads. GIIRA extracts candidate regions supported by a sufficient number of mappings and reassigns ambiguous reads to their most likely origin using a maximum-flow approach. This avoids the exclusion of genes that are predominantly supported by ambiguous mappings. Evaluation on simulated and real data and comparison with existing methods incorporating RNA-Seq information highlight the accuracy of GIIRA in identifying the expressed genes. AVAILABILITY AND IMPLEMENTATION: GIIRA is implemented in Java and is available from https://sourceforge.net/projects/giira/.


Assuntos
Perfilação da Expressão Gênica/métodos , Genes , Análise de Sequência de RNA/métodos , Algoritmos , Animais , Escherichia coli/genética , Genômica , Humanos , Saccharomyces cerevisiae/genética , Alinhamento de Sequência
5.
Bioinformatics ; 29(10): 1260-7, 2013 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-23589648

RESUMO

MOTIVATION: Genome coverage, the number of sequencing reads mapped to a position in a genome, is an insightful indicator of irregularities within sequencing experiments. While the average genome coverage is frequently used within algorithms in computational genomics, the complete information available in coverage profiles (i.e. histograms over all coverages) is currently not exploited to its full extent. Thus, biases such as fragmented or erroneous reference genomes often remain unaccounted for. Making this information accessible can improve the quality of sequencing experiments and quantitative analyses. RESULTS: We introduce a framework for fitting mixtures of probability distributions to genome coverage profiles. Besides commonly used distributions, we introduce distributions tailored to account for common artifacts. The mixture models are iteratively fitted based on the Expectation-Maximization algorithm. We introduce use cases with focus on metagenomics and develop new analysis strategies to assess the validity of a reference genome with respect to (meta-) genomic read data. The framework is evaluated on simulated data as well as applied to a large-scale metagenomic study, for which we compute the validity of 75 microbial genomes. The results indicate that the choice and quality of reference genomes is vital for metagenomic analyses and that validation of coverage profiles is crucial to avoid incorrect conclusions. AVAILABILITY: The code is freely available and can be downloaded from http://sourceforge.net/projects/fitgcp/. CONTACT: RenardB@rki.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Bactérias/classificação , Metagenômica , Bactérias/genética , Bactérias/isolamento & purificação , Trato Gastrointestinal/microbiologia , Genoma , Genoma Bacteriano , Humanos , Probabilidade , Análise de Sequência de DNA/métodos
6.
Mol Cell Proteomics ; 11(7): M111.014167, 2012 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-22493179

RESUMO

Currently, the reliable identification of peptides and proteins is only feasible when thoroughly annotated sequence databases are available. Although sequencing capacities continue to grow, many organisms remain without reliable, fully annotated reference genomes required for proteomic analyses. Standard database search algorithms fail to identify peptides that are not exactly contained in a protein database. De novo searches are generally hindered by their restricted reliability, and current error-tolerant search strategies are limited by global, heuristic tradeoffs between database and spectral information. We propose a Bayesian information criterion-driven error-tolerant peptide search (BICEPS) and offer an open source implementation based on this statistical criterion to automatically balance the information of each single spectrum and the database, while limiting the run time. We show that BICEPS performs as well as current database search algorithms when such algorithms are applied to sequenced organisms, whereas BICEPS only uses a remotely related organism database. For instance, we use a chicken instead of a human database corresponding to an evolutionary distance of more than 300 million years (International Chicken Genome Sequencing Consortium (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695-716). We demonstrate the successful application to cross-species proteomics with a 33% increase in the number of identified proteins for a filarial nematode sample of Litomosoides sigmodontis.


Assuntos
Galinhas/genética , Filarioidea/genética , Peptídeos/química , Proteômica/métodos , Software , Algoritmos , Sequência de Aminoácidos , Animais , Teorema de Bayes , Evolução Biológica , Bases de Dados de Proteínas , Humanos , Internet , Espectrometria de Massas , Dados de Sequência Molecular , Reprodutibilidade dos Testes , Análise de Sequência de Proteína
7.
Bioinformatics ; 27(13): i248-56, 2011 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-21685078

RESUMO

MOTIVATION: In systematic biology, one is often faced with the task of comparing different phylogenetic trees, in particular in multi-gene analysis or cospeciation studies. One approach is to use a tanglegram in which two rooted phylogenetic trees are drawn opposite each other, using auxiliary lines to connect matching taxa. There is an increasing interest in using rooted phylogenetic networks to represent evolutionary history, so as to explicitly represent reticulate events, such as horizontal gene transfer, hybridization or reassortment. Thus, the question arises how to define and compute a tanglegram for such networks. RESULTS: In this article, we present the first formal definition of a tanglegram for rooted phylogenetic networks and present a heuristic approach for computing one, called the NN-tanglegram method. We compare the performance of our method with existing tree tanglegram algorithms and also show a typical application to real biological datasets. For maximum usability, the algorithm does not require that the trees or networks are bifurcating or bicombining, or that they are on identical taxon sets. AVAILABILITY: The algorithm is implemented in our program Dendroscope 3, which is freely available from www.dendroscope.org. CONTACT: scornava@informatik.uni-tuebingen.de; huson@informatik.uni-tuebingen.de.


Assuntos
Evolução Biológica , Biologia Computacional/métodos , Filogenia , Algoritmos , Animais , Biologia/métodos , Simulação por Computador , Ficus/genética , Transferência Genética Horizontal , Hibridização Genética , Vespas/genética
8.
Methods Mol Biol ; 1362: 247-64, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26519182

RESUMO

Recent studies have demonstrated that mass spectrometry-based variant detection is feasible. Typically, either genomic variant databases or transcript data are used to construct customized target databases for the identification of single-amino acid variants in mass spectrometry data. However, both approaches require additional data to perform the identification of SAAVs. Here, we discuss the application of an error-tolerant peptide search engine such as BICEPS for identifying variants exclusively based on standard Uniprot databases. Thereby, unnecessary and redundant extensions of the search space are avoided. The workflow provides an unbiased view on the data; the search space is not limited to known variants and simultaneously does not require additional data. In a subsequent step a second identification search is performed to verify the initially identified variant peptides and aggregate information on the protein level.


Assuntos
Substituição de Aminoácidos , Biologia Computacional/métodos , Bases de Dados de Proteínas , Algoritmos , Células HCT116 , Células HeLa , Humanos , Reprodutibilidade dos Testes , Software , Fluxo de Trabalho
9.
PLoS Curr ; 62014 Jun 16.
Artigo em Inglês | MEDLINE | ID: mdl-24987574

RESUMO

While initial phylogenetic analyses concluded to Guinea 2014 EBOV falling outside the Zaïre lineage (ZEBOV), a recent re-analysis of the same dataset by Dudas and Rambaut (2014) suggested that Guinea 2014 EBOV actually is ZEBOV. Under the same hypothesis as used by these authors (the molecular clock hypothesis), we reinforce their conclusion by providing a statistical assessment of the location of the root of the Zaïre lineage. Our analysis unambiguously supports Guinea 2014 EBOV as a member of the Zaïre lineage. In addition, we also show that some uncertainty exists so as to the location of the root of the genus Ebolavirus. We release the software we used for these re-analyses. RootAnnotator allows for the easy determination of branch root posterior probability from any posterior sample of clocked trees and is freely available at http://sourceforge.net/projects/rootannotator/.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA