Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 3 de 3
Filtrar
1.
Genome Res ; 19(11): 2133-43, 2009 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-19564452

RESUMEN

We present a highly accurate gene-prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized hidden Markov models (gHMMs) with the predictive power of modern machine learning methods, such as Support Vector Machines (SVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans. Considering the average of sensitivity and specificity, the developmental version of mGene exhibited the best prediction performance on nucleotide, exon, and transcript level for ab initio and multiple-genome gene-prediction tasks. The fully developed version shows superior performance in 10 out of 12 evaluation criteria compared with the other participating gene finders, including Fgenesh++ and Augustus. An in-depth analysis of mGene's genome-wide predictions revealed that approximately 2200 predicted genes were not contained in the current genome annotation. Testing a subset of 57 of these genes by RT-PCR and sequencing, we confirmed expression for 24 (42%) of them. mGene missed 300 annotated genes, out of which 205 were unconfirmed. RT-PCR testing of 24 of these genes resulted in a success rate of merely 8%. These findings suggest that even the gene catalog of a well-studied organism such as C. elegans can be substantially improved by mGene's predictions. We also provide gene predictions for the four nematodes C. briggsae, C. brenneri, C. japonica, and C. remanei. Comparing the resulting proteomes among these organisms and to the known protein universe, we identified many species-specific gene inventions. In a quality assessment of several available annotations for these genomes, we find that mGene's predictions are most accurate.


Asunto(s)
Algoritmos , Caenorhabditis elegans/genética , Biología Computacional/métodos , Genoma de los Helmintos/genética , Animales , Inteligencia Artificial , Caenorhabditis/clasificación , Caenorhabditis/genética , Genes de Helminto/genética , Genómica/métodos , Sitios de Empalme de ARN , Reproducibilidad de los Resultados , Reacción en Cadena de la Polimerasa de Transcriptasa Inversa , Análisis de Secuencia de ADN , Sitio de Iniciación de la Transcripción
2.
Bioinformatics ; 24(13): i6-14, 2008 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-18586746

RESUMEN

MOTIVATION: At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts. RESULTS: To make SVM-based sequence classifiers more accessible and profitable, we introduce the concept of positional oligomer importance matrices (POIMs) and propose an efficient algorithm for their computation. In contrast to the raw SVM feature weighting, POIMs take the underlying correlation structure of k-mer features induced by overlaps of related k-mers into account. POIMs can be seen as a powerful generalization of sequence logos: they allow to capture and visualize sequence patterns that are relevant for the investigated biological phenomena. AVAILABILITY: All source code, datasets, tables and figures are available at http://www.fml.tuebingen.mpg.de/raetsch/projects/POIM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Inteligencia Artificial , ADN/genética , Reconocimiento de Normas Patrones Automatizadas/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Datos de Secuencia Molecular
3.
BMC Bioinformatics ; 8 Suppl 10: S7, 2007.
Artículo en Inglés | MEDLINE | ID: mdl-18269701

RESUMEN

BACKGROUND: For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks. RESULTS: In this work we consider Support Vector Machines for splice site recognition. We employ the so-called weighted degree kernel which turns out well suited for this task, as we will illustrate in several experiments where we compare its prediction accuracy with that of recently proposed systems. We apply our method to the genome-wide recognition of splice sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, and Homo sapiens. Our performance estimates indicate that splice sites can be recognized very accurately in these genomes and that our method outperforms many other methods including Markov Chains, GeneSplicer and SpliceMachine. We provide genome-wide predictions of splice sites and a stand-alone prediction tool ready to be used for incorporation in a gene finder. AVAILABILITY: Data, splits, additional information on the model selection, the whole genome predictions, as well as the stand-alone prediction tool are available for download at http://www.fml.mpg.de/raetsch/projects/splice.


Asunto(s)
Sitios de Empalme de ARN/genética , Algoritmos , Animales , Arabidopsis/genética , Brassicaceae/genética , Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Predicción/métodos , Genómica/métodos , Humanos , Cadenas de Markov , Pez Cebra/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA