Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
1.
Genome Res ; 19(11): 2133-43, 2009 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-19564452

RESUMO

We present a highly accurate gene-prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized hidden Markov models (gHMMs) with the predictive power of modern machine learning methods, such as Support Vector Machines (SVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans. Considering the average of sensitivity and specificity, the developmental version of mGene exhibited the best prediction performance on nucleotide, exon, and transcript level for ab initio and multiple-genome gene-prediction tasks. The fully developed version shows superior performance in 10 out of 12 evaluation criteria compared with the other participating gene finders, including Fgenesh++ and Augustus. An in-depth analysis of mGene's genome-wide predictions revealed that approximately 2200 predicted genes were not contained in the current genome annotation. Testing a subset of 57 of these genes by RT-PCR and sequencing, we confirmed expression for 24 (42%) of them. mGene missed 300 annotated genes, out of which 205 were unconfirmed. RT-PCR testing of 24 of these genes resulted in a success rate of merely 8%. These findings suggest that even the gene catalog of a well-studied organism such as C. elegans can be substantially improved by mGene's predictions. We also provide gene predictions for the four nematodes C. briggsae, C. brenneri, C. japonica, and C. remanei. Comparing the resulting proteomes among these organisms and to the known protein universe, we identified many species-specific gene inventions. In a quality assessment of several available annotations for these genomes, we find that mGene's predictions are most accurate.


Assuntos
Algoritmos , Caenorhabditis elegans/genética , Biologia Computacional/métodos , Genoma Helmíntico/genética , Animais , Inteligência Artificial , Caenorhabditis/classificação , Caenorhabditis/genética , Genes de Helmintos/genética , Genômica/métodos , Sítios de Splice de RNA , Reprodutibilidade dos Testes , Reação em Cadeia da Polimerase Via Transcriptase Reversa , Análise de Sequência de DNA , Sítio de Iniciação de Transcrição
2.
Nucleic Acids Res ; 37(Web Server issue): W312-6, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19494180

RESUMO

We describe mGene.web, a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is available at http://www.mgene.org/web, it is free of charge and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp).


Assuntos
Genes , Genômica , Proteínas/genética , Software , Internet , Sítios de Splice de RNA , Análise de Sequência de DNA , Sítio de Iniciação de Transcrição
3.
Bioinformatics ; 24(13): i6-14, 2008 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-18586746

RESUMO

MOTIVATION: At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts. RESULTS: To make SVM-based sequence classifiers more accessible and profitable, we introduce the concept of positional oligomer importance matrices (POIMs) and propose an efficient algorithm for their computation. In contrast to the raw SVM feature weighting, POIMs take the underlying correlation structure of k-mer features induced by overlaps of related k-mers into account. POIMs can be seen as a powerful generalization of sequence logos: they allow to capture and visualize sequence patterns that are relevant for the investigated biological phenomena. AVAILABILITY: All source code, datasets, tables and figures are available at http://www.fml.tuebingen.mpg.de/raetsch/projects/POIM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Inteligência Artificial , DNA/genética , Reconhecimento Automatizado de Padrão/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Sequência de Bases , Dados de Sequência Molecular
4.
PLoS Comput Biol ; 3(2): e20, 2007 Feb 23.
Artigo em Inglês | MEDLINE | ID: mdl-17319737

RESUMO

For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%-13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.


Assuntos
Algoritmos , Inteligência Artificial , Caenorhabditis elegans/genética , Mapeamento Cromossômico/métodos , Bases de Dados Genéticas , Reconhecimento Automatizado de Padrão/métodos , Análise de Sequência de DNA/métodos , Animais , Sequência de Bases , Documentação/métodos , Éxons , Íntrons , Dados de Sequência Molecular , Alinhamento de Sequência/métodos
5.
BMC Bioinformatics ; 8 Suppl 10: S7, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-18269701

RESUMO

BACKGROUND: For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks. RESULTS: In this work we consider Support Vector Machines for splice site recognition. We employ the so-called weighted degree kernel which turns out well suited for this task, as we will illustrate in several experiments where we compare its prediction accuracy with that of recently proposed systems. We apply our method to the genome-wide recognition of splice sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, and Homo sapiens. Our performance estimates indicate that splice sites can be recognized very accurately in these genomes and that our method outperforms many other methods including Markov Chains, GeneSplicer and SpliceMachine. We provide genome-wide predictions of splice sites and a stand-alone prediction tool ready to be used for incorporation in a gene finder. AVAILABILITY: Data, splits, additional information on the model selection, the whole genome predictions, as well as the stand-alone prediction tool are available for download at http://www.fml.mpg.de/raetsch/projects/splice.


Assuntos
Sítios de Splice de RNA/genética , Algoritmos , Animais , Arabidopsis/genética , Brassicaceae/genética , Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Previsões/métodos , Genômica/métodos , Humanos , Cadeias de Markov , Peixe-Zebra/genética
6.
Bioinformatics ; 22(14): e472-80, 2006 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-16873509

RESUMO

UNLABELLED: We develop new methods for finding transcription start sites (TSS) of RNA Polymerase II binding genes in genomic DNA sequences. Employing Support Vector Machines with advanced sequence kernels, we achieve drastically higher prediction accuracies than state-of-the-art methods. MOTIVATION: One of the most important features of genomic DNA are the protein-coding genes. While it is of great value to identify those genes and the encoded proteins, it is also crucial to understand how their transcription is regulated. To this end one has to identify the corresponding promoters and the contained transcription factor binding sites. TSS finders can be used to locate potential promoters. They may also be used in combination with other signal and content detectors to resolve entire gene structures. RESULTS: We have developed a novel kernel based method - called ARTS - that accurately recognizes transcription start sites in human. The application of otherwise too computationally expensive Support Vector Machines was made possible due to the use of efficient training and evaluation techniques using suffix tries. In a carefully designed experimental study, we compare our TSS finder to state-of-the-art methods from the literature: McPromoter, Eponine and FirstEF. For given false positive rates within a reasonable range, we consistently achieve considerably higher true positive rates. For instance, ARTS finds about 35% true positives at a false positive rate of 1/1000, where the other methods find about a half (18%). AVAILABILITY: Datasets, model selection results, whole genome predictions, and additional experimental results are available at http://www.fml.tuebingen.mpg.de/raetsch/projects/arts.


Assuntos
Algoritmos , DNA/genética , RNA Polimerase II/genética , Análise de Sequência de DNA/métodos , Software , Fatores de Transcrição/genética , Sítio de Iniciação de Transcrição , Sequência de Bases , Humanos , Dados de Sequência Molecular , Alinhamento de Sequência/métodos
7.
BMC Bioinformatics ; 7 Suppl 1: S9, 2006 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-16723012

RESUMO

BACKGROUND: Support Vector Machines (SVMs)--using a variety of string kernels--have been successfully applied to biological sequence classification problems. While SVMs achieve high classification accuracy they lack interpretability. In many applications, it does not suffice that an algorithm just detects a biological signal in the sequence, but it should also provide means to interpret its solution in order to gain biological insight. RESULTS: We propose novel and efficient algorithms for solving the so-called Support Vector Multiple Kernel Learning problem. The developed techniques can be used to understand the obtained support vector decision function in order to extract biologically relevant knowledge about the sequence analysis problem at hand. We apply the proposed methods to the task of acceptor splice site prediction and to the problem of recognizing alternatively spliced exons. Our algorithms compute sparse weightings of substring locations, highlighting which parts of the sequence are important for discrimination. CONCLUSION: The proposed method is able to deal with thousands of examples while combining hundreds of kernels within reasonable time, and reliably identifies a few statistically significant positions.


Assuntos
Biologia Computacional/métodos , Algoritmos , Motivos de Aminoácidos , Inteligência Artificial , Sítios de Ligação , Análise por Conglomerados , Modelos Estatísticos , Redes Neurais de Computação , Reconhecimento Automatizado de Padrão , Alinhamento de Sequência , Análise de Sequência de Proteína , Software
8.
Sci Rep ; 6: 36671, 2016 11 28.
Artigo em Inglês | MEDLINE | ID: mdl-27892471

RESUMO

The standard approach to the analysis of genome-wide association studies (GWAS) is based on testing each position in the genome individually for statistical significance of its association with the phenotype under investigation. To improve the analysis of GWAS, we propose a combination of machine learning and statistical testing that takes correlation structures within the set of SNPs under investigation in a mathematically well-controlled manner into account. The novel two-step algorithm, COMBI, first trains a support vector machine to determine a subset of candidate SNPs and then performs hypothesis tests for these SNPs together with an adequate threshold correction. Applying COMBI to data from a WTCCC study (2007) and measuring performance as replication by independent GWAS published within the 2008-2015 period, we show that our method outperforms ordinary raw p-value thresholding as well as other state-of-the-art methods. COMBI presents higher power and precision than the examined alternatives while yielding fewer false (i.e. non-replicated) and more true (i.e. replicated) discoveries when its results are validated on later GWAS studies. More than 80% of the discoveries made by COMBI upon WTCCC data have been validated by independent studies. Implementations of the COMBI method are available as a part of the GWASpi toolbox 2.0.

10.
PLoS One ; 7(10): e42947, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23118845

RESUMO

We provide a novel interpretation of the dual of support vector machines (SVMs) in terms of scatter with respect to class prototypes and their mean. As a key contribution, we extend this framework to multiple classes, providing a new joint Scatter SVM algorithm, at the level of its binary counterpart in the number of optimization variables. This enables us to implement computationally efficient solvers based on sequential minimal and chunking optimization. As a further contribution, the primal problem formulation is developed in terms of regularized risk minimization and the hinge loss, revealing the score function to be used in the actual classification of test patterns. We investigate Scatter SVM properties related to generalization ability, computational efficiency, sparsity and sensitivity maps, and report promising results.


Assuntos
Algoritmos , Modelos Teóricos , Máquina de Vetores de Suporte , Inteligência Artificial , Interpretação Estatística de Dados , Humanos , Reconhecimento Automatizado de Padrão
11.
J Chem Inf Model ; 45(2): 249-53, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-15807485

RESUMO

In this article we report about a successful application of modern machine learning technology, namely Support Vector Machines, to the problem of assessing the 'drug-likeness' of a chemical from a given set of descriptors of the substance. We were able to drastically improve the recent result by Byvatov et al. (2003) on this task and achieved an error rate of about 7% on unseen compounds using Support Vector Machines. We see a very high potential of such machine learning techniques for a variety of computational chemistry problems that occur in the drug discovery and drug design process.


Assuntos
Inteligência Artificial , Desenho de Fármacos , Simulação por Computador , Modelos Químicos
12.
Neural Comput ; 14(10): 2397-414, 2002 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-12396568

RESUMO

Recently, Jaakkola and Haussler (1999) proposed a method for constructing kernel functions from probabilistic models. Their so-called Fisher kernel has been combined with discriminative classifiers such as support vector machines and applied successfully in, for example, DNA and protein analysis. Whereas the Fisher kernel is calculated from the marginal log-likelihood, we propose the TOP kernel derived; from tangent vectors of posterior log-odds. Furthermore, we develop a theoretical framework on feature extractors from probabilistic models and use it for analyzing the TOP kernel. In experiments, our new discriminative TOP kernel compares favorably to the Fisher kernel.


Assuntos
Modelos Neurológicos , Modelos Estatísticos , Redes Neurais de Computação , Proteínas/classificação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA