Pesquisa | Secretaria de Estado da Saúde

A Protein Classification Benchmark collection for machine learning.

Sonego, Paolo; Pacurar, Mircea; Dhir, Somdutta; Kertész-Farkas, Attila; Kocsor, András; Gáspári, Zoltán; Leunissen, Jack A M; Pongor, Sándor.

Nucleic Acids Res ; 35(Database issue): D232-6, 2007 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-17142240

RESUMO

Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection (http://hydra.icgeb.trieste.it/benchmark) was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences and structures, and each set is subdivided into positive/negative, training/test sets in several ways. There is a total of 6405 classification tasks, 3297 on protein sequences, 3095 on protein structures and 10 on protein coding regions in DNA. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences or structures, as well as various functional and taxonomic classification problems. In the case of hierarchical classification schemes, the classification tasks can be defined at various levels of the hierarchy (such as classes, folds, superfamilies, etc.). For each dataset there are distance matrices available that contain all vs. all comparison of the data, based on various sequence or structure comparison methods, as well as a set of classification performance measures computed with various classifier algorithms.

Assuntos

Inteligência Artificial , Bases de Dados de Proteínas , Proteínas/classificação , Algoritmos , Internet , Estrutura Terciária de Proteína , Proteínas/química , Reprodutibilidade dos Testes , Análise de Sequência de Proteína , Interface Usuário-Computador

Benchmarking protein classification algorithms via supervised cross-validation.

Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor.

J Biochem Biophys Methods ; 70(6): 1215-23, 2008 Apr 24.

Artigo em Inglês | MEDLINE | ID: mdl-17604112

RESUMO

Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

Assuntos

Algoritmos , Proteínas/análise , Proteínas/classificação , Proteínas/química , Análise de Sequência de Proteína

Detecting atypical examples of known domain types by sequence similarity searching: the SBASE domain library approach.

Dhir, Somdutta; Pacurar, Mircea; Franklin, Dino; Gáspári, Zoltán; Kertész-Farkas, Attila; Kocsor, András; Eisenhaber, Frank; Pongor, Sándor.

Curr Protein Pept Sci ; 11(7): 538-49, 2010 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-20887262

RESUMO

SBASE is a project initiated to detect known domain types and predicting domain architectures using sequence similarity searching (Simon et al., Protein Seq Data Anal, 5: 39-42, 1992, Pongor et al, Nucl. Acids. Res. 21:3111-3115, 1992). The current approach uses a curated collection of domain sequences - the SBASE domain library - and standard similarity search algorithms, followed by postprocessing which is based on a simple statistics of the domain similarity network (http://hydra.icgeb.trieste.it/sbase/). It is especially useful in detecting rare, atypical examples of known domain types which are sometimes missed even by more sophisticated methodologies. This approach does not require multiple alignment or machine learning techniques, and can be a useful complement to other domain detection methodologies. This article gives an overview of the project history as well as of the concepts and principles developed within this the project.

Assuntos

Mineração de Dados , Bases de Dados de Proteínas , Proteínas/química , Algoritmos , Humanos , Redes Neurais de Computação , Sistemas On-Line , Estrutura Terciária de Proteína , Proteínas/classificação , Curva ROC , Homologia de Sequência de Aminoácidos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa