Pesquisa | Portal Regional da BVS

Detecting atypical examples of known domain types by sequence similarity searching: the SBASE domain library approach.

Dhir, Somdutta; Pacurar, Mircea; Franklin, Dino; Gáspári, Zoltán; Kertész-Farkas, Attila; Kocsor, András; Eisenhaber, Frank; Pongor, Sándor.

Curr Protein Pept Sci ; 11(7): 538-49, 2010 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-20887262

RESUMO

SBASE is a project initiated to detect known domain types and predicting domain architectures using sequence similarity searching (Simon et al., Protein Seq Data Anal, 5: 39-42, 1992, Pongor et al, Nucl. Acids. Res. 21:3111-3115, 1992). The current approach uses a curated collection of domain sequences - the SBASE domain library - and standard similarity search algorithms, followed by postprocessing which is based on a simple statistics of the domain similarity network (http://hydra.icgeb.trieste.it/sbase/). It is especially useful in detecting rare, atypical examples of known domain types which are sometimes missed even by more sophisticated methodologies. This approach does not require multiple alignment or machine learning techniques, and can be a useful complement to other domain detection methodologies. This article gives an overview of the project history as well as of the concepts and principles developed within this the project.

Assuntos

Mineração de Dados , Bases de Dados de Proteínas , Proteínas/química , Algoritmos , Humanos , Redes Neurais de Computação , Sistemas On-Line , Estrutura Terciária de Proteína , Proteínas/classificação , Curva ROC , Homologia de Sequência de Aminoácidos

Protein classification based on propagation of unrooted binary trees.

Kocsor, András; Busa-Fekete, Róbert; Pongor, Sándor.

Protein Pept Lett ; 15(5): 428-34, 2008.

Artigo em Inglês | MEDLINE | ID: mdl-18537730

RESUMO

We present two efficient network propagation algorithms that operate on a binary tree, i.e., a sparse-edged substitute of an entire similarity network. TreeProp-N is based on passing increments between nodes while TreeProp-E employs propagation to the edges of the tree. Both algorithms improve protein classification efficiency.

Assuntos

Algoritmos , Biologia Computacional/métodos , Proteínas/classificação , Bases de Dados de Proteínas , Proteínas/química

ROC analysis: applications to the classification of biological sequences and 3D structures.

Sonego, Paolo; Kocsor, András; Pongor, Sándor.

Brief Bioinform ; 9(3): 198-209, 2008 May.

Artigo em Inglês | MEDLINE | ID: mdl-18192302

RESUMO

ROC ('receiver operator characteristics') analysis is a visual as well as numerical method used for assessing the performance of classification algorithms, such as those used for predicting structures and functions from sequence data. This review summarizes the fundamental concepts of ROC analysis and the interpretation of results using examples of sequence and structure comparison. We overview the available programs and provide evaluation guidelines for genomic/proteomic data, with particular regard to applications to large and heterogeneous databases used in bioinformatics.

Assuntos

Algoritmos , Modelos Químicos , Modelos Moleculares , Curva ROC , Alinhamento de Sequência/métodos , Análise de Sequência/métodos , Software , Conformação Molecular

Benchmarking protein classification algorithms via supervised cross-validation.

Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor.

J Biochem Biophys Methods ; 70(6): 1215-23, 2008 Apr 24.

Artigo em Inglês | MEDLINE | ID: mdl-17604112

RESUMO

Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

Assuntos

Algoritmos , Proteínas/análise , Proteínas/classificação , Proteínas/química , Análise de Sequência de Proteína

Balanced ROC analysis (BAROC) protocol for the evaluation of protein similarities.

Busa-Fekete, Róbert; Kertész-Farkas, Attila; Kocsor, András; Pongor, Sándor.

J Biochem Biophys Methods ; 70(6): 1210-4, 2008 Apr 24.

Artigo em Inglês | MEDLINE | ID: mdl-17689617

RESUMO

Identification of problematic protein classes (domain types, protein families) that are difficult to predict from sequence is a key issue in genome annotation. ROC (Receiver Operating Characteristic) analysis is routinely used for the evaluation of protein similarities, however its results - the area under curve (AUC) values - are differentially biased for the various protein classes that are highly different in size. We show the bias can be compensated for by adjusting the length of the top list in a class-dependent fashion, so that the number of negatives within the top list will be equal to (or proportional with) the size of the positive class. Using this balanced protocol the problematic classes can be identified by their AUC values, or by a scatter diagram in which the AUC values are plotted against positive/negative ratio of the top list. The use of likelihood-ratio scoring (Kaján et al, Bioinformatics,22, 2865-2869, 2007) the bias caused by class imbalance can be further decreased.

Assuntos

Proteínas/análise , Curva ROC , Algoritmos

A Protein Classification Benchmark collection for machine learning.

Sonego, Paolo; Pacurar, Mircea; Dhir, Somdutta; Kertész-Farkas, Attila; Kocsor, András; Gáspári, Zoltán; Leunissen, Jack A M; Pongor, Sándor.

Nucleic Acids Res ; 35(Database issue): D232-6, 2007 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-17142240

RESUMO

Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection (http://hydra.icgeb.trieste.it/benchmark) was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences and structures, and each set is subdivided into positive/negative, training/test sets in several ways. There is a total of 6405 classification tasks, 3297 on protein sequences, 3095 on protein structures and 10 on protein coding regions in DNA. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences or structures, as well as various functional and taxonomic classification problems. In the case of hierarchical classification schemes, the classification tasks can be defined at various levels of the hierarchy (such as classes, folds, superfamilies, etc.). For each dataset there are distance matrices available that contain all vs. all comparison of the data, based on various sequence or structure comparison methods, as well as a set of classification performance measures computed with various classifier algorithms.

Assuntos

Inteligência Artificial , Bases de Dados de Proteínas , Proteínas/classificação , Algoritmos , Internet , Estrutura Terciária de Proteína , Proteínas/química , Reprodutibilidade dos Testes , Análise de Sequência de Proteína , Interface Usuário-Computador

Application of a simple likelihood ratio approximant to protein sequence classification.

Kaján, László; Kertész-Farkas, Attila; Franklin, Dino; Ivanova, Neli; Kocsor, András; Pongor, Sándor.

Bioinformatics ; 22(23): 2865-9, 2006 Dec 01.

Artigo em Inglês | MEDLINE | ID: mdl-17090576

RESUMO

MOTIVATION: Likelihood ratio approximants (LRA) have been widely used for model comparison in statistics. The present study was undertaken in order to explore their utility as a scoring (ranking) function in the classification of protein sequences. RESULTS: We used a simple LRA-based on the maximal similarity (or minimal distance) scores of the two top ranking sequence classes. The scoring methods (Smith-Waterman, BLAST, local alignment kernel and compression based distances) were compared on datasets designed to test sequence similarities between proteins distantly related in terms of structure or evolution. It was found that LRA-based scoring can significantly outperform simple scoring methods.

Assuntos

Algoritmos , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Simulação por Computador , Funções Verossimilhança , Modelos Químicos , Modelos Moleculares , Modelos Estatísticos , Dados de Sequência Molecular , Homologia de Sequência de Aminoácidos

Kalman filtering for disease-state estimation from microarray data.

Kelemen, János Z; Kertész-Farkas, Attila; Kocsor, András; Puskás, László G.

Bioinformatics ; 22(24): 3047-53, 2006 Dec 15.

Artigo em Inglês | MEDLINE | ID: mdl-17065158

RESUMO

MOTIVATION: In this paper, we propose using the Kalman filter (KF) as a pre-processing step in microarray-based molecular diagnosis. Incorporating the expression covariance between genes is important in such classification problems, since this represents the functional relationships that govern tissue state. Failing to fulfil such requirements may result in biologically implausible class prediction models. Here, we show that employing the KF to remove noise (while retaining meaningful covariance and thus being able to estimate the underlying biological state from microarray measurements) yields linearly separable data suitable for most classification algorithms. RESULTS: We demonstrate the utility and performance of the KF as a robust disease-state estimator on publicly available binary and multi-class microarray datasets in combination with the most widely used classification methods to date. Moreover, using popular graphical representation schemes we show that our filtered datasets also have an improved visualization capability.

Assuntos

Biomarcadores Tumorais/análise , Diagnóstico por Computador/métodos , Perfilação da Expressão Gênica/métodos , Proteínas de Neoplasias/análise , Neoplasias/diagnóstico , Neoplasias/metabolismo , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Algoritmos , Humanos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Teoria de Sistemas

Application of compression-based distance measures to protein sequence classification: a methodological study.

Kocsor, András; Kertész-Farkas, Attila; Kaján, László; Pongor, Sándor.

Bioinformatics ; 22(4): 407-12, 2006 Feb 15.

Artigo em Inglês | MEDLINE | ID: mdl-16317070

RESUMO

MOTIVATION: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences. RESULTS: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith-Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith-Waterman algorithm and two hidden Markov model-based algorithms.

Assuntos

Algoritmos , Compressão de Dados/métodos , Proteínas/química , Proteínas/classificação , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Dados de Sequência Molecular , Proteínas/análise

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA