Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
Nucleic Acids Res ; 38(10): 3149-58, 2010 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-20156993

RESUMO

DNA-binding proteins perform vital functions related to transcription, repair and replication. We have developed a new sequence-based machine learning protocol to identify DNA-binding proteins. We compare our method with an extensive benchmark of previously published structure-based machine learning methods as well as a standard sequence alignment technique, BLAST. Furthermore, we elucidate important feature interactions found in a learned model and analyze how specific rules capture general mechanisms that extend across DNA-binding motifs. This analysis is carried out using the malibu machine learning workbench available at http://proteomics.bioengr.uic.edu/malibu and the corresponding data sets and features are available at http://proteomics.bioengr.uic.edu/dna.


Assuntos
Inteligência Artificial , Proteínas de Ligação a DNA/química , Análise de Sequência de Proteína , Estrutura Terciária de Proteína , Alinhamento de Sequência
2.
Artigo em Inglês | MEDLINE | ID: mdl-19163538

RESUMO

malibu is an open-source machine learning work-bench developed in C/C++ for high-performance real-world applications, namely bioinformatics and medical informatics. It leverages third-party machine learning implementations for more robust bug-free software. This workbench handles several well-studied supervised machine learning problems including classification, regression, importance-weighted classification and multiple-instance learning. The malibu interface was designed to create reproducible experiments ideally run in a remote and/or command line environment. The software can be found at: http://proteomics.bioengr. uic.edu/malibu/index.html.


Assuntos
Inteligência Artificial , Biologia Computacional/métodos , Armazenamento e Recuperação da Informação/métodos , Algoritmos , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Factuais , Humanos , Linguagens de Programação , Software , Interface Usuário-Computador
3.
Ann Biomed Eng ; 35(6): 1043-52, 2007 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-17436108

RESUMO

A protein's function depends in a large part on interactions with other molecules. With an increasing number of protein structures becoming available every year, a corresponding structural annotation approach identifying such interactions grows more expedient. At the same time, machine learning has gained popularity in bioinformatics providing robust annotation of genes and proteins without sequence homology. Here we have developed a general machine learning protocol to identify proteins that bind DNA and membrane. In general, there is no theory or even rule of thumb to pick the best machine learning algorithm. Thus, a systematic comparison of several classification algorithms known to perform well is investigated. Indeed, the boosted tree classifier is found to give the best performance, achieving 93% and 88% accuracy to discriminate non-homologous proteins that bind membrane and DNA, respectively, significantly outperforming all previously published works. We also attempted to address the importance of the attributes in function prediction and the relationships between relevant attributes. A graphical model based on boosted trees is applied to study the important features in discriminating DNA-binding proteins. In summary, the current protocol identified physical features important in DNA and membrane binding, rather than annotating function through sequence similarity.


Assuntos
Algoritmos , Inteligência Artificial , Proteínas de Ligação a DNA/química , Proteínas de Membrana/química , Modelos Químicos , Reconhecimento Automatizado de Padrão/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Simulação por Computador , Proteínas de Ligação a DNA/classificação , Proteínas de Ligação a DNA/ultraestrutura , Proteínas de Membrana/classificação , Proteínas de Membrana/ultraestrutura , Modelos Moleculares , Dados de Sequência Molecular , Alinhamento de Sequência/métodos , Homologia de Sequência de Aminoácidos , Relação Estrutura-Atividade
4.
Bioinformatics ; 23(12): 1444-50, 2007 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-17384424

RESUMO

MOTIVATION: The rapid accumulation of single amino acid polymorphisms (SAPs), also known as non-synonymous single nucleotide polymorphisms (nsSNPs), brings the opportunities and needs to understand and predict their disease association. Currently published attributes are limited, the detailed mechanisms governing the disease association of a SAP remain unclear and thus, further investigation of new attributes and improvement of the prediction are desired. RESULTS: A SAP dataset was compiled from the Swiss-Prot variant pages. We extracted and demonstrated the effectiveness of several new biologically informative attributes including the structural neighbor profiles that describe the SAP's microenvironment, nearby functional sites that measure the structure-based and sequence-based distances between the SAP site and its nearby functional sites, aggregation properties that measure the likelihood of protein aggregation and disordered regions that consider whether the SAP is located in structurally disordered regions. The new attributes provided insights into the mechanisms of the disease association of SAPs. We built a support vector machines (SVMs) classifier employing a carefully selected set of new and previously published attributes. Through a strict protein-level 5-fold cross-validation, we attained an overall accuracy of 82.61%, and an MCC of 0.60. Moreover, a web server was developed to provide a user-friendly interface for biologists. AVAILABILITY: The web server is available at http://sapred.cbi.pku.edu.cn/


Assuntos
Sequência de Aminoácidos , Aminoácidos/química , Doença , Polimorfismo Genético , Homologia de Sequência de Aminoácidos , Inteligência Artificial , Sequência Conservada , Bases de Dados de Proteínas , Dissulfetos/química , Humanos , Ligação de Hidrogênio , Estrutura Secundária de Proteína , Estrutura Terciária de Proteína , Reprodutibilidade dos Testes , Análise de Sequência de Proteína
5.
J Mol Biol ; 359(2): 486-95, 2006 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-16626739

RESUMO

Membrane-binding peripheral proteins play important roles in many biological processes, including cell signaling and membrane trafficking. Unlike integral membrane proteins, these proteins bind the membrane mostly in a reversible manner. Since peripheral proteins do not have canonical transmembrane segments, it is difficult to identify them from their amino acid sequences. As a first step toward genome-scale identification of membrane-binding peripheral proteins, we built a kernel-based machine learning protocol. Key features of known membrane-binding proteins, including electrostatic properties and amino acid composition, were calculated from their amino acid sequences and tertiary structures, which were then incorporated into the support vector machine to perform the classification. A data set of 40 membrane-binding proteins and 230 non-membrane-binding proteins was used to construct and validate the protocol. Cross-validation and holdout evaluation of the protocol showed that the accuracy of the prediction reached up to 93.7% and 91.6%, respectively. The protocol was applied to the prediction of membrane-binding properties of four C2 domains from novel protein kinases C. Although these C2 domains have 50% sequence identity, only one of them was predicted to bind the membrane, which was verified experimentally with surface plasmon resonance analysis. These results suggest that our protocol can be used for predicting membrane-binding properties of a wide variety of modular domains and may be further extended to genome-scale identification of membrane-binding peripheral proteins.


Assuntos
Biologia Computacional , Proteínas de Membrana/química , Estrutura Terciária de Proteína , Sequência de Aminoácidos , Inteligência Artificial , Membrana Celular/metabolismo , Bases de Dados de Proteínas , Proteínas de Membrana/metabolismo , Modelos Moleculares , Modelos Teóricos , Reprodutibilidade dos Testes , Análise de Sequência de Proteína , Propriedades de Superfície
6.
Nucleic Acids Res ; 33(20): 6486-93, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-16284202

RESUMO

DNA-binding proteins (DNA-BPs) play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. Attempts have been made to identify DNA-BPs based on their sequence and structural information with moderate accuracy. Here we develop a machine learning protocol for the prediction of DNA-BPs where the classifier is Support Vector Machines (SVMs). Information used for classification is derived from characteristics that include surface and overall composition, overall charge and positive potential patches on the protein surface. In total 121 DNA-BPs and 238 non-binding proteins are used to build and evaluate the protocol. In self-consistency, accuracy value of 100% has been achieved. For cross-validation (CV) optimization over entire dataset, we report an accuracy of 90%. Using leave 1-pair holdout evaluation, the accuracy of 86.3% has been achieved. When we restrict the dataset to less than 20% sequence identity amongst the proteins, the holdout accuracy is achieved at 85.8%. Furthermore, seven DNA-BPs with unbounded structures are all correctly predicted. The current performances are better than results published previously. The higher accuracy value achieved here originates from two factors: the ability of the SVM to handle features that demonstrate a wide range of discriminatory power and, a different definition of the positive patch. Since our protocol does not lean on sequence or structural homology, it can be used to identify or predict proteins with DNA-binding function(s) regardless of their homology to the known ones.


Assuntos
Inteligência Artificial , Proteínas de Ligação a DNA/química , Aminoácidos/química , Biologia Computacional/métodos , Proteínas de Ligação a DNA/classificação , Modelos Moleculares , Reprodutibilidade dos Testes , Eletricidade Estática
7.
Int J Bioinform Res Appl ; 1(3): 319-34, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-18048139

RESUMO

Because of the relatively large gap of knowledge between number of protein sequences and protein structures, the ability to construct a computational model predicting structure from sequence information has become an important area of research. The knowledge of a protein's structure is crucial in understanding its biological role. In this work, we present a support vector machine based method for recognising a protein's fold from sequence information alone, where this sequence has less similarity with sequences of known structures. We have focused on improving multi-class classification, parameter tuning, descriptor design, and feature selection. The current implementation demonstrates better prediction accuracy than previous similar approaches, and has similar performance when compared with straightforward threading.


Assuntos
Proteínas , Máquina de Vetores de Suporte , Sequência de Aminoácidos , Proteínas/química , Análise de Sequência de Proteína , Software
8.
Conf Proc IEEE Eng Med Biol Soc ; 2004: 2885-8, 2004.
Artigo em Inglês | MEDLINE | ID: mdl-17270880

RESUMO

Due to the relatively large gap of knowledge between gene identification and gene function, the ability to construct a computational model describing gene function from sequence information has become an important area of research. In order to understand the biological role of a specific gene, we will require knowledge of the corresponding protein's structure and function. We present a support vector machines based method for determining a protein's fold from sequence information alone where this sequence has little similarity with sequences with known structures. We have focused on improvement in multiclass classification, parameter tuning, descriptor design, and feature selections. The current implementation showed better performance than previous similar approaches.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA