Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 64
Filtrar
Mais filtros

Bases de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
PLoS Comput Biol ; 17(10): e1009463, 2021 10.
Artigo em Inglês | MEDLINE | ID: mdl-34710081

RESUMO

Experimental data about gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5,000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a 10-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills.


Assuntos
Crowdsourcing/métodos , Ontologia Genética , Anotação de Sequência Molecular/métodos , Biologia Computacional , Bases de Dados Genéticas , Humanos , Proteínas/genética , Proteínas/fisiologia
2.
IEEE/ACM Trans Comput Biol Bioinform ; 17(6): 1918-1931, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-30998480

RESUMO

As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those methods. Different from the protein sequence encoding, the amino acid encoding can be used in both residue-level and sequence-level prediction of protein properties by combining them with different algorithms. However, it has not attracted enough attention in the past decades, and there are no comprehensive reviews and assessments about encoding methods so far. In this article, we make a systematic classification and propose a comprehensive review and assessment for various amino acid encoding methods. Those methods are grouped into five categories according to their information sources and information extraction methodologies, including binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding, and machine-learning encoding. Then, 16 representative methods from five categories are selected and compared on protein secondary structure prediction and protein fold recognition tasks by using large-scale benchmark datasets. The results show that the evolution-based position-dependent encoding method PSSM achieved the best performance, and the structure-based and machine-learning encoding methods also show some potential for further application, the neural network based distributed representation of amino acids in particular may bring new light to this area. We hope that the review and assessment are useful for future studies in amino acid encoding.


Assuntos
Sequência de Aminoácidos/genética , Aminoácidos/química , Biologia Computacional/métodos , Proteínas , Análise de Sequência de Proteína/métodos , Algoritmos , Dobramento de Proteína , Estrutura Secundária de Proteína/genética , Proteínas/química , Proteínas/genética , Proteínas/fisiologia
3.
Comput Biol Med ; 97: 113-123, 2018 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-29715596

RESUMO

In this work we use complex network theory to provide a statistical model of the connectivity patterns of human proteins and their interaction partners. Our intention is to identify important proteins that may be predisposed to be potential candidates as drug targets for therapeutic interventions. Target proteins usually have more interaction partners than non-target proteins, but there are no hard-and-fast rules for defining the actual number of interactions. We devise a statistical measure for identifying hub proteins, we score our target proteins with gene ontology annotations. The important druggable protein targets are likely to have similar biological functions that can be assessed for their potential therapeutic value. Our system provides a statistical analysis of the local and distant neighborhood protein interactions of the potential targets using complex network measures. This approach builds a more accurate model of drug-to-target activity and therefore the likely impact on treating diseases. We integrate high quality protein interaction data from the HINT database and disease associated proteins from the DrugTarget database. Other sources include biological knowledge from Gene Ontology and drug information from DrugBank. The problem is a very challenging one since the data is highly imbalanced between target proteins and the more numerous nontargets. We use undersampling on the training data and build Random Forest classifier models which are used to identify previously unclassified target proteins. We validate and corroborate these findings from the available literature.


Assuntos
Biologia Computacional/métodos , Descoberta de Drogas/métodos , Modelos Estatísticos , Mapas de Interação de Proteínas , Bases de Dados de Produtos Farmacêuticos , Bases de Dados de Proteínas , Humanos , Mapas de Interação de Proteínas/efeitos dos fármacos , Mapas de Interação de Proteínas/fisiologia , Proteínas/genética , Proteínas/fisiologia
4.
J Bioinform Comput Biol ; 13(5): 1543001, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26493682

RESUMO

Automated assignment of protein function has received considerable attention in recent years for genome-wide study. With the rapid accumulation of genome sequencing data produced by high-throughput experimental techniques, the process of manually predicting functional properties of proteins has become increasingly cumbersome. Such large genomics data sets can only be annotated computationally. However, automated assignment of functions to unknown protein is challenging due to its inherent difficulty and complexity. Previous studies have revealed that solving problems involving complicated objects with multiple semantic meanings using the multi-instance multi-label (MIML) framework is effective. For the protein function prediction problems, each protein object in nature may associate with distinct structural units (instances) and multiple functional properties (class labels) where each unit is described by an instance and each functional property is considered as a class label. Thus, it is convenient and natural to tackle the protein function prediction problem by using the MIML framework. In this paper, we propose a sparse Markov chain-based semi-supervised MIML method, called Sparse-Markov. A sparse transductive probability graph is constructed to encode the affinity information of the data based on ensemble of Hausdorff distance metrics. Our goal is to exploit the affinity between protein objects in the sparse transductive probability graph to seek a sparse steady state probability of the Markov chain model to do protein function prediction, such that two proteins are given similar functional labels if they are close to each other in terms of an ensemble Hausdorff distance in the graph. Experimental results on seven real-world organism data sets covering three biological domains show that our proposed Sparse-Markov method is able to achieve better performance than four state-of-the-art MIML learning algorithms.


Assuntos
Cadeias de Markov , Proteínas/química , Proteínas/fisiologia , Aprendizado de Máquina Supervisionado , Algoritmos , Animais , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Proteínas/genética
5.
J Theor Biol ; 377: 10-24, 2015 Jul 21.
Artigo em Inglês | MEDLINE | ID: mdl-25865524

RESUMO

In protein function prediction (PFP) problem, the goal is to predict function of numerous well-sequenced known proteins whose function is not still known precisely. PFP is one of the special and complex problems in machine learning domain in which a protein (regarded as instance) may have more than one function simultaneously. Furthermore, the functions (regarded as classes) are dependent and also are organized in a hierarchical structure in the form of a tree or directed acyclic graph. One of the common learning methods proposed for solving this problem is decision trees in which, by partitioning data into sharp boundaries sets, small changes in the attribute values of a new instance may cause incorrect change in predicted label of the instance and finally misclassification. In this paper, a Variance Reduction based Binary Fuzzy Decision Tree (VR-BFDT) algorithm is proposed to predict functions of the proteins. This algorithm just fuzzifies the decision boundaries instead of converting the numeric attributes into fuzzy linguistic terms. It has the ability of assigning multiple functions to each protein simultaneously and preserves the hierarchy consistency between functional classes. It uses the label variance reduction as splitting criterion to select the best "attribute-value" at each node of the decision tree. The experimental results show that the overall performance of the proposed algorithm is promising.


Assuntos
Técnicas de Apoio para a Decisão , Modelos Teóricos , Proteínas/fisiologia , Algoritmos , Animais , Lógica Fuzzy , Aprendizado de Máquina , Proteínas/metabolismo
6.
BMC Bioinformatics ; 15 Suppl 2: S9, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24564855

RESUMO

BACKGROUND: Automated assignment of functions to unknown proteins is one of the most important task in computational biology. The development of experimental methods for genome scale analysis of molecular interaction networks offers new ways to infer protein function from protein-protein interaction (PPI) network data. Existing techniques for collective classification (CC) usually increase accuracy for network data, wherein instances are interlinked with each other, using a large amount of labeled data for training. However, the labeled data are time-consuming and expensive to obtain. On the other hand, one can easily obtain large amount of unlabeled data. Thus, more sophisticated methods are needed to exploit the unlabeled data to increase prediction accuracy for protein function prediction. RESULTS: In this paper, we propose an effective Markov chain based CC algorithm (ICAM) to tackle the label deficiency problem in CC for interrelated proteins from PPI networks. Our idea is to model the problem using two distinct Markov chain classifiers to make separate predictions with regard to attribute features from protein data and relational features from relational information. The ICAM learning algorithm combines the results of the two classifiers to compute the ranks of labels to indicate the importance of a set of labels to an instance, and uses an ICA framework to iteratively refine the learning models for improving performance of protein function prediction from PPI networks in the paucity of labeled data. CONCLUSION: Experimental results on the real-world Yeast protein-protein interaction datasets show that our proposed ICAM method is better than the other ICA-type methods given limited labeled training data. This approach can serve as a valuable tool for the study of protein function prediction from PPI networks.


Assuntos
Algoritmos , Mapeamento de Interação de Proteínas/métodos , Proteínas/fisiologia , Cadeias de Markov
7.
BMC Bioinformatics ; 15: 5, 2014 Jan 09.
Artigo em Inglês | MEDLINE | ID: mdl-24405700

RESUMO

BACKGROUND: With the development of sequencing technologies, more and more sequence variants are available for investigation. Different classes of variants in the human genome have been identified, including single nucleotide substitutions, insertion and deletion, and large structural variations such as duplications and deletions. Insertion and deletion (indel) variants comprise a major proportion of human genetic variation. However, little is known about their effects on humans. The absence of understanding is largely due to the lack of both biological data and computational resources. RESULTS: This paper presents a new indel functional prediction method HMMvar based on HMM profiles, which capture the conservation information in sequences. The results demonstrate that a scoring strategy based on HMM profiles can achieve good performance in identifying deleterious or neutral variants for different data sets, and can predict the protein functional effects of both single and multiple mutations. CONCLUSIONS: This paper proposed a quantitative prediction method, HMMvar, to predict the effect of genetic variation using hidden Markov models. The HMM based pipeline program implementing the method HMMvar is freely available at https://bioinformatics.cs.vt.edu/zhanglab/hmm.


Assuntos
Variação Genética , Genoma Humano/genética , Mutação INDEL/genética , Mutação INDEL/fisiologia , Biologia Computacional/métodos , Genoma Humano/fisiologia , Humanos , Cadeias de Markov , Modelos Genéticos , Modelos Estatísticos , Proteínas/genética , Proteínas/metabolismo , Proteínas/fisiologia , Curva ROC
8.
BMC Bioinformatics ; 14 Suppl 3: S15, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23630983

RESUMO

The assignment of gene function remains a difficult but important task in computational biology. The establishment of the first Critical Assessment of Functional Annotation (CAFA) was aimed at increasing progress in the field. We present an independent analysis of the results of CAFA, aimed at identifying challenges in assessment and at understanding trends in prediction performance. We found that well-accepted methods based on sequence similarity (i.e., BLAST) have a dominant effect. Many of the most informative predictions turned out to be either recovering existing knowledge about sequence similarity or were "post-dictions" already documented in the literature. These results indicate that deep challenges remain in even defining the task of function assignment, with a particular difficulty posed by the problem of defining function in a way that is not dependent on either flawed gold standards or the input data itself. In particular, we suggest that using the Gene Ontology (or other similar systematizations of function) as a gold standard is unlikely to be the way forward.


Assuntos
Anotação de Sequência Molecular , Proteínas/genética , Proteínas/fisiologia , Algoritmos , Biologia Computacional/métodos , Bases de Dados de Proteínas , Genes , Proteínas/química , Vocabulário Controlado
9.
Nucleic Acids Res ; 41(Web Server issue): W242-8, 2013 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-23685612

RESUMO

The PhyloFacts 'Fast Approximate Tree Classification' (FAT-CAT) web server provides a novel approach to ortholog identification using subtree hidden Markov model-based placement of protein sequences to phylogenomic orthology groups in the PhyloFacts database. Results on a data set of microbial, plant and animal proteins demonstrate FAT-CAT's high precision at separating orthologs and paralogs and robustness to promiscuous domains. We also present results documenting the precision of ortholog identification based on subtree hidden Markov model scoring. The FAT-CAT phylogenetic placement is used to derive a functional annotation for the query, including confidence scores and drill-down capabilities. PhyloFacts' broad taxonomic and functional coverage, with >7.3 M proteins from across the Tree of Life, enables FAT-CAT to predict orthologs and assign function for most sequence inputs. Four pipeline parameter presets are provided to handle different sequence types, including partial sequences and proteins containing promiscuous domains; users can also modify individual parameters. PhyloFacts trees matching the query can be viewed interactively online using the PhyloScope Javascript tree viewer and are hyperlinked to various external databases. The FAT-CAT web server is available at http://phylogenomics.berkeley.edu/phylofacts/fatcat/.


Assuntos
Filogenia , Proteínas/classificação , Software , Animais , Classificação/métodos , Internet , Cadeias de Markov , Anotação de Sequência Molecular , Proteínas/genética , Proteínas/fisiologia , Análise de Sequência de Proteína
10.
Jpn J Clin Oncol ; 43(4): 357-61, 2013 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-23493743

RESUMO

Takeda's Oncology Discovery Strategy is tightly integrated and focused on first and fast-best-in-class products and product combinations. Core areas of expertise include hormones, protein homeostasis, biotherapeutics and signal transduction. Strategic imperatives for research success are understanding of unmet needs, focus on biological expertise in foundational areas of leadership and flexibility to adapt to new information.


Assuntos
Antineoplásicos , Indústria Farmacêutica , Produtos Biológicos/uso terapêutico , California , Homeostase/fisiologia , Hormônios/uso terapêutico , Japão , Proteínas/fisiologia , Pesquisa
11.
Nucleic Acids Res ; 41(Database issue): D387-95, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-23197656

RESUMO

TIGRFAMs, available online at http://www.jcvi.org/tigrfams is a database of protein family definitions. Each entry features a seed alignment of trusted representative sequences, a hidden Markov model (HMM) built from that alignment, cutoff scores that let automated annotation pipelines decide which proteins are members, and annotations for transfer onto member proteins. Most TIGRFAMs models are designated equivalog, meaning they assign a specific name to proteins conserved in function from a common ancestral sequence. Models describing more functionally heterogeneous families are designated subfamily or domain, and assign less specific but more widely applicable annotations. The Genome Properties database, available at http://www.jcvi.org/genome-properties, specifies how computed evidence, including TIGRFAMs HMM results, should be used to judge whether an enzymatic pathway, a protein complex or another type of molecular subsystem is encoded in a genome. TIGRFAMs and Genome Properties content are developed in concert because subsystems reconstruction for large numbers of genomes guides selection of seed alignment sequences and cutoff values during protein family construction. Both databases specialize heavily in bacterial and archaeal subsystems. At present, 4284 models appear in TIGRFAMs, while 628 systems are described by Genome Properties. Content derives both from subsystem discovery work and from biocuration of the scientific literature.


Assuntos
Bases de Dados de Proteínas , Proteínas/classificação , Proteínas/genética , Genoma Arqueal , Genoma Bacteriano , Genômica/métodos , Internet , Cadeias de Markov , Anotação de Sequência Molecular , Proteínas/fisiologia , Alinhamento de Sequência
13.
Artigo em Inglês | MEDLINE | ID: mdl-21071803

RESUMO

In this paper, we propose a communication model of evolution and investigate its information-theoretic bounds. The process of evolution is modeled as the retransmission of information over a protein communication channel, where the transmitted message is the organism's proteome encoded in the DNA. We compute the capacity and the rate distortion functions of the protein communication system for the three domains of life: Archaea, Bacteria, and Eukaryotes. The tradeoff between the transmission rate and the distortion in noisy protein communication channels is analyzed. As expected, comparison between the optimal transmission rate and the channel capacity indicates that the biological fidelity does not reach the Shannon optimal distortion. However, the relationship between the channel capacity and rate distortion achieved for different biological domains provides tremendous insight into the dynamics of the evolutionary processes of the three domains of life. We rely on these results to provide a model of genome sequence evolution based on the two major evolutionary driving forces: mutations and unequal crossovers.


Assuntos
Biologia Computacional/métodos , Evolução Molecular , Teoria da Informação , Modelos Biológicos , Proteínas/fisiologia , Proteoma/fisiologia , Transdução de Sinais/fisiologia , Algoritmos , Troca Genética , Cadeias de Markov , Mutação , Distribuição de Poisson , Proteínas/genética , Proteínas/metabolismo , Proteoma/genética , Proteoma/metabolismo
14.
Artigo em Inglês | MEDLINE | ID: mdl-19964780

RESUMO

Orthopedic tissue engineering strategies have developed rapidly in response to large and growing clinical needs. However, current clinical methods for replacement of natural tissue function have significant limitations, and pragmatic challenges have hindered clinical use of emerging tissue engineering approaches. In addition, current methods are not yet capable of achieving complex spatial and temporal regulation of soluble signaling (e.g. growth factor signaling), which may be required for complex, functional tissue regeneration. We have begun to develop a series of new medical devices, which are designed to temporally and spatially regulate growth factor and cytokine signaling during tissue regeneration. The initial goal of these studies is to regulate the behavior of multipotent stem cells, and to promote formation of clinically relevant tissue interfaces (e.g. bone-tendon interfaces). The ultimate goal is to further understand and recapitulate the complex processes that lead to functional musculoskeletal development and regeneration.


Assuntos
Fenômenos Fisiológicos Musculoesqueléticos , Proteínas/fisiologia , Engenharia Tecidual/métodos , Osso e Ossos/fisiologia , Citocinas/fisiologia , Substâncias de Crescimento/fisiologia , Regeneração Tecidual Guiada , Humanos , Cinética , Ligamentos Articulares/cirurgia , Pessoa de Meia-Idade , Doenças Musculoesqueléticas/economia , Doenças Musculoesqueléticas/epidemiologia , Patela/cirurgia , Regeneração/fisiologia , Manguito Rotador/cirurgia , Transdução de Sinais , Tendões/fisiologia , Tendões/cirurgia , Estados Unidos/epidemiologia
15.
Behav Brain Res ; 200(2): 295-303, 2009 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-19146884

RESUMO

Scents play key roles in mediating sexual behaviour in many vertebrates, both in the recognition of opposite sex conspecifics and in assessing the suitability of different individuals as potential mates. The recognition and assessment that underlies female attraction to male scents involves an important interaction between the main and accessory (vomeronasal) olfactory systems. Female mice gain information through the vomeronasal system on nasal contact with a scent source that is essential to stimulate attraction to an individual male's scent. Three highly polymorphic multigene families contribute involatile proteins and peptides to mouse scents that are detected through specific vomeronasal receptors during contact with scent. Major urinary proteins (MUPs) provide an individual genetic identity signature that underlies individual recognition and assessment of male competitive ability, kin recognition to avoid inbreeding, and genetic heterozygosity assessment. Familiar mates are recognised in the context of pregnancy block using MHC peptides, while exocrine-gland secreting peptides (ESPs) are likely to play additional roles in sexual assessment. By associating this involatile information in individual male scents, gained on initial scent contact, with the individual male's airborne volatile signature detected simultaneously through the main olfactory system, females subsequently recognise and are attracted by the individual male's airborne volatile signature alone. This allows much more rapid recognition of scents from familiar animals without requiring physical contact or processing through the vomeronasal system. Nonetheless, key information that induces attraction to a male's scent is held in involatile components detected through the vomeronasal system, allowing assessment of the genetic identity and attractiveness of each individual male.


Assuntos
Comportamento de Escolha/fisiologia , Complexo Principal de Histocompatibilidade/fisiologia , Preferência de Acasalamento Animal/fisiologia , Odorantes , Reconhecimento Psicológico/fisiologia , Animais , Glândulas Exócrinas/metabolismo , Glândulas Exócrinas/fisiologia , Feminino , Heterozigoto , Endogamia , Individualidade , Masculino , Camundongos , Modelos Biológicos , Condutos Olfatórios/fisiologia , Proteínas/fisiologia , Comportamento Sexual Animal/fisiologia , Órgão Vomeronasal/fisiologia
16.
Comput Biol Chem ; 32(5): 375-7, 2008 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-18684672

RESUMO

Pairwise comparison of sequence data is intensively used for automated functional protein annotation, while graphical models emerge as promising candidates for an integration of various heterogeneous features. We designed a model, termed hRMN that integrates different genomic features and implemented a variant of belief propagation for functional annotation transfer. hRMN allows the assignment of multiple functional categories while avoiding common problems in annotation transfer from heterogeneous datasets, such as an independency of the investigated datasets. We benchmarked this system with large-scale annotation transfer (based on the MIPS FunCat ontology) to proteins of the prokaryotes Bacillus subtilis, Helicobacter pylori, Listeria monocytogenes, and Listeria innocua. hRMN consistently outperformed two competitors in annotation of four bacterial genomes. The developed code is available for download at http://mips.gsf.de/proj/bfab/hRMN.html.


Assuntos
Algoritmos , Biologia Computacional/métodos , Modelos Estatísticos , Proteínas/fisiologia , Sequência de Aminoácidos , Bacillus subtilis/genética , Teorema de Bayes , Genoma Bacteriano/genética , Helicobacter pylori/genética , Internet , Listeria/genética , Listeria monocytogenes/genética , Cadeias de Markov , Proteínas/classificação , Proteínas/genética , Reprodutibilidade dos Testes , Alinhamento de Sequência/métodos , Software
17.
Curr Biol ; 18(8): 619-23, 2008 Apr 22.
Artigo em Inglês | MEDLINE | ID: mdl-18424142

RESUMO

The role of individual genetic heterozygosity in mate choice is the subject of much current debate. Several recent studies have reported female preference for more heterozygous males, but the mechanisms underlying heterozygote preference remain largely unknown. Females could favor males that are more successful in intrasexual competition, but they could also assess male heterozygosity directly at specific polymorphic genetic markers. Here, we use a breeding program to remove the intrinsic correlation between genome-wide heterozygosity and two highly polymorphic gene clusters that could allow direct assessment of heterozygosity through scent in mice: the major histocompatibility complex (MHC) and the major urinary proteins (MUPs). When other sources of variation are controlled and intrasexual competition is minimized, female mice prefer to associate with MUP heterozygous over MUP homozygous males. MHC heterozygosity does not influence preference, and neither does heterozygosity across the rest of the genome when intrasexual competition between males is restricted. Female mice thus assess male heterozygosity directly through multiple MUP isoforms expressed in scent signals, independently of the effects of genome-wide heterozygosity on male competitiveness. This is the first evidence that animals may use signals of genetic heterozygosity that have no direct association with individual vigour.


Assuntos
Heterozigoto , Preferência de Acasalamento Animal/fisiologia , Proteínas/fisiologia , Olfato/fisiologia , Agressão/fisiologia , Animais , Feminino , Masculino , Camundongos , Proteínas/genética
18.
BMC Genomics ; 8: 222, 2007 Jul 09.
Artigo em Inglês | MEDLINE | ID: mdl-17620139

RESUMO

BACKGROUND: Comparative sequence analysis is considered as the first step towards annotating new proteins in genome annotation. However, sequence comparison may lead to creation and propagation of function assignment errors. Thus, it is important to perform a thorough analysis for the quality of sequence-based function assignment using large-scale data in a systematic way. RESULTS: We present an analysis of the relationship between sequence similarity and function similarity for the proteins in four model organisms, i.e., Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorrhabditis elegans, and Drosophila melanogaster. Using a measure of functional similarity based on the three categories of Gene Ontology (GO) classifications (biological process, molecular function, and cellular component), we quantified the correlation between functional similarity and sequence similarity measured by sequence identity or statistical significance of the alignment and compared such a correlation against randomly chosen protein pairs. CONCLUSION: Various sequence-function relationships were identified from BLAST versus PSI-BLAST, sequence identity versus Expectation Value, GO indices versus semantic similarity approaches, and within genome versus between genome comparisons, for the three GO categories. Our study provides a benchmark to estimate the confidence in assignment of functions purely based on sequence similarity.


Assuntos
Homologia de Sequência de Aminoácidos , Homologia Estrutural de Proteína , Animais , Arabidopsis , Caenorhabditis elegans , Biologia Computacional , Bases de Dados de Proteínas , Drosophila melanogaster , Genoma , Dados de Sequência Molecular , Proteínas/química , Proteínas/fisiologia , Saccharomyces cerevisiae , Análise de Sequência de DNA , Relação Estrutura-Atividade
19.
Proteins ; 69 Suppl 8: 165-74, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-17654548

RESUMO

Here we present a full overview of the Critical Assessment of Protein Structure Prediction (CASP7) function prediction category. Predictions were submitted for Gene Ontology molecular function terms, Enzyme Commission numbers, and ligand binding site residues. The first two categories were difficult to assess because very little new functional information becomes available after the experiment. The majority of the known Gene Ontology terms and all the Enzyme Commission numbers were available a priori to predictors before the experiment, so prediction for these two categories was not blind. Nevertheless, for Gene Ontology terms we were able to demonstrate that some groups made better predictions than others. In the binding residue category, the predictors did not know in advance which ligands were bound and therefore blind evaluation was possible, but there were disappointingly few predictions in this category. After CASP 6 and 7 the need to organize a more effective blind function prediction category is obvious, even if it means focusing on binding site prediction as the only category that can be truly assessed in the CASP spirit.


Assuntos
Biologia Computacional/métodos , Conformação Proteica , Algoritmos , Sítios de Ligação , Modelos Moleculares , Proteínas/química , Proteínas/fisiologia , Relação Estrutura-Atividade
20.
BMC Bioinformatics ; 7: 405, 2006 Sep 06.
Artigo em Inglês | MEDLINE | ID: mdl-16956412

RESUMO

BACKGROUND: A number of methods that use both protein structural and evolutionary information are available to predict the functional consequences of missense mutations. However, many of these methods break down if either one of the two types of data are missing. Furthermore, there is a lack of rigorous assessment of how important the different factors are to prediction. RESULTS: Here we use Bayesian networks to predict whether or not a missense mutation will affect the function of the protein. Bayesian networks provide a concise representation for inferring models from data, and are known to generalise well to new data. More importantly, they can handle the noisy, incomplete and uncertain nature of biological data. Our Bayesian network achieved comparable performance with previous machine learning methods. The predictive performance of learned model structures was no better than a naïve Bayes classifier. However, analysis of the posterior distribution of model structures allows biologically meaningful interpretation of relationships between the input variables. CONCLUSION: The ability of the Bayesian network to make predictions when only structural or evolutionary data was observed allowed us to conclude that structural information is a significantly better predictor of the functional consequences of a missense mutation than evolutionary information, for the dataset used. Analysis of the posterior distribution of model structures revealed that the top three strongest connections with the class node all involved structural nodes. With this in mind, we derived a simplified Bayesian network that used just these three structural descriptors, with comparable performance to that of an all node network.


Assuntos
Teorema de Bayes , Modelos Biológicos , Mutação de Sentido Incorreto , Proteínas/química , Proteínas/fisiologia , Algoritmos , Aminoácidos/química , Aminoácidos/genética , Bases de Dados Genéticas , Cadeias de Markov , Modelos Estatísticos , Método de Monte Carlo , Muramidase/química , Muramidase/genética , Probabilidade , Conformação Proteica , Curva ROC , Proteínas Repressoras/química , Proteínas Repressoras/genética , Relação Estrutura-Atividade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA