Pesquisa | BVS IEC

Identification and correction of abnormal, incomplete and mispredicted proteins in public databases.

Nagy, Alinda; Hegyi, Hédi; Farkas, Krisztina; Tordai, Hedvig; Kozma, Evelin; Bányai, László; Patthy, László.

BMC Bioinformatics ; 9: 353, 2008 Aug 27.

Artigo em Inglês | MEDLINE | ID: mdl-18752676

RESUMO

BACKGROUND: Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes. RESULTS: Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries. CONCLUSION: MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.

Assuntos

Sistemas de Gerenciamento de Base de Dados , Bases de Dados de Proteínas , Armazenamento e Recuperação da Informação/métodos , Internet , Processamento de Linguagem Natural , Proteínas/classificação , Terminologia como Assunto , Artefatos , Proteínas/química , Proteínas/metabolismo , Controle de Qualidade , Análise de Sequência de Proteína/métodos

Modules, multidomain proteins and organismic complexity.

Tordai, Hedvig; Nagy, Alinda; Farkas, Krisztina; Bányai, László; Patthy, László.

FEBS J ; 272(19): 5064-78, 2005 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-16176277

RESUMO

Originally the term 'protein module' was coined to distinguish mobile domains that frequently occur as building blocks of diverse multidomain proteins from 'static' domains that usually exist only as stand-alone units of single-domain proteins. Despite the widespread use of the term 'mobile domain', the distinction between static and mobile domains is rather vague as it is not easy to quantify the mobility of domains. In the present work we show that the most appropriate measure of the mobility of domains is the number of types of local environments in which a given domain is present. Ranking of domains with respect to this parameter in different evolutionary lineages highlighted marked differences in the propensity of domains to form multidomain proteins. Our analyses have also shown that there is a correlation between domain size and domain mobility: smaller domains are more likely to be used in the construction of multidomain proteins, whereas larger domains are more likely to be static, stand-alone domains. It is also shown that shuffling of a limited set of modules was facilitated by intronic recombination in the metazoan lineage and this has contributed significantly to the emergence of novel complex multidomain proteins, novel functions and increased organismic complexity of metazoa.

Assuntos

Evolução Molecular , Proteínas/química , Proteínas/metabolismo , Animais , Biologia Computacional , Éxons/genética , Modelos Biológicos , Estrutura Terciária de Proteína , Proteínas/genética

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA