Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 33
Filtrar
1.
Autophagy ; 14(12): 2033-2034, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30296899

RESUMEN

I routinely see people use incorrect names for MAP1LC3/LC3 isoforms in scientific papers. In fact, it happens often enough that I decided to investigate the reason for the apparent confusion. It turns out that the sources of misinformation are abundant, including UniProt and antibody supplier web sites.


Asunto(s)
Anticuerpos/clasificación , Proteínas Asociadas a Microtúbulos/clasificación , Terminología como Asunto , Proteínas Relacionadas con la Autofagia/química , Proteínas Relacionadas con la Autofagia/inmunología , Comercio/normas , Bases de Datos de Proteínas/clasificación , Bases de Datos de Proteínas/normas , Humanos , Proteínas Asociadas a Microtúbulos/química , Proteínas Asociadas a Microtúbulos/inmunología , Isoformas de Proteínas/clasificación , Isoformas de Proteínas/inmunología
2.
Acta Crystallogr F Struct Biol Commun ; 74(Pt 8): 463-472, 2018 08 01.
Artículo en Inglés | MEDLINE | ID: mdl-30084395

RESUMEN

Glycosylation is one of the most common forms of protein post-translational modification, but is also the most complex. Dealing with glycoproteins in structure model building, refinement, validation and PDB deposition is more error-prone than dealing with nonglycosylated proteins owing to limitations of the experimental data and available software tools. Also, experimentalists are typically less experienced in dealing with carbohydrate residues than with amino-acid residues. The results of the reannotation and re-refinement by PDB-REDO of 8114 glycoprotein structure models from the Protein Data Bank are analyzed. The positive aspects of 3620 reannotations and subsequent refinement, as well as the remaining challenges to obtaining consistently high-quality carbohydrate models, are discussed.


Asunto(s)
Bases de Datos de Proteínas/clasificación , Bases de Datos de Proteínas/normas , Glicoproteínas/química , Glicoproteínas/clasificación
3.
Sci Rep ; 6: 31971, 2016 08 18.
Artículo en Inglés | MEDLINE | ID: mdl-27534507

RESUMEN

The advances of omics technologies have triggered the production of an enormous volume of data coming from thousands of species. Meanwhile, joint international efforts like the Gene Ontology (GO) consortium have worked to provide functional information for a vast amount of proteins. With these data available, we have developed FunTaxIS, a tool that is the first attempt to infer functional taxonomy (i.e. how functions are distributed over taxa) combining functional and taxonomic information. FunTaxIS is able to define a taxon specific functional space by exploiting annotation frequencies in order to establish if a function can or cannot be used to annotate a certain species. The tool generates constraints between GO terms and taxa and then propagates these relations over the taxonomic tree and the GO graph. Since these constraints nearly cover the whole taxonomy, it is possible to obtain the mapping of a function over the taxonomy. FunTaxIS can be used to make functional comparative analyses among taxa, to detect improper associations between taxa and functions, and to discover how functional knowledge is either distributed or missing. A benchmark test set based on six different model species has been devised to get useful insights on the generated taxonomic rules.


Asunto(s)
Bases de Datos de Proteínas/clasificación , Ontología de Genes , Proteínas/clasificación , Proteoma/clasificación , Animales , Humanos , Proteínas/genética , Especificidad de la Especie
4.
Methods ; 93: 15-23, 2016 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-26318087

RESUMEN

Argot2.5 (Annotation Retrieval of Gene Ontology Terms) is a web server designed to predict protein function. It is an updated version of the previous Argot2 enriched with new features in order to enhance its usability and its overall performance. The algorithmic strategy exploits the grouping of Gene Ontology terms by means of semantic similarity to infer protein function. The tool has been challenged over two independent benchmarks and compared to Argot2, PANNZER, and a baseline method relying on BLAST, proving to obtain a better performance thanks to the contribution of some key interventions in critical steps of the working pipeline. The most effective changes regard: (a) the selection of the input data from sequence similarity searches performed against a clustered version of UniProt databank and a remodeling of the weights given to Pfam hits, (b) the application of taxonomic constraints to filter out annotations that cannot be applied to proteins belonging to the species under investigation. The taxonomic rules are derived from our in-house developed tool, FunTaxIS, that extends those provided by the Gene Ontology consortium. The web server is free for academic users and is available online at http://www.medcomp.medicina.unipd.it/Argot2-5/.


Asunto(s)
Bases de Datos de Proteínas/clasificación , Ontología de Genes , Proteínas/clasificación , Proteínas/fisiología , Navegador Web , Algoritmos , Predicción , Internet
5.
Acta Crystallogr D Biol Crystallogr ; 69(Pt 11): 2209-15, 2013 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-24189232

RESUMEN

The estimate of the root-mean-square deviation (r.m.s.d.) in coordinates between the model and the target is an essential parameter for calibrating likelihood functions for molecular replacement (MR). Good estimates of the r.m.s.d. lead to good estimates of the variance term in the likelihood functions, which increases signal to noise and hence success rates in the MR search. Phaser has hitherto used an estimate of the r.m.s.d. that only depends on the sequence identity between the model and target and which was not optimized for the MR likelihood functions. Variance-refinement functionality was added to Phaser to enable determination of the effective r.m.s.d. that optimized the log-likelihood gain (LLG) for a correct MR solution. Variance refinement was subsequently performed on a database of over 21,000 MR problems that sampled a range of sequence identities, protein sizes and protein fold classes. Success was monitored using the translation-function Z-score (TFZ), where a TFZ of 8 or over for the top peak was found to be a reliable indicator that MR had succeeded for these cases with one molecule in the asymmetric unit. Good estimates of the r.m.s.d. are correlated with the sequence identity and the protein size. A new estimate of the r.m.s.d. that uses these two parameters in a function optimized to fit the mean of the refined variance is implemented in Phaser and improves MR outcomes. Perturbing the initial estimate of the r.m.s.d. from the mean of the distribution in steps of standard deviations of the distribution further increases MR success rates.


Asunto(s)
Secuencia de Aminoácidos , Sustitución de Aminoácidos , Bases de Datos de Proteínas/tendencias , Relación Señal-Ruido , Secuencia de Aminoácidos/genética , Sustitución de Aminoácidos/genética , Cristalografía por Rayos X/instrumentación , Cristalografía por Rayos X/métodos , Bases de Datos de Proteínas/clasificación , Funciones de Verosimilitud , Modelos Moleculares , Mutación , Pliegue de Proteína , Alineación de Secuencia , Programas Informáticos , Difracción de Rayos X
6.
BMC Bioinformatics ; 11: 530, 2010 Oct 25.
Artículo en Inglés | MEDLINE | ID: mdl-20973947

RESUMEN

BACKGROUND: The Gene Ontology project supports categorization of gene products according to their location of action, the molecular functions that they carry out, and the processes that they are involved in. Although the ontologies are intentionally developed to be taxon neutral, and to cover all species, there are inherent taxon specificities in some branches. For example, the process 'lactation' is specific to mammals and the location 'mitochondrion' is specific to eukaryotes. The lack of an explicit formalization of these constraints can lead to errors and inconsistencies in automated and manual annotation. RESULTS: We have formalized the taxonomic constraints implicit in some GO classes, and specified these at various levels in the ontology. We have also developed an inference system that can be used to check for violations of these constraints in annotations. Using the constraints in conjunction with the inference system, we have detected and removed errors in annotations and improved the structure of the ontology. CONCLUSIONS: Detection of inconsistencies in taxon-specificity enables gradual improvement of the ontologies, the annotations, and the formalized constraints. This is progressively improving the quality of our data. The full system is available for download, and new constraints or proposed changes to constraints can be submitted online at https://sourceforge.net/tracker/?atid=605890&group_id=36855.


Asunto(s)
Clasificación/métodos , Anotación de Secuencia Molecular/métodos , Bases de Datos Genéticas/clasificación , Bases de Datos de Proteínas/clasificación , Terminología como Asunto , Vocabulario Controlado
7.
BMC Struct Biol ; 9: 27, 2009 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-19409097

RESUMEN

BACKGROUND: Macromolecular docking is a challenging field of bioinformatics. Developing new algorithms is a slow process generally involving routine tasks that should be found in a robust library and not programmed from scratch for every new software application. RESULTS: We present an object-oriented Python/C++ library to help the development of new docking methods. This library contains low-level routines like PDB-format manipulation functions as well as high-level tools for docking and analyzing results. We also illustrate the ease of use of this library with the detailed implementation of a 3-body docking procedure. CONCLUSION: The PTools library can handle molecules at coarse-grained or atomic resolution and allows users to rapidly develop new software. The library is already in use for protein-protein and protein-DNA docking with the ATTRACT program and for simulation analysis. This library is freely available under the GNU GPL license, together with detailed documentation.


Asunto(s)
Biología Computacional/métodos , Bases de Datos de Proteínas/clasificación , Proteínas/química , Acceso a la Información , Algoritmos , Simulación por Computador , Almacenamiento y Recuperación de la Información , Bibliotecas , Unión Proteica , Mapeo de Interacción de Proteínas , Alineación de Secuencia , Análisis de Secuencia de ADN , Análisis de Secuencia de Proteína , Programas Informáticos
8.
BMC Struct Biol ; 9: 26, 2009 Apr 30.
Artículo en Inglés | MEDLINE | ID: mdl-19402914

RESUMEN

BACKGROUND: In addition to structural domains, most eukaryotic proteins possess intrinsically disordered (ID) regions. Although ID regions often play important functional roles, their accurate identification is difficult. As human transcription factors (TFs) constitute a typical group of proteins with long ID regions, we regarded them as a model of all proteins and attempted to accurately classify TFs into structural domains and ID regions. Although an extremely high fraction of ID regions besides DNA binding and/or other domains was detected in human TFs in our previous investigation, 20% of the residues were left unassigned. In this report, we exploit the generally higher sequence divergence in ID regions than in structural regions to completely divide proteins into structural domains and ID regions. RESULTS: The new dichotomic system first identifies domains of known structures, followed by assignment of structural domains and ID regions with a combination of pre-existing tools and a newly developed program based on sequence divergence, taking un-aligned regions into consideration. The system was found to be highly accurate: its application to a set of proteins with experimentally verified ID regions had an error rate as low as 2%. Application of this system to human TFs (401 proteins) showed that 38% of the residues were in structural domains, while 62% were in ID regions. The preponderance of ID regions makes a sharp contrast to TFs of Escherichia coli (229 proteins), in which only 5% fell in ID regions. The method also revealed that 4.0% and 11.8% of the total length in human and E. coli TFs, respectively, are comprised of structural domains whose structures have not been determined. CONCLUSION: The present system verifies that sequence divergence including information of unaligned regions is a good indicator of ID regions. The system for the first time estimates the complete fractioning of structured/un-structured regions in human TFs, also revealing structural domains without homology to known structures. These predicted novel structural domains are good targets of structural genomics. When applied to other proteins, the system is expected to uncover more novel structural domains.


Asunto(s)
Proteínas Bacterianas , Bases de Datos de Proteínas/clasificación , Pliegue de Proteína , Análisis de Secuencia de Proteína , Relación Estructura-Actividad , Factores de Transcripción/química , Inteligencia Artificial , Biología Computacional , Humanos , Reconocimiento de Normas Patrones Automatizadas , Unión Proteica , Conformación Proteica , Estructura Terciaria de Proteína , Programas Informáticos , Factores de Transcripción/genética
9.
Curr Opin Drug Discov Devel ; 12(3): 408-19, 2009 May.
Artículo en Inglés | MEDLINE | ID: mdl-19396742

RESUMEN

The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation.


Asunto(s)
Bases de Datos de Proteínas/clasificación , Descubrimiento de Drogas/métodos , Internet/tendencias , Bases de Datos como Asunto , Modelos Moleculares , Mutagénesis/fisiología , Unión Proteica , Dominios y Motivos de Interacción de Proteínas/efectos de los fármacos , Programas Informáticos
10.
Biophys Chem ; 138(1-2): 11-22, 2008 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-18814947

RESUMEN

Data reduction techniques are now a vital part of numerical analysis and principal component analysis is often used to identify important molecular features from a set of descriptors. We now take a different approach and apply data reduction techniques directly to protein structure. With this we can reduce the three-dimensional structural data into two-dimensions while preserving the correct relationships. With two-dimensional representations, structural comparisons between proteins are accelerated significantly. This means that protein-protein similarity comparisons are now feasible on a large scale. We show how the approach can help to predict the function of kinase structures according to the Hanks' classification based on their structural similarity to different kinase classes.


Asunto(s)
Fosfotransferasas/química , Proteínas/química , Homología Estructural de Proteína , Biología Computacional , Bases de Datos de Proteínas/clasificación , Modelos Biológicos , Conformación Proteica , Pliegue de Proteína , Estructura Secundaria de Proteína , Estructura Terciaria de Proteína
12.
J Allergy Clin Immunol ; 121(4): 847-52.e7, 2008 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-18395549

RESUMEN

BACKGROUND: Existing allergen databases classify their entries by source and route of exposure, thus lacking an evolutionary, structural, and functional classification of allergens. OBJECTIVE: We sought to build AllFam, a database of allergen families, and use it to extract common structural and functional properties of allergens. METHODS: Allergen data from the Allergome database and protein family definitions from the Pfam database were merged into AllFam, a database that is freely accessible on the Internet at http://www.meduniwien.ac.at/allergens/allfam/. A structural classification of allergens was established by matching Pfam families with families from the Structural Classification of Proteins database. Biochemical functions of allergens were extracted from the Gene Ontology Annotation database. RESULTS: Seven hundred seven allergens were classified by sequence into 134 AllFam families containing 184 Pfam domains (2% of 9318 Pfam families). A random set of 707 sequences with the same taxonomic distribution contained a significantly higher number of different Pfam domains (479 +/- 17). Classifying allergens by structure revealed that 5% of 3012 Structural Classification of Proteins families contained allergens. The biochemical functions of allergens most frequently found were limited to hydrolysis of proteins, polysaccharides, and lipids; binding of metal ions and lipids; storage; and cytoskeleton association. CONCLUSION: The small number of protein families that contain allergens and the narrow functional distribution of most allergens confirm the existence of yet unknown factors that render proteins allergenic.


Asunto(s)
Alérgenos/química , Alérgenos/fisiología , Bases de Datos de Proteínas/clasificación , Familia de Multigenes/inmunología , Proteínas/química , Proteínas/fisiología , Terminología como Asunto , Alérgenos/clasificación , Alérgenos/genética , Animales , Humanos , Proteínas de Plantas/química , Proteínas de Plantas/clasificación , Proteínas de Plantas/genética , Proteínas de Plantas/fisiología , Proteínas/clasificación , Proteínas/genética , Proteoma/química , Proteoma/clasificación , Proteoma/genética , Proteoma/fisiología , Distribución Aleatoria , Análisis de Secuencia de ADN , Relación Estructura-Actividad
13.
J Mol Biol ; 377(4): 1265-78, 2008 Apr 04.
Artículo en Inglés | MEDLINE | ID: mdl-18313074

RESUMEN

A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures due to a limited number of energetically favorable ways to pack secondary structural elements. Using novel strategies, we previously assembled two reliable databases of homologs and analogs. In this study, we compare these two data sets and develop a support vector machine (SVM)-based classifier to discriminate between homologs and analogs. The classifier uses a number of well-known similarity scores. We observe that although both structure scores and sequence scores contribute to SVM performance, profile sequence scores computed based on structural alignments are the best discriminators between remote homologs and structural analogs. We apply our classifier to a representative set from the expert-constructed database, Structural Classification of Proteins (SCOP). The SVM classifier recovers 76% of the remote homologs defined as domains in the same SCOP superfamily but from different families. More importantly, we also detect and discuss interesting homologous relationships between SCOP domains from different superfamilies, folds, and even classes.


Asunto(s)
Biología Computacional , Bases de Datos de Proteínas , Alineación de Secuencia/métodos , Homología de Secuencia de Aminoácido , Secuencia de Aminoácidos , Bases de Datos de Proteínas/clasificación , Modelos Lineales , Modelos Moleculares , Datos de Secuencia Molecular , Teoría de la Probabilidad , Reproducibilidad de los Resultados , Análisis de Secuencia de Proteína/métodos
14.
BMC Bioinformatics ; 9: 35, 2008 Jan 23.
Artículo en Inglés | MEDLINE | ID: mdl-18215279

RESUMEN

BACKGROUND: It has repeatedly been shown that interacting protein families tend to have similar phylogenetic trees. These similarities can be used to predicting the mapping between two families of interacting proteins (i.e. which proteins from one family interact with which members of the other). The correct mapping will be that which maximizes the similarity between the trees. The two families may eventually comprise orthologs and paralogs, if members of the two families are present in more than one organism. This fact can be exploited to restrict the possible mappings, simply by impeding links between proteins of different organisms. We present here an algorithm to predict the mapping between families of interacting proteins which is able to incorporate information regarding orthologues, or any other assignment of proteins to "classes" that may restrict possible mappings. RESULTS: For the first time in methods for predicting mappings, we have tested this new approach on a large number of interacting protein domains in order to statistically assess its performance. The method accurately predicts around 80% in the most favourable cases. We also analysed in detail the results of the method for a well defined case of interacting families, the sensor and kinase components of the Ntr-type two-component system, for which up to 98% of the pairings predicted by the method were correct. CONCLUSION: Based on the well established relationship between tree similarity and interactions we developed a method for predicting the mapping between two interacting families using genomic information alone. The program is available through a web interface.


Asunto(s)
Bases de Datos de Proteínas/clasificación , Sistemas de Información , Proteínas/clasificación , Proteínas/genética , Predicción , Sistemas de Información/tendencias , Unión Proteica/fisiología , Mapeo de Interacción de Proteínas/clasificación , Mapeo de Interacción de Proteínas/métodos , Proteínas/metabolismo , Alineación de Secuencia/métodos , Levaduras/genética , Levaduras/metabolismo
15.
Proteins ; 67(4): 789-94, 2007 Jun 01.
Artículo en Inglés | MEDLINE | ID: mdl-17380509

RESUMEN

Searches using position specific scoring matrices (PSSMs) have been commonly used in remote homology detection procedures such as PSI-BLAST and RPS-BLAST. A PSSM is generated typically using one of the sequences of a family as the reference sequence. In the case of PSI-BLAST searches the reference sequence is same as the query. Recently we have shown that searches against the database of multiple family-profiles, with each one of the members of the family used as a reference sequence, are more effective than searches against the classical database of single family-profiles. Despite relatively a better overall performance when compared with common sequence-profile matching procedures, searches against the multiple family-profiles database result in a few false positives and false negatives. Here we show that profile length and divergence of sequences used in the construction of a PSSM have major influence on the performance of multiple profile based search approach. We also identify that a simple parameter defined by the number of PSSMs corresponding to a family that is hit, for a query, divided by the total number of PSSMs in the family can distinguish effectively the true positives from the false positives in the multiple profiles search approach.


Asunto(s)
Bases de Datos de Proteínas , Secuencia de Aminoácidos , Bases de Datos de Proteínas/clasificación , Sensibilidad y Especificidad , Homología de Secuencia de Aminoácido
17.
Mol Biotechnol ; 34(1): 69-93, 2006 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-16943573

RESUMEN

By organizing and making widely accessible the increasing amounts of data from high-throughput analyses, protein interaction databases have become an integral resource for the biological community in relating sequence data with higher-order function. To provide a sense of the use and applicability of these databases, we describe each of the major comprehensive interaction databases as well as some of the more specialized ones. Content description, search/browse functionalities, and data presentation are discussed. A succinct explanation of database contents helps the user quickly identify whether the database contains applicable information to their research interest. Broad levels of search/browse functions as well as descriptions/examples allow users to quickly find and access pertinent data. At this point, clear presentation of search results as well as the primary content is necessary. Many databases display information graphically or divided into smaller digestible parts over a number of tabbed/linked pages. In addition, cross-linking between the databases promotes interconnectivity of the data and is an added layer of relational data for the user. Overall, although these protein interaction databases are under continual improvement, their current state shows that much time and effort has gone into organizing and presenting these large sets of data-describing protein interactions.


Asunto(s)
Bases de Datos de Proteínas/clasificación , Documentación/métodos , Almacenamiento y Recuperación de la Información/métodos , Mapeo de Interacción de Proteínas/métodos , Proteínas/clasificación , Proteínas/metabolismo , Terminología como Asunto , Sistemas de Administración de Bases de Datos , Proteínas/química , Vocabulario Controlado
18.
Nat Biotechnol ; 24(7): 852-5, 2006 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-16823370

RESUMEN

Antifreeze proteins (AFPs) are found in cold-adapted organisms and have the unusual ability to bind to and inhibit the growth of ice crystals. However, the underlying molecular basis of their ice-binding activity is unclear because of the difficulty of studying the AFP-ice interaction directly and the lack of a common motif, domain or fold among different AFPs. We have formulated a generic ice-binding model and incorporated it into a physicochemical pattern-recognition algorithm. It successfully recognizes ice-binding surfaces for a diverse range of AFPs, and clearly discriminates AFPs from other structures in the Protein Data Bank. The algorithm was used to identify a novel AFP from winter rye, and the antifreeze activity of this protein was subsequently confirmed. The presence of a common and distinct physicochemical pattern provides a structural basis for unifying AFPs from fish, insects and plants.


Asunto(s)
Proteínas Anticongelantes/aislamiento & purificación , Bases de Datos de Proteínas/clasificación , Algoritmos , Proteínas Anticongelantes/clasificación , Modelos Químicos , Datos de Secuencia Molecular , Conformación Proteica , Homología de Secuencia de Aminoácido , Relación Estructura-Actividad
19.
BMC Bioinformatics ; 7: 187, 2006 Apr 04.
Artículo en Inglés | MEDLINE | ID: mdl-16584572

RESUMEN

BACKGROUND: The number and the arrangement of subunits that form a protein are referred to as quaternary structure. Quaternary structure is an important protein attribute that is closely related to its function. Proteins with quaternary structure are called oligomeric proteins. Oligomeric proteins are involved in various biological processes, such as metabolism, signal transduction, and chromosome replication. Thus, it is highly desirable to develop some computational methods to automatically classify the quaternary structure of proteins from their sequences. RESULTS: To explore this problem, we adopted an approach based on the functional domain composition of proteins. Every protein was represented by a vector calculated from the domains in the PFAM database. The nearest neighbor algorithm (NNA) was used for classifying the quaternary structure of proteins from this information. The jackknife cross-validation test was performed on the non-redundant protein dataset in which the sequence identity was less than 25%. The overall success rate obtained is 75.17%. Additionally, to demonstrate the effectiveness of this method, we predicted the proteins in an independent dataset and achieved an overall success rate of 84.11% CONCLUSION: Compared with the amino acid composition method and Blast, the results indicate that the domain composition approach may be a more effective and promising high-throughput method in dealing with this complicated problem in bioinformatics.


Asunto(s)
Bases de Datos de Proteínas/clasificación , Estructura Cuaternaria de Proteína , Análisis de Secuencia de Proteína/métodos , Algoritmos , Biología Computacional/métodos , Estructura Cuaternaria de Proteína/genética , Estructura Terciaria de Proteína/genética
20.
Proteins ; 63(3): 527-41, 2006 May 15.
Artículo en Inglés | MEDLINE | ID: mdl-16456850

RESUMEN

Protein classification and characterization often rely on the information contained in the protein secondary structure. Protein class assignment is usually based on X-ray diffraction measurements, which need the protein in a crystallized form, or on NMR spectra, to obtain the structure of a protein in solution. Simple spectroscopic techniques, such as circular dichroism (CD) and infrared (IR) spectroscopies, are also known to be related to protein secondary structure, but they have seldom been used for protein classification. To see the potential of CD, IR, and combined CD/IR measurements for protein classification, unsupervised pattern recognition methods, Principal Component Analysis (PCA) and cluster analysis, are proposed first to check for natural grouping tendencies of proteins according to their measured spectra. Partial Least Squares Discriminant Analysis (PLS-DA), a supervised pattern recognition method, is used afterwards to test the possibility to model explicitly each protein class and to test these models in class assignment of unknown proteins. Determination of the protein secondary structure, understood as the prediction of the abundance of the different secondary structure motifs in the biomolecule, was carried out with the local regression method interval Partial Least Squares (iPLS). CD, IR, and CD/IR measurements were correlated to the fraction of the motif to be predicted, determined from X-ray measurements. iPLS builds models extracting the spectral information most correlated to a specific secondary motif and avoids the use of irrelevant spectral regions. Spectral intervals chosen by iPLS models provide structural information which can be used to confirm previous biochemical assignments or identify new motif-related spectral features. The predictive ability of the models built with the selected spectral regions has a quality similar to previous classical approaches.


Asunto(s)
Dicroismo Circular , Bases de Datos de Proteínas/clasificación , Estructura Secundaria de Proteína , Espectrofotometría Infrarroja/métodos , Dicroismo Circular/métodos , Análisis por Conglomerados , Cristalografía por Rayos X , Conformación Proteica
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA