Pesquisa | Biblioteca Virtual em Saúde

1.

The conserved domain database in 2023.

Wang, Jiyao; Chitsaz, Farideh; Derbyshire, Myra K; Gonzales, Noreen R; Gwadz, Marc; Lu, Shennan; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Yamashita, Roxanne A; Yang, Mingzhang; Zhang, Dachuan; Zheng, Chanjuan; Lanczycki, Christopher J; Marchler-Bauer, Aron.

Nucleic Acids Res ; 51(D1): D384-D388, 2023 01 06.

Artigo em Inglês | MEDLINE | ID: mdl-36477806

RESUMO

NLM's conserved domain database (CDD) is a collection of protein domain and protein family models constructed as multiple sequence alignments. Its main purpose is to provide annotation for protein and translated nucleotide sequences with the location of domain footprints and associated functional sites, and to define protein domain architecture as a basis for assigning gene product names and putative/predicted function. CDD has been available publicly for over 20 years and has grown substantially during that time. Maintaining an archive of pre-computed annotation continues to be a challenge and has slowed down the cadence of CDD releases. CDD curation staff builds hierarchical classifications of large protein domain families, adds models for novel domain families via surveillance of the protein 'dark matter' that currently lacks annotation, and now spends considerable effort on providing names and attribution for conserved domain architectures. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Assuntos

Bases de Dados de Proteínas , Proteínas , Humanos , Sequência de Aminoácidos , Sequência Conservada , Estrutura Terciária de Proteína , Proteínas/química , Proteínas/genética , Domínios Proteicos

2.

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

Li, Wenjun; O'Neill, Kathleen R; Haft, Daniel H; DiCuccio, Michael; Chetvernin, Vyacheslav; Badretdin, Azat; Coulouris, George; Chitsaz, Farideh; Derbyshire, Myra K; Durkin, A Scott; Gonzales, Noreen R; Gwadz, Marc; Lanczycki, Christopher J; Song, James S; Thanki, Narmada; Wang, Jiyao; Yamashita, Roxanne A; Yang, Mingzhang; Zheng, Chanjuan; Marchler-Bauer, Aron; Thibaud-Nissen, Françoise.

Nucleic Acids Res ; 49(D1): D1020-D1028, 2021 01 08.

Artigo em Inglês | MEDLINE | ID: mdl-33270901

RESUMO

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

Assuntos

Biologia Computacional/métodos , Bases de Dados Genéticas , Genoma Arqueal/genética , Genoma Bacteriano/genética , Anotação de Sequência Molecular/métodos , Proteínas/genética , Curadoria de Dados/métodos , Mineração de Dados/métodos , Genômica/métodos , Internet , Proteínas/classificação , Interface Usuário-Computador

3.

CDD/SPARCLE: the conserved domain database in 2020.

Lu, Shennan; Wang, Jiyao; Chitsaz, Farideh; Derbyshire, Myra K; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Yamashita, Roxanne A; Yang, Mingzhang; Zhang, Dachuan; Zheng, Chanjuan; Lanczycki, Christopher J; Marchler-Bauer, Aron.

Nucleic Acids Res ; 48(D1): D265-D268, 2020 01 08.

Artigo em Inglês | MEDLINE | ID: mdl-31777944

RESUMO

As NLM's Conserved Domain Database (CDD) enters its 20th year of operations as a publicly available resource, CDD curation staff continues to develop hierarchical classifications of widely distributed protein domain families, and to record conserved sites associated with molecular function, so that they can be mapped onto user queries in support of hypothesis-driven biomolecular research. CDD offers both an archive of pre-computed domain annotations as well as live search services for both single protein or nucleotide queries and larger sets of protein query sequences. CDD staff has continued to characterize protein families via conserved domain architectures and has built up a significant corpus of curated domain architectures in support of naming bacterial proteins in RefSeq. These architecture definitions are available via SPARCLE, the Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Assuntos

Bases de Dados de Proteínas , Domínios Proteicos , Sequência de Aminoácidos , Sequência Conservada

4.

iCn3D, a web-based 3D viewer for sharing 1D/2D/3D representations of biomolecular structures.

Wang, Jiyao; Youkharibache, Philippe; Zhang, Dachuan; Lanczycki, Christopher J; Geer, Renata C; Madej, Thomas; Phan, Lon; Ward, Minghong; Lu, Shennan; Marchler, Gabriele H; Wang, Yanli; Bryant, Stephen H; Geer, Lewis Y; Marchler-Bauer, Aron.

Bioinformatics ; 36(1): 131-135, 2020 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-31218344

RESUMO

MOTIVATION: Build a web-based 3D molecular structure viewer focusing on interactive structural analysis. RESULTS: iCn3D (I-see-in-3D) can simultaneously show 3D structure, 2D molecular contacts and 1D protein and nucleotide sequences through an integrated sequence/annotation browser. Pre-defined and arbitrary molecular features can be selected in any of the 1D/2D/3D windows as sets of residues and these selections are synchronized dynamically in all displays. Biological annotations such as protein domains, single nucleotide variations, etc. can be shown as tracks in the 1D sequence/annotation browser. These customized displays can be shared with colleagues or publishers via a simple URL. iCn3D can display structure-structure alignments obtained from NCBI's VAST+ service. It can also display the alignment of a sequence with a structure as identified by BLAST, and thus relate 3D structure to a large fraction of all known proteins. iCn3D can also display electron density maps or electron microscopy (EM) density maps, and export files for 3D printing. The following example URL exemplifies some of the 1D/2D/3D representations: https://www.ncbi.nlm.nih.gov/Structure/icn3d/full.html?mmdbid=1TUP&showanno=1&show2d=1&showsets=1. AVAILABILITY AND IMPLEMENTATION: iCn3D is freely available to the public. Its source code is available at https://github.com/ncbi/icn3d. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Sequência de Bases , Biologia Computacional , Internet , Modelos Moleculares , Proteínas , Software , Biologia Computacional/métodos , Bases de Dados Genéticas , Conformação Molecular , Proteínas/química

5.

CDD/SPARCLE: functional classification of proteins via subfamily domain architectures.

Marchler-Bauer, Aron; Bo, Yu; Han, Lianyi; He, Jane; Lanczycki, Christopher J; Lu, Shennan; Chitsaz, Farideh; Derbyshire, Myra K; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Lu, Fu; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Wang, Zhouxi; Yamashita, Roxanne A; Zhang, Dachuan; Zheng, Chanjuan; Geer, Lewis Y; Bryant, Stephen H.

Nucleic Acids Res ; 45(D1): D200-D203, 2017 01 04.

Artigo em Inglês | MEDLINE | ID: mdl-27899674

RESUMO

NCBI's Conserved Domain Database (CDD) aims at annotating biomolecular sequences with the location of evolutionarily conserved protein domain footprints, and functional sites inferred from such footprints. An archive of pre-computed domain annotation is maintained for proteins tracked by NCBI's Entrez database, and live search services are offered as well. CDD curation staff supplements a comprehensive collection of protein domain and protein family models, which have been imported from external providers, with representations of selected domain families that are curated in-house and organized into hierarchical classifications of functionally distinct families and sub-families. CDD also supports comparative analyses of protein families via conserved domain architectures, and a recent curation effort focuses on providing functional characterizations of distinct subfamily architectures using SPARCLE: Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Assuntos

Biologia Computacional/métodos , Bases de Dados de Proteínas , Domínios e Motivos de Interação entre Proteínas , Proteínas , Disseminação de Informação , Internet , Proteínas/química , Proteínas/classificação , Proteínas/genética

6.

CDD: NCBI's conserved domain database.

Marchler-Bauer, Aron; Derbyshire, Myra K; Gonzales, Noreen R; Lu, Shennan; Chitsaz, Farideh; Geer, Lewis Y; Geer, Renata C; He, Jane; Gwadz, Marc; Hurwitz, David I; Lanczycki, Christopher J; Lu, Fu; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Wang, Zhouxi; Yamashita, Roxanne A; Zhang, Dachuan; Zheng, Chanjuan; Bryant, Stephen H.

Nucleic Acids Res ; 43(Database issue): D222-6, 2015 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-25414356

RESUMO

NCBI's CDD, the Conserved Domain Database, enters its 15(th) year as a public resource for the annotation of proteins with the location of conserved domain footprints. Going forward, we strive to improve the coverage and consistency of domain annotation provided by CDD. We maintain a live search system as well as an archive of pre-computed domain annotation for sequences tracked in NCBI's Entrez protein database, which can be retrieved for single sequences or in bulk. We also maintain import procedures so that CDD contains domain models and domain definitions provided by several collections available in the public domain, as well as those produced by an in-house curation effort. The curation effort aims at increasing coverage and providing finer-grained classifications of common protein domains, for which a wealth of functional and structural data has become available. CDD curation generates alignment models of representative sequence fragments, which are in agreement with domain boundaries as observed in protein 3D structure, and which model the structurally conserved cores of domain families as well as annotate conserved features. CDD can be accessed at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Assuntos

Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Motivos de Aminoácidos , Sequência de Aminoácidos , Sequência Conservada , Curadoria de Dados

7.

MMDB and VAST+: tracking structural similarities between macromolecular complexes.

Madej, Thomas; Lanczycki, Christopher J; Zhang, Dachuan; Thiessen, Paul A; Geer, Renata C; Marchler-Bauer, Aron; Bryant, Stephen H.

Nucleic Acids Res ; 42(Database issue): D297-303, 2014 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-24319143

RESUMO

The computational detection of similarities between protein 3D structures has become an indispensable tool for the detection of homologous relationships, the classification of protein families and functional inference. Consequently, numerous algorithms have been developed that facilitate structure comparison, including rapid searches against a steadily growing collection of protein structures. To this end, NCBI's Molecular Modeling Database (MMDB), which is based on the Protein Data Bank (PDB), maintains a comprehensive and up-to-date archive of protein structure similarities computed with the Vector Alignment Search Tool (VAST). These similarities have been recorded on the level of single proteins and protein domains, comprising in excess of 1.5 billion pairwise alignments. Here we present VAST+, an extension to the existing VAST service, which summarizes and presents structural similarity on the level of biological assemblies or macromolecular complexes. VAST+ simplifies structure neighboring results and shows, for macromolecular complexes tracked in MMDB, lists of similar complexes ranked by the extent of similarity. VAST+ replaces the previous VAST service as the default presentation of structure neighboring data in NCBI's Entrez query and retrieval system. MMDB and VAST+ can be accessed via http://www.ncbi.nlm.nih.gov/Structure.

Assuntos

Bases de Dados de Proteínas , Homologia Estrutural de Proteína , Gráficos por Computador , Internet , Substâncias Macromoleculares/química , Modelos Moleculares , Software

8.

CDD: conserved domains and protein three-dimensional structure.

Marchler-Bauer, Aron; Zheng, Chanjuan; Chitsaz, Farideh; Derbyshire, Myra K; Geer, Lewis Y; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Lanczycki, Christopher J; Lu, Fu; Lu, Shennan; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Yamashita, Roxanne A; Zhang, Dachuan; Bryant, Stephen H.

Nucleic Acids Res ; 41(Database issue): D348-52, 2013 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-23197659

RESUMO

CDD, the Conserved Domain Database, is part of NCBI's Entrez query and retrieval system and is also accessible via http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. CDD provides annotation of protein sequences with the location of conserved domain footprints and functional sites inferred from these footprints. Pre-computed annotation is available via Entrez, and interactive search services accept single protein or nucleotide queries, as well as batch submissions of protein query sequences, utilizing RPS-BLAST to rapidly identify putative matches. CDD incorporates several protein domain and full-length protein model collections, and maintains an active curation effort that aims at providing fine grained classifications for major and well-characterized protein domain families, as supported by available protein three-dimensional (3D) structure and the published literature. To this date, the majority of protein 3D structures are represented by models tracked by CDD, and CDD curators are characterizing novel families that emerge from protein structure determination efforts.

Assuntos

Bases de Dados de Proteínas , Conformação Proteica , Estrutura Terciária de Proteína , Sequência de Aminoácidos , Sequência Conservada , Internet , Modelos Moleculares , Anotação de Sequência Molecular , Proteínas/química , Proteínas/classificação , Proteínas/genética , Análise de Sequência de Proteína

9.

SPEER-SERVER: a web server for prediction of protein specificity determining sites.

Chakraborty, Abhijit; Mandloi, Sapan; Lanczycki, Christopher J; Panchenko, Anna R; Chakrabarti, Saikat.

Nucleic Acids Res ; 40(Web Server issue): W242-8, 2012 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-22689646

RESUMO

Sites that show specific conservation patterns within subsets of proteins in a protein family are likely to be involved in the development of functional specificity. These sites, generally termed specificity determining sites (SDS), might play a crucial role in binding to a specific substrate or proteins. Identification of SDS through experimental techniques is a slow, difficult and tedious job. Hence, it is very important to develop efficient computational methods that can more expediently identify SDS. Herein, we present Specificity prediction using amino acids' Properties, Entropy and Evolution Rate (SPEER)-SERVER, a web server that predicts SDS by analyzing quantitative measures of the conservation patterns of protein sites based on their physico-chemical properties and the heterogeneity of evolutionary changes between and within the protein subfamilies. This web server provides an improved representation of results, adds useful input and output options and integrates a wide range of analysis and data visualization tools when compared with the original standalone version of the SPEER algorithm. Extensive benchmarking finds that SPEER-SERVER exhibits sensitivity and precision performance that, on average, meets or exceeds that of other currently available methods. SPEER-SERVER is available at http://www.hpppi.iicb.res.in/ss/.

Assuntos

Proteínas/química , Software , Algoritmos , Aminoácidos/química , Internet , Ligação Proteica , Alinhamento de Sequência , Análise de Sequência de Proteína , Interface Usuário-Computador

10.

MMDB: 3D structures and macromolecular interactions.

Madej, Thomas; Addess, Kenneth J; Fong, Jessica H; Geer, Lewis Y; Geer, Renata C; Lanczycki, Christopher J; Liu, Chunlei; Lu, Shennan; Marchler-Bauer, Aron; Panchenko, Anna R; Chen, Jie; Thiessen, Paul A; Wang, Yanli; Zhang, Dachuan; Bryant, Stephen H.

Nucleic Acids Res ; 40(Database issue): D461-4, 2012 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-22135289

RESUMO

Close to 60% of protein sequences tracked in comprehensive databases can be mapped to a known three-dimensional (3D) structure by standard sequence similarity searches. Potentially, a great deal can be learned about proteins or protein families of interest from considering 3D structure, and to this day 3D structure data may remain an underutilized resource. Here we present enhancements in the Molecular Modeling Database (MMDB) and its data presentation, specifically pertaining to biologically relevant complexes and molecular interactions. MMDB is tightly integrated with NCBI's Entrez search and retrieval system, and mirrors the contents of the Protein Data Bank. It links protein 3D structure data with sequence data, sequence classification resources and PubChem, a repository of small-molecule chemical structures and their biological activities, facilitating access to 3D structure data not only for structural biologists, but also for molecular biologists and chemists. MMDB provides a complete set of detailed and pre-computed structural alignments obtained with the VAST algorithm, and provides visualization tools for 3D structure and structure/sequence alignment via the molecular graphics viewer Cn3D. MMDB can be accessed at http://www.ncbi.nlm.nih.gov/structure.

Assuntos

Bases de Dados de Proteínas , Modelos Moleculares , Conformação Proteica , Análise de Sequência de Proteína

11.

CDD: a Conserved Domain Database for the functional annotation of proteins.

Marchler-Bauer, Aron; Lu, Shennan; Anderson, John B; Chitsaz, Farideh; Derbyshire, Myra K; DeWeese-Scott, Carol; Fong, Jessica H; Geer, Lewis Y; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Jackson, John D; Ke, Zhaoxi; Lanczycki, Christopher J; Lu, Fu; Marchler, Gabriele H; Mullokandov, Mikhail; Omelchenko, Marina V; Robertson, Cynthia L; Song, James S; Thanki, Narmada; Yamashita, Roxanne A; Zhang, Dachuan; Zhang, Naigong; Zheng, Chanjuan; Bryant, Stephen H.

Nucleic Acids Res ; 39(Database issue): D225-9, 2011 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-21109532

RESUMO

NCBI's Conserved Domain Database (CDD) is a resource for the annotation of protein sequences with the location of conserved domain footprints, and functional sites inferred from these footprints. CDD includes manually curated domain models that make use of protein 3D structure to refine domain models and provide insights into sequence/structure/function relationships. Manually curated models are organized hierarchically if they describe domain families that are clearly related by common descent. As CDD also imports domain family models from a variety of external sources, it is a partially redundant collection. To simplify protein annotation, redundant models and models describing homologous families are clustered into superfamilies. By default, domain footprints are annotated with the corresponding superfamily designation, on top of which specific annotation may indicate high-confidence assignment of family membership. Pre-computed domain annotation is available for proteins in the Entrez/Protein dataset, and a novel interface, Batch CD-Search, allows the computation and download of annotation for large sets of protein queries. CDD can be accessed via http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Assuntos

Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Sequência de Aminoácidos , Sequência Conservada , Modelos Biológicos , Proteínas/classificação , Análise de Sequência de Proteína

12.

Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures.

Neuwald, Andrew F; Lanczycki, Christopher J; Marchler-Bauer, Aron.

BMC Bioinformatics ; 13: 144, 2012 Jun 22.

Artigo em Inglês | MEDLINE | ID: mdl-22726767

RESUMO

BACKGROUND: The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. RESULTS: Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo (MCMC) sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related--a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results were obtained in about a day. CONCLUSIONS: This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups.

Assuntos

Sequência Conservada , Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Proteínas , Algoritmos , Sequência de Aminoácidos , Cadeias de Markov , Método de Monte Carlo , Filogenia , Dobramento de Proteína , Proteínas/química , Proteínas/classificação , Alinhamento de Sequência

13.

CDD: specific functional annotation with the Conserved Domain Database.

Marchler-Bauer, Aron; Anderson, John B; Chitsaz, Farideh; Derbyshire, Myra K; DeWeese-Scott, Carol; Fong, Jessica H; Geer, Lewis Y; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; He, Siqian; Hurwitz, David I; Jackson, John D; Ke, Zhaoxi; Lanczycki, Christopher J; Liebert, Cynthia A; Liu, Chunlei; Lu, Fu; Lu, Shennan; Marchler, Gabriele H; Mullokandov, Mikhail; Song, James S; Tasneem, Asba; Thanki, Narmada; Yamashita, Roxanne A; Zhang, Dachuan; Zhang, Naigong; Bryant, Stephen H.

Nucleic Acids Res ; 37(Database issue): D205-10, 2009 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-18984618

RESUMO

NCBI's Conserved Domain Database (CDD) is a collection of multiple sequence alignments and derived database search models, which represent protein domains conserved in molecular evolution. The collection can be accessed at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml, and is also part of NCBI's Entrez query and retrieval system, cross-linked to numerous other resources. CDD provides annotation of domain footprints and conserved functional sites on protein sequences. Precalculated domain annotation can be retrieved for protein sequences tracked in NCBI's Entrez system, and CDD's collection of models can be queried with novel protein sequences via the CD-Search service at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Starting with the latest version of CDD, v2.14, information from redundant and homologous domain models is summarized at a superfamily level, and domain annotation on proteins is flagged as either 'specific' (identifying molecular function with high confidence) or as 'non-specific' (identifying superfamily membership only).

Assuntos

Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Sequência de Aminoácidos , Sequência Conservada , Proteínas/classificação , Alinhamento de Sequência , Análise de Sequência de Proteína

14.

Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments.

Neuwald, Andrew F; Lanczycki, Christopher J; Hodges, Theresa K; Marchler-Bauer, Aron.

Database (Oxford) ; 20202020 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-32500917

RESUMO

For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease-endonuclease-phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/.

Assuntos

Bases de Dados de Proteínas , Proteínas , Alinhamento de Sequência/métodos , Aprendizado de Máquina , Proteínas/química , Proteínas/genética , Análise de Sequência de Proteína , Software

15.

Prediction of functionally important sites from protein sequences using sparse kernel least squares classifiers.

Tang, Ke; Pugalenthi, Ganesan; Suganthan, P N; Lanczycki, Christopher J; Chakrabarti, Saikat.

Biochem Biophys Res Commun ; 384(2): 155-9, 2009 Jun 26.

Artigo em Inglês | MEDLINE | ID: mdl-19394310

RESUMO

Identification of functionally important sites (FIS) in proteins is a critical problem and can have profound importance where protein structural information is limited. Machine learning techniques have been very useful in successful classification of many important biological problems. In this paper, we adopt the sparse kernel least squares classifiers (SKLSC) approach for classification and/or prediction of FIS using protein sequence derived features. The SKLSC algorithm was applied to 5435 FIS that have been extracted from 312 reliable alignments for a wide range of protein families. We obtained 68.28% sensitivity and 68.66% specificity for training dataset and 65.34% sensitivity and 66.88% specificity for testing dataset. Further, large scale benchmarking study using alignments of 101 protein families containing 1899 FIS showed that our method achieved an average approximately 70% sensitivity in predicting different types of FIS, such as active sites, metal, ligand or protein binding sites. Our findings also indicate that active sites and metal binding sites are comparably easier to predict compared to the ligand and protein binding sites. Despite moderate success, our results suggest the usefulness and potential of SKLSC approach in prediction of FIS using only protein sequence derived information.

Assuntos

Sítios de Ligação , Proteínas/química , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Domínio Catalítico , Análise dos Mínimos Quadrados , Proteínas/classificação

16.

CDD: a conserved domain database for interactive domain family analysis.

Marchler-Bauer, Aron; Anderson, John B; Derbyshire, Myra K; DeWeese-Scott, Carol; Gonzales, Noreen R; Gwadz, Marc; Hao, Luning; He, Siqian; Hurwitz, David I; Jackson, John D; Ke, Zhaoxi; Krylov, Dmitri; Lanczycki, Christopher J; Liebert, Cynthia A; Liu, Chunlei; Lu, Fu; Lu, Shennan; Marchler, Gabriele H; Mullokandov, Mikhail; Song, James S; Thanki, Narmada; Yamashita, Roxanne A; Yin, Jodie J; Zhang, Dachuan; Bryant, Stephen H.

Nucleic Acids Res ; 35(Database issue): D237-40, 2007 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-17135202

RESUMO

The conserved domain database (CDD) is part of NCBI's Entrez database system and serves as a primary resource for the annotation of conserved domain footprints on protein sequences in Entrez. Entrez's global query interface can be accessed at http://www.ncbi.nlm.nih.gov/Entrez and will search CDD and many other databases. Domain annotation for proteins in Entrez has been pre-computed and is readily available in the form of 'Conserved Domain' links. Novel protein sequences can be scanned against CDD using the CD-Search service; this service searches databases of CDD-derived profile models with protein sequence queries using BLAST heuristics, at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Protein query sequences submitted to NCBI's protein BLAST search service are scanned for conserved domain signatures by default. The CDD collection contains models imported from Pfam, SMART and COG, as well as domain models curated at NCBI. NCBI curated models are organized into hierarchies of domains related by common descent. Here we report on the status of the curation effort and present a novel helper application, CDTree, which enables users of the CDD resource to examine curated hierarchies. More importantly, CDD and CDTree used in concert, serve as a powerful tool in protein classification, as they allow users to analyze protein sequences in the context of domain family hierarchies.

Assuntos

Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Sequência de Aminoácidos , Animais , Sequência Conservada , Internet , Filogenia , Estrutura Terciária de Proteína/genética , Proteínas/classificação , Análise de Sequência de Proteína , Interface Usuário-Computador

17.

Refining multiple sequence alignments with conserved core regions.

Chakrabarti, Saikat; Lanczycki, Christopher J; Panchenko, Anna R; Przytycka, Teresa M; Thiessen, Paul A; Bryant, Stephen H.

Nucleic Acids Res ; 34(9): 2598-606, 2006.

Artigo em Inglês | MEDLINE | ID: mdl-16707662

RESUMO

Accurate multiple sequence alignments of proteins are very important to several areas of computational biology and provide an understanding of phylogenetic history of domain families, their identification and classification. This article presents a new algorithm, REFINER, that refines a multiple sequence alignment by iterative realignment of its individual sequences with the predetermined conserved core (block) model of a protein family. Realignment of each sequence can correct misalignments between a given sequence and the rest of the profile and at the same time preserves the family's overall block model. Large-scale benchmarking studies showed a noticeable improvement of alignment after refinement. This can be inferred from the increased alignment score and enhanced sensitivity for database searching using the sequence profiles derived from refined alignments compared with the original alignments. A standalone version of the program is available by ftp distribution (ftp://ftp.ncbi.nih.gov/pub/REFINER) and will be incorporated into the next release of the Cn3D structure/alignment viewer.

Assuntos

Algoritmos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Sequência Conservada , Internet , Dados de Sequência Molecular , Controle de Qualidade , Reprodutibilidade dos Testes , Alinhamento de Sequência/normas , Análise de Sequência de Proteína/normas

18.

Analysis and prediction of functionally important sites in proteins.

Chakrabarti, Saikat; Lanczycki, Christopher J.

Protein Sci ; 16(1): 4-13, 2007 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-17192586

RESUMO

The rapidly increasing volume of sequence and structure information available for proteins poses the daunting task of determining their functional importance. Computational methods can prove to be very useful in understanding and characterizing the biochemical and evolutionary information contained in this wealth of data, particularly at functionally important sites. Therefore, we perform a detailed survey of compositional and evolutionary constraints at the molecular and biological function level for a large set of known functionally important sites extracted from a wide range of protein families. We compare the degree of conservation across different functional categories and provide detailed statistical insight to decipher the varying evolutionary constraints at functionally important sites. The compositional and evolutionary information at functionally important sites has been compiled into a library of functional templates. We developed a module that predicts functionally important columns (FIC) of an alignment based on the detection of a significant "template match score" to a library template. Our template match score measures an alignment column's similarity to a library template and combines a term explicitly representing a column's residue composition with various evolutionary conservation scores (information content and position-specific scoring matrix-derived statistics). Our benchmarking studies show good sensitivity/specificity for the prediction of functional sites and high accuracy in attributing correct molecular function type to the predicted sites. This prediction method is based on information derived from homologous sequences and no structural information is required. Therefore, this method could be extremely useful for large-scale functional annotation.

Assuntos

Proteínas/química , Proteínas/fisiologia , Sítios de Ligação , Sequência Conservada , Evolução Molecular , Modelos Moleculares , Conformação Proteica , Proteínas/genética

19.

CDD: a Conserved Domain Database for protein classification.

Marchler-Bauer, Aron; Anderson, John B; Cherukuri, Praveen F; DeWeese-Scott, Carol; Geer, Lewis Y; Gwadz, Marc; He, Siqian; Hurwitz, David I; Jackson, John D; Ke, Zhaoxi; Lanczycki, Christopher J; Liebert, Cynthia A; Liu, Chunlei; Lu, Fu; Marchler, Gabriele H; Mullokandov, Mikhail; Shoemaker, Benjamin A; Simonyan, Vahan; Song, James S; Thiessen, Paul A; Yamashita, Roxanne A; Yin, Jodie J; Zhang, Dachuan; Bryant, Stephen H.

Nucleic Acids Res ; 33(Database issue): D192-6, 2005 Jan 01.

Artigo em Inglês | MEDLINE | ID: mdl-15608175

RESUMO

The Conserved Domain Database (CDD) is the protein classification component of NCBI's Entrez query and retrieval system. CDD is linked to other Entrez databases such as Proteins, Taxonomy and PubMed, and can be accessed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd. CD-Search, which is available at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi, is a fast, interactive tool to identify conserved domains in new protein sequences. CD-Search results for protein sequences in Entrez are pre-computed to provide links between proteins and domain models, and computational annotation visible upon request. Protein-protein queries submitted to NCBI's BLAST search service at http://www.ncbi.nlm.nih.gov/BLAST are scanned for the presence of conserved domains by default. While CDD started out as essentially a mirror of publicly available domain alignment collections, such as SMART, Pfam and COG, we have continued an effort to update, and in some cases replace these models with domain hierarchies curated at the NCBI. Here, we report on the progress of the curation effort and associated improvements in the functionality of the CDD information retrieval system.

Assuntos

Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Proteínas/classificação , Sequência de Aminoácidos , Sequência Conservada , Filogenia , Alinhamento de Sequência , Análise de Sequência de Proteína , Interface Usuário-Computador

20.

State of the art: refinement of multiple sequence alignments.

Chakrabarti, Saikat; Lanczycki, Christopher J; Panchenko, Anna R; Przytycka, Teresa M; Thiessen, Paul A; Bryant, Stephen H.

BMC Bioinformatics ; 7: 499, 2006 Nov 14.

Artigo em Inglês | MEDLINE | ID: mdl-17105653

RESUMO

BACKGROUND: Accurate multiple sequence alignments of proteins are very important in computational biology today. Despite the numerous efforts made in this field, all alignment strategies have certain shortcomings resulting in alignments that are not always correct. Refinement of existing alignment can prove to be an intelligent choice considering the increasing importance of high quality alignments in large scale high-throughput analysis. RESULTS: We provide an extensive comparison of the performance of the alignment refinement algorithms. The accuracy and efficiency of the refinement programs are compared using the 3D structure-based alignments in the BAliBASE benchmark database as well as manually curated high quality alignments from Conserved Domain Database (CDD). CONCLUSION: Comparison of performance for refined alignments revealed that despite the absence of dramatic improvements, our refinement method, REFINER, which uses conserved regions as constraints performs better in improving the alignments generated by different alignment algorithms. In most cases REFINER produces a higher-scoring, modestly improved alignment that does not deteriorate the well-conserved regions of the original alignment.

Assuntos

Biologia Computacional/métodos , Alinhamento de Sequência , Algoritmos , Bases de Dados de Proteínas , Cadeias de Markov , Linguagens de Programação , Estrutura Terciária de Proteína , Sensibilidade e Especificidade , Análise de Sequência de Proteína , Software

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA