ABSTRACT
NLM's conserved domain database (CDD) is a collection of protein domain and protein family models constructed as multiple sequence alignments. Its main purpose is to provide annotation for protein and translated nucleotide sequences with the location of domain footprints and associated functional sites, and to define protein domain architecture as a basis for assigning gene product names and putative/predicted function. CDD has been available publicly for over 20 years and has grown substantially during that time. Maintaining an archive of pre-computed annotation continues to be a challenge and has slowed down the cadence of CDD releases. CDD curation staff builds hierarchical classifications of large protein domain families, adds models for novel domain families via surveillance of the protein 'dark matter' that currently lacks annotation, and now spends considerable effort on providing names and attribution for conserved domain architectures. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.
Subject(s)
Databases, Protein , Proteins , Humans , Amino Acid Sequence , Conserved Sequence , Protein Structure, Tertiary , Proteins/chemistry , Proteins/genetics , Protein DomainsABSTRACT
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.
Subject(s)
Databases, Protein , Humans , Amino Acid Sequence , Artificial Intelligence , Internet , Proteins/chemistry , SoftwareABSTRACT
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found atĀ https://www.ncbi.nlm.nih.gov/refseq/.
Subject(s)
Computational Biology/methods , Databases, Genetic , Genome, Archaeal/genetics , Genome, Bacterial/genetics , Molecular Sequence Annotation/methods , Proteins/genetics , Data Curation/methods , Data Mining/methods , Genomics/methods , Internet , Proteins/classification , User-Computer InterfaceABSTRACT
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.
Subject(s)
Databases, Protein , Proteins/chemistry , Amino Acid Sequence , COVID-19/metabolism , Internet , Molecular Sequence Annotation , Protein Domains , Protein Interaction Maps , SARS-CoV-2/metabolism , Sequence AlignmentABSTRACT
As NLM's Conserved Domain Database (CDD) enters its 20th year of operations as a publicly available resource, CDD curation staff continues to develop hierarchical classifications of widely distributed protein domain families, and to record conserved sites associated with molecular function, so that they can be mapped onto user queries in support of hypothesis-driven biomolecular research. CDD offers both an archive of pre-computed domain annotations as well as live search services for both single protein or nucleotide queries and larger sets of protein query sequences. CDD staff has continued to characterize protein families via conserved domain architectures and has built up a significant corpus of curated domain architectures in support of naming bacterial proteins in RefSeq. These architecture definitions are available via SPARCLE, the Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.
Subject(s)
Databases, Protein , Protein Domains , Amino Acid Sequence , Conserved SequenceABSTRACT
The InterPro database (http://www.ebi.ac.uk/interpro/) classifies protein sequences into families and predicts the presence of functionally important domains and sites. Here, we report recent developments with InterPro (version 70.0) and its associated software, including an 18% growth in the size of the database in terms on new InterPro entries, updates to content, the inclusion of an additional entry type, refined modelling of discontinuous domains, and the development of a new programmatic interface and website. These developments extend and enrich the information provided by InterPro, and provide greater flexibility in terms of data access. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB, and discuss how our evaluation of residue coverage may help guide future curation activities.
Subject(s)
Databases, Protein , Molecular Sequence Annotation , Animals , Databases, Genetic , Gene Ontology , Humans , Internet , Multigene Family , Protein Domains/genetics , Sequence Homology, Amino Acid , Software , User-Computer InterfaceABSTRACT
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule-BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.
Subject(s)
Data Curation , Databases, Nucleic Acid , Genome , Molecular Sequence Annotation , Prokaryotic Cells , Archaea/genetics , Bacteria/genetics , Databases, Protein , Eukaryota/genetics , Forecasting , Humans , Sequence Homology , Software , Viruses/geneticsABSTRACT
NCBI's Conserved Domain Database (CDD) aims at annotating biomolecular sequences with the location of evolutionarily conserved protein domain footprints, and functional sites inferred from such footprints. An archive of pre-computed domain annotation is maintained for proteins tracked by NCBI's Entrez database, and live search services are offered as well. CDD curation staff supplements a comprehensive collection of protein domain and protein family models, which have been imported from external providers, with representations of selected domain families that are curated in-house and organized into hierarchical classifications of functionally distinct families and sub-families. CDD also supports comparative analyses of protein families via conserved domain architectures, and a recent curation effort focuses on providing functional characterizations of distinct subfamily architectures using SPARCLE: Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.
Subject(s)
Computational Biology/methods , Databases, Protein , Protein Interaction Domains and Motifs , Proteins , Information Dissemination , Internet , Proteins/chemistry , Proteins/classification , Proteins/geneticsABSTRACT
InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences.
Subject(s)
Computational Biology/methods , Databases, Protein , Protein Interaction Domains and Motifs , Software , Humans , Molecular Sequence Annotation , PhylogenyABSTRACT
NCBI's CDD, the Conserved Domain Database, enters its 15(th) year as a public resource for the annotation of proteins with the location of conserved domain footprints. Going forward, we strive to improve the coverage and consistency of domain annotation provided by CDD. We maintain a live search system as well as an archive of pre-computed domain annotation for sequences tracked in NCBI's Entrez protein database, which can be retrieved for single sequences or in bulk. We also maintain import procedures so that CDD contains domain models and domain definitions provided by several collections available in the public domain, as well as those produced by an in-house curation effort. The curation effort aims at increasing coverage and providing finer-grained classifications of common protein domains, for which a wealth of functional and structural data has become available. CDD curation generates alignment models of representative sequence fragments, which are in agreement with domain boundaries as observed in protein 3D structure, and which model the structurally conserved cores of domain families as well as annotate conserved features. CDD can be accessed at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.
Subject(s)
Databases, Protein , Protein Structure, Tertiary , Amino Acid Motifs , Amino Acid Sequence , Conserved Sequence , Data CurationABSTRACT
CDD, the Conserved Domain Database, is part of NCBI's Entrez query and retrieval system and is also accessible via http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. CDD provides annotation of protein sequences with the location of conserved domain footprints and functional sites inferred from these footprints. Pre-computed annotation is available via Entrez, and interactive search services accept single protein or nucleotide queries, as well as batch submissions of protein query sequences, utilizing RPS-BLAST to rapidly identify putative matches. CDD incorporates several protein domain and full-length protein model collections, and maintains an active curation effort that aims at providing fine grained classifications for major and well-characterized protein domain families, as supported by available protein three-dimensional (3D) structure and the published literature. To this date, the majority of protein 3D structures are represented by models tracked by CDD, and CDD curators are characterizing novel families that emerge from protein structure determination efforts.
Subject(s)
Databases, Protein , Protein Conformation , Protein Structure, Tertiary , Amino Acid Sequence , Conserved Sequence , Internet , Models, Molecular , Molecular Sequence Annotation , Proteins/chemistry , Proteins/classification , Proteins/genetics , Sequence Analysis, ProteinABSTRACT
NCBI's Conserved Domain Database (CDD) is a resource for the annotation of protein sequences with the location of conserved domain footprints, and functional sites inferred from these footprints. CDD includes manually curated domain models that make use of protein 3D structure to refine domain models and provide insights into sequence/structure/function relationships. Manually curated models are organized hierarchically if they describe domain families that are clearly related by common descent. As CDD also imports domain family models from a variety of external sources, it is a partially redundant collection. To simplify protein annotation, redundant models and models describing homologous families are clustered into superfamilies. By default, domain footprints are annotated with the corresponding superfamily designation, on top of which specific annotation may indicate high-confidence assignment of family membership. Pre-computed domain annotation is available for proteins in the Entrez/Protein dataset, and a novel interface, Batch CD-Search, allows the computation and download of annotation for large sets of protein queries. CDD can be accessed via http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.
Subject(s)
Databases, Protein , Protein Structure, Tertiary , Amino Acid Sequence , Conserved Sequence , Models, Biological , Proteins/classification , Sequence Analysis, ProteinABSTRACT
NCBI's Conserved Domain Database (CDD) is a collection of multiple sequence alignments and derived database search models, which represent protein domains conserved in molecular evolution. The collection can be accessed at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml, and is also part of NCBI's Entrez query and retrieval system, cross-linked to numerous other resources. CDD provides annotation of domain footprints and conserved functional sites on protein sequences. Precalculated domain annotation can be retrieved for protein sequences tracked in NCBI's Entrez system, and CDD's collection of models can be queried with novel protein sequences via the CD-Search service at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Starting with the latest version of CDD, v2.14, information from redundant and homologous domain models is summarized at a superfamily level, and domain annotation on proteins is flagged as either 'specific' (identifying molecular function with high confidence) or as 'non-specific' (identifying superfamily membership only).
Subject(s)
Databases, Protein , Protein Structure, Tertiary , Amino Acid Sequence , Conserved Sequence , Proteins/classification , Sequence Alignment , Sequence Analysis, ProteinABSTRACT
The conserved domain database (CDD) is part of NCBI's Entrez database system and serves as a primary resource for the annotation of conserved domain footprints on protein sequences in Entrez. Entrez's global query interface can be accessed at http://www.ncbi.nlm.nih.gov/Entrez and will search CDD and many other databases. Domain annotation for proteins in Entrez has been pre-computed and is readily available in the form of 'Conserved Domain' links. Novel protein sequences can be scanned against CDD using the CD-Search service; this service searches databases of CDD-derived profile models with protein sequence queries using BLAST heuristics, at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Protein query sequences submitted to NCBI's protein BLAST search service are scanned for conserved domain signatures by default. The CDD collection contains models imported from Pfam, SMART and COG, as well as domain models curated at NCBI. NCBI curated models are organized into hierarchies of domains related by common descent. Here we report on the status of the curation effort and present a novel helper application, CDTree, which enables users of the CDD resource to examine curated hierarchies. More importantly, CDD and CDTree used in concert, serve as a powerful tool in protein classification, as they allow users to analyze protein sequences in the context of domain family hierarchies.
Subject(s)
Databases, Protein , Protein Structure, Tertiary , Amino Acid Sequence , Animals , Conserved Sequence , Internet , Phylogeny , Protein Structure, Tertiary/genetics , Proteins/classification , Sequence Analysis, Protein , User-Computer InterfaceABSTRACT
This study proposes a text similarity model to help biocuration efforts of the Conserved Domain Database (CDD). CDD is a curated resource that catalogs annotated multiple sequence alignment models for ancient domains and full-length proteins. These models allow for fast searching and quick identification of conserved motifs in protein sequences via Reverse PSI-BLAST. In addition, CDD curators prepare summaries detailing the function of these conserved domains and specific protein families, based on published peer-reviewed articles. To facilitate information access for database users, it is desirable to specifically identify the referenced articles that support the assertions of curator-composed sentences. Moreover, CDD curators desire an alert system that scans the newly published literature and proposes related articles of relevance to the existing CDD records. Our approach to address these needs is a text similarity method that automatically maps a curator-written statement to candidate sentences extracted from the list of referenced articles, as well as the articles in the PubMed Central database. To evaluate this proposal, we paired CDD description sentences with the top 10 matching sentences from the literature, which were given to curators for review. Through this exercise, we discovered that we were able to map the articles in the reference list to the CDD description statements with an accuracy of 77%. In the dataset that was reviewed by curators, we were able to successfully provide references for 86% of the curator statements. In addition, we suggested new articles for curator review, which were accepted by curators to be added into the reference list at an acceptance rate of 50%. Through this process, we developed a substantial corpus of similar sentences from biomedical articles on protein sequence, structure and function research, which constitute the CDD text similarity corpus. This corpus contains 5159 sentence pairs judged for their similarity on a scale from 1 (low) to 5 (high) doubly annotated by four CDD curators. Curator-assigned similarity scores have a Pearson correlation coefficient of 0.70 and an inter-annotator agreement of 85%. To date, this is the largest biomedical text similarity resource that has been manually judged, evaluated and made publicly available to the community to foster research and development of text similarity algorithms.
Subject(s)
Algorithms , Data Curation , Databases, Protein , Proteins , PubMed , Sequence Alignment , Protein Domains , Proteins/chemistry , Proteins/geneticsABSTRACT
The Protein Data Bank (PDB; http://www.pdb.org/) is the single worldwide archive of structural data of biological macromolecules. This paper describes the progress that has been made in validating all data in the PDB archive and in releasing a uniform archive for the community. We have now produced a collection of mmCIF data files for the PDB archive (ftp://beta.rcsb.org/pub/pdb/uniformity/data/mmCIF/). A utility application that converts the mmCIF data files to the PDB format (called CIFTr) has also been released to provide support for existing software.
Subject(s)
Databases, Protein , Proteins/chemistry , Amino Acid Sequence , Animals , Archives , Database Management Systems , Enzymes/chemistry , Forecasting , Information Storage and Retrieval , Internet , Ligands , Polymers/chemistry , Protein Conformation , Quality Control , Stereoisomerism , Terminology as Topic , User-Computer InterfaceABSTRACT
An integrated, bioinformatic analysis of three databases comprising tumor-cell-based small molecule screening data, gene expression measurements, and PDB (Protein Data Bank) ligand-target structures has been developed for probing mechanism of drug action (MOA). Clustering analysis of GI50 profiles for the NCI's database of compounds screened across a panel of tumor cells (NCI60) was used to select a subset of unique cytotoxic responses for about 4000 small molecules. Drug-gene-PDB relationships for this test set were examined by correlative analysis of cytotoxic response and differential gene expression profiles within the NCI60 and structural comparisons with known ligand-target crystallographic complexes. A survey of molecular features within these compounds finds thirteen conserved Compound Classes, each class exhibiting chemical features important for interactions with a variety of biological targets. Protein targets for an additional twelve Compound Classes could be directly assigned using drug-protein interactions observed in the crystallographic database. Results from the analysis of constitutive gene expressions established a clear connection between chemo-resistance and overexpression of gene families associated with the extracellular matrix, cytoskeletal organization, and xenobiotic metabolism. Conversely, chemo-sensitivity implicated overexpression of gene families involved in homeostatic functions of nucleic acid repair, aryl hydrocarbon metabolism, heat shock response, proteasome degradation and apoptosis. Correlations between chemo-responsiveness and differential gene expressions identified chemotypes with nonselective (i.e., many) molecular targets from those likely to have selective (i.e., few) molecular targets. Applications of data mining strategies that jointly utilize tumor cell screening, genomic, and structural data are presented for hypotheses generation and identifying novel anticancer candidates.
Subject(s)
Gene Expression Profiling , Neoplasms/genetics , Antineoplastic Agents/therapeutic use , Cell Survival/drug effects , Gene Expression Regulation, Neoplastic , Humans , Neoplasms/drug therapy , Neoplasms/pathology , Transcription, GeneticABSTRACT
An unsupervised self-organizing map-based clustering strategy has been developed to classify tissue samples from an oligonucleotide microarray patient database. Our method is based on the likelihood that a test data vector may have a gene expression fingerprint that is shared by more than one tumor class and as such can identify datasets that cannot be unequivocally assigned to a single tumor class. Our self-organizing map analysis completely separated the tumor from the normal expression datasets. Within the 14 different tumor types, classification accuracies on the order of approximately 80% correct were achieved. Nearly perfect classifications were found for leukemia, central nervous system, melanoma, uterine, and lymphoma tumor types, with very poor classifications found for colorectal, ovarian, breast, and lung tumors. Classification results were further analyzed to identify sets of differentially expressed genes between tumor and normal gene expressions and among each tumor class. Within the total pool of 1139 genes most differentially expressed in this dataset, subsets were found that could be vetted according to previously published literature sources to be specific tumor markers. Attempts to classify gene expression datasets from other sources found a wide range of classification accuracies. Discussions about the utility of this method and the quality of data needed for accurate tumor classifications are provided.
Subject(s)
Databases, Factual , Gene Expression Profiling , Neoplasms/classification , Neoplasms/genetics , Oligonucleotide Array Sequence Analysis/methods , Algorithms , Artificial Intelligence , Biomarkers, Tumor , Computational Biology , DNA, Neoplasm/analysis , Humans , Neoplasms/pathologyABSTRACT
A hypothetical protein encoded by the gene YjeE of Haemophilus influenzae was selected as part of a structural genomics project for X-ray analysis to assist with the functional assignment. The protein is considered essential to bacteria because the gene is present in virtually all bacterial genomes but not in those of archaea or eukaryotes. The amino acid sequence shows no homology to other proteins except for the presence of the Walker A motif G-X-X-X-X-G-K-T that indicates the possibility of a nucleotide-binding protein. The YjeE protein was cloned, expressed, and the crystal structure determined by the MAD method at 1.7-A resolution. The protein has a nucleotide-binding fold with a four-stranded parallel beta-sheet flanked by antiparallel beta-strands on each side. The topology of the beta-sheet is unique among P-loop proteins and has features of different families of enzymes. Crystallization of YjeE in the presence of ATP and Mg2+ resulted in the structure with ADP bound in the P-loop. The ATPase activity of YjeE was confirmed by kinetic measurements. The distribution of conserved residues suggests that the protein may work as a "molecular switch" triggered by ATP hydrolysis. The phylogenetic pattern of YjeE suggests its involvement in cell wall biosynthesis.
Subject(s)
Adenosine Triphosphatases/chemistry , Bacterial Proteins/chemistry , Haemophilus influenzae/enzymology , Models, Molecular , Adenosine Triphosphatases/genetics , Adenosine Triphosphatases/physiology , Amino Acid Sequence , Bacterial Proteins/genetics , Bacterial Proteins/physiology , Cell Wall/metabolism , Crystallography, X-Ray , Haemophilus influenzae/growth & development , Molecular Sequence Data , Nucleotides/metabolism , Phylogeny , Sequence Homology, Amino AcidABSTRACT
The three-dimensional structures of Haemophilus influenzae proteins whose biological functions are unknown are being determined as part of a structural genomics project to ask whether structural information can assist in assigning the functions of proteins. The structures of the hypothetical proteins are being used to guide further studies and narrow the field of such studies for ultimately determining protein function. An outline of the structural genomics methodological approach is provided along with summaries of a number of completed and in progress crystallographic and NMR structure determinations. With more than twenty-five structures determined at this point and with many more in various stages of completion, the results are encouraging in that some level of functional understanding can be deduced from experimentally solved structures. In addition to aiding in functional assignment, this effort is identifying a number of possible new targets for drug development.