Search | VHL Search Portal

1.

Database resources of the National Center for Biotechnology Information.

Sayers, Eric W; Beck, Jeff; Bolton, Evan E; Brister, J Rodney; Chan, Jessica; Comeau, Donald C; Connor, Ryan; DiCuccio, Michael; Farrell, Catherine M; Feldgarden, Michael; Fine, Anna M; Funk, Kathryn; Hatcher, Eneida; Hoeppner, Marilu; Kane, Megan; Kannan, Sivakumar; Katz, Kenneth S; Kelly, Christopher; Klimke, William; Kim, Sunghwan; Kimchi, Avi; Landrum, Melissa; Lathrop, Stacy; Lu, Zhiyong; Malheiro, Adriana; Marchler-Bauer, Aron; Murphy, Terence D; Phan, Lon; Prasad, Arjun B; Pujar, Shashikant; Sawyer, Amanda; Schmieder, Erin; Schneider, Valerie A; Schoch, Conrad L; Sharma, Shobha; Thibaud-Nissen, Françoise; Trawick, Barton W; Venkatapathi, Thilakam; Wang, Jiyao; Pruitt, Kim D; Sherry, Stephen T.

Nucleic Acids Res ; 52(D1): D33-D43, 2024 Jan 05.

Article in English | MEDLINE | ID: mdl-37994677

ABSTRACT

The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, SciENcv, the NIH Comparative Genomics Resource (CGR), NCBI Virus, SRA, RefSeq, foreign contamination screening tools, Taxonomy, iCn3D, ClinVar, GTR, MedGen, dbSNP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.

Subject(s)

Databases, Genetic , National Library of Medicine (U.S.) , Biotechnology/instrumentation , Databases, Nucleic Acid , Internet , United States

2.

The conserved domain database in 2023.

Wang, Jiyao; Chitsaz, Farideh; Derbyshire, Myra K; Gonzales, Noreen R; Gwadz, Marc; Lu, Shennan; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Yamashita, Roxanne A; Yang, Mingzhang; Zhang, Dachuan; Zheng, Chanjuan; Lanczycki, Christopher J; Marchler-Bauer, Aron.

Nucleic Acids Res ; 51(D1): D384-D388, 2023 01 06.

Article in English | MEDLINE | ID: mdl-36477806

ABSTRACT

NLM's conserved domain database (CDD) is a collection of protein domain and protein family models constructed as multiple sequence alignments. Its main purpose is to provide annotation for protein and translated nucleotide sequences with the location of domain footprints and associated functional sites, and to define protein domain architecture as a basis for assigning gene product names and putative/predicted function. CDD has been available publicly for over 20 years and has grown substantially during that time. Maintaining an archive of pre-computed annotation continues to be a challenge and has slowed down the cadence of CDD releases. CDD curation staff builds hierarchical classifications of large protein domain families, adds models for novel domain families via surveillance of the protein 'dark matter' that currently lacks annotation, and now spends considerable effort on providing names and attribution for conserved domain architectures. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Subject(s)

Databases, Protein , Proteins , Humans , Amino Acid Sequence , Conserved Sequence , Protein Structure, Tertiary , Proteins/chemistry , Proteins/genetics , Protein Domains

3.

InterPro in 2022.

Paysan-Lafosse, Typhaine; Blum, Matthias; Chuguransky, Sara; Grego, Tiago; Pinto, Beatriz Lázaro; Salazar, Gustavo A; Bileschi, Maxwell L; Bork, Peer; Bridge, Alan; Colwell, Lucy; Gough, Julian; Haft, Daniel H; Letunic, Ivica; Marchler-Bauer, Aron; Mi, Huaiyu; Natale, Darren A; Orengo, Christine A; Pandurangan, Arun P; Rivoire, Catherine; Sigrist, Christian J A; Sillitoe, Ian; Thanki, Narmada; Thomas, Paul D; Tosatto, Silvio C E; Wu, Cathy H; Bateman, Alex.

Nucleic Acids Res ; 51(D1): D418-D427, 2023 01 06.

Article in English | MEDLINE | ID: mdl-36350672

ABSTRACT

The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.

Subject(s)

Databases, Protein , Humans , Amino Acid Sequence , Artificial Intelligence , Internet , Proteins/chemistry , Software

4.

Database resources of the National Center for Biotechnology Information in 2023.

Sayers, Eric W; Bolton, Evan E; Brister, J Rodney; Canese, Kathi; Chan, Jessica; Comeau, Donald C; Farrell, Catherine M; Feldgarden, Michael; Fine, Anna M; Funk, Kathryn; Hatcher, Eneida; Kannan, Sivakumar; Kelly, Christopher; Kim, Sunghwan; Klimke, William; Landrum, Melissa J; Lathrop, Stacy; Lu, Zhiyong; Madden, Thomas L; Malheiro, Adriana; Marchler-Bauer, Aron; Murphy, Terence D; Phan, Lon; Pujar, Shashikant; Rangwala, Sanjida H; Schneider, Valerie A; Tse, Tony; Wang, Jiyao; Ye, Jian; Trawick, Barton W; Pruitt, Kim D; Sherry, Stephen T.

Nucleic Acids Res ; 51(D1): D29-D38, 2023 01 06.

Article in English | MEDLINE | ID: mdl-36370100

ABSTRACT

The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. New resources include the Comparative Genome Resource (CGR) and the BLAST ClusteredNR database. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, IgBLAST, GDV, RefSeq, NCBI Virus, GenBank type assemblies, iCn3D, ClinVar, GTR, dbGaP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.

Subject(s)

Databases, Genetic , Databases, Nucleic Acid , United States , National Library of Medicine (U.S.) , Sequence Alignment , Biotechnology , Internet

5.

Database resources of the national center for biotechnology information.

Sayers, Eric W; Bolton, Evan E; Brister, J Rodney; Canese, Kathi; Chan, Jessica; Comeau, Donald C; Connor, Ryan; Funk, Kathryn; Kelly, Chris; Kim, Sunghwan; Madej, Tom; Marchler-Bauer, Aron; Lanczycki, Christopher; Lathrop, Stacy; Lu, Zhiyong; Thibaud-Nissen, Francoise; Murphy, Terence; Phan, Lon; Skripchenko, Yuri; Tse, Tony; Wang, Jiyao; Williams, Rebecca; Trawick, Barton W; Pruitt, Kim D; Sherry, Stephen T.

Nucleic Acids Res ; 50(D1): D20-D26, 2022 01 07.

Article in English | MEDLINE | ID: mdl-34850941

ABSTRACT

The National Center for Biotechnology Information (NCBI) produces a variety of online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for the most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, RefSeq, SRA, Virus, dbSNP, dbVar, ClinicalTrials.gov, MMDB, iCn3D and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.

Subject(s)

Biotechnology/trends , Databases, Genetic/trends , Databases, Chemical , Databases, Nucleic Acid , Databases, Protein , Humans , Internet , National Library of Medicine (U.S.) , PubMed , United States

6.

The NIH Comparative Genomics Resource: addressing the promises and challenges of comparative genomics on human health.

Bornstein, Kristin; Gryan, Gary; Chang, E Sally; Marchler-Bauer, Aron; Schneider, Valerie A.

BMC Genomics ; 24(1): 575, 2023 Sep 27.

Article in English | MEDLINE | ID: mdl-37759191

ABSTRACT

Comparative genomics is the comparison of genetic information within and across organisms to understand the evolution, structure, and function of genes, proteins, and non-coding regions (Sivashankari and Shanmughavel, Bioinformation 1:376-8, 2007). Advances in sequencing technology and assembly algorithms have resulted in the ability to sequence large genomes and provided a wealth of data that are being used in comparative genomic analyses. Comparative analysis can be leveraged to systematically explore and evaluate the biological relationships and evolution between species, aid in understanding the structure and function of genes, and gain a better understanding of disease and potential drug targets. As our knowledge of genetics expands, comparative genomics can help identify emerging model organisms among a broader span of the tree of life, positively impacting human health. This impact includes, but is not limited to, zoonotic disease research, therapeutics development, microbiome research, xenotransplantation, oncology, and toxicology. Despite advancements in comparative genomics, new challenges have arisen around the quantity, quality assurance, annotation, and interoperability of genomic data and metadata. New tools and approaches are required to meet these challenges and fulfill the needs of researchers. This paper focuses on how the National Institutes of Health (NIH) Comparative Genomics Resource (CGR) can address both the opportunities for comparative genomics to further impact human health and confront an increasingly complex set of challenges facing researchers.

Subject(s)

Algorithms , Genomics , United States , Humans , Comparative Genomic Hybridization , Drug Delivery Systems , National Institutes of Health (U.S.)

7.

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

Li, Wenjun; O'Neill, Kathleen R; Haft, Daniel H; DiCuccio, Michael; Chetvernin, Vyacheslav; Badretdin, Azat; Coulouris, George; Chitsaz, Farideh; Derbyshire, Myra K; Durkin, A Scott; Gonzales, Noreen R; Gwadz, Marc; Lanczycki, Christopher J; Song, James S; Thanki, Narmada; Wang, Jiyao; Yamashita, Roxanne A; Yang, Mingzhang; Zheng, Chanjuan; Marchler-Bauer, Aron; Thibaud-Nissen, Françoise.

Nucleic Acids Res ; 49(D1): D1020-D1028, 2021 01 08.

Article in English | MEDLINE | ID: mdl-33270901

ABSTRACT

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

Subject(s)

Computational Biology/methods , Databases, Genetic , Genome, Archaeal/genetics , Genome, Bacterial/genetics , Molecular Sequence Annotation/methods , Proteins/genetics , Data Curation/methods , Data Mining/methods , Genomics/methods , Internet , Proteins/classification , User-Computer Interface

8.

Database resources of the National Center for Biotechnology Information.

Sayers, Eric W; Beck, Jeffrey; Bolton, Evan E; Bourexis, Devon; Brister, James R; Canese, Kathi; Comeau, Donald C; Funk, Kathryn; Kim, Sunghwan; Klimke, William; Marchler-Bauer, Aron; Landrum, Melissa; Lathrop, Stacy; Lu, Zhiyong; Madden, Thomas L; O'Leary, Nuala; Phan, Lon; Rangwala, Sanjida H; Schneider, Valerie A; Skripchenko, Yuri; Wang, Jiyao; Ye, Jian; Trawick, Barton W; Pruitt, Kim D; Sherry, Stephen T.

Nucleic Acids Res ; 49(D1): D10-D17, 2021 01 08.

Article in English | MEDLINE | ID: mdl-33095870

ABSTRACT

The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 34 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Custom implementations of the BLAST program provide sequence-based searching of many specialized datasets. New resources released in the past year include a new PubMed interface and NCBI datasets. Additional resources that were updated in the past year include PMC, Bookshelf, Genome Data Viewer, SRA, ClinVar, dbSNP, dbVar, Pathogen Detection, BLAST, Primer-BLAST, IgBLAST, iCn3D and PubChem. All of these resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.

Subject(s)

Databases, Genetic , National Library of Medicine (U.S.) , Computational Biology/methods , Databases, Chemical , Databases, Nucleic Acid , Databases, Protein , Genomics/methods , Humans , PubMed , United States

9.

The InterPro protein families and domains database: 20 years on.

Blum, Matthias; Chang, Hsin-Yu; Chuguransky, Sara; Grego, Tiago; Kandasaamy, Swaathi; Mitchell, Alex; Nuka, Gift; Paysan-Lafosse, Typhaine; Qureshi, Matloob; Raj, Shriya; Richardson, Lorna; Salazar, Gustavo A; Williams, Lowri; Bork, Peer; Bridge, Alan; Gough, Julian; Haft, Daniel H; Letunic, Ivica; Marchler-Bauer, Aron; Mi, Huaiyu; Natale, Darren A; Necci, Marco; Orengo, Christine A; Pandurangan, Arun P; Rivoire, Catherine; Sigrist, Christian J A; Sillitoe, Ian; Thanki, Narmada; Thomas, Paul D; Tosatto, Silvio C E; Wu, Cathy H; Bateman, Alex; Finn, Robert D.

Nucleic Acids Res ; 49(D1): D344-D354, 2021 01 08.

Article in English | MEDLINE | ID: mdl-33156333

ABSTRACT

The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.

Subject(s)

Databases, Protein , Proteins/chemistry , Amino Acid Sequence , COVID-19/metabolism , Internet , Molecular Sequence Annotation , Protein Domains , Protein Interaction Maps , SARS-CoV-2/metabolism , Sequence Alignment

10.

CDD/SPARCLE: the conserved domain database in 2020.

Lu, Shennan; Wang, Jiyao; Chitsaz, Farideh; Derbyshire, Myra K; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Yamashita, Roxanne A; Yang, Mingzhang; Zhang, Dachuan; Zheng, Chanjuan; Lanczycki, Christopher J; Marchler-Bauer, Aron.

Nucleic Acids Res ; 48(D1): D265-D268, 2020 01 08.

Article in English | MEDLINE | ID: mdl-31777944

ABSTRACT

As NLM's Conserved Domain Database (CDD) enters its 20th year of operations as a publicly available resource, CDD curation staff continues to develop hierarchical classifications of widely distributed protein domain families, and to record conserved sites associated with molecular function, so that they can be mapped onto user queries in support of hypothesis-driven biomolecular research. CDD offers both an archive of pre-computed domain annotations as well as live search services for both single protein or nucleotide queries and larger sets of protein query sequences. CDD staff has continued to characterize protein families via conserved domain architectures and has built up a significant corpus of curated domain architectures in support of naming bacterial proteins in RefSeq. These architecture definitions are available via SPARCLE, the Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Subject(s)

Databases, Protein , Protein Domains , Amino Acid Sequence , Conserved Sequence

11.

iCn3D, a web-based 3D viewer for sharing 1D/2D/3D representations of biomolecular structures.

Wang, Jiyao; Youkharibache, Philippe; Zhang, Dachuan; Lanczycki, Christopher J; Geer, Renata C; Madej, Thomas; Phan, Lon; Ward, Minghong; Lu, Shennan; Marchler, Gabriele H; Wang, Yanli; Bryant, Stephen H; Geer, Lewis Y; Marchler-Bauer, Aron.

Bioinformatics ; 36(1): 131-135, 2020 01 01.

Article in English | MEDLINE | ID: mdl-31218344

ABSTRACT

MOTIVATION: Build a web-based 3D molecular structure viewer focusing on interactive structural analysis. RESULTS: iCn3D (I-see-in-3D) can simultaneously show 3D structure, 2D molecular contacts and 1D protein and nucleotide sequences through an integrated sequence/annotation browser. Pre-defined and arbitrary molecular features can be selected in any of the 1D/2D/3D windows as sets of residues and these selections are synchronized dynamically in all displays. Biological annotations such as protein domains, single nucleotide variations, etc. can be shown as tracks in the 1D sequence/annotation browser. These customized displays can be shared with colleagues or publishers via a simple URL. iCn3D can display structure-structure alignments obtained from NCBI's VAST+ service. It can also display the alignment of a sequence with a structure as identified by BLAST, and thus relate 3D structure to a large fraction of all known proteins. iCn3D can also display electron density maps or electron microscopy (EM) density maps, and export files for 3D printing. The following example URL exemplifies some of the 1D/2D/3D representations: https://www.ncbi.nlm.nih.gov/Structure/icn3d/full.html?mmdbid=1TUP&showanno=1&show2d=1&showsets=1. AVAILABILITY AND IMPLEMENTATION: iCn3D is freely available to the public. Its source code is available at https://github.com/ncbi/icn3d. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Base Sequence , Computational Biology , Internet , Models, Molecular , Proteins , Software , Computational Biology/methods , Databases, Genetic , Molecular Conformation , Proteins/chemistry

12.

Database resources of the National Center for Biotechnology Information.

Sayers, Eric W; Agarwala, Richa; Bolton, Evan E; Brister, J Rodney; Canese, Kathi; Clark, Karen; Connor, Ryan; Fiorini, Nicolas; Funk, Kathryn; Hefferon, Timothy; Holmes, J Bradley; Kim, Sunghwan; Kimchi, Avi; Kitts, Paul A; Lathrop, Stacy; Lu, Zhiyong; Madden, Thomas L; Marchler-Bauer, Aron; Phan, Lon; Schneider, Valerie A; Schoch, Conrad L; Pruitt, Kim D; Ostell, James.

Nucleic Acids Res ; 47(D1): D23-D28, 2019 01 08.

Article in English | MEDLINE | ID: mdl-30395293

ABSTRACT

The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 38 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. New resources released in the past year include PubMed Labs and a new sequence database search. Resources that were updated in the past year include PubMed, PMC, Bookshelf, genome data viewer, Assembly, prokaryotic genomes, Genome, BioProject, dbSNP, dbVar, BLAST databases, igBLAST, iCn3D and PubChem. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.

Subject(s)

Biotechnology/organization & administration , Databases, Genetic , Animals , Biotechnology/methods , Databases, Chemical , Humans , Software , United States/epidemiology , Web Browser

13.

InterPro in 2019: improving coverage, classification and access to protein sequence annotations.

Mitchell, Alex L; Attwood, Teresa K; Babbitt, Patricia C; Blum, Matthias; Bork, Peer; Bridge, Alan; Brown, Shoshana D; Chang, Hsin-Yu; El-Gebali, Sara; Fraser, Matthew I; Gough, Julian; Haft, David R; Huang, Hongzhan; Letunic, Ivica; Lopez, Rodrigo; Luciani, Aurélien; Madeira, Fabio; Marchler-Bauer, Aron; Mi, Huaiyu; Natale, Darren A; Necci, Marco; Nuka, Gift; Orengo, Christine; Pandurangan, Arun P; Paysan-Lafosse, Typhaine; Pesseat, Sebastien; Potter, Simon C; Qureshi, Matloob A; Rawlings, Neil D; Redaschi, Nicole; Richardson, Lorna J; Rivoire, Catherine; Salazar, Gustavo A; Sangrador-Vegas, Amaia; Sigrist, Christian J A; Sillitoe, Ian; Sutton, Granger G; Thanki, Narmada; Thomas, Paul D; Tosatto, Silvio C E; Yong, Siew-Yit; Finn, Robert D.

Nucleic Acids Res ; 47(D1): D351-D360, 2019 01 08.

Article in English | MEDLINE | ID: mdl-30398656

ABSTRACT

The InterPro database (http://www.ebi.ac.uk/interpro/) classifies protein sequences into families and predicts the presence of functionally important domains and sites. Here, we report recent developments with InterPro (version 70.0) and its associated software, including an 18% growth in the size of the database in terms on new InterPro entries, updates to content, the inclusion of an additional entry type, refined modelling of discontinuous domains, and the development of a new programmatic interface and website. These developments extend and enrich the information provided by InterPro, and provide greater flexibility in terms of data access. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB, and discuss how our evaluation of residue coverage may help guide future curation activities.

Subject(s)

Databases, Protein , Molecular Sequence Annotation , Animals , Databases, Genetic , Gene Ontology , Humans , Internet , Multigene Family , Protein Domains/genetics , Sequence Homology, Amino Acid , Software , User-Computer Interface

14.

RefSeq: an update on prokaryotic genome annotation and curation.

Haft, Daniel H; DiCuccio, Michael; Badretdin, Azat; Brover, Vyacheslav; Chetvernin, Vyacheslav; O'Neill, Kathleen; Li, Wenjun; Chitsaz, Farideh; Derbyshire, Myra K; Gonzales, Noreen R; Gwadz, Marc; Lu, Fu; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Yamashita, Roxanne A; Zheng, Chanjuan; Thibaud-Nissen, Françoise; Geer, Lewis Y; Marchler-Bauer, Aron; Pruitt, Kim D.

Nucleic Acids Res ; 46(D1): D851-D860, 2018 01 04.

Article in English | MEDLINE | ID: mdl-29112715

ABSTRACT

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule-BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

Subject(s)

Data Curation , Databases, Nucleic Acid , Genome , Molecular Sequence Annotation , Prokaryotic Cells , Archaea/genetics , Bacteria/genetics , Databases, Protein , Eukaryota/genetics , Forecasting , Humans , Sequence Homology , Software , Viruses/genetics

15.

CDD/SPARCLE: functional classification of proteins via subfamily domain architectures.

Marchler-Bauer, Aron; Bo, Yu; Han, Lianyi; He, Jane; Lanczycki, Christopher J; Lu, Shennan; Chitsaz, Farideh; Derbyshire, Myra K; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Lu, Fu; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Wang, Zhouxi; Yamashita, Roxanne A; Zhang, Dachuan; Zheng, Chanjuan; Geer, Lewis Y; Bryant, Stephen H.

Nucleic Acids Res ; 45(D1): D200-D203, 2017 01 04.

Article in English | MEDLINE | ID: mdl-27899674

ABSTRACT

NCBI's Conserved Domain Database (CDD) aims at annotating biomolecular sequences with the location of evolutionarily conserved protein domain footprints, and functional sites inferred from such footprints. An archive of pre-computed domain annotation is maintained for proteins tracked by NCBI's Entrez database, and live search services are offered as well. CDD curation staff supplements a comprehensive collection of protein domain and protein family models, which have been imported from external providers, with representations of selected domain families that are curated in-house and organized into hierarchical classifications of functionally distinct families and sub-families. CDD also supports comparative analyses of protein families via conserved domain architectures, and a recent curation effort focuses on providing functional characterizations of distinct subfamily architectures using SPARCLE: Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Subject(s)

Computational Biology/methods , Databases, Protein , Protein Interaction Domains and Motifs , Proteins , Information Dissemination , Internet , Proteins/chemistry , Proteins/classification , Proteins/genetics

16.

InterPro in 2017-beyond protein family and domain annotations.

Finn, Robert D; Attwood, Teresa K; Babbitt, Patricia C; Bateman, Alex; Bork, Peer; Bridge, Alan J; Chang, Hsin-Yu; Dosztányi, Zsuzsanna; El-Gebali, Sara; Fraser, Matthew; Gough, Julian; Haft, David; Holliday, Gemma L; Huang, Hongzhan; Huang, Xiaosong; Letunic, Ivica; Lopez, Rodrigo; Lu, Shennan; Marchler-Bauer, Aron; Mi, Huaiyu; Mistry, Jaina; Natale, Darren A; Necci, Marco; Nuka, Gift; Orengo, Christine A; Park, Youngmi; Pesseat, Sebastien; Piovesan, Damiano; Potter, Simon C; Rawlings, Neil D; Redaschi, Nicole; Richardson, Lorna; Rivoire, Catherine; Sangrador-Vegas, Amaia; Sigrist, Christian; Sillitoe, Ian; Smithers, Ben; Squizzato, Silvano; Sutton, Granger; Thanki, Narmada; Thomas, Paul D; Tosatto, Silvio C E; Wu, Cathy H; Xenarios, Ioannis; Yeh, Lai-Su; Young, Siew-Yit; Mitchell, Alex L.

Nucleic Acids Res ; 45(D1): D190-D199, 2017 01 04.

Article in English | MEDLINE | ID: mdl-27899635

ABSTRACT

InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences.

Subject(s)

Computational Biology/methods , Databases, Protein , Protein Interaction Domains and Motifs , Software , Humans , Molecular Sequence Annotation , Phylogeny

17.

CDD: NCBI's conserved domain database.

Marchler-Bauer, Aron; Derbyshire, Myra K; Gonzales, Noreen R; Lu, Shennan; Chitsaz, Farideh; Geer, Lewis Y; Geer, Renata C; He, Jane; Gwadz, Marc; Hurwitz, David I; Lanczycki, Christopher J; Lu, Fu; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Wang, Zhouxi; Yamashita, Roxanne A; Zhang, Dachuan; Zheng, Chanjuan; Bryant, Stephen H.

Nucleic Acids Res ; 43(Database issue): D222-6, 2015 Jan.

Article in English | MEDLINE | ID: mdl-25414356

ABSTRACT

NCBI's CDD, the Conserved Domain Database, enters its 15(th) year as a public resource for the annotation of proteins with the location of conserved domain footprints. Going forward, we strive to improve the coverage and consistency of domain annotation provided by CDD. We maintain a live search system as well as an archive of pre-computed domain annotation for sequences tracked in NCBI's Entrez protein database, which can be retrieved for single sequences or in bulk. We also maintain import procedures so that CDD contains domain models and domain definitions provided by several collections available in the public domain, as well as those produced by an in-house curation effort. The curation effort aims at increasing coverage and providing finer-grained classifications of common protein domains, for which a wealth of functional and structural data has become available. CDD curation generates alignment models of representative sequence fragments, which are in agreement with domain boundaries as observed in protein 3D structure, and which model the structurally conserved cores of domain families as well as annotate conserved features. CDD can be accessed at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

Subject(s)

Databases, Protein , Protein Structure, Tertiary , Amino Acid Motifs , Amino Acid Sequence , Conserved Sequence , Data Curation

18.

cddApp: a Cytoscape app for accessing the NCBI conserved domain database.

Morris, John H; Wu, Allan; Yamashita, Roxanne A; Marchler-Bauer, Aron; Ferrin, Thomas E.

Bioinformatics ; 31(1): 134-6, 2015 Jan 01.

Article in English | MEDLINE | ID: mdl-25212755

ABSTRACT

MOTIVATION: cddApp is a Cytoscape extension that supports the annotation of protein networks with information about domains and specific functional sites from the National Center for Biotechnology Information's conserved domain database (CDD). CDD information is loaded for nodes annotated with NCBI numbers or UniProt identifiers and (optionally) Protein Data Bank structures. cddApp integrates with the Cytoscape apps structureViz2 and enhancedGraphics. Together, these three apps provide powerful tools to annotate nodes with CDD domain and site information and visualize that information in both network and structural contexts. AVAILABILITY AND IMPLEMENTATION: cddApp is written in Java and freely available for download from the Cytoscape app store (http://apps.cytoscape.org). Documentation is provided at http://www.rbvi.ucsf.edu/cytoscape, and the source is publically available from GitHub http://github.com/RBVI/cddApp.

Subject(s)

Bacterial Proteins/metabolism , Computational Biology/instrumentation , Metabolic Networks and Pathways , Molecular Sequence Annotation/methods , Sequence Analysis, Protein/methods , Software , Algorithms , Bacillus , Bacterial Proteins/chemistry , Conserved Sequence , Databases, Protein , Humans , Protein Conformation , Protein Interaction Mapping

19.

MMDB and VAST+: tracking structural similarities between macromolecular complexes.

Madej, Thomas; Lanczycki, Christopher J; Zhang, Dachuan; Thiessen, Paul A; Geer, Renata C; Marchler-Bauer, Aron; Bryant, Stephen H.

Nucleic Acids Res ; 42(Database issue): D297-303, 2014 Jan.

Article in English | MEDLINE | ID: mdl-24319143

ABSTRACT

The computational detection of similarities between protein 3D structures has become an indispensable tool for the detection of homologous relationships, the classification of protein families and functional inference. Consequently, numerous algorithms have been developed that facilitate structure comparison, including rapid searches against a steadily growing collection of protein structures. To this end, NCBI's Molecular Modeling Database (MMDB), which is based on the Protein Data Bank (PDB), maintains a comprehensive and up-to-date archive of protein structure similarities computed with the Vector Alignment Search Tool (VAST). These similarities have been recorded on the level of single proteins and protein domains, comprising in excess of 1.5 billion pairwise alignments. Here we present VAST+, an extension to the existing VAST service, which summarizes and presents structural similarity on the level of biological assemblies or macromolecular complexes. VAST+ simplifies structure neighboring results and shows, for macromolecular complexes tracked in MMDB, lists of similar complexes ranked by the extent of similarity. VAST+ replaces the previous VAST service as the default presentation of structure neighboring data in NCBI's Entrez query and retrieval system. MMDB and VAST+ can be accessed via http://www.ncbi.nlm.nih.gov/Structure.

Subject(s)

Databases, Protein , Structural Homology, Protein , Computer Graphics , Internet , Macromolecular Substances/chemistry , Models, Molecular , Software

20.

CDD: conserved domains and protein three-dimensional structure.

Marchler-Bauer, Aron; Zheng, Chanjuan; Chitsaz, Farideh; Derbyshire, Myra K; Geer, Lewis Y; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Lanczycki, Christopher J; Lu, Fu; Lu, Shennan; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Yamashita, Roxanne A; Zhang, Dachuan; Bryant, Stephen H.

Nucleic Acids Res ; 41(Database issue): D348-52, 2013 Jan.

Article in English | MEDLINE | ID: mdl-23197659

ABSTRACT

CDD, the Conserved Domain Database, is part of NCBI's Entrez query and retrieval system and is also accessible via http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. CDD provides annotation of protein sequences with the location of conserved domain footprints and functional sites inferred from these footprints. Pre-computed annotation is available via Entrez, and interactive search services accept single protein or nucleotide queries, as well as batch submissions of protein query sequences, utilizing RPS-BLAST to rapidly identify putative matches. CDD incorporates several protein domain and full-length protein model collections, and maintains an active curation effort that aims at providing fine grained classifications for major and well-characterized protein domain families, as supported by available protein three-dimensional (3D) structure and the published literature. To this date, the majority of protein 3D structures are represented by models tracked by CDD, and CDD curators are characterizing novel families that emerge from protein structure determination efforts.

Subject(s)

Databases, Protein , Protein Conformation , Protein Structure, Tertiary , Amino Acid Sequence , Conserved Sequence , Internet , Models, Molecular , Molecular Sequence Annotation , Proteins/chemistry , Proteins/classification , Proteins/genetics , Sequence Analysis, Protein

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL