Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 37
Filter
Add more filters










Publication year range
1.
Nucleic Acids Res ; 52(D1): D762-D769, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37962425

ABSTRACT

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq. MAGs now account for over 17000 assemblies in RefSeq, split over 165 orders and 362 families. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP), which is used to annotate nearly all RefSeq assemblies include better detection of protein-coding genes. Nearly 83% of RefSeq proteins are now named by a curated Protein Family Model, a 4.7% increase in the past three years ago. In addition to literature citations, Enzyme Commission numbers, and gene symbols, Gene Ontology terms are now assigned to 48% of RefSeq proteins, allowing for easier multi-genome comparison. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. PGAP is available as a stand-alone tool able to produce GenBank-ready files at https://github.com/ncbi/pgap.


Subject(s)
Archaea , Bacteria , Databases, Nucleic Acid , Metagenome , Archaea/genetics , Bacteria/genetics , Databases, Nucleic Acid/standards , Databases, Nucleic Acid/trends , Genome, Archaeal/genetics , Genome, Bacterial/genetics , Internet , Molecular Sequence Annotation , Proteins/genetics
2.
Nucleic Acids Res ; 52(D1): D33-D43, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37994677

ABSTRACT

The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, SciENcv, the NIH Comparative Genomics Resource (CGR), NCBI Virus, SRA, RefSeq, foreign contamination screening tools, Taxonomy, iCn3D, ClinVar, GTR, MedGen, dbSNP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.


Subject(s)
Databases, Genetic , National Library of Medicine (U.S.) , Biotechnology/instrumentation , Databases, Nucleic Acid , Internet , United States
3.
Article in English | MEDLINE | ID: mdl-36748495

ABSTRACT

The public sequence databases are entrusted with the dual responsibility of providing an accessible archive to all submitters and supporting data reliability and its re-use to all users. Genomes from type materials can act as an unambiguous reference for a taxonomic name and play an important role in comparative genomics, especially for taxon verification or reclassification. The National Center for Biotechnology Information (NCBI) collects and curates information on prokaryotic type strains and genomes from type strains. The average nucleotide identity (ANI)-based quality control processes introduced at NCBI to verify the genomes from type strains and improve related sequence records are detailed here. Using the curated genomes from type strains as reference, the taxonomy of over 1.1 million GenBank genomes were verified and the taxonomy of over 7000 new submissions before acceptance to GenBank and over 1800 existing genomes in GenBank were reclassified.


Subject(s)
Databases, Nucleic Acid , Fatty Acids , Sequence Analysis, DNA , Reproducibility of Results , RNA, Ribosomal, 16S/genetics , Phylogeny , Base Composition , DNA, Bacterial/genetics , Bacterial Typing Techniques , Fatty Acids/chemistry
4.
Virol J ; 20(1): 31, 2023 02 17.
Article in English | MEDLINE | ID: mdl-36812119

ABSTRACT

BACKGROUND: Since the onset of the SARS-CoV-2 pandemic, bioinformatic analyses have been performed to understand the nucleotide and synonymous codon usage features and mutational patterns of the virus. However, comparatively few have attempted to perform such analyses on a considerably large cohort of viral genomes while organizing the plethora of available sequence data for a month-by-month analysis to observe changes over time. Here, we aimed to perform sequence composition and mutation analysis of SARS-CoV-2, separating sequences by gene, clade, and timepoints, and contrast the mutational profile of SARS-CoV-2 to other comparable RNA viruses. METHODS: Using a cleaned, filtered, and pre-aligned dataset of over 3.5 million sequences downloaded from the GISAID database, we computed nucleotide and codon usage statistics, including calculation of relative synonymous codon usage values. We then calculated codon adaptation index (CAI) changes and a nonsynonymous/synonymous mutation ratio (dN/dS) over time for our dataset. Finally, we compiled information on the types of mutations occurring for SARS-CoV-2 and other comparable RNA viruses, and generated heatmaps showing codon and nucleotide composition at high entropy positions along the Spike sequence. RESULTS: We show that nucleotide and codon usage metrics remain relatively consistent over the 32-month span, though there are significant differences between clades within each gene at various timepoints. CAI and dN/dS values vary substantially between different timepoints and different genes, with Spike gene on average showing both the highest CAI and dN/dS values. Mutational analysis showed that SARS-CoV-2 Spike has a higher proportion of nonsynonymous mutations than analogous genes in other RNA viruses, with nonsynonymous mutations outnumbering synonymous ones by up to 20:1. However, at several specific positions, synonymous mutations were overwhelmingly predominant. CONCLUSIONS: Our multifaceted analysis covering both the composition and mutation signature of SARS-CoV-2 gives valuable insight into the nucleotide frequency and codon usage heterogeneity of SARS-CoV-2 over time, and its unique mutational profile compared to other RNA viruses.


Subject(s)
COVID-19 , RNA Viruses , Humans , SARS-CoV-2/genetics , Nucleotides , COVID-19/genetics , Codon , Mutation , Genome, Viral , RNA Viruses/genetics , Evolution, Molecular
5.
STAR Protoc ; 3(3): 101648, 2022 09 16.
Article in English | MEDLINE | ID: mdl-36052345

ABSTRACT

Here, we describe a bioinformatics pipeline that evaluates the interactions between coagulation-related proteins and genetic variants with SARS-CoV-2 proteins. This pipeline searches for host proteins that may bind to viral protein and identifies and scores the protein genetic variants to predict the disease pathogenesis in specific subpopulations. Additionally, it is able to find structurally similar motifs and identify potential binding sites within the host-viral protein complexes to unveil viral impact on regulated biological processes and/or host-protein impact on viral invasion or reproduction. For complete details on the use and execution of this protocol, please refer to Holcomb et al. (2021).


Subject(s)
COVID-19 , SARS-CoV-2 , Binding Sites , COVID-19/genetics , Host Microbial Interactions , Humans , SARS-CoV-2/genetics , Viral Proteins/genetics
6.
Genome Med ; 13(1): 122, 2021 07 28.
Article in English | MEDLINE | ID: mdl-34321100

ABSTRACT

BACKGROUND: Gene expression is highly variable across tissues of multi-cellular organisms, influencing the codon usage of the tissue-specific transcriptome. Cancer disrupts the gene expression pattern of healthy tissue resulting in altered codon usage preferences. The topic of codon usage changes as they relate to codon demand, and tRNA supply in cancer is of growing interest. METHODS: We analyzed transcriptome-weighted codon and codon pair usage based on The Cancer Genome Atlas (TCGA) RNA-seq data from 6427 solid tumor samples and 632 normal tissue samples. This dataset represents 32 cancer types affecting 11 distinct tissues. Our analysis focused on tissues that give rise to multiple solid tumor types and cancer types that are present in multiple tissues. RESULTS: We identified distinct patterns of synonymous codon usage changes for different cancer types affecting the same tissue. For example, a substantial increase in GGT-glycine was observed in invasive ductal carcinoma (IDC), invasive lobular carcinoma (ILC), and mixed invasive ductal and lobular carcinoma (IDLC) of the breast. Change in synonymous codon preference favoring GGT correlated with change in synonymous codon preference against GGC in IDC and IDLC, but not in ILC. Furthermore, we examined the codon usage changes between paired healthy/tumor tissue from the same patient. Using clinical data from TCGA, we conducted a survival analysis of patients based on the degree of change between healthy and tumor-specific codon usage, revealing an association between larger changes and increased mortality. We have also created a database that contains cancer-specific codon and codon pair usage data for cancer types derived from TCGA, which represents a comprehensive tool for codon-usage-oriented cancer research. CONCLUSIONS: Based on data from TCGA, we have highlighted tumor type-specific signatures of codon and codon pair usage. Paired data revealed variable changes to codon usage patterns, which must be considered when designing personalized cancer treatments. The associated database, CancerCoCoPUTs, represents a comprehensive resource for codon and codon pair usage in cancer and is available at https://dnahive.fda.gov/review/cancercocoputs/ . These findings are important to understand the relationship between tRNA supply and codon demand in cancer states and could help guide the development of new cancer therapeutics.


Subject(s)
Codon Usage , Codon , Computational Biology/methods , Databases, Genetic , Neoplasms/diagnosis , Neoplasms/genetics , Biomarkers, Tumor , Gene Expression Profiling , Gene Expression Regulation, Neoplastic , Genome-Wide Association Study , Genomics/methods , Humans , Kaplan-Meier Estimate , Neoplasms/mortality , Prognosis , Transcriptome
7.
Open Forum Infect Dis ; 8(6): ofab189, 2021 Jun.
Article in English | MEDLINE | ID: mdl-34109257

ABSTRACT

BACKGROUND: The advent of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) provoked researchers to propose multiple antiviral strategies to improve patients' outcomes. Studies provide evidence that cyclosporine A (CsA) decreases SARS-CoV-2 replication in vitro and decreases mortality rates of coronavirus disease 2019 (COVID-19) patients. CsA binds cyclophilins, which isomerize prolines, affecting viral protein activity. METHODS: We investigated the proline composition from various coronavirus proteomes to identify proteins that may critically rely on cyclophilin's peptidyl-proline isomerase activity and found that the nucleocapsid (N) protein significantly depends on cyclophilin A (CyPA). We modeled CyPA and N protein interactions to demonstrate the N protein as a potential indirect therapeutic target of CsA, which we propose may impede coronavirus replication by obstructing nucleocapsid folding. RESULTS: Finally, we analyzed the literature and protein-protein interactions, finding evidence that, by inhibiting CyPA, CsA may impact coagulation proteins and hemostasis. CONCLUSIONS: Despite CsA's promising antiviral characteristics, the interactions between cyclophilins and coagulation factors emphasize risk stratification for COVID patients with thrombosis dispositions.

8.
PLoS Comput Biol ; 17(3): e1008805, 2021 03.
Article in English | MEDLINE | ID: mdl-33730015

ABSTRACT

Thrombosis is a recognized complication of Coronavirus disease of 2019 (COVID-19) and is often associated with poor prognosis. There is a well-recognized link between coagulation and inflammation, however, the extent of thrombotic events associated with COVID-19 warrants further investigation. Poly(A) Binding Protein Cytoplasmic 4 (PABPC4), Serine/Cysteine Proteinase Inhibitor Clade G Member 1 (SERPING1) and Vitamin K epOxide Reductase Complex subunit 1 (VKORC1), which are all proteins linked to coagulation, have been shown to interact with SARS proteins. We computationally examined the interaction of these with SARS-CoV-2 proteins and, in the case of VKORC1, we describe its binding to ORF7a in detail. We examined the occurrence of variants of each of these proteins across populations and interrogated their potential contribution to COVID-19 severity. Potential mechanisms, by which some of these variants may contribute to disease, are proposed. Some of these variants are prevalent in minority groups that are disproportionally affected by severe COVID-19. Therefore, we are proposing that further investigation around these variants may lead to better understanding of disease pathogenesis in minority groups and more informed therapeutic approaches.


Subject(s)
Blood Coagulation , Blood Proteins/genetics , COVID-19/metabolism , Complement C1 Inhibitor Protein/genetics , Poly(A)-Binding Proteins/genetics , SARS-CoV-2/metabolism , Vitamin K Epoxide Reductases/genetics , Anticoagulants/administration & dosage , Blood Proteins/metabolism , COVID-19/physiopathology , COVID-19/virology , Complement C1 Inhibitor Protein/metabolism , Genome-Wide Association Study , Humans , Models, Molecular , Mutation , Poly(A)-Binding Proteins/metabolism , Protein Binding , SARS-CoV-2/genetics , Severity of Illness Index , Viral Proteins/metabolism , Vitamin K Epoxide Reductases/metabolism , Warfarin/administration & dosage
9.
Nucleic Acids Res ; 49(D1): D1020-D1028, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33270901

ABSTRACT

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.


Subject(s)
Computational Biology/methods , Databases, Genetic , Genome, Archaeal/genetics , Genome, Bacterial/genetics , Molecular Sequence Annotation/methods , Proteins/genetics , Data Curation/methods , Data Mining/methods , Genomics/methods , Internet , Proteins/classification , User-Computer Interface
10.
F1000Res ; 9: 174, 2020.
Article in English | MEDLINE | ID: mdl-33014344

ABSTRACT

Ribosome profiling provides the opportunity to evaluate translation kinetics at codon level resolution. Here, we describe ribosome profiling data, generated from two HEK293T cell lines. The ribosome profiling data are composed of Ribo-seq (mRNA sequencing data from ribosome protected fragments) and RNA-seq data (total RNA sequencing). The two HEK293T cell lines each express a version of the F9 gene, both of which are translated into identical proteins in terms of their amino acid sequences. However, these F9 genes vary drastically in their codon usage and predicted mRNA structure. We also provide the pipeline that we used to analyze the data. Further analyzing this dataset holds great potential as it can be used i) to unveil insights into the composition and regulation of the transcriptome, ii) for comparison with other ribosome profiling datasets, iii) to measure the rate of protein synthesis across the proteome and identify differences in elongation rates, iv) to discover previously unidentified translation of peptides, v) to explore the effects of codon usage or codon context in translational kinetics and vi) to investigate cotranslational folding. Importantly, a unique feature of this dataset, compared to other available ribosome profiling data, is the presence of the F9 gene in two very distinct coding sequences.


Subject(s)
Codon/genetics , Factor IX/genetics , Protein Biosynthesis , Ribosomes/genetics , HEK293 Cells , Humans
11.
Sci Rep ; 10(1): 15643, 2020 09 24.
Article in English | MEDLINE | ID: mdl-32973171

ABSTRACT

As the SARS-CoV-2 pandemic is rapidly progressing, the need for the development of an effective vaccine is critical. A promising approach for vaccine development is to generate, through codon pair deoptimization, an attenuated virus. This approach carries the advantage that it only requires limited knowledge specific to the virus in question, other than its genome sequence. Therefore, it is well suited for emerging viruses, for which we may not have extensive data. We performed comprehensive in silico analyses of several features of SARS-CoV-2 genomic sequence (e.g., codon usage, codon pair usage, dinucleotide/junction dinucleotide usage, RNA structure around the frameshift region) in comparison with other members of the coronaviridae family of viruses, the overall human genome, and the transcriptome of specific human tissues such as lung, which are primarily targeted by the virus. Our analysis identified the spike (S) and nucleocapsid (N) proteins as promising targets for deoptimization and suggests a roadmap for SARS-CoV-2 vaccine development, which can be generalizable to other viruses.


Subject(s)
Betacoronavirus/genetics , Coronavirus Infections/prevention & control , Nucleocapsid Proteins/genetics , Pandemics/prevention & control , Pneumonia, Viral/prevention & control , Spike Glycoprotein, Coronavirus/genetics , Viral Vaccines/immunology , Base Sequence , COVID-19 , COVID-19 Vaccines , Coronavirus Infections/immunology , Coronavirus Nucleocapsid Proteins , Genome, Viral/genetics , Humans , Nucleocapsid Proteins/immunology , Phosphoproteins , SARS-CoV-2 , Spike Glycoprotein, Coronavirus/immunology , Vaccines, Inactivated/immunology , Whole Genome Sequencing
12.
bioRxiv ; 2020 Sep 18.
Article in English | MEDLINE | ID: mdl-32935103

ABSTRACT

Thrombosis has been one of the complications of the Coronavirus disease of 2019 (COVID-19), often associated with poor prognosis. There is a well-recognized link between coagulation and inflammation, however, the extent of thrombotic events associated with COVID-19 warrants further investigation. Poly(A) Binding Protein Cytoplasmic 4 (PABPC4), Serine/Cysteine Proteinase Inhibitor Clade G Member 1 (SERPING1) and Vitamin K epOxide Reductase Complex subunit 1 (VKORC1), which are all proteins linked to coagulation, have been shown to interact with SARS proteins. We computationally examined the interaction of these with SARS-CoV-2 proteins and, in the case of VKORC1, we describe its binding to ORF7a in detail. We examined the occurrence of variants of each of these proteins across populations and interrogated their potential contribution to COVID-19 severity. Potential mechanisms by which some of these variants may contribute to disease are proposed. Some of these variants are prevalent in minority groups that are disproportionally affected by severe COVID-19. Therefore, we are proposing that further investigation around these variants may lead to better understanding of disease pathogenesis in minority groups and more informed therapeutic approaches. AUTHOR SUMMARY: Increased blood clotting, especially in the lungs, is a common complication of COVID-19. Infectious diseases cause inflammation which in turn can contribute to increased blood clotting. However, the extent of clot formation that is seen in the lungs of COVID-19 patients suggests that there may be a more direct link. We identified three human proteins that are involved indirectly in the blood clotting cascade and have been shown to interact with proteins of SARS virus, which is closely related to the novel coronavirus. We examined computationally the interaction of these human proteins with the viral proteins. We looked for genetic variants of these proteins and examined how these variants are distributed across populations. We investigated whether variants of these genes could impact severity of COVID-19. Further investigation around these variants may provide clues for the pathogenesis of COVID-19 particularly in minority groups.

13.
Thromb Haemost ; 120(12): 1668-1679, 2020 Dec.
Article in English | MEDLINE | ID: mdl-32838472

ABSTRACT

Coronavirus disease of 2019 (COVID-19) is the clinical manifestation of the respiratory infection caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). While primarily recognized as a respiratory disease, it is clear that COVID-19 is systemic illness impacting multiple organ systems. One defining clinical feature of COVID-19 has been the high incidence of thrombotic events. The underlying processes and risk factors for the occurrence of thrombotic events in COVID-19 remain inadequately understood. While severe bacterial, viral, or fungal infections are well recognized to activate the coagulation system, COVID-19-associated coagulopathy is likely to have unique mechanistic features. Inflammatory-driven processes are likely primary drivers of coagulopathy in COVID-19, but the exact mechanisms linking inflammation to dysregulated hemostasis and thrombosis are yet to be delineated. Cumulative findings of microvascular thrombosis has raised question if the endothelium and microvasculature should be a point of investigative focus. von Willebrand factor (VWF) and its protease, a disintegrin and metalloproteinase with a thrombospondin type 1 motif, member 13 (ADAMTS-13), play important role in the maintenance of microvascular hemostasis. In inflammatory conditions, imbalanced VWF-ADAMTS-13 characterized by elevated VWF levels and inhibited and/or reduced activity of ADAMTS-13 has been reported. Also, an imbalance between ADAMTS-13 activity and VWF antigen is associated with organ dysfunction and death in patients with systemic inflammation. A thorough understanding of VWF-ADAMTS-13 interactions during early and advanced phases of COVID-19 could help better define the pathophysiology, guide thromboprophylaxis and treatment, and improve clinical prognosis.


Subject(s)
COVID-19/complications , Disseminated Intravascular Coagulation/etiology , Microvessels/pathology , SARS-CoV-2/physiology , Thrombosis/etiology , ADAMTS13 Protein/metabolism , Animals , Blood Coagulation/immunology , Humans , von Willebrand Factor/metabolism
14.
bioRxiv ; 2020 Mar 31.
Article in English | MEDLINE | ID: mdl-32511300

ABSTRACT

As the SARS-CoV-2 pandemic is rapidly progressing, the need for the development of an effective vaccine is critical. A promising approach for vaccine development is to generate, through codon pair deoptimization, an attenuated virus. This approach carries the advantage that it only requires limited knowledge specific to the virus in question, other than its genome sequence. Therefore, it is well suited for emerging viruses for which we may not have extensive data. We performed comprehensive in silico analyses of several features of SARS-CoV-2 genomic sequence (e.g., codon usage, codon pair usage, dinucleotide/junction dinucleotide usage, RNA structure around the frameshift region) in comparison with other members of the coronaviridae family of viruses, the overall human genome, and the transcriptome of specific human tissues such as lung, which are primarily targeted by the virus. Our analysis identified the spike (S) and nucleocapsid (N) proteins as promising targets for deoptimization and suggests a roadmap for SARS-CoV-2 vaccine development, which can be generalizable to other viruses.

15.
Sci Rep ; 9(1): 15449, 2019 10 29.
Article in English | MEDLINE | ID: mdl-31664102

ABSTRACT

Synonymous codons occur with different frequencies in different organisms, a phenomenon termed codon usage bias. Codon optimization, a common term for a variety of approaches used widely by the biopharmaceutical industry, involves synonymous substitutions to increase protein expression. It had long been presumed that synonymous variants, which, by definition, do not alter the primary amino acid sequence, have no effect on protein structure and function. However, a critical mass of reports suggests that synonymous codon variations may impact protein conformation. To investigate the impact of synonymous codons usage on protein expression and function, we designed an optimized coagulation factor IX (FIX) variant and used multiple methods to compare its properties to the wild-type FIX upon expression in HEK293T cells. We found that the two variants differ in their conformation, even when controlling for the difference in expression levels. Using ribosome profiling, we identified robust changes in the translational kinetics of the two variants and were able to identify a region in the gene that may have a role in altering the conformation of the protein. Our data have direct implications for codon optimization strategies, for production of recombinant proteins and gene therapies.


Subject(s)
Codon , Factor IX/chemistry , Factor IX/genetics , Genetic Therapy , Protein Biosynthesis , Genetic Code , HEK293 Cells , Humans , Protein Conformation
17.
Int J Syst Evol Microbiol ; 68(7): 2386-2392, 2018 Jul.
Article in English | MEDLINE | ID: mdl-29792589

ABSTRACT

Average nucleotide identity analysis is a useful tool to verify taxonomic identities in prokaryotic genomes, for both complete and draft assemblies. Using optimum threshold ranges appropriate for different prokaryotic taxa, we have reviewed all prokaryotic genome assemblies in GenBank with regard to their taxonomic identity. We present the methods used to make such comparisons, the current status of GenBank verifications, and recent developments in confirming species assignments in new genome submissions.


Subject(s)
Databases, Nucleic Acid , Genome, Archaeal , Genome, Bacterial , Nucleotides/genetics , Phylogeny , Base Composition , Prokaryotic Cells , Sequence Analysis, DNA
19.
Nucleic Acids Res ; 46(D1): D851-D860, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29112715

ABSTRACT

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule-BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.


Subject(s)
Data Curation , Databases, Nucleic Acid , Genome , Molecular Sequence Annotation , Prokaryotic Cells , Archaea/genetics , Bacteria/genetics , Databases, Protein , Eukaryota/genetics , Forecasting , Humans , Sequence Homology , Software , Viruses/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...