RESUMO
Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.
Assuntos
Genes , Genoma Humano , Anotação de Sequência Molecular , Isoformas de Proteínas , Humanos , Genoma Humano/genética , Anotação de Sequência Molecular/normas , Anotação de Sequência Molecular/tendências , Isoformas de Proteínas/genética , Projeto Genoma Humano , Pseudogenes , RNA/genéticaRESUMO
Comprehensive genome annotation is essential to understand the impact of clinically relevant variants. However, the absence of a standard for clinical reporting and browser display complicates the process of consistent interpretation and reporting. To address these challenges, Ensembl/GENCODE1 and RefSeq2 launched a joint initiative, the Matched Annotation from NCBI and EMBL-EBI (MANE) collaboration, to converge on human gene and transcript annotation and to jointly define a high-value set of transcripts and corresponding proteins. Here, we describe the MANE transcript sets for use as universal standards for variant reporting and browser display. The MANE Select set identifies a representative transcript for each human protein-coding gene, whereas the MANE Plus Clinical set provides additional transcripts at loci where the Select transcripts alone are not sufficient to report all currently known clinical variants. Each MANE transcript represents an exact match between the exonic sequences of an Ensembl/GENCODE transcript and its counterpart in RefSeq such that the identifiers can be used synonymously. We have now released MANE Select transcripts for 97% of human protein-coding genes, including all American College of Medical Genetics and Genomics Secondary Findings list v3.0 (ref. 3) genes. MANE transcripts are accessible from major genome browsers and key resources. Widespread adoption of these transcript sets will increase the consistency of reporting, facilitate the exchange of data regardless of the annotation source and help to streamline clinical interpretation.
Assuntos
Biologia Computacional , Bases de Dados Genéticas , Genômica , Genoma , Humanos , Disseminação de Informação , Anotação de Sequência Molecular , National Library of Medicine (U.S.) , Estados UnidosRESUMO
We report a new visualization tool for analysis of whole-genome assembly-assembly alignments, the Comparative Genome Viewer (CGV) (https://ncbi.nlm.nih.gov/genome/cgv/). CGV visualizes pairwise same-species and cross-species alignments provided by National Center for Biotechnology Information (NCBI) using assembly alignment algorithms developed by us and others. Researchers can examine large structural differences spanning chromosomes, such as inversions or translocations. Users can also navigate to regions of interest, where they can detect and analyze smaller-scale deletions and rearrangements within specific chromosome or gene regions. RefSeq or user-provided gene annotation is displayed where available. CGV currently provides approximately 800 alignments from over 350 animal, plant, and fungal species. CGV and related NCBI viewers are undergoing active development to further meet needs of the research community in comparative genome visualization.
Assuntos
Genoma , Software , Animais , Genoma/genética , Alinhamento de Sequência/métodos , Genômica/métodos , Algoritmos , Estados Unidos , Humanos , Eucariotos/genética , Bases de Dados Genéticas , National Library of Medicine (U.S.) , Anotação de Sequência Molecular/métodosRESUMO
GenBank® (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains 25 trillion base pairs from over 3.7 billion nucleotide sequences for 557 000 formally described species. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. Recent updates include policies for including spatio-temporal metadata, clarified documentation for GenBank data processing, enhanced foreign contamination screening tools, new processes in the Submission Portal, migration of Entrez Genome and Assembly displays into NCBI Datasets, and the impending retirement of tbl2asn, replaced by table2asn.
Assuntos
Bases de Dados de Ácidos Nucleicos , Genômica , Sequência de Bases , Internet , HumanosRESUMO
The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, SciENcv, the NIH Comparative Genomics Resource (CGR), NCBI Virus, SRA, RefSeq, foreign contamination screening tools, Taxonomy, iCn3D, ClinVar, GTR, MedGen, dbSNP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Assuntos
Bases de Dados Genéticas , National Library of Medicine (U.S.) , Biotecnologia/instrumentação , Bases de Dados de Ácidos Nucleicos , Internet , Estados UnidosRESUMO
Eukaryotic genomes contain many nongenic elements that function in gene regulation, chromosome organization, recombination, repair, or replication, and mutation of those elements can affect genome function and cause disease. Although numerous epigenomic studies provide high coverage of gene regulatory regions, those data are not usually exposed in traditional genome annotation and can be difficult to access and interpret without field-specific expertise. The National Center for Biotechnology Information (NCBI) therefore provides RefSeq Functional Elements (RefSeqFEs), which represent experimentally validated human and mouse nongenic elements derived from the literature. The curated data set is comprised of richly annotated sequence records, descriptive records in the NCBI Gene database, reference genome feature annotation, and activity-based interactions between nongenic regions, target genes, and each other. The data set provides succinct functional details and transparent experimental evidence, leverages data from multiple experimental sources, is readily accessible and adaptable, and uses a flexible data model. The data have multiple uses for basic functional discovery, bioinformatics studies, genetic variant interpretation; as known positive controls for epigenomic data evaluation; and as reference standards for functional interactions. Comparisons to other gene regulatory data sets show that the RefSeqFE data set includes a wider range of feature types representing more areas of biology, but it is comparatively smaller and subject to data selection biases. RefSeqFEs thus provide an alternative and complementary resource for experimentally assayed functional elements, with future data set growth expected.
Assuntos
Biologia Computacional , Genoma , Animais , Bases de Dados Genéticas , Eucariotos/genética , Humanos , Camundongos , Padrões de ReferênciaRESUMO
GenBank® (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains 19.6 trillion base pairs from over 2.9 billion nucleotide sequences for 504 000 formally described species. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. Recent updates include resources for data from the SARS-CoV-2 virus, NCBI Datasets, BLAST ClusteredNR, the Submission Portal, table2asn, a Foreign Contamination Screening tool and BioSample.
Assuntos
Bases de Dados de Ácidos Nucleicos , Humanos , COVID-19/genética , Genômica , SARS-CoV-2/genéticaRESUMO
The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. New resources include the Comparative Genome Resource (CGR) and the BLAST ClusteredNR database. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, IgBLAST, GDV, RefSeq, NCBI Virus, GenBank type assemblies, iCn3D, ClinVar, GTR, dbGaP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Assuntos
Bases de Dados Genéticas , Bases de Dados de Ácidos Nucleicos , Estados Unidos , National Library of Medicine (U.S.) , Alinhamento de Sequência , Biotecnologia , InternetRESUMO
A global international initiative, such as the Earth BioGenome Project (EBP), requires both agreement and coordination on standards to ensure that the collective effort generates rapid progress toward its goals. To this end, the EBP initiated five technical standards committees comprising volunteer members from the global genomics scientific community: Sample Collection and Processing, Sequencing and Assembly, Annotation, Analysis, and IT and Informatics. The current versions of the resulting standards documents are available on the EBP website, with the recognition that opportunities, technologies, and challenges may improve or change in the future, requiring flexibility for the EBP to meet its goals. Here, we describe some highlights from the proposed standards, and areas where additional challenges will need to be met.
Assuntos
Sequência de Bases/genética , Eucariotos/genética , Genômica/normas , Animais , Biodiversidade , Genômica/métodos , Humanos , Padrões de Referência , Valores de Referência , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/normasRESUMO
GenBank® (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains 15.3 trillion base pairs from over 2.5 billion nucleotide sequences for 504 000 formally described species. Recent updates include resources for data from the SARS-CoV-2 virus, including a SARS-CoV-2 landing page, NCBI Datasets, NCBI Virus and the Submission Portal. We also discuss upcoming changes to GI identifiers, a new data management interface for BioProject, and advice for providing contextual metadata in submissions.
Assuntos
Bases de Dados de Ácidos Nucleicos , Vírus/genética , Genoma Viral , National Library of Medicine (U.S.) , SARS-CoV-2/genética , Estados Unidos , Interface Usuário-ComputadorRESUMO
The National Center for Biotechnology Information (NCBI) produces a variety of online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for the most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, RefSeq, SRA, Virus, dbSNP, dbVar, ClinicalTrials.gov, MMDB, iCn3D and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Assuntos
Biotecnologia/tendências , Bases de Dados Genéticas/tendências , Bases de Dados de Compostos Químicos , Bases de Dados de Ácidos Nucleicos , Bases de Dados de Proteínas , Humanos , Internet , National Library of Medicine (U.S.) , PubMed , Estados UnidosRESUMO
GenBank® (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains 9.9 trillion base pairs from over 2.1 billion nucleotide sequences for 478 000 formally described species. Daily data exchange with the European Nucleotide Archive and the DNA Data Bank of Japan ensures worldwide coverage. Recent updates include new resources for data from the SARS-CoV-2 virus, updates to the NCBI Submission Portal and associated submission wizards for dengue and SARS-CoV-2 viruses, new taxonomy queries for viruses and prokaryotes, and simplified submission processes for EST and GSS sequences.
Assuntos
Biologia Computacional/estatística & dados numéricos , Bases de Dados de Ácidos Nucleicos , Genômica/métodos , SARS-CoV-2/genética , Análise de Sequência de DNA/métodos , Animais , COVID-19/epidemiologia , COVID-19/virologia , Biologia Computacional/métodos , Humanos , Armazenamento e Recuperação da Informação/métodos , Internet , Anotação de Sequência Molecular/métodos , PandemiasRESUMO
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 34 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Custom implementations of the BLAST program provide sequence-based searching of many specialized datasets. New resources released in the past year include a new PubMed interface and NCBI datasets. Additional resources that were updated in the past year include PMC, Bookshelf, Genome Data Viewer, SRA, ClinVar, dbSNP, dbVar, Pathogen Detection, BLAST, Primer-BLAST, IgBLAST, iCn3D and PubChem. All of these resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Assuntos
Bases de Dados Genéticas , National Library of Medicine (U.S.) , Biologia Computacional/métodos , Bases de Dados de Compostos Químicos , Bases de Dados de Ácidos Nucleicos , Bases de Dados de Proteínas , Genômica/métodos , Humanos , PubMed , Estados UnidosRESUMO
GenBank® (www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains over 6.25 trillion base pairs from over 1.6 billion nucleotide sequences for 450 000 formally described species. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. Recent updates include a new version of Genome Workbench that supports GenBank submissions, new submission wizards for viral genomes, enhancements to BankIt and improved handling of taxonomy for sequences from pathogens.
Assuntos
Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos , Genômica/métodos , Software , Anotação de Sequência Molecular , National Institutes of Health (U.S.) , Estados Unidos , NavegadorRESUMO
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Custom implementations of the BLAST program provide sequence-based searching of many specialized datasets. New resources released in the past year include a new PubMed interface, a sequence database search and a gene orthologs page. Additional resources that were updated in the past year include PMC, Bookshelf, My Bibliography, Assembly, RefSeq, viral genomes, the prokaryotic genome annotation pipeline, Genome Workbench, dbSNP, BLAST, Primer-BLAST, IgBLAST and PubChem. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Assuntos
Biologia Computacional/métodos , Biologia Computacional/organização & administração , Bases de Dados Genéticas , National Library of Medicine (U.S.) , Bases de Dados de Ácidos Nucleicos , Genômica/métodos , Humanos , PubMed , Estados Unidos , NavegadorRESUMO
Recoding of stop codons as amino acid-specifying codons is a co-translational event that enables C-terminal extension of a protein. Synthesis of selenoproteins requires recoding of internal UGA stop codons to the 21st non-standard amino acid selenocysteine (Sec) and plays a vital role in human health and disease. Separately, canonical stop codons can be recoded to specify standard amino acids in a process known as stop codon readthrough (SCR), producing extended protein isoforms with potential novel functions. Conventional computational tools cannot distinguish between the dual functionality of stop codons as stop signals and sense codons, resulting in misannotation of selenoprotein gene products and failure to predict SCR. Manual curation is therefore required to correctly represent recoded gene products and their functions. Our goal was to provide accurately curated and annotated datasets of selenoprotein and SCR transcript and protein records to serve as annotation standards and to promote basic and biomedical research. Gene annotations were curated in nine vertebrate model organisms and integrated into NCBI's Reference Sequence (RefSeq) dataset, resulting in 247 selenoprotein genes encoding 322 selenoproteins, and 93 genes exhibiting SCR encoding 94 SCR isoforms.
Assuntos
Códon de Terminação , Curadoria de Dados , Bases de Dados Genéticas , Anotação de Sequência Molecular , Selenoproteínas/genética , Vertebrados/genética , Animais , Bovinos , Humanos , Camundongos , Proteoma , RatosRESUMO
GenBank® (www.ncbi.nlm.nih.gov/genbank/) is a comprehensive database that contains publicly available nucleotide sequences for 420 000 formally described species. Most GenBank submissions are made using BankIt, the NCBI Submission Portal, or the tool tbl2asn, and are obtained from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Nucleotide database, which links to related information such as taxonomy, genomes, protein sequences and structures, and biomedical journal literature in PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. Recent updates include an expansion of sequence identifier formats to accommodate expected database growth, submission wizards for ribosomal RNA, and the transfer of Expressed Sequence Tag (EST) and Genome Survey Sequence (GSS) data into the Nucleotide database.
Assuntos
Bases de Dados de Ácidos Nucleicos , Navegador , Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos/tendências , Genômica/métodos , Humanos , Armazenamento e Recuperação da Informação , Design de SoftwareRESUMO
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 38 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. New resources released in the past year include PubMed Labs and a new sequence database search. Resources that were updated in the past year include PubMed, PMC, Bookshelf, genome data viewer, Assembly, prokaryotic genomes, Genome, BioProject, dbSNP, dbVar, BLAST databases, igBLAST, iCn3D and PubChem. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Assuntos
Biotecnologia/organização & administração , Bases de Dados Genéticas , Animais , Biotecnologia/métodos , Bases de Dados de Compostos Químicos , Humanos , Software , Estados Unidos/epidemiologia , NavegadorRESUMO
The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.
Assuntos
Mapeamento de Sequências Contíguas/métodos , Genoma Humano , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Mapeamento de Sequências Contíguas/normas , Genômica/normas , Haploidia , Haplótipos , Humanos , Polimorfismo Genético , Padrões de Referência , Análise de Sequência de DNA/normasRESUMO
GenBank® (www.ncbi.nlm.nih.gov/genbank/) is a comprehensive database that contains publicly available nucleotide sequences for 400 000 formally described species. These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun and environmental sampling projects. Most submissions are made using BankIt, the National Center for Biotechnology Information (NCBI) Submission Portal, or the tool tbl2asn. GenBank staff assign accession numbers upon data receipt. Daily data exchange with the European Nucleotide Archive and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the NCBI Nucleotide database, which links to related information such as taxonomy, genomes, protein sequences and structures, and biomedical journal literature in PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. Recent updates include changes to sequence identifiers, submission wizards for 16S and Influenza sequences, and an Identical Protein Groups resource.