RESUMEN
The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, SciENcv, the NIH Comparative Genomics Resource (CGR), NCBI Virus, SRA, RefSeq, foreign contamination screening tools, Taxonomy, iCn3D, ClinVar, GTR, MedGen, dbSNP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Asunto(s)
Bases de Datos Genéticas , National Library of Medicine (U.S.) , Biotecnología/instrumentación , Bases de Datos de Ácidos Nucleicos , Internet , Estados UnidosRESUMEN
Eukaryotic genomes contain many nongenic elements that function in gene regulation, chromosome organization, recombination, repair, or replication, and mutation of those elements can affect genome function and cause disease. Although numerous epigenomic studies provide high coverage of gene regulatory regions, those data are not usually exposed in traditional genome annotation and can be difficult to access and interpret without field-specific expertise. The National Center for Biotechnology Information (NCBI) therefore provides RefSeq Functional Elements (RefSeqFEs), which represent experimentally validated human and mouse nongenic elements derived from the literature. The curated data set is comprised of richly annotated sequence records, descriptive records in the NCBI Gene database, reference genome feature annotation, and activity-based interactions between nongenic regions, target genes, and each other. The data set provides succinct functional details and transparent experimental evidence, leverages data from multiple experimental sources, is readily accessible and adaptable, and uses a flexible data model. The data have multiple uses for basic functional discovery, bioinformatics studies, genetic variant interpretation; as known positive controls for epigenomic data evaluation; and as reference standards for functional interactions. Comparisons to other gene regulatory data sets show that the RefSeqFE data set includes a wider range of feature types representing more areas of biology, but it is comparatively smaller and subject to data selection biases. RefSeqFEs thus provide an alternative and complementary resource for experimentally assayed functional elements, with future data set growth expected.
Asunto(s)
Biología Computacional , Genoma , Animales , Bases de Datos Genéticas , Eucariontes/genética , Humanos , Ratones , Estándares de ReferenciaRESUMEN
The 'Human Immunodeficiency Virus Type 1 (HIV-1), Human Interaction Database', available through the National Library of Medicine at http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses/hiv-1/interactions, serves the scientific community exploring the discovery of novel HIV vaccine candidates and therapeutic targets. Each HIV-1 human protein interaction can be retrieved without restriction by web-based downloads and ftp protocols and includes: Reference Sequence (RefSeq) protein accession numbers, National Center for Biotechnology Information Gene identification numbers, brief descriptions of the interactions, searchable keywords for interactions and PubMed identification numbers (PMIDs) of journal articles describing the interactions. In addition to specific HIV-1 protein-human protein interactions, included are interaction effects upon HIV-1 replication resulting when individual human gene expression is blocked using siRNA. A total of 3142 human genes are described participating in 12,786 protein-protein interactions, along with 1316 replication interactions described for each of 1250 human genes identified using small interfering RNA (siRNA). Together the data identifies 4006 human genes involved in 14,102 interactions. With the inclusion of siRNA interactions we introduce a redesigned web interface to enhance viewing, filtering and downloading of the combined data set.
Asunto(s)
Bases de Datos Genéticas , VIH-1/metabolismo , Proteínas del Virus de la Inmunodeficiencia Humana/metabolismo , VIH-1/genética , VIH-1/fisiología , Humanos , Internet , Mapeo de Interacción de Proteínas , ARN Interferente Pequeño/metabolismo , Replicación ViralRESUMEN
The National Center for Biotechnology Information's (NCBI) Gene database (www.ncbi.nlm.nih.gov/gene) integrates gene-specific information from multiple data sources. NCBI Reference Sequence (RefSeq) genomes for viruses, prokaryotes and eukaryotes are the primary foundation for Gene records in that they form the critical association between sequence and a tracked gene upon which additional functional and descriptive content is anchored. Additional content is integrated based on the genomic location and RefSeq transcript and protein sequence data. The content of a Gene record represents the integration of curation and automated processing from RefSeq, collaborating model organism databases, consortia such as Gene Ontology, and other databases within NCBI. Records in Gene are assigned unique, tracked integers as identifiers. The content (citations, nomenclature, genomic location, gene products and their attributes, phenotypes, sequences, interactions, variation details, maps, expression, homologs, protein domains and external databases) is available via interactive browsing through NCBI's Entrez system, via NCBI's Entrez programming utilities (E-Utilities and Entrez Direct) and for bulk transfer by FTP.
Asunto(s)
Bases de Datos Genéticas , Genes , Variación Genética , Genómica , Internet , National Library of Medicine (U.S.) , Fenotipo , Estados UnidosRESUMEN
The National Institutes of Health Genetic Testing Registry (GTR; available online at http://www.ncbi.nlm.nih.gov/gtr/) maintains comprehensive information about testing offered worldwide for disorders with a genetic basis. Information is voluntarily submitted by test providers. The database provides details of each test (e.g. its purpose, target populations, methods, what it measures, analytical validity, clinical validity, clinical utility, ordering information) and laboratory (e.g. location, contact information, certifications and licenses). Each test is assigned a stable identifier of the format GTR000000000, which is versioned when the submitter updates information. Data submitted by test providers are integrated with basic information maintained in National Center for Biotechnology Information's databases and presented on the web and through FTP (ftp.ncbi.nih.gov/pub/GTR/_README.html).
Asunto(s)
Bases de Datos Genéticas , Pruebas Genéticas , Sistema de Registros , Genes , Variación Genética , Humanos , Internet , FenotipoRESUMEN
Genomic sequencing of clinical samples to identify emerging variants of SARS-CoV-2 has been a key public health tool for curbing the spread of the virus. As a result, an unprecedented number of SARS-CoV-2 genomes were sequenced during the COVID-19 pandemic, which allowed for rapid identification of genetic variants, enabling the timely design and testing of therapies and deployment of new vaccine formulations to combat the new variants. However, despite the technological advances of deep sequencing, the analysis of the raw sequence data generated globally is neither standardized nor consistent, leading to vastly disparate sequences that may impact identification of variants. Here, we show that for both Illumina and Oxford Nanopore sequencing platforms, downstream bioinformatic protocols used by industry, government, and academic groups resulted in different virus sequences from same sample. These bioinformatic workflows produced consensus genomes with differences in single nucleotide polymorphisms, inclusion and exclusion of insertions, and/or deletions, despite using the same raw sequence as input datasets. Here, we compared and characterized such discrepancies and propose a specific suite of parameters and protocols that should be adopted across the field. Consistent results from bioinformatic workflows are fundamental to SARS-CoV-2 and future pathogen surveillance efforts, including pandemic preparation, to allow for a data-driven and timely public health response.
Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , COVID-19/epidemiología , Pandemias , Flujo de Trabajo , Biología ComputacionalRESUMEN
During the COVID-19 pandemic, SARS-CoV-2 surveillance efforts integrated genome sequencing of clinical samples to identify emergent viral variants and to support rapid experimental examination of genome-informed vaccine and therapeutic designs. Given the broad range of methods applied to generate new viral genomes, it is critical that consensus and variant calling tools yield consistent results across disparate pipelines. Here we examine the impact of sequencing technologies (Illumina and Oxford Nanopore) and 7 different downstream bioinformatic protocols on SARS-CoV-2 variant calling as part of the NIH Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) Tracking Resistance and Coronavirus Evolution (TRACE) initiative, a public-private partnership established to address the COVID-19 outbreak. Our results indicate that bioinformatic workflows can yield consensus genomes with different single nucleotide polymorphisms, insertions, and/or deletions even when using the same raw sequence input datasets. We introduce the use of a specific suite of parameters and protocols that greatly improves the agreement among pipelines developed by diverse organizations. Such consistency among bioinformatic pipelines is fundamental to SARS-CoV-2 and future pathogen surveillance efforts. The application of analysis standards is necessary to more accurately document phylogenomic trends and support data-driven public health responses.
RESUMEN
The 'Human Immunodeficiency Virus Type 1 (HIV-1), Human Protein Interaction Database', available through the National Library of Medicine at www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions, was created to catalog all interactions between HIV-1 and human proteins published in the peer-reviewed literature. The database serves the scientific community exploring the discovery of novel HIV vaccine candidates and therapeutic targets. To facilitate this discovery approach, the following information for each HIV-1 human protein interaction is provided and can be retrieved without restriction by web-based downloads and ftp protocols: Reference Sequence (RefSeq) protein accession numbers, Entrez Gene identification numbers, brief descriptions of the interactions, searchable keywords for interactions and PubMed identification numbers (PMIDs) of journal articles describing the interactions. Currently, 2589 unique HIV-1 to human protein interactions and 5135 brief descriptions of the interactions, with a total of 14,312 PMID references to the original articles reporting the interactions, are stored in this growing database. In addition, all protein-protein interactions documented in the database are integrated into Entrez Gene records and listed in the 'HIV-1 protein interactions' section of Entrez Gene reports. The database is also tightly linked to other databases through Entrez Gene, enabling users to search for an abundance of information related to HIV pathogenesis and replication.
Asunto(s)
Bases de Datos de Proteínas , VIH-1/metabolismo , Mapeo de Interacción de Proteínas , Proteínas Virales/metabolismo , Síndrome de Inmunodeficiencia Adquirida/virología , Gráficos por Computador , Humanos , Proteínas/metabolismo , Productos del Gen gag del Virus de la Inmunodeficiencia Humana/metabolismoRESUMEN
Sequence Read Archive submissions to the National Center for Biotechnology Information often lack useful metadata, which limits the utility of these submissions. We describe the Sequence Taxonomic Analysis Tool (STAT), a scalable k-mer-based tool for fast assessment of taxonomic diversity intrinsic to submissions, independent of metadata. We show that our MinHash-based k-mer tool is accurate and scalable, offering reliable criteria for efficient selection of data for further analysis by the scientific community, at once validating submissions while also augmenting sample metadata with reliable, searchable, taxonomic terms.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Contaminación de ADN , Humanos , Metagenómica/métodos , SARS-CoV-2/genéticaRESUMEN
Although many interactions between HIV-1 and human proteins have been reported in the scientific literature, no publicly accessible source for efficiently reviewing this information was available. Therefore, a project was initiated in an attempt to catalogue all published interactions between HIV-1 and human proteins. HIV-related articles in PubMed were used to develop a database containing names, Entrez GeneIDs, and RefSeq protein accession numbers of interacting proteins. Furthermore, brief descriptions of the interactions, PubMed identification numbers of articles describing the interactions, and keywords for searching the interactions were incorporated. Over 100,000 articles were reviewed, resulting in the identification of 1448 human proteins that interact with HIV-1 comprising 2589 unique HIV-1-to-human protein interactions. Preliminary analysis of the extracted data indicates 32% were direct physical interactions (e.g., binding) and 68% were indirect interactions (e.g., upregulation through activation of signaling pathways). Interestingly, 37% of human proteins in the database were found to interact with more than one HIV-1 protein. For example, the signaling protein mitogen-activated protein kinase 1 has a surprising range of interactions with 10 different HIV-1 proteins. Moreover, large numbers of interactions were published for the HIV-1 regulatory protein Tat and envelope proteins: 30% and 33% of total interactions identified, respectively. The database is accessible at http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/ and is cross-linked to other National Center for Biotechnology Information databases and programs via Entrez Gene. This database represents a unique and continuously updated scientific resource for understanding HIV-1 replication and pathogenesis to assist in accelerating the development of effective therapeutic and vaccine interventions.