RESUMO
The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, SciENcv, the NIH Comparative Genomics Resource (CGR), NCBI Virus, SRA, RefSeq, foreign contamination screening tools, Taxonomy, iCn3D, ClinVar, GTR, MedGen, dbSNP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Assuntos
Bases de Dados Genéticas , National Library of Medicine (U.S.) , Biotecnologia/instrumentação , Bases de Dados de Ácidos Nucleicos , Internet , Estados UnidosRESUMO
The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. New resources include the Comparative Genome Resource (CGR) and the BLAST ClusteredNR database. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, IgBLAST, GDV, RefSeq, NCBI Virus, GenBank type assemblies, iCn3D, ClinVar, GTR, dbGaP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Assuntos
Bases de Dados Genéticas , Bases de Dados de Ácidos Nucleicos , Estados Unidos , National Library of Medicine (U.S.) , Alinhamento de Sequência , Biotecnologia , InternetRESUMO
The National Center for Biotechnology Information (NCBI) produces a variety of online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for the most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, RefSeq, SRA, Virus, dbSNP, dbVar, ClinicalTrials.gov, MMDB, iCn3D and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Assuntos
Biotecnologia/tendências , Bases de Dados Genéticas/tendências , Bases de Dados de Compostos Químicos , Bases de Dados de Ácidos Nucleicos , Bases de Dados de Proteínas , Humanos , Internet , National Library of Medicine (U.S.) , PubMed , Estados UnidosRESUMO
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 34 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Custom implementations of the BLAST program provide sequence-based searching of many specialized datasets. New resources released in the past year include a new PubMed interface and NCBI datasets. Additional resources that were updated in the past year include PMC, Bookshelf, Genome Data Viewer, SRA, ClinVar, dbSNP, dbVar, Pathogen Detection, BLAST, Primer-BLAST, IgBLAST, iCn3D and PubChem. All of these resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Assuntos
Bases de Dados Genéticas , National Library of Medicine (U.S.) , Biologia Computacional/métodos , Bases de Dados de Compostos Químicos , Bases de Dados de Ácidos Nucleicos , Bases de Dados de Proteínas , Genômica/métodos , Humanos , PubMed , Estados UnidosRESUMO
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Custom implementations of the BLAST program provide sequence-based searching of many specialized datasets. New resources released in the past year include a new PubMed interface, a sequence database search and a gene orthologs page. Additional resources that were updated in the past year include PMC, Bookshelf, My Bibliography, Assembly, RefSeq, viral genomes, the prokaryotic genome annotation pipeline, Genome Workbench, dbSNP, BLAST, Primer-BLAST, IgBLAST and PubChem. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Assuntos
Biologia Computacional/métodos , Biologia Computacional/organização & administração , Bases de Dados Genéticas , National Library of Medicine (U.S.) , Bases de Dados de Ácidos Nucleicos , Genômica/métodos , Humanos , PubMed , Estados Unidos , NavegadorRESUMO
MOTIVATION: Normalizing sequence variants on a reference, projecting them across congruent sequences and aggregating their diverse representations are critical to the elucidation of the genetic basis of disease and biological function. Inconsistent representation of variants among variant callers, local databases and tools result in discrepancies that complicate analysis. NCBI's genetic variation resources, dbSNP and ClinVar, require a robust, scalable set of principles to manage asserted sequence variants. RESULTS: The SPDI data model defines variants as a sequence of four attributes: sequence, position, deletion and insertion, and can be applied to nucleotide and protein variants. NCBI web services convert representations among HGVS, VCF and SPDI and provide two functions to aggregate variants. One, based on the NCBI Variant Overprecision Correction Algorithm, returns a unique, normalized representation termed the 'Contextual Allele'. The SPDI data model, with its four operations, defines exactly the reference subsequence affected by the variant, even in repeat regions, such as homopolymer and other sequence repeats. The second function projects variants across congruent sequences and depends on an alignment dataset of non-assembly NCBI RefSeq sequences (prefixed NM, NR and NG), as well as inter- and intra-assembly-associated genomic sequences (NCs, NTs and NWs), supporting robust projection of variants across congruent sequences and assembly versions. The variant is projected to all congruent Contextual Alleles. One of these Contextual Alleles, typically the allele based on the latest assembly version, represents the entire set, is designated the unique 'Canonical Allele' and is used directly to aggregate variants across congruent sequences. AVAILABILITY AND IMPLEMENTATION: The SPDI services are available for open access at: https://api.ncbi.nlm.nih.gov/variation/v0. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Bases de Dados Genéticas , Genômica , Algoritmos , Genoma , Vocabulário ControladoRESUMO
MOTIVATION: Build a web-based 3D molecular structure viewer focusing on interactive structural analysis. RESULTS: iCn3D (I-see-in-3D) can simultaneously show 3D structure, 2D molecular contacts and 1D protein and nucleotide sequences through an integrated sequence/annotation browser. Pre-defined and arbitrary molecular features can be selected in any of the 1D/2D/3D windows as sets of residues and these selections are synchronized dynamically in all displays. Biological annotations such as protein domains, single nucleotide variations, etc. can be shown as tracks in the 1D sequence/annotation browser. These customized displays can be shared with colleagues or publishers via a simple URL. iCn3D can display structure-structure alignments obtained from NCBI's VAST+ service. It can also display the alignment of a sequence with a structure as identified by BLAST, and thus relate 3D structure to a large fraction of all known proteins. iCn3D can also display electron density maps or electron microscopy (EM) density maps, and export files for 3D printing. The following example URL exemplifies some of the 1D/2D/3D representations: https://www.ncbi.nlm.nih.gov/Structure/icn3d/full.html?mmdbid=1TUP&showanno=1&show2d=1&showsets=1. AVAILABILITY AND IMPLEMENTATION: iCn3D is freely available to the public. Its source code is available at https://github.com/ncbi/icn3d. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Sequência de Bases , Biologia Computacional , Internet , Modelos Moleculares , Proteínas , Software , Biologia Computacional/métodos , Bases de Dados Genéticas , Conformação Molecular , Proteínas/químicaRESUMO
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 38 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. New resources released in the past year include PubMed Labs and a new sequence database search. Resources that were updated in the past year include PubMed, PMC, Bookshelf, genome data viewer, Assembly, prokaryotic genomes, Genome, BioProject, dbSNP, dbVar, BLAST databases, igBLAST, iCn3D and PubChem. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Assuntos
Biotecnologia/organização & administração , Bases de Dados Genéticas , Animais , Biotecnologia/métodos , Bases de Dados de Compostos Químicos , Humanos , Software , Estados Unidos/epidemiologia , NavegadorRESUMO
The identification and interpretation of genomic variants play a key role in the diagnosis of genetic diseases and related research. These tasks increasingly rely on accessing relevant manually curated information from domain databases (e.g. SwissProt or ClinVar). However, due to the sheer volume of medical literature and high cost of expert curation, curated variant information in existing databases are often incomplete and out-of-date. In addition, the same genetic variant can be mentioned in publications with various names (e.g. 'A146T' versus 'c.436G>A' versus 'rs121913527'). A search in PubMed using only one name usually cannot retrieve all relevant articles for the variant of interest. Hence, to help scientists, healthcare professionals, and database curators find the most up-to-date published variant research, we have developed LitVar for the search and retrieval of standardized variant information. In addition, LitVar uses advanced text mining techniques to compute and extract relationships between variants and other associated entities such as diseases and chemicals/drugs. LitVar is publicly available at https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/LitVar.
Assuntos
Curadoria de Dados/métodos , Mineração de Dados/métodos , Polimorfismo de Nucleotídeo Único , Ferramenta de Busca , Interface Usuário-Computador , Genética Médica , Genoma Humano , Genômica/métodos , Humanos , Internet , PubMed , SemânticaRESUMO
Motivation: Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data. Results: We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of â¼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research. Availability and implementation: The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/. Contact: zhiyong.lu@nih.gov.
Assuntos
Mineração de Dados/métodos , Mutação , Polimorfismo Genético , Medicina de Precisão/métodos , Software , Curadoria de Dados , Bases de Dados Factuais , Predisposição Genética para Doença , Genômica/métodos , Humanos , Fenótipo , PubMed , PublicaçõesRESUMO
BACKGROUND: Prioritization of sequence variants for diagnosis and discovery of Mendelian diseases is challenging, especially in large collections of whole genome sequences (WGS). Fast, scalable solutions are needed for discovery research, for clinical applications, and for curation of massive public variant repositories such as dbSNP and gnomAD. In response, we have developed VVP, the VAAST Variant Prioritizer. VVP is ultrafast, scales to even the largest variant repositories and genome collections, and its outputs are designed to simplify clinical interpretation of variants of uncertain significance. RESULTS: We show that scoring the entire contents of dbSNP (> 155 million variants) requires only 95 min using a machine with 4 cpus and 16 GB of RAM, and that a 60X WGS can be processed in less than 5 min. We also demonstrate that VVP can score variants anywhere in the genome, regardless of type, effect, or location. It does so by integrating sequence conservation, the type of sequence change, allele frequencies, variant burden, and zygosity. Finally, we also show that VVP scores are consistently accurate, and easily interpreted, traits not shared by many commonly used tools such as SIFT and CADD. CONCLUSIONS: VVP provides rapid and scalable means to prioritize any sequence variant, anywhere in the genome, and its scores are designed to facilitate variant interpretation using ACMG and NHS guidelines. These traits make it well suited for operation on very large collections of WGS sequences.
Assuntos
Biologia Computacional/métodos , Variação Genética , Genoma Humano , Software , Bases de Dados Genéticas , Humanos , Polimorfismo de Nucleotídeo Único/genética , Curva ROC , Fatores de Tempo , Sequenciamento Completo do Genoma , Zigoto/metabolismoRESUMO
MOTIVATION: Genetic variants in drug targets and metabolizing enzymes often have important functional implications, including altering the efficacy and toxicity of drugs. Identifying single nucleotide variants (SNVs) that contribute to differences in drug response and understanding their underlying mechanisms are fundamental to successful implementation of the precision medicine model. This work reports an effort to collect, classify and analyze SNVs that may affect the optimal response to currently approved drugs. RESULTS: An integrated approach was taken involving data mining across multiple information resources including databases containing drugs, drug targets, chemical structures, protein-ligand structure complexes, genetic and clinical variations as well as protein sequence alignment tools. We obtained 2640 SNVs of interest, most of which occur rarely in populations (minor allele frequency < 0.01). Clinical significance of only 9.56% of the SNVs is known in ClinVar, although 79.02% are predicted as deleterious. The examples here demonstrate that even if the mapped SNVs predicted as deleterious may not result in significant structural modifications, they can plausibly modify the protein-drug interactions, affecting selectivity and drug-binding affinity. Our analysis identifies potentially deleterious SNVs present on drug-binding residues that are relevant for further studies in the context of precision medicine. AVAILABILITY AND IMPLEMENTATION: Data are available from Supplementary information file. CONTACT: yanli.wang@nih.gov. SUPPLEMENTARY INFORMATION: Supplementary Tables S1-S5 are available at Bioinformatics online.
Assuntos
Mineração de Dados/métodos , Polimorfismo de Nucleotídeo Único , Ligação Proteica/genética , Análise de Sequência de Proteína/métodos , Sítios de Ligação , Frequência do Gene , Humanos , Medicina de Precisão/métodos , Análise de Sequência de DNA/métodosRESUMO
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Website. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Probe, Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Assuntos
Bases de Dados como Assunto , Bases de Dados Genéticas , Bases de Dados de Proteínas , Expressão Gênica , Genômica , Internet , Modelos Moleculares , National Library of Medicine (U.S.) , Publicações Periódicas como Assunto , PubMed , Alinhamento de Sequência , Análise de Sequência de DNA , Análise de Sequência de Proteína , Análise de Sequência de RNA , Bibliotecas de Moléculas Pequenas , Estados UnidosRESUMO
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Electronic PCR, OrfFinder, Splign, ProSplign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), IBIS, Biosystems, Peptidome, OMSSA, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Assuntos
Bases de Dados Genéticas , Bases de Dados de Proteínas , Expressão Gênica , Genômica , National Library of Medicine (U.S.) , Estrutura Terciária de Proteína , PubMed , Alinhamento de Sequência , Análise de Sequência de DNA , Análise de Sequência de RNA , Software , Integração de Sistemas , Estados UnidosRESUMO
Since its start, the Mammalian Gene Collection (MGC) has sought to provide at least one full-protein-coding sequence cDNA clone for every human and mouse gene with a RefSeq transcript, and at least 6200 rat genes. The MGC cloning effort initially relied on random expressed sequence tag screening of cDNA libraries. Here, we summarize our recent progress using directed RT-PCR cloning and DNA synthesis. The MGC now contains clones with the entire protein-coding sequence for 92% of human and 89% of mouse genes with curated RefSeq (NM-accession) transcripts, and for 97% of human and 96% of mouse genes with curated RefSeq transcripts that have one or more PubMed publications, in addition to clones for more than 6300 rat genes. These high-quality MGC clones and their sequences are accessible without restriction to researchers worldwide.
Assuntos
Clonagem Molecular/métodos , Biologia Computacional/métodos , DNA Complementar/genética , Biblioteca Gênica , Genes/genética , Mamíferos/genética , Animais , DNA/biossíntese , Humanos , Camundongos , National Institutes of Health (U.S.) , Ratos , Reação em Cadeia da Polimerase Via Transcriptase Reversa , Estados UnidosRESUMO
dbVar houses over 3 million submitted structural variants (SSV) from 120 human studies including copy number variations (CNV), insertions, deletions, inversions, translocations, and complex chromosomal rearrangements. Users can submit multiple SSVs to dbVAR that are presumably identical, but were ascertained by different platforms and samples, to calculate whether the variant is rare or common in the population and allow for cross validation. However, because SSV genomic location reporting can vary - including fuzzy locations where the start and/or end points are not precisely known - analysis, comparison, annotation, and reporting of SSVs across studies can be difficult. This project was initiated by the Structural Variant Comparison Group for the purpose of generating a non-redundant set of genomic regions defined by counts of concordance for all human SSVs placed on RefSeq assembly GRCh38 (RefSeq accession GCF_000001405.26). We intend that the availability of these regions, called structural variant clusters (SVCs), will facilitate the analysis, annotation, and exchange of SV data and allow for simplified display in genomic sequence viewers for improved variant interpretation. Sets of SVCs were generated by variant type for each of the 120 studies as well as for a combined set across all studies. Starting from 3.64 million SSVs, 2.5 million and 3.4 million non-redundant SVCs with count >=1 were generated by variant type for each study and across all studies, respectively. In addition, we have developed utilities for annotating, searching, and filtering SVC data in GVF format for computing summary statistics, exporting data for genomic viewers, and annotating the SVC using external data sources.
RESUMO
Eukaryotic cells respond to starvation by decreasing the rate of general protein synthesis while inducing translation of specific mRNAs encoding transcription factors GCN4 (yeast) or ATF4 (humans). Both responses are elicited by phosphorylation of translation initiation factor 2 (eIF2) and the attendant inhibition of its nucleotide exchange factor eIF2B-decreasing the binding to 40S ribosomes of methionyl initiator tRNA in the ternary complex (TC) with eIF2 and GTP. The reduction in TC levels enables scanning ribosomes to bypass the start codons of upstream open reading frames in the GCN4 mRNA leader and initiate translation at the authentic GCN4 start codon. We exploited the fact that GCN4 translation is a sensitive reporter of defects in TC recruitment to identify the catalytic and regulatory subunits of eIF2B. More recently, we implicated the C-terminal domain of eIF1A in 40S-binding of TC in vivo. Interestingly, we found that TC resides in a multifactor complex (MFC) with eIF3, eIF1, and the GTPase-activating protein for eIF2, known as eIF5. Our biochemical and genetic analyses indicate that physical interactions between MFC components enhance TC binding to 40S subunits and are required for wild-type translational control of GCN4. MFC integrity and eIF3 function also contribute to post-assembly steps in the initiation pathway that impact GCN4 expression. Thus, apart from its critical role in the starvation response, GCN4 regulation is a valuable tool for dissecting the contributions of multiple translation factors in the eukaryotic initiation pathway.
Assuntos
Regulação Fúngica da Expressão Gênica , Biossíntese de Proteínas , Saccharomyces cerevisiae , Proteínas de Ligação a DNA/genética , Proteínas de Ligação a DNA/metabolismo , Fator de Iniciação 1 em Eucariotos/genética , Fator de Iniciação 1 em Eucariotos/metabolismo , Fator de Iniciação 2 em Eucariotos/metabolismo , Fator de Iniciação 2B em Eucariotos/genética , Fator de Iniciação 2B em Eucariotos/metabolismo , Humanos , Substâncias Macromoleculares , Modelos Moleculares , Ligação Proteica , Proteínas Quinases/genética , Proteínas Quinases/metabolismo , Estrutura Terciária de Proteína , Ribossomos/metabolismo , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismoRESUMO
Rapidly accumulating data from genome-wide association studies (GWASs) and other large-scale studies are most useful when synthesized with existing databases. To address this opportunity, we developed the Phenotype-Genotype Integrator (PheGenI), a user-friendly web interface that integrates various National Center for Biotechnology Information (NCBI) genomic databases with association data from the National Human Genome Research Institute GWAS Catalog and supports downloads of search results. Here, we describe the rationale for and development of this resource. Integrating over 66,000 association records with extensive single nucleotide polymorphism (SNP), gene, and expression quantitative trait loci data already available from the NCBI, PheGenI enables deeper investigation and interrogation of SNPs associated with a wide range of traits, facilitating the examination of the relationships between genetic variation and human diseases.