SPDI: data model for variants and applications at NCBI.

Holmes, J Bradley; Moyer, Eric; Phan, Lon; Maglott, Donna; Kattman, Brandi

Holmes, J Bradley; Moyer, Eric; Phan, Lon; Maglott, Donna; Kattman, Brandi.

Afiliação

Holmes JB; Information Engineering Branch, National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.
Moyer E; Information Engineering Branch, National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.
Phan L; Information Engineering Branch, National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.
Maglott D; Information Engineering Branch, National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.
Kattman B; Information Engineering Branch, National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.

Bioinformatics ; 36(6): 1902-1907, 2020 03 01.

Article em En | MEDLINE | ID: mdl-31738401

ABSTRACT

ABSTRACT

MOTIVATION Normalizing sequence variants on a reference, projecting them across congruent sequences and aggregating their diverse representations are critical to the elucidation of the genetic basis of disease and biological function. Inconsistent representation of variants among variant callers, local databases and tools result in discrepancies that complicate analysis. NCBI's genetic variation resources, dbSNP and ClinVar, require a robust, scalable set of principles to manage asserted sequence variants.

RESULTS:

The SPDI data model defines variants as a sequence of four attributes sequence, position, deletion and insertion, and can be applied to nucleotide and protein variants. NCBI web services convert representations among HGVS, VCF and SPDI and provide two functions to aggregate variants. One, based on the NCBI Variant Overprecision Correction Algorithm, returns a unique, normalized representation termed the 'Contextual Allele'. The SPDI data model, with its four operations, defines exactly the reference subsequence affected by the variant, even in repeat regions, such as homopolymer and other sequence repeats. The second function projects variants across congruent sequences and depends on an alignment dataset of non-assembly NCBI RefSeq sequences (prefixed NM, NR and NG), as well as inter- and intra-assembly-associated genomic sequences (NCs, NTs and NWs), supporting robust projection of variants across congruent sequences and assembly versions. The variant is projected to all congruent Contextual Alleles. One of these Contextual Alleles, typically the allele based on the latest assembly version, represents the entire set, is designated the unique 'Canonical Allele' and is used directly to aggregate variants across congruent sequences. AVAILABILITY AND IMPLEMENTATION The SPDI services are available for open access at https//api.ncbi.nlm.nih.gov/variation/v0. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Assuntos

Bases de Dados Genéticas; Genômica; Algoritmos; Genoma; Vocabulário Controlado

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Genômica / Bases de Dados Genéticas Idioma: En Ano de publicação: 2020 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google