Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 429
Filtrar
1.
Nature ; 631(8019): 134-141, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38867047

RESUMO

Mosaic loss of the X chromosome (mLOX) is the most common clonal somatic alteration in leukocytes of female individuals1,2, but little is known about its genetic determinants or phenotypic consequences. Here, to address this, we used data from 883,574 female participants across 8 biobanks; 12% of participants exhibited detectable mLOX in approximately 2% of leukocytes. Female participants with mLOX had an increased risk of myeloid and lymphoid leukaemias. Genetic analyses identified 56 common variants associated with mLOX, implicating genes with roles in chromosomal missegregation, cancer predisposition and autoimmune diseases. Exome-sequence analyses identified rare missense variants in FBXO10 that confer a twofold increased risk of mLOX. Only a small fraction of associations was shared with mosaic Y chromosome loss, suggesting that distinct biological processes drive formation and clonal expansion of sex chromosome missegregation. Allelic shift analyses identified X chromosome alleles that are preferentially retained in mLOX, demonstrating variation at many loci under cellular selection. A polygenic score including 44 allelic shift loci correctly inferred the retained X chromosomes in 80.7% of mLOX cases in the top decile. Our results support a model in which germline variants predispose female individuals to acquiring mLOX, with the allelic content of the X chromosome possibly shaping the magnitude of clonal expansion.


Assuntos
Aneuploidia , Cromossomos Humanos X , Células Clonais , Leucócitos , Mosaicismo , Adulto , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Alelos , Doenças Autoimunes/genética , Bancos de Espécimes Biológicos , Segregação de Cromossomos/genética , Cromossomos Humanos X/genética , Cromossomos Humanos Y/genética , Células Clonais/metabolismo , Células Clonais/patologia , Exoma/genética , Proteínas F-Box/genética , Predisposição Genética para Doença/genética , Mutação em Linhagem Germinativa , Leucemia/genética , Leucócitos/metabolismo , Modelos Genéticos , Herança Multifatorial/genética , Mutação de Sentido Incorreto/genética
2.
Dev Cell ; 2024 Jun 21.
Artigo em Inglês | MEDLINE | ID: mdl-38942017

RESUMO

Recent advances in human genetics have shed light on the genetic factors contributing to inflammatory diseases, particularly Crohn's disease (CD), a prominent form of inflammatory bowel disease. Certain risk genes associated with CD directly influence cytokine biology and cell-specific communication networks. Current CD therapies primarily rely on anti-inflammatory drugs, which are inconsistently effective and lack strategies for promoting epithelial restoration and mucosal balance. To understand CD's underlying mechanisms, we investigated the link between CD and the FGFR1OP gene, which encodes a centrosome protein. FGFR1OP deletion in mouse intestinal epithelial cells disrupted crypt architecture, resulting in crypt loss, inflammation, and fatality. FGFR1OP insufficiency hindered epithelial resilience during colitis. FGFR1OP was crucial for preserving non-muscle myosin II activity, ensuring the integrity of the actomyosin cytoskeleton and crypt cell adhesion. This role of FGFR1OP suggests that its deficiency in genetically predisposed individuals may reduce epithelial renewal capacity, heightening susceptibility to inflammation and disease.

3.
Genome Res ; 34(5): 796-809, 2024 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-38749656

RESUMO

Underrepresented populations are often excluded from genomic studies owing in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high-quality set of 4094 whole genomes from 80 populations in the HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also show substantial added value from this data set compared with the prior versions of the component resources, typically combined via liftOver and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared with previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality-control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.


Assuntos
Bases de Dados Genéticas , Genoma Humano , Humanos , Projeto Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Variação Genética , Genômica/métodos
4.
Am J Hum Genet ; 111(6): 1047-1060, 2024 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-38776927

RESUMO

Lichen planus (LP) is a T-cell-mediated inflammatory disease affecting squamous epithelia in many parts of the body, most often the skin and oral mucosa. Cutaneous LP is usually transient and oral LP (OLP) is most often chronic, so we performed a large-scale genetic and epidemiological study of LP to address whether the oral and non-oral subgroups have shared or distinct underlying pathologies and their overlap with autoimmune disease. Using lifelong records covering diagnoses, procedures, and clinic identity from 473,580 individuals in the FinnGen study, genome-wide association analyses were conducted on carefully constructed subcategories of OLP (n = 3,323) and non-oral LP (n = 4,356) and on the combined group. We identified 15 genome-wide significant associations in FinnGen and an additional 12 when meta-analyzed with UKBB (27 independent associations at 25 distinct genomic locations), most of which are shared between oral and non-oral LP. Many associations coincide with known autoimmune disease loci, consistent with the epidemiologic enrichment of LP with hypothyroidism and other autoimmune diseases. Notably, a third of the FinnGen associations demonstrate significant differences between OLP and non-OLP. We also observed a 13.6-fold risk for tongue cancer and an elevated risk for other oral cancers in OLP, in agreement with earlier reports that connect LP with higher cancer incidence. In addition to a large-scale dissection of LP genetics and comorbidities, our study demonstrates the use of comprehensive, multidimensional health registry data to address outstanding clinical questions and reveal underlying biological mechanisms in common but understudied diseases.


Assuntos
Doenças Autoimunes , Estudo de Associação Genômica Ampla , Líquen Plano Bucal , Neoplasias Bucais , Humanos , Doenças Autoimunes/genética , Líquen Plano Bucal/genética , Líquen Plano Bucal/patologia , Neoplasias Bucais/genética , Neoplasias Bucais/patologia , Feminino , Masculino , Heterogeneidade Genética , Pessoa de Meia-Idade , Líquen Plano/genética , Líquen Plano/patologia , Predisposição Genética para Doença , Idoso , Adulto , Fatores de Risco , Polimorfismo de Nucleotídeo Único
5.
medRxiv ; 2024 May 09.
Artigo em Inglês | MEDLINE | ID: mdl-38766240

RESUMO

Central serous chorioretinopathy (CSC) is a fluid maculopathy whose etiology is not well understood. Abnormal choroidal veins in CSC patients have been shown to have similarities with varicose veins. To identify potential mechanisms, we analyzed genotype data from 1,477 CSC patients and 455,449 controls in FinnGen. We identified an association for a low-frequency (AF=0.5%) missense variant (rs113791087) in the gene encoding vascular endothelial protein tyrosine phosphatase (VE-PTP) (OR=2.85, P=4.5×10-9). This was confirmed in a meta-analysis of 2,452 CSC patients and 865,767 controls from 4 studies (OR=3.06, P=7.4×10-15). Rs113791087 was associated with a 56% higher prevalence of retinal abnormalities (35.3% vs 22.6%, P=8.0×10-4) in 708 UK Biobank participants and, surprisingly, with varicose veins (OR=1.31, P=2.3×10-11) and glaucoma (OR=0.82, P=6.9×10-9). Predicted loss-of-function variants in VEPTP, though rare in number, were associated with CSC in All of Us (OR=17.10, P=0.018). These findings highlight the significance of VE-PTP in diverse ocular and systemic vascular diseases.

6.
medRxiv ; 2024 May 16.
Artigo em Inglês | MEDLINE | ID: mdl-38798318

RESUMO

Understanding the genetic basis of gene expression can help us understand the molecular underpinnings of human traits and disease. Expression quantitative trait locus (eQTL) mapping can help in studying this relationship but have been shown to be very cell-type specific, motivating the use of single-cell RNA sequencing and single-cell eQTLs to obtain a more granular view of genetic regulation. Current methods for single-cell eQTL mapping either rely on the "pseudobulk" approach and traditional pipelines for bulk transcriptomics or do not scale well to large datasets. Here, we propose SAIGE-QTL, a robust and scalable tool that can directly map eQTLs using single-cell profiles without needing aggregation at the pseudobulk level. Additionally, SAIGE-QTL allows for testing the effects of less frequent/rare genetic variation through set-based tests, which is traditionally excluded from eQTL mapping studies. We evaluate the performance of SAIGE-QTL on both real and simulated data and demonstrate the improved power for eQTL mapping over existing pipelines.

7.
bioRxiv ; 2024 May 03.
Artigo em Inglês | MEDLINE | ID: mdl-38645134

RESUMO

Missense variants can have a range of functional impacts depending on factors such as the specific amino acid substitution and location within the gene. To interpret their deleteriousness, studies have sought to identify regions within genes that are specifically intolerant of missense variation 1-12 . Here, we leverage the patterns of rare missense variation in 125,748 individuals in the Genome Aggregation Database (gnomAD) 13 against a null mutational model to identify transcripts that display regional differences in missense constraint. Missense-depleted regions are enriched for ClinVar 14 pathogenic variants, de novo missense variants from individuals with neurodevelopmental disorders (NDDs) 15,16 , and complex trait heritability. Following ClinGen calibration recommendations for the ACMG/AMP guidelines, we establish that regions with less than 20% of their expected missense variation achieve moderate support for pathogenicity. We create a missense deleteriousness metric (MPC) that incorporates regional constraint and outperforms other deleteriousness scores at stratifying case and control de novo missense variation, with a strong enrichment in NDDs. These results provide additional tools to aid in missense variant interpretation.

8.
Eur J Hum Genet ; 32(5): 576-583, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38467730

RESUMO

Intellectual disability (ID) is a common disorder, yet there is a wide spectrum of impairment from mild to profoundly affected individuals. Mild ID is seen as the low extreme of the general distribution of intelligence, while severe ID is often seen as a monogenic disorder caused by rare, pathogenic, highly penetrant variants. To investigate the genetic factors influencing mild and severe ID, we evaluated rare and common variation in the Northern Finland Intellectual Disability cohort (n = 1096 ID patients), a cohort with a high percentage of mild ID (n = 550) and from a population bottleneck enriched in rare, damaging variation. Despite this enrichment, we found only a small percentage of ID was due to recessive Finnish-enriched variants (0.5%). A larger proportion was linked to dominant variation, with a significant burden of rare, damaging variation in both mild and severe ID. This rare variant burden was enriched in more severe ID (p = 2.4e-4), patients without a relative with ID (p = 4.76e-4), and in those with features associated with monogenic disorders. We also found a significant burden of common variants associated with decreased cognitive function, with no difference between mild and more severe ID. When we included common and rare variants in a joint model, the rare and common variants had additive effects in both mild and severe ID. A multimodel inference approach also found that common and rare variants together best explained ID status (ΔAIC = 16.8, ΔBIC = 10.2). Overall, we report evidence for the additivity of rare and common variant burden throughout the spectrum of intellectual disability.


Assuntos
Deficiência Intelectual , Humanos , Deficiência Intelectual/genética , Deficiência Intelectual/patologia , Masculino , Feminino , Finlândia , Adulto , Variação Genética
9.
Nat Genet ; 56(3): 377-382, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38182742

RESUMO

Gestational diabetes mellitus (GDM) is a common metabolic disorder affecting more than 16 million pregnancies annually worldwide1,2. GDM is related to an increased lifetime risk of type 2 diabetes (T2D)1-3, with over a third of women developing T2D within 15 years of their GDM diagnosis. The diseases are hypothesized to share a genetic predisposition1-7, but few studies have sought to uncover the genetic underpinnings of GDM. Most studies have evaluated the impact of T2D loci only8-10, and the three prior genome-wide association studies of GDM11-13 have identified only five loci, limiting the power to assess to what extent variants or biological pathways are specific to GDM. We conducted the largest genome-wide association study of GDM to date in 12,332 cases and 131,109 parous female controls in the FinnGen study and identified 13 GDM-associated loci, including nine new loci. Genetic features distinct from T2D were identified both at the locus and genomic scale. Our results suggest that the genetics of GDM risk falls into the following two distinct categories: one part conventional T2D polygenic risk and one part predominantly influencing mechanisms disrupted in pregnancy. Loci with GDM-predominant effects map to genes related to islet cells, central glucose homeostasis, steroidogenesis and placental expression.


Assuntos
Diabetes Mellitus Tipo 2 , Diabetes Gestacional , Ilhotas Pancreáticas , Gravidez , Feminino , Humanos , Diabetes Mellitus Tipo 2/genética , Diabetes Gestacional/genética , Estudo de Associação Genômica Ampla , Placenta
10.
Nat Genet ; 56(2): 327-335, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38200129

RESUMO

Acquiring a sufficiently powered cohort of control samples matched to a case sample can be time-consuming or, in some cases, impossible. Accordingly, an ability to leverage genetic data from control samples that were already collected elsewhere could dramatically improve power in genetic association studies. Sharing of control samples can pose significant challenges, since most human genetic data are subject to strict sharing regulations. Here, using the properties of singular value decomposition and subsampling algorithm, we developed a method allowing selection of the best-matching controls in an external pool of samples compliant with personal data protection and eliminating the need for genotype sharing. We provide access to a library of 39,472 exome sequencing controls at http://dnascore.net enabling association studies for case cohorts lacking control subjects. Using this approach, control sets can be selected from this online library with a prespecified matching accuracy, ensuring well-calibrated association analysis for both rare and common variants.


Assuntos
Algoritmos , Exoma , Humanos , Exoma/genética , Genótipo , Estudos de Associação Genética , Pesquisa
12.
bioRxiv ; 2024 Feb 28.
Artigo em Inglês | MEDLINE | ID: mdl-36747613

RESUMO

Underrepresented populations are often excluded from genomic studies due in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high quality set of 4,094 whole genomes from HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also demonstrate substantial added value from this dataset compared to the prior versions of the component resources, typically combined via liftover and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared to previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.

13.
Nat Genet ; 56(1): 152-161, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38057443

RESUMO

Recessive diseases arise when both copies of a gene are impacted by a damaging genetic variant. When a patient carries two potentially causal variants in a gene, accurate diagnosis requires determining that these variants occur on different copies of the chromosome (that is, are in trans) rather than on the same copy (that is, in cis). However, current approaches for determining phase, beyond parental testing, are limited in clinical settings. Here we developed a strategy for inferring phase for rare variant pairs within genes, leveraging genotypes observed in the Genome Aggregation Database (v2, n = 125,748 exomes). Our approach estimates phase with 96% accuracy, both in trio data and in patients with Mendelian conditions and presumed causal compound heterozygous variants. We provide a public resource of phasing estimates for coding variants and counts per gene of rare variants in trans that can aid interpretation of rare co-occurring variants in the context of recessive disease.


Assuntos
Exoma , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Exoma/genética , Sequenciamento do Exoma , Genótipo
14.
Nature ; 625(7993): 92-100, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38057664

RESUMO

The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders1-4, but attempts to assess constraint for non-protein-coding regions have proved more difficult. Here we aggregate, process and release a dataset of 76,156 human genomes from the Genome Aggregation Database (gnomAD)-the largest public open-access human genome allele frequency reference dataset-and use it to build a genomic constraint map for the whole genome (genomic non-coding constraint of haploinsufficient variation (Gnocchi)). We present a refined mutational model that incorporates local sequence context and regional genomic features to detect depletions of variation. As expected, the average constraint for protein-coding sequences is stronger than that for non-coding regions. Within the non-coding genome, constrained regions are enriched for known regulatory elements and variants that are implicated in complex human diseases and traits, facilitating the triangulation of biological annotation, disease association and natural selection to non-coding DNA analysis. More constrained regulatory elements tend to regulate more constrained protein-coding genes, which in turn suggests that non-coding constraint can aid the identification of constrained genes that are as yet unrecognized by current gene constraint metrics. We demonstrate that this genome-wide constraint map improves the identification and interpretation of functional human genetic variation.


Assuntos
Genoma Humano , Genômica , Modelos Genéticos , Mutação , Humanos , Acesso à Informação , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Frequência do Gene , Genoma Humano/genética , Mutação/genética , Seleção Genética
15.
Nat Genet ; 56(1): 162-169, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38036779

RESUMO

Fine-mapping aims to identify causal genetic variants for phenotypes. Bayesian fine-mapping algorithms (for example, SuSiE, FINEMAP, ABF and COJO-ABF) are widely used, but assessing posterior probability calibration remains challenging in real data, where model misspecification probably exists, and true causal variants are unknown. We introduce replication failure rate (RFR), a metric to assess fine-mapping consistency by downsampling. SuSiE, FINEMAP and COJO-ABF show high RFR, indicating potential overconfidence in their output. Simulations reveal that nonsparse genetic architecture can lead to miscalibration, while imputation noise, nonuniform distribution of causal variants and quality control filters have minimal impact. Here we present SuSiE-inf and FINEMAP-inf, fine-mapping methods modeling infinitesimal effects alongside fewer larger causal effects. Our methods show improved calibration, RFR and functional enrichment, competitive recall and computational efficiency. Notably, using our methods' posterior effect sizes substantially increases polygenic risk score accuracy over SuSiE and FINEMAP. Our work improves causal variant identification for complex traits, a fundamental goal of human genetics.


Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Humanos , Teorema de Bayes , Herança Multifatorial , Algoritmos
16.
medRxiv ; 2023 Nov 20.
Artigo em Inglês | MEDLINE | ID: mdl-38076851

RESUMO

Focal segmental glomerulosclerosis (FSGS) is a common cause of nephrotic syndrome with an annual incidence in the United States in African-Americans compared to European-Americans of 24 cases and 5 cases per million, respectively. Among glomerular diseases in Europe and Latin-America, FSGS was the second most frequent diagnosis, and in Asia the fifth. We expand previous efforts in understanding genetics of FSGS by performing a case-control study involving ethnically-diverse groups FSGS cases (726) and a pool of controls (13,994), using panel sequencing of approximately 2,500 podocyte-expressed genes. Through rare variant association tests, we replicated known risk genes - KANK1, COL4A4, and APOL1. A novel significant association was observed for the gene encoding complement receptor 1 (CR1). High-risk rare variants in CR1 in the European-American cohort were commonly observed in Latin- and African-Americans. Therefore, a combined rare and common variant analysis was used to replicate the CR1 association in non-European populations. The CR1 risk variant, rs17047661, gives rise to the Sl1/Sl2 (R1601G) allele that was previously associated with protection against cerebral malaria. Pleiotropic effects of rs17047661 may explain the difference in allele frequencies across continental ancestries and suggest a possible role for genetically-driven alterations of adaptive immunity in the pathogenesis of FSGS.

17.
medRxiv ; 2023 Nov 27.
Artigo em Inglês | MEDLINE | ID: mdl-38076931

RESUMO

A diagnosis of epilepsy has significant consequences for an individual but is often challenging in clinical practice. Novel biomarkers are thus greatly needed. Here, we investigated how common genetic factors (epilepsy polygenic risk scores, [PRSs]) influence epilepsy risk in detailed longitudinal electronic health records (EHRs) of > 360k Finns spanning up to 50 years of individuals' lifetimes. Individuals with a high genetic generalized epilepsy PRS (PRSGGE) in FinnGen had an increased risk for genetic generalized epilepsy (GGE) (hazard ratio [HR] 1.55 per PRSGGE standard deviation [SD]) across their lifetime and after unspecified seizure events. Effect sizes of epilepsy PRSs were comparable to effect sizes in clinically curated data supporting our EHR-derived epilepsy diagnoses. Within 10 years after an unspecified seizure, the GGE rate was 37% when PRSGGE > 2 SD compared to 5.6% when PRSGGE < -2 SD. The effect of PRSGGE was even larger on GGE subtypes of idiopathic generalized epilepsy (IGE) (HR 2.1 per SD PRSGGE). We further report significantly larger effects of PRSGGE on epilepsy in females and in younger age groups. Analogously, we found significant but more modest focal epilepsy PRS burden associated with non-acquired focal epilepsy (NAFE). We found PRSGGE specifically associated with GGE in comparison with >2000 independent diseases while PRSNAFE was also associated with other diseases than NAFE such as back pain. Here, we show that epilepsy specific PRSs have good discriminative ability after a first seizure event i.e. in circumstances where the prior probability of epilepsy is high outlining a potential to serve as biomarkers for an epilepsy diagnosis.

18.
Cell Genom ; 3(12): 100436, 2023 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-38116116

RESUMO

Genome-wide association studies (GWASs) have identified tens of thousands of genetic loci associated with human complex traits. However, the majority of GWASs were conducted in individuals of European ancestries. Failure to capture global genetic diversity has limited genomic discovery and has impeded equitable delivery of genomic knowledge to diverse populations. Here we report findings from 102,900 individuals across 36 human quantitative traits in the Taiwan Biobank (TWB), a major biobank effort that broadens the population diversity of genetic studies in East Asia. We identified 968 novel genetic loci, pinpointed novel causal variants through statistical fine-mapping, compared the genetic architecture across TWB, Biobank Japan, and UK Biobank, and evaluated the utility of cross-phenotype, cross-population polygenic risk scores in disease risk prediction. These results demonstrated the potential to advance discovery through diversifying GWAS populations and provided insights into the common genetic basis of human complex traits in East Asia.

19.
medRxiv ; 2023 Oct 25.
Artigo em Inglês | MEDLINE | ID: mdl-37961173

RESUMO

Mass General Brigham, an integrated healthcare system based in the Greater Boston area of Massachusetts, annually serves 1.5 million patients. We established the Mass General Brigham Biobank (MGBB), encompassing 142,238 participants, to unravel the intricate relationships among genomic profiles, environmental context, and disease manifestations within clinical practice. In this study, we highlight the impact of ancestral diversity in the MGBB by employing population genetics, geospatial assessment, and association analyses of rare and common genetic variants. The population structures captured by the genetics mirror the sequential immigration to the Greater Boston area throughout American history, highlighting communities tied to shared genetic and environmental factors. Our investigation underscores the potency of unbiased, large-scale analyses in a healthcare-affiliated biobank, elucidating the dynamic interplay across genetics, immigration, structural geospatial factors, and health outcomes in one of the earliest American sites of European colonization.

20.
Am J Hum Genet ; 110(12): 2068-2076, 2023 Dec 07.
Artigo em Inglês | MEDLINE | ID: mdl-38000370

RESUMO

DNA sample contamination is a major issue in clinical and research applications of whole-genome and -exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a metric to estimate DNA sample contamination from variant-level whole-genome and -exome sequence data called CHARR, contamination from homozygous alternate reference reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VariantDataset format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole-genome and exome sequencing datasets.


Assuntos
DNA , Truta , Humanos , Animais , Análise de Sequência de DNA/métodos , Genótipo , Homozigoto , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...