Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 34
Filtrar
1.
Nature ; 625(7993): 92-100, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38057664

RESUMO

The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders1-4, but attempts to assess constraint for non-protein-coding regions have proved more difficult. Here we aggregate, process and release a dataset of 76,156 human genomes from the Genome Aggregation Database (gnomAD)-the largest public open-access human genome allele frequency reference dataset-and use it to build a genomic constraint map for the whole genome (genomic non-coding constraint of haploinsufficient variation (Gnocchi)). We present a refined mutational model that incorporates local sequence context and regional genomic features to detect depletions of variation. As expected, the average constraint for protein-coding sequences is stronger than that for non-coding regions. Within the non-coding genome, constrained regions are enriched for known regulatory elements and variants that are implicated in complex human diseases and traits, facilitating the triangulation of biological annotation, disease association and natural selection to non-coding DNA analysis. More constrained regulatory elements tend to regulate more constrained protein-coding genes, which in turn suggests that non-coding constraint can aid the identification of constrained genes that are as yet unrecognized by current gene constraint metrics. We demonstrate that this genome-wide constraint map improves the identification and interpretation of functional human genetic variation.


Assuntos
Genoma Humano , Genômica , Modelos Genéticos , Mutação , Humanos , Acesso à Informação , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Frequência do Gene , Genoma Humano/genética , Mutação/genética , Seleção Genética
2.
Am J Hum Genet ; 110(12): 2068-2076, 2023 Dec 07.
Artigo em Inglês | MEDLINE | ID: mdl-38000370

RESUMO

DNA sample contamination is a major issue in clinical and research applications of whole-genome and -exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a metric to estimate DNA sample contamination from variant-level whole-genome and -exome sequence data called CHARR, contamination from homozygous alternate reference reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VariantDataset format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole-genome and exome sequencing datasets.


Assuntos
DNA , Truta , Humanos , Animais , Análise de Sequência de DNA/métodos , Genótipo , Homozigoto , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software
3.
Am J Hum Genet ; 110(9): 1454-1469, 2023 09 07.
Artigo em Inglês | MEDLINE | ID: mdl-37595579

RESUMO

Short-read genome sequencing (GS) holds the promise of becoming the primary diagnostic approach for the assessment of autism spectrum disorder (ASD) and fetal structural anomalies (FSAs). However, few studies have comprehensively evaluated its performance against current standard-of-care diagnostic tests: karyotype, chromosomal microarray (CMA), and exome sequencing (ES). To assess the clinical utility of GS, we compared its diagnostic yield against these three tests in 1,612 quartet families including an individual with ASD and in 295 prenatal families. Our GS analytic framework identified a diagnostic variant in 7.8% of ASD probands, almost 2-fold more than CMA (4.3%) and 3-fold more than ES (2.7%). However, when we systematically captured copy-number variants (CNVs) from the exome data, the diagnostic yield of ES (7.4%) was brought much closer to, but did not surpass, GS. Similarly, we estimated that GS could achieve an overall diagnostic yield of 46.1% in unselected FSAs, representing a 17.2% increased yield over karyotype, 14.1% over CMA, and 4.1% over ES with CNV calling or 36.1% increase without CNV discovery. Overall, GS provided an added diagnostic yield of 0.4% and 0.8% beyond the combination of all three standard-of-care tests in ASD and FSAs, respectively. This corresponded to nine GS unique diagnostic variants, including sequence variants in exons not captured by ES, structural variants (SVs) inaccessible to existing standard-of-care tests, and SVs where the resolution of GS changed variant classification. Overall, this large-scale evaluation demonstrated that GS significantly outperforms each individual standard-of-care test while also outperforming the combination of all three tests, thus warranting consideration as the first-tier diagnostic approach for the assessment of ASD and FSAs.


Assuntos
Transtorno do Espectro Autista , Feminino , Gravidez , Humanos , Transtorno do Espectro Autista/diagnóstico , Transtorno do Espectro Autista/genética , Primeiro Trimestre da Gravidez , Ultrassonografia Pré-Natal , Mapeamento Cromossômico , Exoma
4.
Nature ; 581(7809): 444-451, 2020 05.
Artigo em Inglês | MEDLINE | ID: mdl-32461652

RESUMO

Structural variants (SVs) rearrange large segments of DNA1 and can have profound consequences in evolution and human disease2,3. As national biobanks, disease-association studies, and clinical genetic testing have grown increasingly reliant on genome sequencing, population references such as the Genome Aggregation Database (gnomAD)4 have become integral in the interpretation of single-nucleotide variants (SNVs)5. However, there are no reference maps of SVs from high-coverage genome sequencing comparable to those for SNVs. Here we present a reference of sequence-resolved SVs constructed from 14,891 genomes across diverse global populations (54% non-European) in gnomAD. We discovered a rich and complex landscape of 433,371 SVs, from which we estimate that SVs are responsible for 25-29% of all rare protein-truncating events per genome. We found strong correlations between natural selection against damaging SNVs and rare SVs that disrupt or duplicate protein-coding sequence, which suggests that genes that are highly intolerant to loss-of-function are also sensitive to increased dosage6. We also uncovered modest selection against noncoding SVs in cis-regulatory elements, although selection against protein-truncating SVs was stronger than all noncoding effects. Finally, we identified very large (over one megabase), rare SVs in 3.9% of samples, and estimate that 0.13% of individuals may carry an SV that meets the existing criteria for clinically important incidental findings7. This SV resource is freely distributed via the gnomAD browser8 and will have broad utility in population genetics, disease-association studies, and diagnostic screening.


Assuntos
Doença/genética , Variação Genética , Genética Médica/normas , Genética Populacional/normas , Genoma Humano/genética , Feminino , Testes Genéticos , Técnicas de Genotipagem , Humanos , Masculino , Pessoa de Meia-Idade , Mutação , Polimorfismo de Nucleotídeo Único/genética , Grupos Raciais/genética , Padrões de Referência , Seleção Genética , Sequenciamento Completo do Genoma
5.
Nature ; 581(7809): 434-443, 2020 05.
Artigo em Inglês | MEDLINE | ID: mdl-32461654

RESUMO

Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes1. Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases.


Assuntos
Exoma/genética , Genes Essenciais/genética , Variação Genética/genética , Genoma Humano/genética , Adulto , Encéfalo/metabolismo , Doenças Cardiovasculares/genética , Estudos de Coortes , Bases de Dados Genéticas , Feminino , Predisposição Genética para Doença/genética , Estudo de Associação Genômica Ampla , Humanos , Mutação com Perda de Função/genética , Masculino , Taxa de Mutação , Pró-Proteína Convertase 9/genética , RNA Mensageiro/genética , Reprodutibilidade dos Testes , Sequenciamento do Exoma , Sequenciamento Completo do Genoma
6.
Genome Res ; 32(3): 569-582, 2022 03.
Artigo em Inglês | MEDLINE | ID: mdl-35074858

RESUMO

Genomic databases of allele frequency are extremely helpful for evaluating clinical variants of unknown significance; however, until now, databases such as the Genome Aggregation Database (gnomAD) have focused on nuclear DNA and have ignored the mitochondrial genome (mtDNA). Here, we present a pipeline to call mtDNA variants that addresses three technical challenges: (1) detecting homoplasmic and heteroplasmic variants, present, respectively, in all or a fraction of mtDNA molecules; (2) circular mtDNA genome; and (3) misalignment of nuclear sequences of mitochondrial origin (NUMTs). We observed that mtDNA copy number per cell varied across gnomAD cohorts and influenced the fraction of NUMT-derived false-positive variant calls, which can account for the majority of putative heteroplasmies. To avoid false positives, we excluded contaminated samples, cell lines, and samples prone to NUMT misalignment due to few mtDNA copies. Furthermore, we report variants with heteroplasmy ≥10%. We applied this pipeline to 56,434 whole-genome sequences in the gnomAD v3.1 database that includes individuals of European (58%), African (25%), Latino (10%), and Asian (5%) ancestry. Our gnomAD v3.1 release contains population frequencies for 10,850 unique mtDNA variants at more than half of all mtDNA bases. Importantly, we report frequencies within each nuclear ancestral population and mitochondrial haplogroup. Homoplasmic variants account for most variant calls (98%) and unique variants (85%). We observed that 1/250 individuals carry a pathogenic mtDNA variant with heteroplasmy above 10%. These mtDNA population allele frequencies are freely accessible and will aid in diagnostic interpretation and research studies.


Assuntos
DNA Mitocondrial , Genoma Mitocondrial , Núcleo Celular/genética , DNA Mitocondrial/genética , Frequência do Gene , Genoma , Humanos , Mitocôndrias/genética , Análise de Sequência de DNA
9.
Nature ; 550(7675): 244-248, 2017 10 11.
Artigo em Inglês | MEDLINE | ID: mdl-29022598

RESUMO

X chromosome inactivation (XCI) silences transcription from one of the two X chromosomes in female mammalian cells to balance expression dosage between XX females and XY males. XCI is, however, incomplete in humans: up to one-third of X-chromosomal genes are expressed from both the active and inactive X chromosomes (Xa and Xi, respectively) in female cells, with the degree of 'escape' from inactivation varying between genes and individuals. The extent to which XCI is shared between cells and tissues remains poorly characterized, as does the degree to which incomplete XCI manifests as detectable sex differences in gene expression and phenotypic traits. Here we describe a systematic survey of XCI, integrating over 5,500 transcriptomes from 449 individuals spanning 29 tissues from GTEx (v6p release) and 940 single-cell transcriptomes, combined with genomic sequence data. We show that XCI at 683 X-chromosomal genes is generally uniform across human tissues, but identify examples of heterogeneity between tissues, individuals and cells. We show that incomplete XCI affects at least 23% of X-chromosomal genes, identify seven genes that escape XCI with support from multiple lines of evidence and demonstrate that escape from XCI results in sex biases in gene expression, establishing incomplete XCI as a mechanism that is likely to introduce phenotypic diversity. Overall, this updated catalogue of XCI across human tissues helps to increase our understanding of the extent and impact of the incompleteness in the maintenance of XCI.


Assuntos
Especificidade de Órgãos/genética , Análise de Célula Única , Inativação do Cromossomo X/genética , Cromossomos Humanos X/genética , Feminino , Genes Ligados ao Cromossomo X/genética , Genoma Humano/genética , Genômica , Humanos , Masculino , Fenótipo , Análise de Sequência de RNA , Transcriptoma/genética
10.
Nature ; 536(7616): 285-91, 2016 08 18.
Artigo em Inglês | MEDLINE | ID: mdl-27535533

RESUMO

Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.


Assuntos
Exoma/genética , Variação Genética/genética , Análise Mutacional de DNA , Conjuntos de Dados como Assunto , Humanos , Fenótipo , Proteoma/genética , Doenças Raras/genética , Tamanho da Amostra
14.
Nat Methods ; 15(8): 595-597, 2018 08.
Artigo em Inglês | MEDLINE | ID: mdl-30013044

RESUMO

Existing benchmark datasets for use in evaluating variant-calling accuracy are constructed from a consensus of known short-variant callers, and they are thus biased toward easy regions that are accessible by these algorithms. We derived a new benchmark dataset from the de novo PacBio assemblies of two fully homozygous human cell lines, which provides a relatively more accurate and less biased estimate of small-variant-calling error rates in a realistic context.


Assuntos
Bases de Dados Genéticas/estatística & dados numéricos , Variação Genética , Algoritmos , Benchmarking , Linhagem Celular Tumoral , Bases de Dados Genéticas/normas , Diploide , Feminino , Genoma Humano , Homozigoto , Humanos , Mola Hidatiforme/genética , Gravidez , Biologia Sintética , Neoplasias Uterinas/genética , Sequenciamento Completo do Genoma/estatística & dados numéricos
15.
Bioinformatics ; 36(7): 2060-2067, 2020 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-31830260

RESUMO

SUMMARY: We investigate convolutional neural networks (CNNs) for filtering small genomic variants in short-read DNA sequence data. Errors created during sequencing and library preparation make variant calling a difficult task. Encoding the reference genome and aligned reads covering sites of genetic variation as numeric tensors allows us to leverage CNNs for variant filtration. Convolutions over these tensors learn to detect motifs useful for classifying variants. Variant filtering models are trained to classify variants as artifacts or real variation. Visualizing the learned weights of the CNN confirmed it detects familiar DNA motifs known to correlate with real variation, like homopolymers and short tandem repeats (STR). After confirmation of the biological plausibility of the learned features we compared our model to current state-of-the-art filtration methods like Gaussian Mixture Models, Random Forests and CNNs designed for image classification, like DeepVariant. We demonstrate improvements in both sensitivity and precision. The tensor encoding was carefully tailored for processing genomic data, respecting the qualitative differences in structure between DNA and natural images. Ablation tests quantitatively measured the benefits of our tensor encoding strategy. Bayesian hyper-parameter optimization confirmed our notion that architectures designed with DNA data in mind outperform off-the-shelf image classification models. Our cross-generalization analysis identified idiosyncrasies in truth resources pointing to the need for new methods to construct genomic truth data. Our results show that models trained on heterogenous data types and diverse truth resources generalize well to new datasets, negating the need to train separate models for each data type. AVAILABILITY AND IMPLEMENTATION: This work is available in the Genome Analysis Toolkit (GATK) with the tool name CNNScoreVariants (https://github.com/broadinstitute/gatk). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica , Mutação INDEL , Teorema de Bayes , Sequenciamento de Nucleotídeos em Larga Escala , Redes Neurais de Computação , Análise de Sequência
16.
Int J Mol Sci ; 20(22)2019 Nov 12.
Artigo em Inglês | MEDLINE | ID: mdl-31726778

RESUMO

Nanoparticles have been extensively studied for drug delivery and targeting to specific organs. The functionalization of the nanoparticle surface by site-specific ligands (antibodies, peptides, saccharides) can ensure efficient recognition and binding with relevant biological targets. One of the main challenges in the development of these decorated nanocarriers is the accurate quantification of the amount of ligands on the nanoparticle surface. In this study, nanostructured lipid carriers (NLC) were functionalized with N-acetyl-D-galactosamine (GalNAc) units, known to target the asialoglycoprotein receptor (ASGPR). Different molar percentages of GalNAc-functionalized surfactant (0%, 2%, 5%, and 14%) were used in the formulation. Based on ultra-high-performance liquid chromatography separation and evaporative light-scattering detection (UPLC-ELSD), an analytical method was developed to specifically quantify the amount of GalNAc units present at the NLC surface. This method allowed the accurate quantification of GalNAc surfactant and therefore gave some insights into the structural parameters of these multivalent ligand systems. Our data show that the GalNAc decorated NLC possess large numbers of ligands at their surface and suitable distances between them for efficient multivalent interaction with the ASGPR, and therefore promising liver-targeting efficiency.


Assuntos
Portadores de Fármacos/química , Galactosamina/química , Lipídeos/química , Nanopartículas/química , Tensoativos/química
17.
J Nat Prod ; 78(8): 2007-22, 2015 Aug 28.
Artigo em Inglês | MEDLINE | ID: mdl-26244884

RESUMO

Raw licorice roots represent heterogeneous materials obtained from mainly three Glycyrrhiza species. G. glabra, G. uralensis, and G. inflata exhibit marked metabolite differences in terms of flavanones (Fs), chalcones (Cs), and other phenolic constituents. The principal objective of this work was to develop complementary chemometric models for the metabolite profiling, classification, and quality control of authenticated licorice. A total of 51 commercial and macroscopically verified samples were DNA authenticated. Principal component analysis and canonical discriminant analysis were performed on (1)H NMR spectra and area under the curve values obtained from UHPLC-UV chromatograms, respectively. The developed chemometric models enable the identification and classification of Glycyrrhiza species according to their composition in major Fs, Cs, and species specific phenolic compounds. Further key outcomes demonstrated that DNA authentication combined with chemometric analyses enabled the characterization of mixtures, hybrids, and species outliers. This study provides a new foundation for the botanical and chemical authentication, classification, and metabolomic characterization of crude licorice botanicals and derived materials. Collectively, the proposed methods offer a comprehensive approach for the quality control of licorice as one of the most widely used botanical dietary supplements.


Assuntos
DNA/metabolismo , Glycyrrhiza/química , Chalconas/química , Suplementos Nutricionais , Flavanonas/química , Glycyrrhiza/classificação , Estrutura Molecular , Ressonância Magnética Nuclear Biomolecular , Fenóis/metabolismo , Controle de Qualidade
18.
Biophys J ; 105(12): 2832-42, 2013 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-24359755

RESUMO

It has been observed experimentally that cells from failing hearts exhibit elevated levels of reactive oxygen species (ROS) upon increases in energetic workload. One proposed mechanism for this behavior is mitochondrial Ca(2+) mismanagement that leads to depletion of ROS scavengers. Here, we present a computational model to test this hypothesis. Previously published models of ROS production and scavenging were combined and reparameterized to describe ROS regulation in the cellular environment. Extramitochondrial Ca(2+) pulses were applied to simulate frequency-dependent changes in cytosolic Ca(2+). Model results show that decreased mitochondrial Ca(2+)uptake due to mitochondrial Ca(2+) uniporter inhibition (simulating Ru360) or elevated cytosolic Na(+), as in heart failure, leads to a decreased supply of NADH and NADPH upon increasing cellular workload. Oxidation of NADPH leads to oxidation of glutathione (GSH) and increased mitochondrial ROS levels, validating the Ca(2+) mismanagement hypothesis. The model goes on to predict that the ratio of steady-state [H2O2]m during 3Hz pacing to [H2O2]m at rest is highly sensitive to the size of the GSH pool. The largest relative increase in [H2O2]m in response to pacing is shown to occur when the total GSH and GSSG is close to 1 mM, whereas pool sizes below 0.9 mM result in high resting H2O2 levels, a quantitative prediction only possible with a computational model.


Assuntos
Sequestradores de Radicais Livres/metabolismo , Glutationa/metabolismo , Insuficiência Cardíaca/metabolismo , Mitocôndrias/metabolismo , Modelos Cardiovasculares , Espécies Reativas de Oxigênio/metabolismo , Animais , Cálcio/metabolismo , Humanos , NAD/metabolismo , Sódio/metabolismo
19.
Biophys J ; 105(4): 1045-56, 2013 Aug 20.
Artigo em Inglês | MEDLINE | ID: mdl-23972856

RESUMO

Elevated levels of reactive oxygen species (ROS) play a critical role in cardiac myocyte signaling in both healthy and diseased cells. Mitochondria represent the predominant cellular source of ROS, specifically the activity of complexes I and III. The model presented here explores the modulation of electron transport chain ROS production for state 3 and state 4 respiration and the role of substrates and respiratory inhibitors. Model simulations show that ROS production from complex III increases exponentially with membrane potential (ΔΨm) when in state 4. Complex I ROS release in the model can occur in the presence of NADH and succinate (reverse electron flow), leading to a highly reduced ubiquinone pool, displaying the highest ROS production flux in state 4. In the presence of ample ROS scavenging, total ROS production is moderate in state 3 and increases substantially under state 4 conditions. The ROS production model was extended by combining it with a minimal model of ROS scavenging. When the mitochondrial redox status was oxidized by increasing the proton permeability of the inner mitochondrial membrane, simulations with the combined model show that ROS levels initially decline as production drops off with decreasing ΔΨm and then increase as scavenging capacity is exhausted. Hence, this mechanistic model of ROS production demonstrates how ROS levels are controlled by mitochondrial redox balance.


Assuntos
Simulação por Computador , Mitocôndrias/metabolismo , Miocárdio/citologia , Espécies Reativas de Oxigênio/metabolismo , Respiração Celular , Complexo de Proteínas da Cadeia de Transporte de Elétrons/química , Complexo de Proteínas da Cadeia de Transporte de Elétrons/metabolismo , Potencial da Membrana Mitocondrial , Modelos Biológicos , Modelos Moleculares , Oxirredução , Conformação Proteica
20.
bioRxiv ; 2023 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-37425834

RESUMO

DNA sample contamination is a major issue in clinical and research applications of whole genome and exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a new metric to estimate DNA sample contamination from variant-level whole genome and exome sequence data, CHARR, Contamination from Homozygous Alternate Reference Reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VDS format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole genome and exome sequencing datasets.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA