Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 116
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Nat Rev Genet ; 22(5): 269-283, 2021 05.
Article in English | MEDLINE | ID: mdl-33408383

ABSTRACT

Nearly all genetic variants that influence disease risk have human-specific origins; however, the systems they influence have ancient roots that often trace back to evolutionary events long before the origin of humans. Here, we review how advances in our understanding of the genetic architectures of diseases, recent human evolution and deep evolutionary history can help explain how and why humans in modern environments become ill. Human populations exhibit differences in the prevalence of many common and rare genetic diseases. These differences are largely the result of the diverse environmental, cultural, demographic and genetic histories of modern human populations. Synthesizing our growing knowledge of evolutionary history with genetic medicine, while accounting for environmental and social factors, will help to achieve the promise of personalized genomics and realize the potential hidden in an individual's DNA sequence to guide clinical decisions. In short, precision medicine is fundamentally evolutionary medicine, and integration of evolutionary perspectives into the clinic will support the realization of its full potential.


Subject(s)
Disease/genetics , Evolution, Molecular , Health Status , Genetic Variation , Humans
2.
Cell ; 151(1): 206-20, 2012 Sep 28.
Article in English | MEDLINE | ID: mdl-22981692

ABSTRACT

Heart development is exquisitely sensitive to the precise temporal regulation of thousands of genes that govern developmental decisions during differentiation. However, we currently lack a detailed understanding of how chromatin and gene expression patterns are coordinated during developmental transitions in the cardiac lineage. Here, we interrogated the transcriptome and several histone modifications across the genome during defined stages of cardiac differentiation. We find distinct chromatin patterns that are coordinated with stage-specific expression of functionally related genes, including many human disease-associated genes. Moreover, we discover a novel preactivation chromatin pattern at the promoters of genes associated with heart development and cardiac function. We further identify stage-specific distal enhancer elements and find enriched DNA binding motifs within these regions that predict sets of transcription factors that orchestrate cardiac differentiation. Together, these findings form a basis for understanding developmentally regulated chromatin transitions during lineage commitment and the molecular etiology of congenital heart disease.


Subject(s)
Epigenesis, Genetic , Gene Regulatory Networks , Myocardium/cytology , Animals , Cell Differentiation , Chromatin/metabolism , Embryonic Stem Cells/metabolism , Enhancer Elements, Genetic , Heart/embryology , Humans , Mice , Transcription Factors/metabolism , Transcriptome
3.
Cell ; 145(5): 678-91, 2011 May 27.
Article in English | MEDLINE | ID: mdl-21620135

ABSTRACT

G-quadruplex (G4) DNA structures are extremely stable four-stranded secondary structures held together by noncanonical G-G base pairs. Genome-wide chromatin immunoprecipitation was used to determine the in vivo binding sites of the multifunctional Saccharomyces cerevisiae Pif1 DNA helicase, a potent unwinder of G4 structures in vitro. G4 motifs were a significant subset of the high-confidence Pif1-binding sites. Replication slowed in the vicinity of these motifs, and they were prone to breakage in Pif1-deficient cells, whereas non-G4 Pif1-binding sites did not show this behavior. Introducing many copies of G4 motifs caused slow growth in replication-stressed Pif1-deficient cells, which was relieved by spontaneous mutations that eliminated their ability to form G4 structures, bind Pif1, slow DNA replication, and stimulate DNA breakage. These data suggest that G4 structures form in vivo and that they are resolved by Pif1 to prevent replication fork stalling and DNA breakage.


Subject(s)
DNA Helicases/metabolism , DNA Replication , G-Quadruplexes , Saccharomyces cerevisiae Proteins/metabolism , Saccharomyces cerevisiae/genetics , DNA Copy Number Variations , DNA Polymerase II/metabolism , S Phase , Saccharomyces cerevisiae/cytology , Saccharomyces cerevisiae/metabolism
4.
Annu Rev Genomics Hum Genet ; 23: 591-612, 2022 08 31.
Article in English | MEDLINE | ID: mdl-35440148

ABSTRACT

Ancient DNA provides a powerful window into the biology of extant and extinct species, including humans' closest relatives: Denisovans and Neanderthals. Here, we review what is known about archaic hominin phenotypes from genomic data and how those inferences have been made. We contend that understanding the influence of variants on lower-level molecular phenotypes-such as gene expression and protein function-is a promising approach to using ancient DNA to learn about archaic hominin traits. Molecular phenotypes have simpler genetic architectures than organism-level complex phenotypes, and this approach enables moving beyond association studies by proposing hypotheses about the effects of archaic variants that are testable in model systems. The major challenge to understanding archaic hominin phenotypes is broadening our ability to accurately map genotypes to phenotypes, but ongoing advances ensure that there will be much more to learn about archaic hominin phenotypes from their genomes.


Subject(s)
Hominidae , Neanderthals , Animals , DNA, Ancient , Genome, Human , Genomics , Hominidae/genetics , Humans , Neanderthals/genetics , Phenotype
5.
Genome Res ; 32(4): 778-790, 2022 04.
Article in English | MEDLINE | ID: mdl-35210353

ABSTRACT

More than 90% of genetic variants are rare in most modern sequencing studies, such as the Alzheimer's Disease Sequencing Project (ADSP) whole-exome sequencing (WES) data. Furthermore, 54% of the rare variants in ADSP WES are singletons. However, both single variant and unit-based tests are limited in their statistical power to detect an association between rare variants and phenotypes. To best use missense rare variants and investigate their biological effect, we examine their association with phenotypes in the context of protein structures. We developed a protein structure-based approach, protein optimized kernel evaluation of missense nucleotides (POKEMON), which evaluates rare missense variants based on their spatial distribution within a protein rather than their allele frequency. The hypothesis behind this test is that the three-dimensional spatial distribution of variants within a protein structure provides functional context to power an association test. POKEMON identified three candidate genes (TREM2, SORL1, and EXOC3L4) and another suggestive gene from the ADSP WES data. For TREM2 and SORL1, two known Alzheimer's disease (AD) genes, the signal from the spatial cluster is stable even if we exclude known AD risk variants, indicating the presence of additional low-frequency risk variants within these genes. EXOC3L4 is a novel AD risk gene that has a cluster of variants primarily shared by case subjects around the Sec6 domain. This cluster is also validated in an independent replication data set and a validation data set with a larger sample size.


Subject(s)
Alzheimer Disease , Alzheimer Disease/genetics , Gene Frequency , Genetic Predisposition to Disease , Humans , LDL-Receptor Related Proteins/genetics , LDL-Receptor Related Proteins/metabolism , Membrane Transport Proteins/genetics , Mutation, Missense , Phenotype , Exome Sequencing
6.
PLoS Genet ; 18(11): e1010494, 2022 11.
Article in English | MEDLINE | ID: mdl-36342969

ABSTRACT

Natural selection shapes the genetic architecture of many human traits. However, the prevalence of different modes of selection on genomic regions associated with variation in traits remains poorly understood. To address this, we developed an efficient computational framework to calculate positive and negative enrichment of different evolutionary measures among regions associated with complex traits. We applied the framework to summary statistics from >900 genome-wide association studies (GWASs) and 11 evolutionary measures of sequence constraint, population differentiation, and allele age while accounting for linkage disequilibrium, allele frequency, and other potential confounders. We demonstrate that this framework yields consistent results across GWASs with variable sample sizes, numbers of trait-associated SNPs, and analytical approaches. The resulting evolutionary atlas maps diverse signatures of selection on genomic regions associated with complex human traits on an unprecedented scale. We detected positive enrichment for sequence conservation among trait-associated regions for the majority of traits (>77% of 290 high power GWASs), which included reproductive traits. Many traits also exhibited substantial positive enrichment for population differentiation, especially among hair, skin, and pigmentation traits. In contrast, we detected widespread negative enrichment for signatures of balancing selection (51% of GWASs) and absence of enrichment for evolutionary signals in regions associated with late-onset Alzheimer's disease. These results support a pervasive role for negative selection on regions of the human genome that contribute to variation in complex traits, but also demonstrate that diverse modes of evolution are likely to have shaped trait-associated loci. This atlas of evolutionary signatures across the diversity of available GWASs will enable exploration of the relationship between the genetic architecture and evolutionary processes in the human genome.


Subject(s)
Genome-Wide Association Study , Selection, Genetic , Humans , Linkage Disequilibrium , Phenotype , Genomics , Polymorphism, Single Nucleotide/genetics , Genome, Human/genetics
7.
Proc Natl Acad Sci U S A ; 119(26): e2200551119, 2022 06 28.
Article in English | MEDLINE | ID: mdl-35749358

ABSTRACT

Human genetic variation associates with the composition of the gut microbiome, yet its influence on clinical traits remains largely unknown. We analyzed the consequences of nearly a thousand gut microbiome-associated variants (MAVs) on phenotypes reported in electronic health records from tens of thousands of individuals. We discovered and replicated associations of MAVs with neurological, metabolic, digestive, and circulatory diseases. Five significant MAVs in these categories correlate with the relative abundance of microbes down to the strain level. We also demonstrate that these relationships are independently observed and concordant with microbe by disease associations reported in case-control studies. Moreover, a selective sweep and population differentiation impacted some disease-linked MAVs. Combined, these findings establish triad relationships among the human genome, microbiome, and disease. Consequently, human genetic influences may offer opportunities for precision diagnostics of microbiome-associated diseases but also highlight the relevance of genetic background for microbiome modulation and therapeutics.


Subject(s)
Disease , Gastrointestinal Microbiome , Genetic Variation , Disease/genetics , Genome, Human , Humans , Phenomics , Phenotype
8.
Am J Hum Genet ; 108(2): 269-283, 2021 02 04.
Article in English | MEDLINE | ID: mdl-33545030

ABSTRACT

Topologically associating domains (TADs) are fundamental units of three-dimensional (3D) nuclear organization. The regions bordering TADs-TAD boundaries-contribute to the regulation of gene expression by restricting interactions of cis-regulatory sequences to their target genes. TAD and TAD-boundary disruption have been implicated in rare-disease pathogenesis; however, we have a limited framework for integrating TADs and their variation across cell types into the interpretation of common-trait-associated variants. Here, we investigate an attribute of 3D genome architecture-the stability of TAD boundaries across cell types-and demonstrate its relevance to understanding how genetic variation in TADs contributes to complex disease. By synthesizing TAD maps across 37 diverse cell types with 41 genome-wide association studies (GWASs), we investigate the differences in disease association and evolutionary pressure on variation in TADs versus TAD boundaries. We demonstrate that genetic variation in TAD boundaries contributes more to complex-trait heritability, especially for immunologic, hematologic, and metabolic traits. We also show that TAD boundaries are more evolutionarily constrained than TADs. Next, stratifying boundaries by their stability across cell types, we find substantial variation. Compared to boundaries unique to a specific cell type, boundaries stable across cell types are further enriched for complex-trait heritability, evolutionary constraint, CTCF binding, and housekeeping genes. Thus, considering TAD boundary stability across cell types provides valuable context for understanding the genome's functional landscape and enabling variant interpretation that takes 3D structure into account.


Subject(s)
Chromatin , Evolution, Molecular , Genetic Variation , Genome, Human , Multifactorial Inheritance , Cells, Cultured , Embryonic Stem Cells , Gene Expression Regulation , Genome-Wide Association Study , Humans
9.
Am J Hum Genet ; 108(10): 1946-1963, 2021 10 07.
Article in English | MEDLINE | ID: mdl-34529933

ABSTRACT

Rare diseases affect millions of people worldwide, and discovering their genetic causes is challenging. More than half of the individuals analyzed by the Undiagnosed Diseases Network (UDN) remain undiagnosed. The central hypothesis of this work is that many of these rare genetic disorders are caused by multiple variants in more than one gene. However, given the large number of variants in each individual genome, experimentally evaluating combinations of variants for potential to cause disease is currently infeasible. To address this challenge, we developed the digenic predictor (DiGePred), a random forest classifier for identifying candidate digenic disease gene pairs by features derived from biological networks, genomics, evolutionary history, and functional annotations. We trained the DiGePred classifier by using DIDA, the largest available database of known digenic-disease-causing gene pairs, and several sets of non-digenic gene pairs, including variant pairs derived from unaffected relatives of UDN individuals. DiGePred achieved high precision and recall in cross-validation and on a held-out test set (PR area under the curve > 77%), and we further demonstrate its utility by using digenic pairs from the recent literature. In contrast to other approaches, DiGePred also appropriately controls the number of false positives when applied in realistic clinical settings. Finally, to enable the rapid screening of variant gene pairs for digenic disease potential, we freely provide the predictions of DiGePred on all human gene pairs. Our work enables the discovery of genetic causes for rare non-monogenic diseases by providing a means to rapidly evaluate variant gene pairs for the potential to cause digenic disease.


Subject(s)
Disease/genetics , Genomics/methods , Machine Learning , Multifactorial Inheritance , Phenotype , Rare Diseases/diagnosis , Undiagnosed Diseases/diagnosis , Databases, Genetic , Humans , Rare Diseases/genetics , Undiagnosed Diseases/genetics
10.
Bioinformatics ; 39(1)2023 01 01.
Article in English | MEDLINE | ID: mdl-36655767

ABSTRACT

SUMMARY: GSEL is a computational framework for calculating the enrichment of signatures of diverse evolutionary forces in a set of genomic regions. GSEL can flexibly integrate any sequence-based evolutionary metric and analyze sets of human genomic regions identified by genome-wide assays (e.g. GWAS, eQTL, *-seq). The core of GSEL's approach is the generation of empirical null distributions tailored to the allele frequency and linkage disequilibrium structure of the regions of interest. We illustrate the application of GSEL to variants identified from a GWAS of body mass index, a highly polygenic trait. AVAILABILITY AND IMPLEMENTATION: GSEL is implemented as a fast, flexible and user-friendly python package. It is available with demonstration data at https://github.com/abraham-abin13/gsel_vec. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Body Mass Index , Genome, Human , Genomics , Software , Humans , Gene Frequency , Genome-Wide Association Study
11.
Am J Hum Genet ; 107(1): 111-123, 2020 07 02.
Article in English | MEDLINE | ID: mdl-32533946

ABSTRACT

Partial or complete loss-of-function variants in SCN5A are the most common genetic cause of the arrhythmia disorder Brugada syndrome (BrS1). However, the pathogenicity of SCN5A variants is often unknown or disputed; 80% of the 1,390 SCN5A missense variants observed in at least one individual to date are variants of uncertain significance (VUSs). The designation of VUS is a barrier to the use of sequence data in clinical care. We selected 83 variants: 10 previously studied control variants, 10 suspected benign variants, and 63 suspected Brugada syndrome-associated variants, selected on the basis of their frequency in the general population and in individuals with Brugada syndrome. We used high-throughput automated patch clamping to study the function of the 83 variants, with the goal of reclassifying variants with functional data. The ten previously studied controls had functional properties concordant with published manual patch clamp data. All 10 suspected benign variants had wild-type-like function. 22 suspected BrS variants had loss of channel function (<10% normalized peak current) and 22 variants had partial loss of function (10%-50% normalized peak current). The previously unstudied variants were initially classified as likely benign (n = 2), likely pathogenic (n = 10), or VUSs (n = 61). After the patch clamp studies, 16 variants were benign/likely benign, 45 were pathogenic/likely pathogenic, and only 12 were still VUSs. Structural modeling identified likely mechanisms for loss of function including altered thermostability and disruptions to alpha helices, disulfide bonds, or the permeation pore. High-throughput patch clamping enabled reclassification of the majority of tested VUSs in SCN5A.


Subject(s)
NAV1.5 Voltage-Gated Sodium Channel/genetics , Arrhythmias, Cardiac/genetics , Brugada Syndrome/genetics , Cell Line , Female , Genetic Variation , Genotype , HEK293 Cells , High-Throughput Screening Assays/methods , Humans , Male , Phenotype
12.
Adv Exp Med Biol ; 1415: 157-163, 2023.
Article in English | MEDLINE | ID: mdl-37440029

ABSTRACT

Protein function can be impacted by changes in protein structure stability, but determining which change has impact is complex. Stability can be affected by a large change in the tertiary (3D) structure of the protein or due to free-energy changes caused by single amino acid substitutions. Changes in the DNA sequence can have minor or major impact on protein stability, which can lead to disease. Inherited retinal degenerations are generally caused by single mutations which are mostly located in protein-coding regions, while age-related macular degeneration (AMD) is a complex disorder that can be influenced by some genetic variants impacting proteins involved in the disease, although not all AMD risk variants lead to amino acid changes. Here, we review ways that proteins may be affected, the identification and understanding of these changes, and how to identify causal changes that can be targeted to develop treatments to alleviate retinal degenerative disease.


Subject(s)
Macular Degeneration , Retinal Degeneration , Humans , Retinal Degeneration/genetics , Retina , Macular Degeneration/genetics , Mutation , Proteins/chemistry , Protein Stability
13.
Mol Biol Evol ; 38(9): 3681-3696, 2021 08 23.
Article in English | MEDLINE | ID: mdl-33973014

ABSTRACT

Despite the importance of gene regulatory enhancers in human biology and evolution, we lack a comprehensive model of enhancer evolution and function. This substantially limits our understanding of the genetic basis of species divergence and our ability to interpret the effects of noncoding variants on human traits. To explore enhancer sequence evolution and its relationship to regulatory function, we traced the evolutionary origins of transcribed human enhancer sequences with activity across diverse tissues and cellular contexts from the FANTOM5 consortium. The transcribed enhancers are enriched for sequences of a single evolutionary age ("simple" evolutionary architectures) compared with enhancers that are composites of sequences of multiple evolutionary ages ("complex" evolutionary architectures), likely indicating constraint against genomic rearrangements. Complex enhancers are older, more pleiotropic, and more active across species than simple enhancers. Genetic variants within complex enhancers are also less likely to associate with human traits and biochemical activity. Transposable-element-derived sequences (TEDS) have made diverse contributions to enhancers of both architectures; the majority of TEDS are found in enhancers with simple architectures, while a minority have remodeled older sequences to create complex architectures. Finally, we compare the evolutionary architectures of transcribed enhancers with histone-mark-defined enhancers. Our results reveal that most human transcribed enhancers are ancient sequences of a single age, and thus the evolution of most human enhancers was not driven by increases in evolutionary complexity over time. Our analyses further suggest that considering enhancer evolutionary histories provides context that can aid interpretation of the effects of variants on enhancer function. Based on these results, we propose a framework for analyzing enhancer evolutionary architecture.


Subject(s)
Enhancer Elements, Genetic , Genomics , DNA Transposable Elements , Gene Expression Regulation , Humans , Phenotype
14.
BMC Med ; 20(1): 333, 2022 09 28.
Article in English | MEDLINE | ID: mdl-36167547

ABSTRACT

BACKGROUND: Identifying pregnancies at risk for preterm birth, one of the leading causes of worldwide infant mortality, has the potential to improve prenatal care. However, we lack broadly applicable methods to accurately predict preterm birth risk. The dense longitudinal information present in electronic health records (EHRs) is enabling scalable and cost-efficient risk modeling of many diseases, but EHR resources have been largely untapped in the study of pregnancy. METHODS: Here, we apply machine learning to diverse data from EHRs with 35,282 deliveries to predict singleton preterm birth. RESULTS: We find that machine learning models based on billing codes alone can predict preterm birth risk at various gestational ages (e.g., ROC-AUC = 0.75, PR-AUC = 0.40 at 28 weeks of gestation) and outperform comparable models trained using known risk factors (e.g., ROC-AUC = 0.65, PR-AUC = 0.25 at 28 weeks). Examining the patterns learned by the model reveals it stratifies deliveries into interpretable groups, including high-risk preterm birth subtypes enriched for distinct comorbidities. Our machine learning approach also predicts preterm birth subtypes (spontaneous vs. indicated), mode of delivery, and recurrent preterm birth. Finally, we demonstrate the portability of our approach by showing that the prediction models maintain their accuracy on a large, independent cohort (5978 deliveries) from a different healthcare system. CONCLUSIONS: By leveraging rich phenotypic and genetic features derived from EHRs, we suggest that machine learning algorithms have great potential to improve medical care during pregnancy. However, further work is needed before these models can be applied in clinical settings.


Subject(s)
Premature Birth , Algorithms , Electronic Health Records , Female , Gestational Age , Humans , Infant, Newborn , Machine Learning , Pregnancy , Premature Birth/diagnosis , Premature Birth/epidemiology
15.
J Proteome Res ; 20(8): 4089-4100, 2021 08 06.
Article in English | MEDLINE | ID: mdl-34236204

ABSTRACT

Prediction of residue-level structural attributes and protein-level structural classes helps model protein tertiary structures and understand protein functions. Existing methods are either specialized on only one class of proteins or developed to predict only a specific type of residue-level attribute. In this work, we develop a new deep-learning method, named Membrane Association and Secondary Structure Predictor (MASSP), for accurately predicting both residue-level structural attributes (secondary structure, location, orientation, and topology) and protein-level structural classes (bitopic, α-helical, ß-barrel, and soluble). MASSP integrates a multilayer two-dimensional convolutional neural network (2D-CNN) with a long short-term memory (LSTM) neural network into a multitasking framework. Our comparison shows that MASSP performs equally well or better than the state-of-the-art methods in predicting residue-level secondary structures, boundaries of transmembrane segments, and topology. Furthermore, it achieves outstanding accuracy in predicting protein-level structural classes. MASSP automatically distinguishes the structural classes of input sequences and identifies transmembrane segments and topologies if present, making it broadly applicable to different classes of proteins. In summary, MASSP's good performance and broad applicability make it well suited for annotating residue-level attributes and protein-level structural classes at the proteome scale.


Subject(s)
Deep Learning , Computational Biology , Databases, Protein , Protein Structure, Secondary , Proteome
16.
Hum Genet ; 140(4): 667-680, 2021 Apr.
Article in English | MEDLINE | ID: mdl-33469725

ABSTRACT

PURPOSE: Mayer-Rokitansky-Küster-Hauser (MRKH) syndrome consists of congenital absence of the uterus and vagina and is often associated with renal, skeletal, cardiac, and auditory defects. The genetic basis is largely unknown except for rare variants in several genes. Many candidate genes have been suggested by mouse models and human studies. The purpose of this study was to narrow down the number of candidate genes. METHODS: Whole exome sequencing was performed on 111 unrelated individuals with MRKH; variant analysis focused on 72 genes suggested by mouse models, human studies of physiological candidates, or located near translocation breakpoints in t(3;16). Candidate variants (CV) predicted to be deleterious were confirmed by Sanger sequencing. RESULTS: Sanger sequencing verified 54 heterozygous CV from genes identified through mouse (13 CV in 6 genes), human (22 CV in seven genes), and translocation breakpoint (19 CV in 11 genes) studies. Twelve patients had ≥ 2 CVs, including four patients with two variants in the same gene. One likely digenic combination of LAMC1 and MMP14 was identified. CONCLUSION: We narrowed 72 candidate genes to 10 genes that appear more likely implicated. These candidate genes will require further investigation to elucidate their role in the development of MRKH.


Subject(s)
46, XX Disorders of Sex Development/genetics , Congenital Abnormalities/genetics , Mullerian Ducts/abnormalities , Uterus/abnormalities , Vagina/abnormalities , 46, XX Disorders of Sex Development/pathology , Animals , Congenital Abnormalities/pathology , Female , Genetic Variation , Humans , Male , Mice , Mullerian Ducts/pathology , Translocation, Genetic , Exome Sequencing
17.
Am J Hum Genet ; 102(3): 415-426, 2018 03 01.
Article in English | MEDLINE | ID: mdl-29455857

ABSTRACT

The spatial distribution of genetic variation within proteins is shaped by evolutionary constraint and provides insight into the functional importance of protein regions and the potential pathogenicity of protein alterations. Here, we comprehensively evaluate the 3D spatial patterns of human germline and somatic variation in 6,604 experimentally derived protein structures and 33,144 computationally derived homology models covering 77% of all human proteins. Using a systematic approach, we quantify differences in the spatial distributions of neutral germline variants, disease-causing germline variants, and recurrent somatic variants. Neutral missense variants exhibit a general trend toward spatial dispersion, which is driven by constraint on core residues. In contrast, germline disease-causing variants are generally clustered in protein structures and form clusters more frequently than recurrent somatic variants identified from tumor sequencing. In total, we identify 215 proteins with significant spatial constraints on the distribution of disease-causing missense variants in experimentally derived protein structures, only 65 (30%) of which have been previously reported. This analysis identifies many clusters not detectable from sequence information alone; only 12% of proteins with significant clustering in 3D were identified from similar analyses of linear protein sequence. Furthermore, spatial analyses of mutations in homology-based structural models are highly correlated with those from experimentally derived structures, supporting the use of computationally derived models. Our approach highlights significant differences in the spatial constraints on different classes of mutations in protein structure and identifies regions of potential function within individual proteins.


Subject(s)
Mutation, Missense/genetics , Proteins/chemistry , Proteins/genetics , Amino Acid Sequence , Cluster Analysis , Humans , Models, Molecular
18.
FASEB J ; 34(12): 15946-15960, 2020 12.
Article in English | MEDLINE | ID: mdl-33015868

ABSTRACT

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the cause of the global pandemic of coronavirus disease-2019 (COVID-19). SARS-CoV-2 is a zoonotic disease, but little is known about variations in species susceptibility that could identify potential reservoir species, animal models, and the risk to pets, wildlife, and livestock. Certain species, such as domestic cats and tigers, are susceptible to SARS-CoV-2 infection, while other species such as mice and chickens are not. Most animal species, including those in close contact with humans, have unknown susceptibility. Hence, methods to predict the infection risk of animal species are urgently needed. SARS-CoV-2 spike protein binding to angiotensin-converting enzyme 2 (ACE2) is critical for viral cell entry and infection. Here we integrate species differences in susceptibility with multiple in-depth structural analyses to identify key ACE2 amino acid positions including 30, 83, 90, 322, and 354 that distinguish susceptible from resistant species. Using differences in these residues across species, we developed a susceptibility score that predicts an elevated risk of SARS-CoV-2 infection for multiple species including horses and camels. We also demonstrate that SARS-CoV-2 is nearly optimal for binding ACE2 of humans compared to other animals, which may underlie the highly contagious transmissibility of this virus among humans. Taken together, our findings define potential ACE2 and SARS-CoV-2 residues for therapeutic targeting and identification of animal species on which to focus research and protection measures for environmental and public health.


Subject(s)
Angiotensin-Converting Enzyme 2/chemistry , COVID-19/genetics , Genetic Predisposition to Disease , Receptors, Virus/chemistry , Amino Acid Sequence , Angiotensin-Converting Enzyme 2/genetics , Animals , Camelus , Glycosylation , Horses , Humans , Models, Molecular , Phylogeny , Protein Binding , Protein Structure, Tertiary , Receptors, Virus/genetics , SARS-CoV-2 , Sequence Alignment , Species Specificity
19.
PLoS Comput Biol ; 16(11): e1008334, 2020 11.
Article in English | MEDLINE | ID: mdl-33137083

ABSTRACT

Deep neural networks (DNNs) have achieved state-of-the-art performance in identifying gene regulatory sequences, but they have provided limited insight into the biology of regulatory elements due to the difficulty of interpreting the complex features they learn. Several models of how combinatorial binding of transcription factors, i.e. the regulatory grammar, drives enhancer activity have been proposed, ranging from the flexible TF billboard model to the stringent enhanceosome model. However, there is limited knowledge of the prevalence of these (or other) sequence architectures across enhancers. Here we perform several hypothesis-driven analyses to explore the ability of DNNs to learn the regulatory grammar of enhancers. We created synthetic datasets based on existing hypotheses about combinatorial transcription factor binding site (TFBS) patterns, including homotypic clusters, heterotypic clusters, and enhanceosomes, from real TF binding motifs from diverse TF families. We then trained deep residual neural networks (ResNets) to model the sequences under a range of scenarios that reflect real-world multi-label regulatory sequence prediction tasks. We developed a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet models. We demonstrated that simulated regulatory grammars are best learned in the penultimate layer of the ResNets, and the proposed method can accurately retrieve the regulatory grammar even when there is heterogeneity in the enhancer categories and a large fraction of TFBS outside of the regulatory grammar. However, we also identify common scenarios where ResNets fail to learn simulated regulatory grammars. Finally, we applied the proposed method to mouse developmental enhancers and were able to identify the components of a known heterotypic TF cluster. Our results provide a framework for interpreting the regulatory rules learned by ResNets, and they demonstrate that the ability and efficiency of ResNets in learning the regulatory grammar depends on the nature of the prediction task.


Subject(s)
Deep Learning , Gene Regulatory Networks , Genes, Regulator , Animals , Binding Sites/genetics , Computational Biology , Computer Simulation , Databases, Genetic , Enhancer Elements, Genetic , Mice , Models, Genetic , Neural Networks, Computer , Terminology as Topic , Transcription Factors/metabolism
20.
PLoS Comput Biol ; 16(11): e1008291, 2020 11.
Article in English | MEDLINE | ID: mdl-33253214

ABSTRACT

Predicting mutation-induced changes in protein thermodynamic stability (ΔΔG) is of great interest in protein engineering, variant interpretation, and protein biophysics. We introduce ThermoNet, a deep, 3D-convolutional neural network (3D-CNN) designed for structure-based prediction of ΔΔGs upon point mutation. To leverage the image-processing power inherent in CNNs, we treat protein structures as if they were multi-channel 3D images. In particular, the inputs to ThermoNet are uniformly constructed as multi-channel voxel grids based on biophysical properties derived from raw atom coordinates. We train and evaluate ThermoNet with a curated data set that accounts for protein homology and is balanced with direct and reverse mutations; this provides a framework for addressing biases that have likely influenced many previous ΔΔG prediction methods. ThermoNet demonstrates performance comparable to the best available methods on the widely used Ssym test set. In addition, ThermoNet accurately predicts the effects of both stabilizing and destabilizing mutations, while most other methods exhibit a strong bias towards predicting destabilization. We further show that homology between Ssym and widely used training sets like S2648 and VariBench has likely led to overestimated performance in previous studies. Finally, we demonstrate the practical utility of ThermoNet in predicting the ΔΔGs for two clinically relevant proteins, p53 and myoglobin, and for pathogenic and benign missense variants from ClinVar. Overall, our results suggest that 3D-CNNs can model the complex, non-linear interactions perturbed by mutations, directly from biophysical properties of atoms.


Subject(s)
Imaging, Three-Dimensional/methods , Neural Networks, Computer , Point Mutation , Proteins/chemistry , Thermodynamics , Computational Biology , Protein Stability
SELECTION OF CITATIONS
SEARCH DETAIL