Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 36
Filter
1.
bioRxiv ; 2023 Nov 13.
Article in English | MEDLINE | ID: mdl-38014075

ABSTRACT

Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1-6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and largescale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 elementgene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancerpromoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics.

3.
bioRxiv ; 2023 Dec 21.
Article in English | MEDLINE | ID: mdl-38187584

ABSTRACT

Regulatory DNA sequences within enhancers and promoters bind transcription factors to encode cell type-specific patterns of gene expression. However, the regulatory effects and programmability of such DNA sequences remain difficult to map or predict because we have lacked scalable methods to precisely edit regulatory DNA and quantify the effects in an endogenous genomic context. Here we present an approach to measure the quantitative effects of hundreds of designed DNA sequence variants on gene expression, by combining pooled CRISPR prime editing with RNA fluorescence in situ hybridization and cell sorting (Variant-FlowFISH). We apply this method to mutagenize and rewrite regulatory DNA sequences in an enhancer and the promoter of PPIF in two immune cell lines. Of 672 variant-cell type pairs, we identify 497 that affect PPIF expression. These variants appear to act through a variety of mechanisms including disruption or optimization of existing transcription factor binding sites, as well as creation of de novo sites. Disrupting a single endogenous transcription factor binding site often led to large changes in expression (up to -40% in the enhancer, and -50% in the promoter). The same variant often had different effects across cell types and states, demonstrating a highly tunable regulatory landscape. We use these data to benchmark performance of sequence-based predictive models of gene regulation, and find that certain types of variants are not accurately predicted by existing models. Finally, we computationally design 185 small sequence variants (≤10 bp) and optimize them for specific effects on expression in silico. 84% of these rationally designed edits showed the intended direction of effect, and some had dramatic effects on expression (-100% to +202%). Variant-FlowFISH thus provides a powerful tool to map the effects of variants and transcription factor binding sites on gene expression, test and improve computational models of gene regulation, and reprogram regulatory DNA.

4.
Genome Biol ; 23(1): 245, 2022 11 23.
Article in English | MEDLINE | ID: mdl-36419176

ABSTRACT

BACKGROUND: Degradation rate is a fundamental aspect of mRNA metabolism, and the factors governing it remain poorly characterized. Understanding the genetic and biochemical determinants of mRNA half-life would enable more precise identification of variants that perturb gene expression through post-transcriptional gene regulatory mechanisms. RESULTS: We establish a compendium of 39 human and 27 mouse transcriptome-wide mRNA decay rate datasets. A meta-analysis of these data identified a prevalence of technical noise and measurement bias, induced partially by the underlying experimental strategy. Correcting for these biases allowed us to derive more precise, consensus measurements of half-life which exhibit enhanced consistency between species. We trained substantially improved statistical models based upon genetic and biochemical features to better predict half-life and characterize the factors molding it. Our state-of-the-art model, Saluki, is a hybrid convolutional and recurrent deep neural network which relies only upon an mRNA sequence annotated with coding frame and splice sites to predict half-life (r=0.77). The key novel principle learned by Saluki is that the spatial positioning of splice sites, codons, and RNA-binding motifs within an mRNA is strongly associated with mRNA half-life. Saluki predicts the impact of RNA sequences and genetic mutations therein on mRNA stability, in agreement with functional measurements derived from massively parallel reporter assays. CONCLUSIONS: Our work produces a more robust ground truth for transcriptome-wide mRNA half-lives in mammalian cells. Using these revised measurements, we trained Saluki, a model that is over 50% more accurate in predicting half-life from sequence than existing models. Saluki succinctly captures many of the known determinants of mRNA half-life and can be rapidly deployed to predict the functional consequences of arbitrary mutations in the transcriptome.


Subject(s)
Mammals , RNA Stability , Humans , Animals , Mice , Mammals/genetics , RNA, Messenger/genetics , Transcriptome , Biological Assay
5.
Nat Methods ; 19(9): 1088-1096, 2022 09.
Article in English | MEDLINE | ID: mdl-35941239

ABSTRACT

Single-cell assay for transposase-accessible chromatin using sequencing (scATAC) shows great promise for studying cellular heterogeneity in epigenetic landscapes, but there remain important challenges in the analysis of scATAC data due to the inherent high dimensionality and sparsity. Here we introduce scBasset, a sequence-based convolutional neural network method to model scATAC data. We show that by leveraging the DNA sequence information underlying accessibility peaks and the expressiveness of a neural network model, scBasset achieves state-of-the-art performance across a variety of tasks on scATAC and single-cell multiome datasets, including cell clustering, scATAC profile denoising, data integration across assays and transcription factor activity inference.


Subject(s)
Chromatin Immunoprecipitation Sequencing , Chromatin , Chromatin/genetics , Epigenomics , Neural Networks, Computer , Sequence Analysis, DNA/methods , Single-Cell Analysis/methods , Transposases/genetics
6.
Elife ; 112022 02 04.
Article in English | MEDLINE | ID: mdl-35119359

ABSTRACT

The process wherein dividing cells exhaust proliferative capacity and enter into replicative senescence has become a prominent model for cellular aging in vitro. Despite decades of study, this cellular state is not fully understood in culture and even much less so during aging. Here, we revisit Leonard Hayflick's original observation of replicative senescence in WI-38 human lung fibroblasts equipped with a battery of modern techniques including RNA-seq, single-cell RNA-seq, proteomics, metabolomics, and ATAC-seq. We find evidence that the transition to a senescent state manifests early, increases gradually, and corresponds to a concomitant global increase in DNA accessibility in nucleolar and lamin associated domains. Furthermore, we demonstrate that senescent WI-38 cells acquire a striking resemblance to myofibroblasts in a process similar to the epithelial to mesenchymal transition (EMT) that is regulated by t YAP1/TEAD1 and TGF-ß2. Lastly, we show that verteporfin inhibition of YAP1/TEAD1 activity in aged WI-38 cells robustly attenuates this gene expression program.


Subject(s)
Cellular Senescence , Epithelial-Mesenchymal Transition , Aged , Aging/physiology , Cell Line , Cellular Senescence/genetics , Fibroblasts/metabolism , Humans
7.
Nat Methods ; 18(10): 1196-1203, 2021 10.
Article in English | MEDLINE | ID: mdl-34608324

ABSTRACT

How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer-promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution.


Subject(s)
DNA/genetics , Databases, Genetic , Epigenesis, Genetic , Gene Expression Regulation , Machine Learning , Nerve Net , Animals , Cell Line , Genome , Genomics/methods , Humans , Mice , Quantitative Trait Loci
8.
Nat Commun ; 12(1): 5101, 2021 08 24.
Article in English | MEDLINE | ID: mdl-34429411

ABSTRACT

3' untranslated regions (3' UTRs) post-transcriptionally regulate mRNA stability, localization, and translation rate. While 3'-UTR isoforms have been globally quantified in limited cell types using bulk measurements, their differential usage among cell types during mammalian development remains poorly characterized. In this study, we examine a dataset comprising ~2 million nuclei spanning E9.5-E13.5 of mouse embryonic development to quantify transcriptome-wide changes in alternative polyadenylation (APA). We observe a global lengthening of 3' UTRs across embryonic stages in all cell types, although we detect shorter 3' UTRs in hematopoietic lineages and longer 3' UTRs in neuronal cell types within each stage. An analysis of RNA-binding protein (RBP) dynamics identifies ELAV-like family members, which are concomitantly induced in neuronal lineages and developmental stages experiencing 3'-UTR lengthening, as putative regulators of APA. By measuring 3'-UTR isoforms in an expansive single cell dataset, our work provides a transcriptome-wide and organism-wide map of the dynamic landscape of alternative polyadenylation during mammalian organogenesis.


Subject(s)
Embryonic Development/genetics , Embryonic Development/physiology , Polyadenylation , 3' Untranslated Regions , Animals , Gene Expression Regulation, Developmental , Mice , NIH 3T3 Cells , Neurons/metabolism , Organogenesis , Protein Isoforms , RNA Stability , RNA-Binding Proteins/metabolism , Transcriptome
9.
Nat Commun ; 12(1): 3394, 2021 06 07.
Article in English | MEDLINE | ID: mdl-34099641

ABSTRACT

The large majority of variants identified by GWAS are non-coding, motivating detailed characterization of the function of non-coding variants. Experimental methods to assess variants' effect on gene expressions in native chromatin context via direct perturbation are low-throughput. Existing high-throughput computational predictors thus have lacked large gold standard sets of regulatory variants for training and validation. Here, we leverage a set of 14,807 putative causal eQTLs in humans obtained through statistical fine-mapping, and we use 6121 features to directly train a predictor of whether a variant modifies nearby gene expression. We call the resulting prediction the expression modifier score (EMS). We validate EMS by comparing its ability to prioritize functional variants with other major scores. We then use EMS as a prior for statistical fine-mapping of eQTLs to identify an additional 20,913 putatively causal eQTLs, and we incorporate EMS into co-localization analysis to identify 310 additional candidate genes across UK Biobank phenotypes.


Subject(s)
Chromosome Mapping/methods , Computational Biology/methods , Quantitative Trait Loci , Supervised Machine Learning , Adult , Cohort Studies , Datasets as Topic , Gene Expression Profiling , Humans , Polymorphism, Single Nucleotide
10.
Cell Rep ; 35(4): 109046, 2021 04 27.
Article in English | MEDLINE | ID: mdl-33910007

ABSTRACT

Skeletal muscle experiences a decline in lean mass and regenerative potential with age, in part due to intrinsic changes in progenitor cells. However, it remains unclear how age-related changes in progenitors manifest across a differentiation trajectory. Here, we perform single-cell RNA sequencing (RNA-seq) on muscle mononuclear cells from young and aged mice and profile muscle stem cells (MuSCs) and fibro-adipose progenitors (FAPs) after differentiation. Differentiation increases the magnitude of age-related change in MuSCs and FAPs, but it also masks a subset of age-related changes present in progenitors. Using a dynamical systems approach and RNA velocity, we find that aged MuSCs follow the same differentiation trajectory as young cells but stall in differentiation near a commitment decision. Our results suggest that differentiation reveals latent features of aging and that fate commitment decisions are delayed in aged myogenic cells in vitro.


Subject(s)
Aging/genetics , Muscle Development/genetics , Animals , Cell Differentiation , Cells, Cultured , Mice
11.
Genome Res ; 31(10): 1781-1793, 2021 10.
Article in English | MEDLINE | ID: mdl-33627475

ABSTRACT

Annotating cell identities is a common bottleneck in the analysis of single-cell genomics experiments. Here, we present scNym, a semisupervised, adversarial neural network that learns to transfer cell identity annotations from one experiment to another. scNym takes advantage of information in both labeled data sets and new, unlabeled data sets to learn rich representations of cell identity that enable effective annotation transfer. We show that scNym effectively transfers annotations across experiments despite biological and technical differences, achieving performance superior to existing methods. We also show that scNym models can synthesize information from multiple training and target data sets to improve performance. We show that in addition to high accuracy, scNym models are well calibrated and interpretable with saliency methods.


Subject(s)
Neural Networks, Computer
12.
Nat Methods ; 17(11): 1111-1117, 2020 11.
Article in English | MEDLINE | ID: mdl-33046897

ABSTRACT

In interphase, the human genome sequence folds in three dimensions into a rich variety of locus-specific contact patterns. Cohesin and CTCF (CCCTC-binding factor) are key regulators; perturbing the levels of either greatly disrupts genome-wide folding as assayed by chromosome conformation capture methods. Still, how a given DNA sequence encodes a particular locus-specific folding pattern remains unknown. Here we present a convolutional neural network, Akita, that accurately predicts genome folding from DNA sequence alone. Representations learned by Akita underscore the importance of an orientation-specific grammar for CTCF binding sites. Akita learns predictive nucleotide-level features of genome folding, revealing effects of nucleotides beyond the core CTCF motif. Once trained, Akita enables rapid in silico predictions. Accounting for this, we demonstrate how Akita can be used to perform in silico saturation mutagenesis, interpret eQTLs, make predictions for structural variants and probe species-specific genome folding. Collectively, these results enable decoding genome function from sequence through structure.


Subject(s)
CCCTC-Binding Factor/genetics , Cell Cycle Proteins/genetics , Chromosomal Proteins, Non-Histone/genetics , DNA-Binding Proteins/genetics , Genome, Human , Neural Networks, Computer , Sequence Analysis, DNA/methods , Gene Expression Regulation , Humans , Models, Genetic , Cohesins
13.
Nat Commun ; 11(1): 4703, 2020 09 17.
Article in English | MEDLINE | ID: mdl-32943643

ABSTRACT

Deep learning models have shown great promise in predicting regulatory effects from DNA sequence, but their informativeness for human complex diseases is not fully understood. Here, we evaluate genome-wide SNP annotations from two previous deep learning models, DeepSEA and Basenji, by applying stratified LD score regression to 41 diseases and traits (average N = 320K), conditioning on a broad set of coding, conserved and regulatory annotations. We aggregated annotations across all (respectively blood or brain) tissues/cell-types in meta-analyses across all (respectively 11 blood or 8 brain) traits. The annotations were highly enriched for disease heritability, but produced only limited conditionally significant results: non-tissue-specific and brain-specific Basenji-H3K4me3 for all traits and brain traits respectively. We conclude that deep learning models have yet to achieve their full potential to provide considerable unique information for complex disease, and that their conditional informativeness for disease cannot be inferred from their accuracy in predicting regulatory annotations.


Subject(s)
Deep Learning , Disease/genetics , Molecular Sequence Annotation , Alleles , Genetic Predisposition to Disease , Genome, Human , Genome-Wide Association Study , Histones/genetics , Humans , Linkage Disequilibrium , Models, Genetic , Phenotype , Polymorphism, Single Nucleotide
14.
PLoS Comput Biol ; 16(7): e1008050, 2020 07.
Article in English | MEDLINE | ID: mdl-32687525

ABSTRACT

Machine learning algorithms trained to predict the regulatory activity of nucleic acid sequences have revealed principles of gene regulation and guided genetic variation analysis. While the human genome has been extensively annotated and studied, model organisms have been less explored. Model organism genomes offer both additional training sequences and unique annotations describing tissue and cell states unavailable in humans. Here, we develop a strategy to train deep convolutional neural networks simultaneously on multiple genomes and apply it to learn sequence predictors for large compendia of human and mouse data. Training on both genomes improves gene expression prediction accuracy on held out and variant sequences. We further demonstrate a novel and powerful approach to apply mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease. Together these techniques unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.


Subject(s)
Gene Expression Regulation , Genetic Variation , Machine Learning , Algorithms , Animals , Computational Biology , Databases, Genetic , Epigenomics , Genome, Human , Genomics , Hepatocytes/metabolism , Humans , Mice , Models, Genetic , Models, Statistical , Mutation , Neural Networks, Computer , Quantitative Trait Loci , Sequence Analysis, DNA , Software , Species Specificity
15.
Cell Syst ; 11(1): 95-101.e5, 2020 07 22.
Article in English | MEDLINE | ID: mdl-32592658

ABSTRACT

Single-cell RNA sequencing (scRNA-seq) measurements of gene expression enable an unprecedented high-resolution view into cellular state. However, current methods often result in two or more cells that share the same cell-identifying barcode; these "doublets" violate the fundamental premise of single-cell technology and can lead to incorrect inferences. Here, we describe Solo, a semi-supervised deep learning approach that identifies doublets with greater accuracy than existing methods. Solo embeds cells unsupervised using a variational autoencoder and then appends a feed-forward neural network layer to the encoder to form a supervised classifier. We train this classifier to distinguish simulated doublets from the observed data. Solo can be applied in combination with experimental doublet detection methods to further purify scRNA-seq data to true single cells. It is freely available from https://github.com/calico/solo. A record of this paper's transparent peer review process is included in the Supplemental Information.


Subject(s)
Deep Learning/standards , RNA-Seq/methods , Single-Cell Analysis/methods , Humans
16.
Genome Res ; 29(12): 2088-2103, 2019 12.
Article in English | MEDLINE | ID: mdl-31754020

ABSTRACT

Aging is a pleiotropic process affecting many aspects of mammalian physiology. Mammals are composed of distinct cell type identities and tissue environments, but the influence of these cell identities and environments on the trajectory of aging in individual cells remains unclear. Here, we performed single-cell RNA-seq on >50,000 individual cells across three tissues in young and old mice to allow for direct comparison of aging phenotypes across cell types. We found transcriptional features of aging common across many cell types, as well as features of aging unique to each type. Leveraging matrix factorization and optimal transport methods, we found that both cell identities and tissue environments exert influence on the trajectory and magnitude of aging, with cell identity influence predominating. These results suggest that aging manifests with unique directionality and magnitude across the diverse cell identities in mammals.


Subject(s)
Aging , RNA-Seq , Sequence Analysis, RNA , Single-Cell Analysis , Aging/genetics , Aging/metabolism , Animals , Male , Mice
17.
Nat Genet ; 50(10): 1483-1493, 2018 10.
Article in English | MEDLINE | ID: mdl-30177862

ABSTRACT

Biological interpretation of genome-wide association study data frequently involves assessing whether SNPs linked to a biological process, for example, binding of a transcription factor, show unsigned enrichment for disease signal. However, signed annotations quantifying whether each SNP allele promotes or hinders the biological process can enable stronger statements about disease mechanism. We introduce a method, signed linkage disequilibrium profile regression, for detecting genome-wide directional effects of signed functional annotations on disease risk. We validate the method via simulations and application to molecular quantitative trait loci in blood, recovering known transcriptional regulators. We apply the method to expression quantitative trait loci in 48 Genotype-Tissue Expression tissues, identifying 651 transcription factor-tissue associations including 30 with robust evidence of tissue specificity. We apply the method to 46 diseases and complex traits (average n = 290 K), identifying 77 annotation-trait associations representing 12 independent transcription factor-trait associations, and characterize the underlying transcriptional programs using gene-set enrichment analyses. Our results implicate new causal disease genes and new disease mechanisms.


Subject(s)
Disease/genetics , Genome-Wide Association Study , Multifactorial Inheritance/genetics , Quantitative Trait Loci , Transcription Factors/metabolism , Binding Sites/genetics , Blood Cells/metabolism , Blood Cells/pathology , Blood Chemical Analysis , Gene Expression Regulation , Genetic Predisposition to Disease , Humans , Linkage Disequilibrium , Phenotype , Polymorphism, Single Nucleotide , Protein Binding , Risk Factors
18.
Genome Res ; 28(5): 739-750, 2018 05.
Article in English | MEDLINE | ID: mdl-29588361

ABSTRACT

Models for predicting phenotypic outcomes from genotypes have important applications to understanding genomic function and improving human health. Here, we develop a machine-learning system to predict cell-type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. By use of convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. We show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable fine mapping of disease loci.


Subject(s)
Chromosomes/genetics , Computational Biology/methods , Neural Networks, Computer , Regulatory Sequences, Nucleic Acid/genetics , Animals , Epigenomics/methods , Gene Expression Profiling/methods , Gene Expression Regulation , Genomics/methods , Humans , Machine Learning , Models, Genetic , Polymorphism, Single Nucleotide , Promoter Regions, Genetic/genetics
19.
Nat Genet ; 50(2): 250-258, 2018 02.
Article in English | MEDLINE | ID: mdl-29358654

ABSTRACT

Transcription factors (TFs) direct developmental transitions by binding to target DNA sequences, influencing gene expression and establishing complex gene-regultory networks. To systematically determine the molecular components that enable or constrain TF activity, we investigated the genomic occupancy of FOXA2, GATA4 and OCT4 in several cell types. Despite their classification as pioneer factors, all three TFs exhibit cell-type-specific binding, even when supraphysiologically and ectopically expressed. However, FOXA2 and GATA4 can be distinguished by low enrichment at loci that are highly occupied by these factors in alternative cell types. We find that expression of additional cofactors increases enrichment at a subset of these sites. Finally, FOXA2 occupancy and changes to DNA accessibility can occur in G1-arrested cells, but subsequent loss of DNA methylation requires DNA replication.


Subject(s)
DNA/metabolism , Epigenesis, Genetic/physiology , Gene Regulatory Networks/physiology , Transcription Factors/metabolism , A549 Cells , Binding Sites/genetics , Cell Lineage/drug effects , Cell Lineage/genetics , Cells, Cultured , Computational Biology , DNA/genetics , Epistasis, Genetic/physiology , GATA4 Transcription Factor/metabolism , Gene Expression Regulation , Genes, Switch , HEK293 Cells , Hep G2 Cells , Hepatocyte Nuclear Factor 3-beta/metabolism , Humans , Octamer Transcription Factor-3/metabolism , Protein Binding
20.
Elife ; 62017 09 06.
Article in English | MEDLINE | ID: mdl-28875933

ABSTRACT

A substantial fraction of the genome is transcribed in a cell-type-specific manner, producing long non-coding RNAs (lncRNAs), rather than protein-coding transcripts. Here, we systematically characterize transcriptional dynamics during hematopoiesis and in hematological malignancies. Our analysis of annotated and de novo assembled lncRNAs showed many are regulated during differentiation and mis-regulated in disease. We assessed lncRNA function via an in vivo RNAi screen in a model of acute myeloid leukemia. This identified several lncRNAs essential for leukemia maintenance, and found that a number act by promoting leukemia stem cell signatures. Leukemia blasts show a myeloid differentiation phenotype when these lncRNAs were depleted, and our data indicates that this effect is mediated via effects on the MYC oncogene. Bone marrow reconstitutions showed that a lncRNA expressed across all progenitors was required for the myeloid lineage, whereas the other leukemia-induced lncRNAs were dispensable in the normal setting.


Subject(s)
Cell Differentiation , Gene Expression Regulation , Hematopoiesis , Leukemia, Myeloid, Acute/pathology , RNA, Long Noncoding/genetics , RNA, Long Noncoding/metabolism , Animals , Disease Models, Animal , Gene Expression Profiling , Mice
SELECTION OF CITATIONS
SEARCH DETAIL
...