RESUMO
To facilitate single-cell multi-omics analysis and improve reproducibility, we present single-cell pipeline for end-to-end data integration (SPEEDI), a fully automated end-to-end framework for batch inference, data integration, and cell-type labeling. SPEEDI introduces data-driven batch inference and transforms the often heterogeneous data matrices obtained from different samples into a uniformly annotated and integrated dataset. Without requiring user input, it automatically selects parameters and executes pre-processing, sample integration, and cell-type mapping. It can also perform downstream analyses of differential signals between treatment conditions and gene functional modules. SPEEDI's data-driven batch-inference method works with widely used integration and cell-typing tools. By developing data-driven batch inference, providing full end-to-end automation, and eliminating parameter selection, SPEEDI improves reproducibility and lowers the barrier to obtaining biological insight from these valuable single-cell datasets. The SPEEDI interactive web application can be accessed at https://speedi.princeton.edu/. A record of this paper's transparent peer review process is included in the supplemental information.
Assuntos
Análise de Célula Única , Análise de Célula Única/métodos , Humanos , Software , Biologia Computacional/métodos , Reprodutibilidade dos Testes , Automação/métodosRESUMO
Unraveling the phenotypic and genetic complexity of autism is extremely challenging yet critical for understanding the biology, inheritance, trajectory, and clinical manifestations of the many forms of the condition. Here, we leveraged broad phenotypic data from a large cohort with matched genetics to characterize classes of autism and their patterns of core, associated, and co-occurring traits, ultimately demonstrating that phenotypic patterns are associated with distinct genetic and molecular programs. We used a generative mixture modeling approach to identify robust, clinically-relevant classes of autism which we validate and replicate in a large independent cohort. We link the phenotypic findings to distinct patterns of de novo and inherited variation which emerge from the deconvolution of these genetic signals, and demonstrate that class-specific common variant scores strongly align with clinical outcomes. We further provide insights into the distinct biological pathways and processes disrupted by the sets of mutations in each class. Remarkably, we discover class-specific differences in the developmental timing of genes that are dysregulated, and these temporal patterns correspond to clinical milestone and outcome differences between the classes. These analyses embrace the phenotypic complexity of children with autism, unraveling genetic and molecular programs underlying their heterogeneity and suggesting specific biological dysregulation patterns and mechanistic hypotheses.
RESUMO
MOTIVATION: In recent years deep learning has become one of the central approaches in a number of applications, including many tasks in genomics. However, as models grow in depth and complexity, they either require more data or a strategic initialization technique to improve performance. RESULTS: In this project, we introduce cGen, a novel unsupervised, model-agnostic contrastive pre-training method for sequence-based models. cGen can be used before training to initialize weights, reducing the size of the dataset needed. It works through learning the intrinsic features of the reference genome and makes no assumptions on the underlying structure. We show that the embeddings produced by the unsupervised model are already informative for gene expression prediction and that the sequence features provide a meaningful clustering. We demonstrate that cGen improves model performance in various sequence-based deep learning applications, such as chromatin profiling prediction and gene expression. Our findings suggest that using cGen, particularly in areas constrained by data availability, could improve the performance of deep learning genomic models without the need to modify the model architecture.
RESUMO
Deciphering the regulatory code of gene expression and interpreting the transcriptional effects of genome variation are critical challenges in human genetics. Modern experimental technologies have resulted in an abundance of data, enabling the development of sequence-based deep learning models that link patterns embedded in DNA to the biochemical and regulatory properties contributing to transcriptional regulation, including modeling epigenetic marks, 3D genome organization, and gene expression, with tissue and cell-type specificity. Such methods can predict the functional consequences of any noncoding variant in the human genome, even rare or never-before-observed variants, and systematically characterize their consequences beyond what is tractable from experiments or quantitative genetics studies alone. Recently, the development and application of interpretability approaches have led to the identification of key sequence patterns contributing to the predicted tasks, providing insights into the underlying biological mechanisms learned and revealing opportunities for improvement in future models.
Assuntos
Aprendizado Profundo , Regulação da Expressão Gênica , Transcrição Gênica , Humanos , Genoma Humano , Epigênese GenéticaRESUMO
Neurons from layer II of the entorhinal cortex (ECII) are the first to accumulate tau protein aggregates and degenerate during prodromal Alzheimer's disease. Gaining insight into the molecular mechanisms underlying this vulnerability will help reveal genes and pathways at play during incipient stages of the disease. Here, we use a data-driven functional genomics approach to model ECII neurons in silico and identify the proto-oncogene DEK as a regulator of tau pathology. We show that epigenetic changes caused by Dek silencing alter activity-induced transcription, with major effects on neuronal excitability. This is accompanied by the gradual accumulation of tau in the somatodendritic compartment of mouse ECII neurons in vivo, reactivity of surrounding microglia, and microglia-mediated neuron loss. These features are all characteristic of early Alzheimer's disease. The existence of a cell-autonomous mechanism linking Alzheimer's disease pathogenic mechanisms in the precise neuron type where the disease starts provides unique evidence that synaptic homeostasis dysregulation is of central importance in the onset of tau pathology in Alzheimer's disease.
Assuntos
Doença de Alzheimer , Neurônios , Proto-Oncogene Mas , Proteínas tau , Doença de Alzheimer/metabolismo , Doença de Alzheimer/patologia , Animais , Neurônios/metabolismo , Proteínas tau/metabolismo , Camundongos , Córtex Entorrinal/metabolismo , Córtex Entorrinal/patologia , Humanos , Camundongos TransgênicosRESUMO
Protein-protein interactions (PPIs) drive cellular processes and responses to environmental cues, reflecting the cellular state. Here we develop Tapioca, an ensemble machine learning framework for studying global PPIs in dynamic contexts. Tapioca predicts de novo interactions by integrating mass spectrometry interactome data from thermal/ion denaturation or cofractionation workflows with protein properties and tissue-specific functional networks. Focusing on the thermal proximity coaggregation method, we improved the experimental workflow. Finely tuned thermal denaturation afforded increased throughput, while cell lysis optimization enhanced protein detection from different subcellular compartments. The Tapioca workflow was next leveraged to investigate viral infection dynamics. Temporal PPIs were characterized during the reactivation from latency of the oncogenic Kaposi's sarcoma-associated herpesvirus. Together with functional assays, NUCKS was identified as a proviral hub protein, and a broader role was uncovered by integrating PPI networks from alpha- and betaherpesvirus infections. Altogether, Tapioca provides a web-accessible platform for predicting PPIs in dynamic contexts.
Assuntos
Herpesvirus Humano 8 , Manihot , Sarcoma de Kaposi , Sarcoma de Kaposi/metabolismo , Proteínas Virais/metabolismo , Manihot/metabolismo , Latência Viral , Herpesvirus Humano 8/metabolismoRESUMO
Single same cell RNAseq/ATACseq multiome data provide unparalleled potential to develop high resolution maps of the cell-type specific transcriptional regulatory circuitry underlying gene expression. We present CREMA, a framework that recovers the full cis-regulatory circuitry by modeling gene expression and chromatin activity in individual cells without peak-calling or cell type labeling constraints. We demonstrate that CREMA overcomes the limitations of existing methods that fail to identify about half of functional regulatory elements which are outside the called chromatin 'peaks'. These circuit sites outside called peaks are shown to be important cell type specific functional regulatory loci, sufficient to distinguish individual cell types. Analysis of mouse pituitary data identifies a Gata2-circuit for the gonadotrope-enriched disease-associated Pcsk1 gene, which is experimentally validated by reduced gonadotrope expression in a gonadotrope conditional Gata2-knockout model. We present a web accessible human immune cell regulatory circuit resource, and provide CREMA as an R package.
Assuntos
Gonadotrofos , Hipófise , Camundongos , Humanos , Animais , Hipófise/metabolismo , Gonadotrofos/metabolismo , Cromatina/genética , Cromatina/metabolismo , Sequências Reguladoras de Ácido NucleicoRESUMO
Resolving chromatin-remodeling-linked gene expression changes at cell-type resolution is important for understanding disease states. Here we describe MAGICAL (Multiome Accessibility Gene Integration Calling and Looping), a hierarchical Bayesian approach that leverages paired single-cell RNA sequencing and single-cell transposase-accessible chromatin sequencing from different conditions to map disease-associated transcription factors, chromatin sites, and genes as regulatory circuits. By simultaneously modeling signal variation across cells and conditions in both omics data types, MAGICAL achieved high accuracy on circuit inference. We applied MAGICAL to study Staphylococcus aureus sepsis from peripheral blood mononuclear single-cell data that we generated from subjects with bloodstream infection and uninfected controls. MAGICAL identified sepsis-associated regulatory circuits predominantly in CD14 monocytes, known to be activated by bacterial sepsis. We addressed the challenging problem of distinguishing host regulatory circuit responses to methicillin-resistant and methicillin-susceptible S. aureus infections. Although differential expression analysis failed to show predictive value, MAGICAL identified epigenetic circuit biomarkers that distinguished methicillin-resistant from methicillin-susceptible S. aureus infections.
RESUMO
To facilitate single cell multi-omics analysis and improve reproducibility, we present SPEEDI (Single-cell Pipeline for End to End Data Integration), a fully automated end-to-end framework for batch inference, data integration, and cell type labeling. SPEEDI introduces data-driven batch inference and transforms the often heterogeneous data matrices obtained from different samples into a uniformly annotated and integrated dataset. Without requiring user input, it automatically selects parameters and executes pre-processing, sample integration, and cell type mapping. It can also perform downstream analyses of differential signals between treatment conditions and gene functional modules. SPEEDI's data-driven batch inference method works with widely used integration and cell-typing tools. By developing data-driven batch inference, providing full end-to-end automation, and eliminating parameter selection, SPEEDI improves reproducibility and lowers the barrier to obtaining biological insight from these valuable single-cell datasets. The SPEEDI interactive web application can be accessed at https://speedi.princeton.edu/.
RESUMO
Finely-tuned enzymatic pathways control cellular processes, and their dysregulation can lead to disease. Creating predictive and interpretable models for these pathways is challenging because of the complexity of the pathways and of the cellular and genomic contexts. Here we introduce Elektrum, a deep learning framework which addresses these challenges with data-driven and biophysically interpretable models for determining the kinetics of biochemical systems. First, it uses in vitro kinetic assays to rapidly hypothesize an ensemble of high-quality Kinetically Interpretable Neural Networks (KINNs) that predict reaction rates. It then employs a novel transfer learning step, where the KINNs are inserted as intermediary layers into deeper convolutional neural networks, fine-tuning the predictions for reaction-dependent in vivo outcomes. Elektrum makes effective use of the limited, but clean in vitro data and the complex, yet plentiful in vivo data that captures cellular context. We apply Elektrum to predict CRISPR-Cas9 off-target editing probabilities and demonstrate that Elektrum achieves state-of-the-art performance, regularizes neural network architectures, and maintains physical interpretability.
RESUMO
Endurance exercise is an important health modifier. We studied cell-type specific adaptations of human skeletal muscle to acute endurance exercise using single-nucleus (sn) multiome sequencing in human vastus lateralis samples collected before and 3.5 hours after 40 min exercise at 70% VO2max in four subjects, as well as in matched time of day samples from two supine resting circadian controls. High quality same-cell RNA-seq and ATAC-seq data were obtained from 37,154 nuclei comprising 14 cell types. Among muscle fiber types, both shared and fiber-type specific regulatory programs were identified. Single-cell circuit analysis identified distinct adaptations in fast, slow and intermediate fibers as well as LUM-expressing FAP cells, involving a total of 328 transcription factors (TFs) acting at altered accessibility sites regulating 2,025 genes. These data and circuit mapping provide single-cell insight into the processes underlying tissue and metabolic remodeling responses to exercise.
RESUMO
Human biology is rooted in highly specialized cell types programmed by a common genome, 98% of which is outside of genes. Genetic variation in the enormous noncoding space is linked to the majority of disease risk. To address the problem of linking these variants to expression changes in primary human cells, we introduce ExPectoSC, an atlas of modular deep-learning-based models for predicting cell-type-specific gene expression directly from sequence. We provide models for 105 primary human cell types covering 7 organ systems, demonstrate their accuracy, and then apply them to prioritize relevant cell types for complex human diseases. The resulting atlas of sequence-based gene expression and variant effects is publicly available in a user-friendly interface and readily extensible to any primary cell types. We demonstrate the accuracy of our approach through systematic evaluations and apply the models to prioritize ClinVar clinical variants of uncertain significance, verifying our top predictions experimentally.
Assuntos
Ascomicetos , Humanos , Expressão Gênica/genéticaRESUMO
Assays detecting blood transcriptome changes are studied for infectious disease diagnosis. Blood-based RNA alternative splicing (AS) events, which have not been well characterized in pathogen infection, have potential normalization and assay platform stability advantages over gene expression for diagnosis. Here, we present a computational framework for developing AS diagnostic biomarkers. Leveraging a large prospective cohort of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection and whole-blood RNA sequencing (RNA-seq) data, we identify a major functional AS program switch upon viral infection. Using an independent cohort, we demonstrate the improved accuracy of AS biomarkers for SARS-CoV-2 diagnosis compared with six reported transcriptome signatures. We then optimize a subset of AS-based biomarkers to develop microfluidic PCR diagnostic assays. This assay achieves nearly perfect test accuracy (61/62 = 98.4%) using a naive principal component classifier, significantly more accurate than a gene expression PCR assay in the same cohort. Therefore, our RNA splicing computational framework enables a promising avenue for host-response diagnosis of infection.
Assuntos
COVID-19 , Doenças Transmissíveis , Humanos , SARS-CoV-2/genética , COVID-19/diagnóstico , Processamento Alternativo/genética , Teste para COVID-19 , RNA , Estudos Prospectivos , Biomarcadores/análiseRESUMO
DNA methylation comprises a cumulative record of lifetime exposures superimposed on genetically determined markers. Little is known about methylation dynamics in humans following an acute perturbation, such as infection. We characterized the temporal trajectory of blood epigenetic remodeling in 133 participants in a prospective study of young adults before, during, and after asymptomatic and mildly symptomatic SARS-CoV-2 infection. The differential methylation caused by asymptomatic or mildly symptomatic infections was indistinguishable. While differential gene expression largely returned to baseline levels after the virus became undetectable, some differentially methylated sites persisted for months of follow-up, with a pattern resembling autoimmune or inflammatory disease. We leveraged these responses to construct methylation-based machine learning models that distinguished samples from pre-, during-, and postinfection time periods, and quantitatively predicted the time since infection. The clinical trajectory in the young adults and in a diverse cohort with more severe outcomes was predicted by the similarity of methylation before or early after SARS-CoV-2 infection to the model-defined postinfection state. Unlike the phenomenon of trained immunity, the postacute SARS-CoV-2 epigenetic landscape we identify is antiprotective.
Assuntos
COVID-19 , Adulto Jovem , Humanos , COVID-19/genética , SARS-CoV-2/genética , Estudos Prospectivos , Metilação de DNA/genética , Processamento de Proteína Pós-TraducionalRESUMO
Periodontal disease is more common in individuals with rheumatoid arthritis (RA) who have detectable anti-citrullinated protein antibodies (ACPAs), implicating oral mucosal inflammation in RA pathogenesis. Here, we performed paired analysis of human and bacterial transcriptomics in longitudinal blood samples from RA patients. We found that patients with RA and periodontal disease experienced repeated oral bacteremias associated with transcriptional signatures of ISG15+HLADRhi and CD48highS100A2pos monocytes, recently identified in inflamed RA synovia and blood of those with RA flares. The oral bacteria observed transiently in blood were broadly citrullinated in the mouth, and their in situ citrullinated epitopes were targeted by extensively somatically hypermutated ACPAs encoded by RA blood plasmablasts. Together, these results suggest that (i) periodontal disease results in repeated breaches of the oral mucosa that release citrullinated oral bacteria into circulation, which (ii) activate inflammatory monocyte subsets that are observed in inflamed RA synovia and blood of RA patients with flares and (iii) activate ACPA B cells, thereby promoting affinity maturation and epitope spreading to citrullinated human antigens.
Assuntos
Artrite Reumatoide , Doenças Periodontais , Humanos , Autoanticorpos , Mucosa Bucal , Formação de Anticorpos , Epitopos , BactériasRESUMO
Finely tuned enzymatic pathways control cellular processes, and their dysregulation can lead to disease. Developing predictive and interpretable models for these pathways is challenging because of the complexity of the pathways and of the cellular and genomic contexts. Here we introduce Elektrum, a deep learning framework that addresses these challenges with data-driven and biophysically interpretable models for determining the kinetics of biochemical systems. First, it uses in vitro kinetic assays to rapidly hypothesize an ensemble of high-quality kinetically interpretable neural networks (KINNs) that predict reaction rates. It then employs a transfer learning step, where the KINNs are inserted as intermediary layers into deeper convolutional neural networks, fine-tuning the predictions for reaction-dependent in vivo outcomes. We apply Elektrum to predict CRISPR-Cas9 off-target editing probabilities and demonstrate that Elektrum achieves improved performance, regularizes neural network architectures and maintains physical interpretability.
Assuntos
Sistemas CRISPR-Cas , Redes Neurais de Computação , Sistemas CRISPR-Cas/genética , RNA Guia de Sistemas CRISPR-Cas , Genômica , Aprendizado de MáquinaRESUMO
The identification of a COVID-19 host response signature in blood can increase the understanding of SARS-CoV-2 pathogenesis and improve diagnostic tools. Applying a multi-objective optimization framework to both massive public and new multi-omics data, we identified a COVID-19 signature regulated at both transcriptional and epigenetic levels. We validated the signature's robustness in multiple independent COVID-19 cohorts. Using public data from 8,630 subjects and 53 conditions, we demonstrated no cross-reactivity with other viral and bacterial infections, COVID-19 comorbidities, or confounders. In contrast, previously reported COVID-19 signatures were associated with significant cross-reactivity. The signature's interpretation, based on cell-type deconvolution and single-cell data analysis, revealed prominent yet complementary roles for plasmablasts and memory T cells. Although the signal from plasmablasts mediated COVID-19 detection, the signal from memory T cells controlled against cross-reactivity with other viral infections. This framework identified a robust, interpretable COVID-19 signature and is broadly applicable in other disease contexts. A record of this paper's transparent peer review process is included in the supplemental information.
Assuntos
COVID-19 , Viroses , Humanos , SARS-CoV-2RESUMO
Male sex is a major risk factor for SARS-CoV-2 infection severity. To understand the basis for this sex difference, we studied SARS-CoV-2 infection in a young adult cohort of United States Marine recruits. Among 2,641 male and 244 female unvaccinated and seronegative recruits studied longitudinally, SARS-CoV-2 infections occurred in 1,033 males and 137 females. We identified sex differences in symptoms, viral load, blood transcriptome, RNA splicing, and proteomic signatures. Females had higher pre-infection expression of antiviral interferon-stimulated gene (ISG) programs. Causal mediation analysis implicated ISG differences in number of symptoms, levels of ISGs, and differential splicing of CD45 lymphocyte phosphatase during infection. Our results indicate that the antiviral innate immunity set point causally contributes to sex differences in response to SARS-CoV-2 infection. A record of this paper's transparent peer review process is included in the supplemental information.
Assuntos
COVID-19 , Imunidade Inata , Caracteres Sexuais , Feminino , Humanos , Masculino , Adulto Jovem , COVID-19/imunologia , Interferons , Proteômica , SARS-CoV-2RESUMO
BACKGROUND: Marine recruits training at Parris Island experienced an unexpectedly high rate of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection, despite preventive measures including a supervised, 2-week, pre-entry quarantine. We characterize SARS-CoV-2 transmission in this cohort. METHODS: Between May and November 2020, we monitored 2,469 unvaccinated, mostly male, Marine recruits prospectively during basic training. If participants tested negative for SARS-CoV-2 by quantitative polymerase chain reaction (qPCR) at the end of quarantine, they were transferred to the training site in segregated companies and underwent biweekly testing for 6 weeks. We assessed the effects of coronavirus disease 2019 (COVID-19) prevention measures on other respiratory infections with passive surveillance data, performed phylogenetic analysis, and modeled transmission dynamics and testing regimens. RESULTS: Preventive measures were associated with drastically lower rates of other respiratory illnesses. However, among the trainees, 1,107 (44.8%) tested SARS-CoV-2-positive, with either mild or no symptoms. Phylogenetic analysis of viral genomes from 580 participants revealed that all cases but one were linked to five independent introductions, each characterized by accumulation of mutations across and within companies, and similar viral isolates in individuals from the same company. Variation in company transmission rates (mean reproduction number R 0 ; 5.5 [95% confidence interval [CI], 5.0, 6.1]) could be accounted for by multiple initial cases within a company and superspreader events. Simulations indicate that frequent rapid-report testing with case isolation may minimize outbreaks. CONCLUSIONS: Transmission of wild-type SARS-CoV-2 among Marine recruits was approximately twice that seen in the community. Insights from SARS-CoV-2 outbreak dynamics and mutations spread in a remote, congregate setting may inform effective mitigation strategies.