ABSTRACT
Artificial intelligence (AI) models based on deep learning now represent the state of the art for making functional predictions in genomics research. However, the underlying basis on which predictive models make such predictions is often unknown. For genomics researchers, this missing explanatory information would frequently be of greater value than the predictions themselves, as it can enable new insights into genetic processes. We review progress in the emerging area of explainable AI (xAI), a field with the potential to empower life science researchers to gain mechanistic insights into complex deep learning models. We discuss and categorize approaches for model interpretation, including an intuitive understanding of how each approach works and their underlying assumptions and limitations in the context of typical high-throughput biological datasets.
Subject(s)
Artificial Intelligence , Deep Learning , GenomicsABSTRACT
Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially1. Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 105 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. We characterized novel viruses related to coronaviruses, hepatitis delta virus and huge phages, respectively, and analysed their environmental reservoirs. To catalyse the ongoing revolution of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.
Subject(s)
Cloud Computing , Databases, Genetic , RNA Viruses/genetics , RNA Viruses/isolation & purification , Sequence Alignment/methods , Virology/methods , Virome/genetics , Animals , Archives , Bacteriophages/enzymology , Bacteriophages/genetics , Biodiversity , Coronavirus/classification , Coronavirus/enzymology , Coronavirus/genetics , Evolution, Molecular , Hepatitis Delta Virus/enzymology , Hepatitis Delta Virus/genetics , Humans , Models, Molecular , RNA Viruses/classification , RNA Viruses/enzymology , RNA-Dependent RNA Polymerase/chemistry , RNA-Dependent RNA Polymerase/genetics , SoftwareABSTRACT
Regulatory T cell (Treg) therapy is a promising approach to improve outcomes in transplantation and autoimmunity. In conventional T cell therapy, chronic stimulation can result in poor in vivo function, a phenomenon termed exhaustion. Whether or not Tregs are also susceptible to exhaustion, and if so, if this would limit their therapeutic effect, was unknown. To "benchmark" exhaustion in human Tregs, we used a method known to induce exhaustion in conventional T cells: expression of a tonic-signaling chimeric antigen receptor (TS-CAR). We found that TS-CAR-expressing Tregs rapidly acquired a phenotype that resembled exhaustion and had major changes in their transcriptome, metabolism, and epigenome. Similar to conventional T cells, TS-CAR Tregs upregulated expression of inhibitory receptors and transcription factors such as PD-1, TIM3, TOX and BLIMP1, and displayed a global increase in chromatin accessibility-enriched AP-1 family transcription factor binding sites. However, they also displayed Treg-specific changes such as high expression of 4-1BB, LAP, and GARP. DNA methylation analysis and comparison to a CD8+ T cell-based multipotency index showed that Tregs naturally exist in a relatively differentiated state, with further TS-CAR-induced changes. Functionally, TS-CAR Tregs remained stable and suppressive in vitro but were nonfunctional in vivo, as tested in a model of xenogeneic graft-versus-host disease. These data are the first comprehensive investigation of exhaustion in Tregs and reveal key similarities and differences with exhausted conventional T cells. The finding that human Tregs are susceptible to chronic stimulation-driven dysfunction has important implications for the design of CAR Treg adoptive immunotherapy strategies.
Subject(s)
Graft vs Host Disease , Receptors, Chimeric Antigen , Humans , T-Lymphocytes, Regulatory , T-Cell Exhaustion , Immunotherapy, Adoptive/methods , Receptors, Antigen, T-Cell/genetics , Receptors, Antigen, T-Cell/metabolismABSTRACT
Deciphering the environmental contexts at which genetic effects are most prominent is central for making full use of GWAS results in follow-up experiment design and treatment development. However, measuring a large number of environmental factors at high granularity might not always be feasible. Instead, here we propose extracting cellular embedding of environmental factors from gene expression data by using latent variable (LV) analysis and taking these LVs as environmental proxies in detecting gene-by-environment (GxE) interaction effects on gene expression, i.e., GxE expression quantitative trait loci (eQTLs). Applying this approach to two largest brain eQTL datasets (n = 1,100), we show that LVs and GxE eQTLs in one dataset replicate well in the other dataset. Combining the two samples via meta-analysis, 895 GxE eQTLs are identified. On average, GxE effect explains an additional â¼4% variation in expression of each gene that displays a GxE effect. Ten of these 52 genes are associated with cell-type-specific eQTLs, and the remaining genes are multi-functional. Furthermore, after substituting LVs with expression of transcription factors (TF), we found 91 TF-specific eQTLs, which demonstrates an important use of our brain GxE eQTLs.
Subject(s)
Brain/metabolism , Genotype , Transcriptome , Humans , Quantitative Trait LociABSTRACT
Deep learning models such as convolutional neural networks (CNNs) excel in genomic tasks but lack interpretability. We introduce ExplaiNN, which combines the expressiveness of CNNs with the interpretability of linear models. ExplaiNN can predict TF binding, chromatin accessibility, and de novo motifs, achieving performance comparable to state-of-the-art methods. Its predictions are transparent, providing global (cell state level) as well as local (individual sequence level) biological insights into the data. ExplaiNN can serve as a plug-and-play platform for pretrained models and annotated position weight matrices. ExplaiNN aims to accelerate the adoption of deep learning in genomic sequence analysis by domain experts.
Subject(s)
Genomics , Neural Networks, Computer , Genomics/methods , Chromatin/genetics , Protein BindingABSTRACT
Improving methods for human embryonic stem cell differentiation represents a challenge in modern regenerative medicine research. Using drug repurposing approaches, we discover small molecules that regulate the formation of definitive endoderm. Among them are inhibitors of known processes involved in endoderm differentiation (mTOR, PI3K, and JNK pathways) and a new compound, with an unknown mechanism of action, capable of inducing endoderm formation in the absence of growth factors in the media. Optimization of the classical protocol by inclusion of this compound achieves the same differentiation efficiency with a 90% cost reduction. The presented in silico procedure for candidate molecule selection has broad potential for improving stem cell differentiation protocols.
Subject(s)
Endoderm , Human Embryonic Stem Cells , Humans , Cell Differentiation/physiologyABSTRACT
BACKGROUND: Deep learning has proven to be a powerful technique for transcription factor (TF) binding prediction but requires large training datasets. Transfer learning can reduce the amount of data required for deep learning, while improving overall model performance, compared to training a separate model for each new task. RESULTS: We assess a transfer learning strategy for TF binding prediction consisting of a pre-training step, wherein we train a multi-task model with multiple TFs, and a fine-tuning step, wherein we initialize single-task models for individual TFs with the weights learned by the multi-task model, after which the single-task models are trained at a lower learning rate. We corroborate that transfer learning improves model performance, especially if in the pre-training step the multi-task model is trained with biologically relevant TFs. We show the effectiveness of transfer learning for TFs with ~ 500 ChIP-seq peak regions. Using model interpretation techniques, we demonstrate that the features learned in the pre-training step are refined in the fine-tuning step to resemble the binding motif of the target TF (i.e., the recipient of transfer learning in the fine-tuning step). Moreover, pre-training with biologically relevant TFs allows single-task models in the fine-tuning step to learn useful features other than the motif of the target TF. CONCLUSIONS: Our results confirm that transfer learning is a powerful technique for TF binding prediction.