RESUMEN
Because of its ability to find complex patterns in high dimensional and heterogeneous data, machine learning (ML) has emerged as a critical tool for making sense of the growing amount of genetic and genomic data available. While the complexity of ML models is what makes them powerful, it also makes them difficult to interpret. Fortunately, efforts to develop approaches that make the inner workings of ML models understandable to humans have improved our ability to make novel biological insights. Here, we discuss the importance of interpretable ML, different strategies for interpreting ML models, and examples of how these strategies have been applied. Finally, we identify challenges and promising future directions for interpretable ML in genetics and genomics.
Asunto(s)
Biología Computacional/métodos , Genética Médica , Genética de Población , Genoma Humano , Aprendizaje Automático , HumanosRESUMEN
The ability to predict traits from genome-wide sequence information (i.e., genomic prediction) has improved our understanding of the genetic basis of complex traits and transformed breeding practices. Transcriptome data may also be useful for genomic prediction. However, it remains unclear how well transcript levels can predict traits, particularly when traits are scored at different development stages. Using maize (Zea mays) genetic markers and transcript levels from seedlings to predict mature plant traits, we found that transcript and genetic marker models have similar performance. When the transcripts and genetic markers with the greatest weights (i.e., the most important) in those models were used in one joint model, performance increased. Furthermore, genetic markers important for predictions were not close to or identified as regulatory variants for important transcripts. These findings demonstrate that transcript levels are useful for predicting traits and that their predictive power is not simply due to genetic variation in the transcribed genomic regions. Finally, genetic marker models identified only 1 of 14 benchmark flowering-time genes, while transcript models identified 5. These data highlight that, in addition to being useful for genomic prediction, transcriptome data can provide a link between traits and variation that cannot be readily captured at the sequence level.
Asunto(s)
Genoma de Planta/genética , Herencia Multifactorial , Transcriptoma , Zea mays/genética , Marcadores Genéticos , Variación Genética , Estudio de Asociación del Genoma Completo , Genómica , Modelos Genéticos , FenotipoRESUMEN
Plant iron deficiency (-Fe) activates a complex regulatory network that coordinates root Fe uptake and distribution to sink tissues. In Arabidopsis (Arabidopsis thaliana), FER-LIKE FE DEFICIENCY-INDUCED TRANSCRIPTION FACTOR (FIT), a basic helix-loop-helix (bHLH) transcription factor (TF), regulates root Fe acquisition genes. Many other -Fe-induced genes are FIT independent, and instead regulated by other bHLH TFs and by yet unknown TFs. The cis-regulatory code, that is, the cis-regulatory elements (CREs) and their combinations that regulate plant -Fe-responses, remains largely elusive. Using Arabidopsis root transcriptome data and coexpression clustering, we identified over 100 putative CREs (pCREs) that predicted -Fe-induced gene expression in computational models. To assess pCRE properties and possible functions, we used large-scale in vitro TF binding data, positional bias, and evolutionary conservation. As one example, our approach uncovered pCREs resembling IDE1 (iron deficiency-responsive element 1), a known grass -Fe response CRE. Arabidopsis IDE1-likes were associated with FIT-dependent gene expression, more specifically with biosynthesis of Fe-chelating compounds. Thus, IDE1 seems to be conserved in grass and nongrass species. Our pCREs matched among others in vitro binding sites of B3, NAC, bZIP, and TCP TFs, which might be regulators of -Fe responses. Altogether, our findings provide a comprehensive source of cis-regulatory information for -Fe-responsive genes that advance our mechanistic understanding and inform future efforts in engineering plants with more efficient Fe uptake or transport systems.
Asunto(s)
Proteínas de Arabidopsis/metabolismo , Arabidopsis/metabolismo , Raíces de Plantas/metabolismo , Arabidopsis/genética , Proteínas de Arabidopsis/genética , Regulación de la Expresión Génica de las Plantas , Raíces de Plantas/genética , Secuencias Reguladoras de Ácidos Nucleicos/genéticaRESUMEN
Multicellular organisms have diverse cell types with distinct roles in development and responses to the environment. At the transcriptional level, the differences in the environmental response between cell types are due to differences in regulatory programs. In plants, although cell-type environmental responses have been examined, it is unclear how these responses are regulated. Here, we identify a set of putative cis-regulatory elements (pCREs) enriched in the promoters of genes responsive to high-salinity stress in six Arabidopsis (Arabidopsis thaliana) root cell types. We then use these pCREs to establish cis-regulatory codes (i.e. models predicting whether a gene is responsive to high salinity for each cell type with machine learning). These pCRE-based models outperform models using in vitro binding data of 758 Arabidopsis transcription factors. Surprisingly, organ pCREs identified based on the whole-root high-salinity response can predict cell-type responses as well as pCREs derived from cell-type data, because organ and cell-type pCREs predict complementary subsets of high-salinity response genes. Our findings not only advance our understanding of the regulatory mechanisms of the plant spatial transcriptional response through cis-regulatory codes but also suggest broad applicability of the approach to any species, particularly those with little or no trans-regulatory data.
Asunto(s)
Células Vegetales/metabolismo , Secuencias Reguladoras de Ácidos Nucleicos/genética , Salinidad , Secuencia de Bases , Regulación de la Expresión Génica de las Plantas , Aprendizaje Automático , Especificidad de Órganos/genética , Raíces de Plantas/genética , Unión Proteica , Factores de Transcripción/metabolismo , Transcripción Genética , Regulación hacia Arriba/genéticaRESUMEN
BACKGROUND: Transcription factors (TFs) play a key role in regulating plant development and response to environmental stimuli. While most genes revert to single copy after whole genome duplication (WGD) event, transcription factors are retained at a significantly higher rate. Little is known about how TF duplicates have diverged in their expression and regulation, the answer to which may contribute to a better understanding of the elevated retention rate among TFs. RESULTS: Here we assessed what features may explain differences in the retention of TF duplicates and other genes using Arabidopsis thaliana as a model. We integrated 34 expression, sequence, and conservation features to build a linear model for predicting the extent of duplicate retention following WGD events among TFs and 19 groups of genes with other functions. We found that TFs was the least well predicted, demonstrating the features of TFs are substantially deviated from duplicate genes in other function groups. Consistent with this, the evolution of TF expression patterns and cis-regulatory cites favors the partitioning of ancestral states among the resulting duplicates: one "ancestral" TF duplicate retains most ancestral expression and cis-regulatory sites, while the "non-ancestral" duplicate is enriched for novel regulatory sites. By modeling the retention of ancestral expression and cis-regulatory states in duplicate pairs using a system of differential equations, we found that TF duplicate pairs in a partitioned state are preferentially maintained. CONCLUSIONS: These TF duplicates with asymmetrically partitioned ancestral states are likely maintained because one copy retains ancestral functions while the other, at least in some cases, acquires novel cis-regulatory sites that may be important for novel, adaptive traits.
Asunto(s)
Proteínas de Arabidopsis/genética , Arabidopsis/genética , Duplicación de Gen , Genoma de Planta , Factores de Transcripción/genética , Evolución Molecular , Regulación de la Expresión Génica de las Plantas , Genes Duplicados , Modelos Lineales , Oportunidad RelativaRESUMEN
Plants are exposed to a variety of environmental conditions, and their ability to respond to environmental variation depends on the proper regulation of gene expression in an organ-, tissue-, and cell type-specific manner. Although our knowledge of how stress responses are regulated is accumulating, a genome-wide model of how plant transcription factors (TFs) and cis-regulatory elements control spatially specific stress response has yet to emerge. Using Arabidopsis (Arabidopsis thaliana) as a model, we identified a set of 1,894 putative cis-regulatory elements (pCREs) that are associated with high-salinity (salt) up-regulated genes in the root or the shoot. We used these pCREs to develop computational models that can better predict salt up-regulated genes in the root and shoot compared with models based on known TF binding motifs. In addition, we incorporated TF binding sites identified via large-scale in vitro assays, chromatin accessibility, evolutionary conservation, and pCRE combinatorial relationships in machine learning models and found that only consideration of pCRE combinations led to better performance in salt up-regulation prediction in the root and shoot. Our results suggest that the plant organ transcriptional response to high salinity is regulated by a core set of pCREs and provide a genome-wide view of the cis-regulatory code of plant spatial transcriptional responses to environmental stress.
Asunto(s)
Arabidopsis/genética , Regulación de la Expresión Génica de las Plantas , Modelos Genéticos , Salinidad , Arabidopsis/metabolismo , Proteínas de Arabidopsis/metabolismo , Secuencia de Bases , Sitios de Unión/genética , Simulación por Computador , Redes Reguladoras de Genes , Genoma de Planta/genética , Raíces de Plantas/genética , Raíces de Plantas/metabolismo , Brotes de la Planta/genética , Brotes de la Planta/metabolismo , Unión Proteica , Elementos Reguladores de la Transcripción/genética , Estrés Fisiológico , Factores de Transcripción/metabolismoRESUMEN
Common genetic variants confer substantial risk for chronic lung diseases, including pulmonary fibrosis. Defining the genetic control of gene expression in a cell-type-specific and context-dependent manner is critical for understanding the mechanisms through which genetic variation influences complex traits and disease pathobiology. To this end, we performed single-cell RNA sequencing of lung tissue from 66 individuals with pulmonary fibrosis and 48 unaffected donors. Using a pseudobulk approach, we mapped expression quantitative trait loci (eQTLs) across 38 cell types, observing both shared and cell-type-specific regulatory effects. Furthermore, we identified disease interaction eQTLs and demonstrated that this class of associations is more likely to be cell-type-specific and linked to cellular dysregulation in pulmonary fibrosis. Finally, we connected lung disease risk variants to their regulatory targets in disease-relevant cell types. These results indicate that cellular context determines the impact of genetic variation on gene expression and implicates context-specific eQTLs as key regulators of lung homeostasis and disease.
Asunto(s)
Fibrosis Pulmonar , Sitios de Carácter Cuantitativo , Humanos , Sitios de Carácter Cuantitativo/genética , Fibrosis Pulmonar/genética , Regulación de la Expresión Génica/genética , Pulmón , Herencia Multifactorial , Estudio de Asociación del Genoma Completo/métodos , Polimorfismo de Nucleótido SimpleRESUMEN
Common genetic variants confer substantial risk for chronic lung diseases, including pulmonary fibrosis (PF). Defining the genetic control of gene expression in a cell-type-specific and context-dependent manner is critical for understanding the mechanisms through which genetic variation influences complex traits and disease pathobiology. To this end, we performed single-cell RNA-sequencing of lung tissue from 67 PF and 49 unaffected donors. Employing a pseudo-bulk approach, we mapped expression quantitative trait loci (eQTL) across 38 cell types, observing both shared and cell type-specific regulatory effects. Further, we identified disease-interaction eQTL and demonstrated that this class of associations is more likely to be cell-type specific and linked to cellular dysregulation in PF. Finally, we connected PF risk variants to their regulatory targets in disease-relevant cell types. These results indicate that cellular context determines the impact of genetic variation on gene expression, and implicates context-specific eQTL as key regulators of lung homeostasis and disease.
RESUMEN
Population-scale single-cell RNA sequencing (scRNA-seq) is now viable, enabling finer resolution functional genomics studies and leading to a rush to adapt bulk methods and develop new single-cell-specific methods to perform these studies. Simulations are useful for developing, testing, and benchmarking methods but current scRNA-seq simulation frameworks do not simulate population-scale data with genetic effects. Here, we present splatPop, a model for flexible, reproducible, and well-documented simulation of population-scale scRNA-seq data with known expression quantitative trait loci. splatPop can also simulate complex batch, cell group, and conditional effects between individuals from different cohorts as well as genetically-driven co-expression.
Asunto(s)
Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Benchmarking , Análisis por Conglomerados , Simulación por Computador , Perfilación de la Expresión Génica/métodos , Genómica , Humanos , Sitios de Carácter Cuantitativo , Programas InformáticosRESUMEN
BACKGROUND: Single-cell RNA sequencing (scRNA-seq) has enabled the unbiased, high-throughput quantification of gene expression specific to cell types and states. With the cost of scRNA-seq decreasing and techniques for sample multiplexing improving, population-scale scRNA-seq, and thus single-cell expression quantitative trait locus (sc-eQTL) mapping, is increasingly feasible. Mapping of sc-eQTL provides additional resolution to study the regulatory role of common genetic variants on gene expression across a plethora of cell types and states and promises to improve our understanding of genetic regulation across tissues in both health and disease. RESULTS: While previously established methods for bulk eQTL mapping can, in principle, be applied to sc-eQTL mapping, there are a number of open questions about how best to process scRNA-seq data and adapt bulk methods to optimize sc-eQTL mapping. Here, we evaluate the role of different normalization and aggregation strategies, covariate adjustment techniques, and multiple testing correction methods to establish best practice guidelines. We use both real and simulated datasets across single-cell technologies to systematically assess the impact of these different statistical approaches. CONCLUSION: We provide recommendations for future single-cell eQTL studies that can yield up to twice as many eQTL discoveries as default approaches ported from bulk studies.
Asunto(s)
Mapeo Cromosómico/estadística & datos numéricos , Genoma Humano , Células Madre Pluripotentes Inducidas/metabolismo , Sitios de Carácter Cuantitativo , Análisis de la Célula Individual/métodos , Alelos , Línea Celular , Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Humanos , Células Madre Pluripotentes Inducidas/citología , Análisis de Secuencia de ARN , Programas Informáticos , Secuenciación del ExomaRESUMEN
Plant cells constantly alter their gene expression profiles to respond to environmental fluctuations. These continuous adjustments are regulated by multi-hierarchical networks of transcription factors. To understand how such gene regulatory networks (GRNs) have stabilized evolutionarily while allowing for species-specific responses, we compare the GRNs underlying salt response in the early-diverging and late-diverging plants Marchantia polymorpha and Arabidopsis thaliana. Salt-responsive GRNs, constructed on the basis of the temporal transcriptional patterns in the two species, share common trans-regulators but exhibit an evolutionary divergence in cis-regulatory sequences and in the overall network sizes. In both species, WRKY-family transcription factors and their feedback loops serve as central nodes in salt-responsive GRNs. The divergent cis-regulatory sequences of WRKY-target genes are probably associated with the expansion in network size, linking salt stress to tissue-specific developmental and physiological responses. The WRKY modules and highly linked WRKY feedback loops have been preserved widely in other plants, including rice, while keeping their binding-motif sequences mutable. Together, the conserved trans-regulators and the quickly evolving cis-regulatory sequences allow salt-responsive GRNs to adapt over a long evolutionary timescale while maintaining some consistent regulatory structure. This strategy may benefit plants as they adapt to changing environments.
Asunto(s)
Arabidopsis/genética , Redes Reguladoras de Genes , Marchantia/genética , Proteínas de Plantas/genética , Estrés Salino/genética , Adaptación Fisiológica , Proteínas de Arabidopsis/genética , Evolución Biológica , Regulación de la Expresión Génica de las Plantas , Mutación , Oryza/genética , Filogenia , Factores de Transcripción/genéticaRESUMEN
Plants respond to their environment by dynamically modulating gene expression. A powerful approach for understanding how these responses are regulated is to integrate information about cis-regulatory elements (CREs) into models called cis-regulatory codes. Transcriptional response to combined stress is typically not the sum of the responses to the individual stresses. However, cis-regulatory codes underlying combined stress response have not been established. Here we modeled transcriptional response to single and combined heat and drought stress in Arabidopsis thaliana. We grouped genes by their pattern of response (independent, antagonistic and synergistic) and trained machine learning models to predict their response using putative CREs (pCREs) as features (median F-measure = 0.64). We then developed a deep learning approach to integrate additional omics information (sequence conservation, chromatin accessibility and histone modification) into our models, improving performance by 6.2%. While pCREs important for predicting independent and antagonistic responses tended to resemble binding motifs of transcription factors associated with heat and/or drought stress, important synergistic pCREs resembled binding motifs of transcription factors not known to be associated with stress. These findings demonstrate how in silico approaches can improve our understanding of the complex codes regulating response to combined stress and help us identify prime targets for future characterization.
RESUMEN
The usefulness of genomic prediction in crop and livestock breeding programs has prompted efforts to develop new and improved genomic prediction algorithms, such as artificial neural networks and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and six non-linear algorithms. First, we found that hyperparameter selection was necessary for all non-linear algorithms and that feature selection prior to model training was critical for artificial neural networks when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple algorithms (i.e., ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits. Although artificial neural networks did not perform best for any trait, we identified strategies (i.e., feature selection, seeded starting weights) that boosted their performance to near the level of other algorithms. Our results highlight the importance of algorithm selection for the prediction of trait values.
Asunto(s)
Genómica/métodos , Aprendizaje Automático , Plantas/genética , Benchmarking , Genotipo , Redes Neurales de la Computación , FenotipoRESUMEN
Extensive transcriptional activity occurring in intergenic regions of genomes has raised the question whether intergenic transcription represents the activity of novel genes or noisy expression. To address this, we evaluated cross-species and post-duplication sequence and expression conservation of intergenic transcribed regions (ITRs) in four Poaceae species. Among 43,301 ITRs across the four species, 34,460 (80%) are species-specific. ITRs found across species tend to be more divergent in expression and have more recent duplicates compared to annotated genes. To assess if ITRs are functional (under selection), machine learning models were established in Oryza sativa (rice) that could accurately distinguish between phenotype genes and pseudogenes (area under curve-receiver operating characteristic = 0.94). Based on the models, 584 (8%) and 4391 (61%) rice ITRs are classified as likely functional and nonfunctional with high confidence, respectively. ITRs with conserved expression and ancient retained duplicates, features that were not part of the model, are frequently classified as likely-functional, suggesting these characteristics could serve as pragmatic rules of thumb for identifying candidate sequences likely to be under selection. This study also provides a framework to identify novel genes using comparative transcriptomic data to improve genome annotation that is fundamental for connecting genotype to phenotype in crop and model systems.
Asunto(s)
ADN Intergénico , Genes de Plantas , Poaceae/genética , Transcripción Genética , Evolución Biológica , Genoma de Planta , Aprendizaje Automático , Modelos Genéticos , Fenotipo , Seudogenes , Especificidad de la EspecieRESUMEN
The origin of sea lamprey (Petromyzon marinus) in Lake Champlain has been heavily debated over the past decade. Given the lack of historical documentation, two competing hypotheses have emerged in the literature. First, it has been argued that the relatively recent population size increase and concomitant rise in wounding rates on prey populations are indicative of an invasive population that entered the lake through the Champlain Canal. Second, recent genetic evidence suggests a post-glacial colonization at the end of the Pleistocene, approximately 11,000 years ago. One limitation to resolving the origin of sea lamprey in Lake Champlain is a lack of historical and current measures of population size. In this study, the issue of population size was explicitly addressed using nuclear (nDNA) and mitochondrial DNA (mtDNA) markers to estimate historical demography with genetic models. Haplotype network analysis, mismatch analysis, and summary statistics based on mtDNA noncoding sequences for NCI (479 bp) and NCII (173 bp) all indicate a recent population expansion. Coalescent models based on mtDNA and nDNA identified two potential demographic events: a population decline followed by a very recent population expansion. The decline in effective population size may correlate with land-use and fishing pressure changes post-European settlement, while the recent expansion may be associated with the implementation of the salmonid stocking program in the 1970s. These results are most consistent with the hypothesis that sea lamprey are native to Lake Champlain; however, the credibility intervals around parameter estimates demonstrate that there is uncertainty regarding the magnitude and timing of past demographic events.