RESUMEN
Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, composed of both a workflow and documentation, aims to provide responsible, reliable, reproducible, and transparent results across the data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. As part of the PCS workflow, we develop PCS inference procedures, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We illustrate PCS inference through neuroscience and genomics projects of our own and others. Moreover, we demonstrate its favorable performance over existing methods in terms of receiver operating characteristic (ROC) curves in high-dimensional, sparse linear model simulations, including a wide range of misspecified models. Finally, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo.
RESUMEN
Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned is receiving an increasing amount of attention. However, this increased focus has led to considerable confusion about the notion of interpretability. In particular, it is unclear how the wide array of proposed interpretation methods are related and what common concepts can be used to evaluate them. We aim to address these concerns by defining interpretability in the context of machine learning and introducing the predictive, descriptive, relevant (PDR) framework for discussing interpretations. The PDR framework provides 3 overarching desiderata for evaluation: predictive accuracy, descriptive accuracy, and relevancy, with relevancy judged relative to a human audience. Moreover, to help manage the deluge of interpretation methods, we introduce a categorization of existing techniques into model-based and post hoc categories, with subgroups including sparsity, modularity, and simulatability. To demonstrate how practitioners can use the PDR framework to evaluate and understand interpretations, we provide numerous real-world examples. These examples highlight the often underappreciated role played by human audiences in discussions of interpretability. Finally, based on our framework, we discuss limitations of existing methods and directions for future work. We hope that this work will provide a common vocabulary that will make it easier for both practitioners and researchers to discuss and choose from the full range of interpretation methods.
RESUMEN
Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on random forests (RFs) and random intersection trees (RITs) and through extensive, biologically inspired simulations, we developed the iterative random forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with the same order of computational cost as the RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human-derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, third-order interactions, e.g., between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF rediscovered a central role of H3K36me3 in chromatin-mediated splicing regulation and identified interesting fifth- and sixth-order interactions, indicative of multivalent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens additional avenues of inquiry into the molecular mechanisms underlying genome biology.
Asunto(s)
Drosophila/genética , Modelos Genéticos , Algoritmos , Empalme Alternativo , Animales , Biología Computacional , Regulación del Desarrollo de la Expresión Génica , Redes Reguladoras de Genes , Estudio de Asociación del Genoma CompletoRESUMEN
Amyotrophic lateral sclerosis (ALS) is a rapidly progressing, highly heterogeneous neurodegenerative disease, underscoring the importance of obtaining information to personalize clinical decisions quickly after diagnosis. Here, we investigated whether ALS-relevant signatures can be detected directly from biopsied patient fibroblasts. We profiled familial ALS (fALS) fibroblasts, representing a range of mutations in the fused in sarcoma (FUS) gene and ages of onset. To differentiate FUS fALS and healthy control fibroblasts, machine-learning classifiers were trained separately on high-content imaging and transcriptional profiles. "Molecular ALS phenotype" scores, derived from these classifiers, captured a spectrum from disease to health. Interestingly, these scores negatively correlated with age of onset, identified several pre-symptomatic individuals and sporadic ALS (sALS) patients with FUS-like fibroblasts, and quantified "movement" of FUS fALS and "FUS-like" sALS toward health upon FUS ASO treatment. Taken together, these findings provide evidence that non-neuronal patient fibroblasts can be used for rapid, personalized assessment in ALS.
Asunto(s)
Esclerosis Amiotrófica Lateral , Fibroblastos , Proteína FUS de Unión a ARN , Humanos , Esclerosis Amiotrófica Lateral/genética , Esclerosis Amiotrófica Lateral/metabolismo , Esclerosis Amiotrófica Lateral/patología , Fibroblastos/metabolismo , Fibroblastos/patología , Proteína FUS de Unión a ARN/metabolismo , Proteína FUS de Unión a ARN/genética , Mutación/genética , Masculino , Femenino , Piel/patología , Piel/metabolismo , Aprendizaje Automático , Persona de Mediana EdadRESUMEN
Detecting epistatic drivers of human phenotypes is a considerable challenge. Traditional approaches use regression to sequentially test multiplicative interaction terms involving pairs of genetic variants. For higher-order interactions and genome-wide large-scale data, this strategy is computationally intractable. Moreover, multiplicative terms used in regression modeling may not capture the form of biological interactions. Building on the Predictability, Computability, Stability (PCS) framework, we introduce the epiTree pipeline to extract higher-order interactions from genomic data using tree-based models. The epiTree pipeline first selects a set of variants derived from tissue-specific estimates of gene expression. Next, it uses iterative random forests (iRF) to search training data for candidate Boolean interactions (pairwise and higher-order). We derive significance tests for interactions, based on a stabilized likelihood ratio test, by simulating Boolean tree-structured null (no epistasis) and alternative (epistasis) distributions on hold-out test data. Finally, our pipeline computes PCS epistasis p-values that probabilisticly quantify improvement in prediction accuracy via bootstrap sampling on the test set. We validate the epiTree pipeline in two case studies using data from the UK Biobank: predicting red hair and multiple sclerosis (MS). In the case of predicting red hair, epiTree recovers known epistatic interactions surrounding MC1R and novel interactions, representing non-linearities not captured by logistic regression models. In the case of predicting MS, a more complex phenotype than red hair, epiTree rankings prioritize novel interactions surrounding HLA-DRB1, a variant previously associated with MS in several populations. Taken together, these results highlight the potential for epiTree rankings to help reduce the design space for follow up experiments.
Asunto(s)
Epistasis Genética , Estudio de Asociación del Genoma Completo , Humanos , Estudio de Asociación del Genoma Completo/métodos , Fenotipo , Herencia Multifactorial/genética , Modelos Logísticos , Polimorfismo de Nucleótido SimpleRESUMEN
High-content microscopy offers a scalable approach to screen against multiple targets in a single pass. Prior work has focused on methods to select "optimal" cellular readouts in microscopy screens. However, methods to select optimal cell line models have garnered much less attention. Here, we provide a roadmap for how to select the cell line or lines that are best suited to identify bioactive compounds and their mechanism of action (MOA). We test our approach on compounds targeting cancer-relevant pathways, ranking cell lines in two tasks: detecting compound activity ("phenoactivity") and grouping compounds with similar MOA by similar phenotype ("phenosimilarity"). Evaluating six cell lines across 3214 well-annotated compounds, we show that optimal cell line selection depends on both the task of interest (e.g. detecting phenoactivity vs. inferring phenosimilarity) and distribution of MOAs within the compound library. Given a task of interest and set of compounds, we provide a systematic framework for choosing optimal cell line(s). Our framework can be used to reduce the number of cell lines required to identify hits within a compound library and help accelerate the pace of early drug discovery.
RESUMEN
High-content microscopy offers a scalable approach to screen against multiple targets in a single pass. Prior work has focused on methods to select "optimal" cellular readouts in microscopy screens. However, methods to select optimal cell line models have garnered much less attention. Here, we provide a roadmap for how to select the cell line or lines that are best suited to identify bioactive compounds and their mechanism of action (MOA). We test our approach on compounds targeting cancer-relevant pathways, ranking cell lines in two tasks: detecting compound activity ("phenoactivity") and grouping compounds with similar MOA by similar phenotype ("phenosimilarity"). Evaluating six cell lines across 3214 well-annotated compounds, we show that optimal cell line selection depends on both the task of interest (e.g., detecting phenoactivity vs inferring phenosimilarity) and distribution of MOAs within the compound library. Given a task of interest and a set of compounds, we provide a systematic framework for choosing optimal cell line(s). Our framework can be used to reduce the number of cell lines required to identify hits within a compound library and help accelerate the pace of early drug discovery.
Asunto(s)
Descubrimiento de Drogas , Línea Celular , Fenotipo , Descubrimiento de Drogas/métodosRESUMEN
Parkinson's disease-causing leucine-rich repeat kinase 2 (LRRK2) mutations lead to varying degrees of Rab GTPase hyperphosphorylation. Puzzlingly, LRRK2 GTPase-inactivating mutations-which do not affect intrinsic kinase activity-lead to higher levels of cellular Rab phosphorylation than kinase-activating mutations. Here, we investigate whether mutation-dependent differences in LRRK2 cellular localization could explain this discrepancy. We discover that blocking endosomal maturation leads to the rapid formation of mutant LRRK2+ endosomes on which LRRK2 phosphorylates substrate Rabs. LRRK2+ endosomes are maintained through positive feedback, which mutually reinforces membrane localization of LRRK2 and phosphorylated Rab substrates. Furthermore, across a panel of mutants, cells expressing GTPase-inactivating mutants form strikingly more LRRK2+ endosomes than cells expressing kinase-activating mutants, resulting in higher total cellular levels of phosphorylated Rabs. Our study suggests that the increased probability that LRRK2 GTPase-inactivating mutants are retained on intracellular membranes compared to kinase-activating mutants leads to higher substrate phosphorylation.
Asunto(s)
Proteínas Serina-Treonina Quinasas , Proteínas de Unión al GTP rab , Proteína 2 Quinasa Serina-Treonina Rica en Repeticiones de Leucina/genética , Proteína 2 Quinasa Serina-Treonina Rica en Repeticiones de Leucina/metabolismo , Proteínas Serina-Treonina Quinasas/genética , Proteínas Serina-Treonina Quinasas/metabolismo , Fosforilación , Mutación/genética , Proteínas de Unión al GTP rab/genética , Proteínas de Unión al GTP rab/metabolismoRESUMEN
The combinatorial effect of genetic variants is often assumed to be additive. Although genetic variation can clearly interact non-additively, methods to uncover epistatic relationships remain in their infancy. We develop low-signal signed iterative random forests to elucidate the complex genetic architecture of cardiac hypertrophy. We derive deep learning-based estimates of left ventricular mass from the cardiac MRI scans of 29,661 individuals enrolled in the UK Biobank. We report epistatic genetic variation including variants close to CCDC141, IGF1R, TTN, and TNKS. Several loci not prioritized by univariate genome-wide association analysis are identified. Functional genomic and integrative enrichment analyses reveal a complex gene regulatory network in which genes mapped from these loci share biological processes and myogenic regulatory factors. Through a network analysis of transcriptomic data from 313 explanted human hearts, we show that these interactions are preserved at the level of the cardiac transcriptome. We assess causality of epistatic effects via RNA silencing of gene-gene interactions in human induced pluripotent stem cell-derived cardiomyocytes. Finally, single-cell morphology analysis using a novel high-throughput microfluidic system shows that cardiomyocyte hypertrophy is non-additively modifiable by specific pairwise interactions between CCDC141 and both TTN and IGF1R. Our results expand the scope of genetic regulation of cardiac structure to epistasis.
RESUMEN
The combinatorial effect of genetic variants is often assumed to be additive. Although genetic variation can clearly interact non-additively, methods to uncover epistatic relationships remain in their infancy. We develop low-signal signed iterative random forests to elucidate the complex genetic architecture of cardiac hypertrophy. We derive deep learning-based estimates of left ventricular mass from the cardiac MRI scans of 29,661 individuals enrolled in the UK Biobank. We report epistatic genetic variation including variants close to CCDC141, IGF1R, TTN, and TNKS. Several loci not prioritized by univariate genome-wide association analysis are identified. Functional genomic and integrative enrichment analyses reveal a complex gene regulatory network in which genes mapped from these loci share biological processes and myogenic regulatory factors. Through a network analysis of transcriptomic data from 313 explanted human hearts, we show that these interactions are preserved at the level of the cardiac transcriptome. We assess causality of epistatic effects via RNA silencing of gene-gene interactions in human induced pluripotent stem cell-derived cardiomyocytes. Finally, single-cell morphology analysis using a novel high-throughput microfluidic system shows that cardiomyocyte hypertrophy is non-additively modifiable by specific pairwise interactions between CCDC141 and both TTN and IGF1R. Our results expand the scope of genetic regulation of cardiac structure to epistasis.
RESUMEN
Increasing occurrence of harmful algal blooms across the land-water interface poses significant risks to coastal ecosystem structure and human health. Defining significant drivers and their interactive impacts on blooms allows for more effective analysis and identification of specific conditions supporting phytoplankton growth. A novel iterative Random Forests (iRF) machine-learning model was developed and applied to two example cases along the California coast to identify key stable interactions: (1) phytoplankton abundance in response to various drivers due to coastal conditions and land-sea nutrient fluxes, (2) microbial community structure during algal blooms. In Example 1, watershed derived nutrients were identified as the least significant interacting variable associated with Monterey Bay phytoplankton abundance. In Example 2, through iRF analysis of field-based 16S OTU bacterial community and algae datasets, we independently found stable interactions of prokaryote abundance patterns associated with phytoplankton abundance that have been previously identified in laboratory-based studies. Our study represents the first iRF application to marine algal blooms that helps to identify ocean, microbial, and terrestrial conditions that are considered dominant causal factors on bloom dynamics.