RESUMO
Machine learning methods for extracting patterns from high-dimensional data are very important in the biological sciences. However, in certain cases, real-world applications cannot confirm the reported prediction performance. One of the main reasons for this is data leakage, which can be seen as the illicit sharing of information between the training data and the test data, resulting in performance estimates that are far better than the performance observed in the intended application scenario. Data leakage can be difficult to detect in biological datasets due to their complex dependencies. With this in mind, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. We illustrate the usefulness of our questions by applying them to nontrivial examples. Our goal is to raise awareness of potential data leakage problems and to promote robust and reproducible machine learning-based research in biology.
Assuntos
Aprendizado de Máquina , Humanos , Biologia Computacional/métodos , AlgoritmosRESUMO
Identifying protein-protein interactions (PPIs) is crucial for deciphering biological pathways. Numerous prediction methods have been developed as cheap alternatives to biological experiments, reporting surprisingly high accuracy estimates. We systematically investigated how much reproducible deep learning models depend on data leakage, sequence similarities and node degree information, and compared them with basic machine learning models. We found that overlaps between training and test sets resulting from random splitting lead to strongly overestimated performances. In this setting, models learn solely from sequence similarities and node degrees. When data leakage is avoided by minimizing sequence similarities between training and test set, performances become random. Moreover, baseline models directly leveraging sequence similarity and network topology show good performances at a fraction of the computational cost. Thus, we advocate that any improvements should be reported relative to baseline methods in the future. Our findings suggest that predicting PPIs remains an unsolved task for proteins showing little sequence similarity to previously studied proteins, highlighting that further experimental research into the 'dark' protein interactome and better computational methods are needed.
Assuntos
Aprendizado de MáquinaRESUMO
The EMT-transcription factor ZEB1 is heterogeneously expressed in tumor cells and in cancer-associated fibroblasts (CAFs) in colorectal cancer (CRC). While ZEB1 in tumor cells regulates metastasis and therapy resistance, its role in CAFs is largely unknown. Combining fibroblast-specific Zeb1 deletion with immunocompetent mouse models of CRC, we observe that inflammation-driven tumorigenesis is accelerated, whereas invasion and metastasis in sporadic cancers are reduced. Single-cell transcriptomics, histological characterization, and in vitro modeling reveal a crucial role of ZEB1 in CAF polarization, promoting myofibroblastic features by restricting inflammatory activation. Zeb1 deficiency impairs collagen deposition and CAF barrier function but increases NFκB-mediated cytokine production, jointly promoting lymphocyte recruitment and immune checkpoint activation. Strikingly, the Zeb1-deficient CAF repertoire sensitizes to immune checkpoint inhibition, offering a therapeutic opportunity of targeting ZEB1 in CAFs and its usage as a prognostic biomarker. Collectively, we demonstrate that ZEB1-dependent plasticity of CAFs suppresses anti-tumor immunity and promotes metastasis.
Assuntos
Fibroblastos Associados a Câncer , Neoplasias Colorretais , Imunoterapia , Inflamação , Homeobox 1 de Ligação a E-box em Dedo de Zinco , Homeobox 1 de Ligação a E-box em Dedo de Zinco/metabolismo , Homeobox 1 de Ligação a E-box em Dedo de Zinco/genética , Neoplasias Colorretais/patologia , Neoplasias Colorretais/genética , Neoplasias Colorretais/metabolismo , Neoplasias Colorretais/terapia , Neoplasias Colorretais/imunologia , Animais , Camundongos , Fibroblastos Associados a Câncer/metabolismo , Fibroblastos Associados a Câncer/patologia , Humanos , Inflamação/metabolismo , Inflamação/genética , Inflamação/patologia , Imunoterapia/métodos , Regulação Neoplásica da Expressão Gênica , Fibroblastos/metabolismo , Linhagem Celular Tumoral , Inibidores de Checkpoint Imunológico/farmacologia , Inibidores de Checkpoint Imunológico/uso terapêutico , Transição Epitelial-Mesenquimal/genéticaRESUMO
Most heritable diseases are polygenic. To comprehend the underlying genetic architecture, it is crucial to discover the clinically relevant epistatic interactions (EIs) between genomic single nucleotide polymorphisms (SNPs) (1-3). Existing statistical computational methods for EI detection are mostly limited to pairs of SNPs due to the combinatorial explosion of higher-order EIs. With NeEDL (network-based epistasis detection via local search), we leverage network medicine to inform the selection of EIs that are an order of magnitude more statistically significant compared to existing tools and consist, on average, of five SNPs. We further show that this computationally demanding task can be substantially accelerated once quantum computing hardware becomes available. We apply NeEDL to eight different diseases and discover genes (affected by EIs of SNPs) that are partly known to affect the disease, additionally, these results are reproducible across independent cohorts. EIs for these eight diseases can be interactively explored in the Epistasis Disease Atlas (https://epistasis-disease-atlas.com). In summary, NeEDL demonstrates the potential of seamlessly integrated quantum computing techniques to accelerate biomedical research. Our network medicine approach detects higher-order EIs with unprecedented statistical and biological evidence, yielding unique insights into polygenic diseases and providing a basis for the development of improved risk scores and combination therapies.
Assuntos
Epistasia Genética , Polimorfismo de Nucleotídeo Único , Humanos , Teoria Quântica , Herança Multifatorial/genética , Doença/genética , Biologia Computacional/métodos , Algoritmos , Predisposição Genética para DoençaRESUMO
In recent decades, the development of new drugs has become increasingly expensive and inefficient, and the molecular mechanisms of most pharmaceuticals remain poorly understood. In response, computational systems and network medicine tools have emerged to identify potential drug repurposing candidates. However, these tools often require complex installation and lack intuitive visual network mining capabilities. To tackle these challenges, we introduce Drugst.One, a platform that assists specialized computational medicine tools in becoming user-friendly, web-based utilities for drug repurposing. With just three lines of code, Drugst.One turns any systems biology software into an interactive web tool for modeling and analyzing complex protein-drug-disease networks. Demonstrating its broad adaptability, Drugst.One has been successfully integrated with 21 computational systems medicine tools. Available at https://drugst.one, Drugst.One has significant potential for streamlining the drug discovery process, allowing researchers to focus on essential aspects of pharmaceutical treatment research.
Assuntos
Reposicionamento de Medicamentos , Software , Reposicionamento de Medicamentos/métodos , Humanos , Internet , Descoberta de Drogas/métodos , Biologia de Sistemas/métodos , Biologia Computacional/métodosRESUMO
Gene regulatory networks (GRNs) and gene co-expression networks (GCNs) allow genome-wide exploration of molecular regulation patterns in health and disease. The standard approach for obtaining GRNs and GCNs is to infer them from gene expression data, using computational network inference methods. However, since network inference methods are usually applied on aggregate data, distortion of the networks by demographic confounders might remain undetected, especially because gene expression patterns are known to vary between different demographic groups. In this paper, we present a computational framework to systematically evaluate the influence of demographic confounders on network inference from gene expression data. Our framework compares similarities between networks inferred for different demographic groups with similarity distributions obtained for random splits of the expression data. Moreover, it allows to quantify to which extent demographic groups are represented by networks inferred from the aggregate data in a confounder-agnostic way. We apply our framework to test four widely used GRN and GCN inference methods as to their robustness w. r. t. confounding by age, ethnicity and sex in cancer. Our findings based on more than $ {44000}$ inferred networks indicate that age and sex confounders play an important role in network inference for certain cancer types, emphasizing the importance of incorporating an assessment of the effect of demographic confounders into network inference workflows. Our framework is available as a Python package on GitHub: https://github.com/bionetslab/grn-confounders.
Assuntos
Redes Reguladoras de Genes , Neoplasias , Humanos , Neoplasias/genética , Demografia , AlgoritmosRESUMO
As the development of new drugs reaches its physical and financial limits, drug repurposing has become more important than ever. For mechanistically grounded drug repurposing, it is crucial to uncover the disease mechanisms and to detect clusters of mechanistically related diseases. Various methods for computing candidate disease mechanisms and disease clusters exist. However, in the absence of ground truth, in silico validation is challenging. This constitutes a major hurdle toward the adoption of in silico prediction tools by experimentalists who are often hesitant to carry out wet-lab validations for predicted candidate mechanisms without clearly quantified initial plausibility. To address this problem, we present DIGEST (in silico validation of disease and gene sets, clusterings or subnetworks), a Python-based validation tool available as a web interface (https://digest-validation.net), as a stand-alone package or over a REST API. DIGEST greatly facilitates in silico validation of gene and disease sets, clusterings or subnetworks via fully automated pipelines comprising disease and gene ID mapping, enrichment analysis, comparisons of shared genes and variants and background distribution estimation. Moreover, functionality is provided to automatically update the external databases used by the pipelines. DIGEST hence allows the user to assess the statistical significance of candidate mechanisms with regard to functional and genetic coherence and enables the computation of empirical $P$-values with just a few mouse clicks.
Assuntos
Software , Análise por Conglomerados , Bases de Dados FactuaisRESUMO
SUMMARY: We present ROBUST-Web which implements our recently presented ROBUST disease module mining algorithm in a user-friendly web application. ROBUST-Web features seamless downstream disease module exploration via integrated gene set enrichment analysis, tissue expression annotation, and visualization of drug-protein and disease-gene links. Moreover, ROBUST-Web includes bias-aware edge costs for the underlying Steiner tree model as a new algorithmic feature, which allow to correct for study bias in protein-protein interaction networks and further improves the robustness of the computed modules. AVAILABILITY AND IMPLEMENTATION: Web application: https://robust-web.net. Source code of web application and Python package with new bias-aware edge costs: https://github.com/bionetslab/robust-web, https://github.com/bionetslab/robust_bias_aware.
Assuntos
Algoritmos , Software , Mapas de Interação de ProteínasRESUMO
In network and systems medicine, active module identification methods (AMIMs) are widely used for discovering candidate molecular disease mechanisms. To this end, AMIMs combine network analysis algorithms with molecular profiling data, most commonly, by projecting gene expression data onto generic protein-protein interaction (PPI) networks. Although active module identification has led to various novel insights into complex diseases, there is increasing awareness in the field that the combination of gene expression data and PPI network is problematic because up-to-date PPI networks have a very small diameter and are subject to both technical and literature bias. In this paper, we report the results of an extensive study where we analyzed for the first time whether widely used AMIMs really benefit from using PPI networks. Our results clearly show that, except for the recently proposed AMIM DOMINO, the tested AMIMs do not produce biologically more meaningful candidate disease modules on widely used PPI networks than on random networks with the same node degrees. AMIMs hence mainly learn from the node degrees and mostly fail to exploit the biological knowledge encoded in the edges of the PPI networks. This has far-reaching consequences for the field of active module identification. In particular, we suggest that novel algorithms are needed which overcome the degree bias of most existing AMIMs and/or work with customized, context-specific networks instead of generic PPI networks.
Assuntos
Algoritmos , Expressão Gênica , Mapeamento de Interação de Proteínas/métodos , Mapas de Interação de Proteínas/genética , Biologia de Sistemas/métodos , Esclerose Lateral Amiotrófica/genética , Esclerose Lateral Amiotrófica/metabolismo , Carcinoma Pulmonar de Células não Pequenas/genética , Carcinoma Pulmonar de Células não Pequenas/metabolismo , Colite Ulcerativa/genética , Colite Ulcerativa/metabolismo , Doença de Crohn/genética , Doença de Crohn/metabolismo , Humanos , Doença de Huntington/genética , Doença de Huntington/metabolismo , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/metabolismo , Fenótipo , Proteínas/genética , Proteínas/metabolismoRESUMO
MOTIVATION: Disease module mining methods (DMMMs) extract subgraphs that constitute candidate disease mechanisms from molecular interaction networks such as protein-protein interaction (PPI) networks. Irrespective of the employed models, DMMMs typically include non-robust steps in their workflows, i.e. the computed subnetworks vary when running the DMMMs multiple times on equivalent input. This lack of robustness has a negative effect on the trustworthiness of the obtained subnetworks and is hence detrimental for the widespread adoption of DMMMs in the biomedical sciences. RESULTS: To overcome this problem, we present a new DMMM called ROBUST (robust disease module mining via enumeration of diverse prize-collecting Steiner trees). In a large-scale empirical evaluation, we show that ROBUST outperforms competing methods in terms of robustness, scalability and, in most settings, functional relevance of the produced modules, measured via KEGG (Kyoto Encyclopedia of Genes and Genomes) gene set enrichment scores and overlap with DisGeNET disease genes. AVAILABILITY AND IMPLEMENTATION: A Python 3 implementation and scripts to reproduce the results reported in this article are available on GitHub: https://github.com/bionetslab/robust, https://github.com/bionetslab/robust-eval. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Árvores , Biologia Computacional/métodos , Mapas de Interação de ProteínasRESUMO
MOTIVATION: Recently, various tools for detecting single nucleotide polymorphisms (SNPs) involved in epistasis have been developed. However, no studies evaluate the employed statistical epistasis models such as the χ2-test or quadratic regression independently of the tools that use them. Such an independent evaluation is crucial for developing improved epistasis detection tools, for it allows to decide if a tool's performance should be attributed to the epistasis model or to the optimization strategy run on top of it. RESULTS: We present a protocol for evaluating epistasis models independently of the tools they are used in and generalize existing models designed for dichotomous phenotypes to the categorical and quantitative case. In addition, we propose a new model which scores candidate SNP sets by computing maximum likelihood distributions for the observed phenotypes in the cells of their penetrance tables. Extensive experiments show that the proposed maximum likelihood model outperforms three widely used epistasis models in most cases. The experiments also provide valuable insights into the properties of existing models, for instance, that quadratic regression perform particularly well on instances with quantitative phenotypes. AVAILABILITY AND IMPLEMENTATION: The evaluation protocol and all compared models are implemented in C++ and are supported under Linux and macOS. They are available at https://github.com/baumbachlab/genepiseeker/, along with test datasets and scripts to reproduce the experiments. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Epistasia Genética , Polimorfismo de Nucleotídeo Único , Fenótipo , ProbabilidadeRESUMO
MOTIVATION: Unsupervised learning approaches are frequently used to stratify patients into clinically relevant subgroups and to identify biomarkers such as disease-associated genes. However, clustering and biclustering techniques are oblivious to the functional relationship of genes and are thus not ideally suited to pinpoint molecular mechanisms along with patient subgroups. RESULTS: We developed the network-constrained biclustering approach Biclustering Constrained by Networks (BiCoN) which (i) restricts biclusters to functionally related genes connected in molecular interaction networks and (ii) maximizes the difference in gene expression between two subgroups of patients. This allows BiCoN to simultaneously pinpoint molecular mechanisms responsible for the patient grouping. Network-constrained clustering of genes makes BiCoN more robust to noise and batch effects than typical clustering and biclustering methods. BiCoN can faithfully reproduce known disease subtypes as well as novel, clinically relevant patient subgroups, as we could demonstrate using breast and lung cancer datasets. In summary, BiCoN is a novel systems medicine tool that combines several heuristic optimization strategies for robust disease mechanism extraction. BiCoN is well-documented and freely available as a python package or a web interface. AVAILABILITY AND IMPLEMENTATION: PyPI package: https://pypi.org/project/bicon. WEB INTERFACE: https://exbio.wzw.tum.de/bicon. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMO
SUMMARY: Simulated data are crucial for evaluating epistasis detection tools in genome-wide association studies. Existing simulators are limited, as they do not account for linkage disequilibrium (LD), support limited interaction models of single nucleotide polymorphisms (SNPs) and only dichotomous phenotypes or depend on proprietary software. In contrast, EpiGEN supports SNP interactions of arbitrary order, produces realistic LD patterns and generates both categorical and quantitative phenotypes. AVAILABILITY AND IMPLEMENTATION: EpiGEN is implemented in Python 3 and is freely available at https://github.com/baumbachlab/epigen. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Epistasia Genética , Estudo de Associação Genômica Ampla , Simulação por Computador , Epigen , Polimorfismo de Nucleotídeo Único , SoftwareRESUMO
Summary: Diseases can be caused by molecular perturbations that induce specific changes in regulatory interactions and their coordinated expression, also referred to as network rewiring. However, the detection of complex changes in regulatory connections remains a challenging task and would benefit from the development of novel nonparametric approaches. We develop a new ensemble method called BoostDiff (boosted differential regression trees) to infer a differential network discriminating between two conditions. BoostDiff builds an adaptively boosted (AdaBoost) ensemble of differential trees with respect to a target condition. To build the differential trees, we propose differential variance improvement as a novel splitting criterion. Variable importance measures derived from the resulting models are used to reflect changes in gene expression predictability and to build the output differential networks. BoostDiff outperforms existing differential network methods on simulated data evaluated in four different complexity settings. We then demonstrate the power of our approach when applied to real transcriptomics data in COVID-19, Crohn's disease, breast cancer, prostate adenocarcinoma, and stress response in Bacillus subtilis. BoostDiff identifies context-specific networks that are enriched with genes of known disease-relevant pathways and complements standard differential expression analyses. Availability and implementation: BoostDiff is available at https://github.com/scibiome/boostdiff_inference.
RESUMO
A long-term objective of network medicine is to replace our current, mainly phenotype-based disease definitions by subtypes of health conditions corresponding to distinct pathomechanisms. For this, molecular and health data are modeled as networks and are mined for pathomechanisms. However, many such studies rely on large-scale disease association data where diseases are annotated using the very phenotype-based disease definitions the network medicine field aims to overcome. This raises the question to which extent the biases mechanistically inadequate disease annotations introduce in disease association data distort the results of studies which use such data for pathomechanism mining. We address this question using global- and local-scale analyses of networks constructed from disease association data of various types. Our results indicate that large-scale disease association data should be used with care for pathomechanism mining and that analyses of such data should be accompanied by close-up analyses of molecular data for well-characterized patient cohorts.
RESUMO
Focal Cortical Dysplasia (FCD) is a frequent cause of drug-resistant focal epilepsy in children and young adults. The international FCD classifications of 2011 and 2022 have identified several clinico-pathological subtypes, either occurring isolated, i.e., FCD ILAE Type 1 or 2, or in association with a principal cortical lesion, i.e., FCD Type 3. Here, we addressed the DNA methylation signature of a previously described new subtype of FCD 3D occurring in the occipital lobe of very young children and microscopically defined by neuronal cell loss in cortical layer 4. We studied the DNA methylation profile using 850 K BeadChip arrays in a retrospective cohort of 104 patients with FCD 1 A, 2 A, 2B, 3D, TLE without FCD, and 16 postmortem specimens without neurological disorders as controls, operated in China or Germany. DNA was extracted from formalin-fixed paraffin-embedded tissue blocks with microscopically confirmed lesions, and DNA methylation profiles were bioinformatically analyzed with a recently developed deep learning algorithm. Our results revealed a distinct position of FCD 3D in the DNA methylation map of common FCD subtypes, also different from non-FCD epilepsy surgery controls or non-epileptic postmortem controls. Within the FCD 3D cohort, the DNA methylation signature separated three histopathology subtypes, i.e., glial scarring around porencephalic cysts, loss of layer 4, and Rasmussen encephalitis. Differential methylation in FCD 3D with loss of layer 4 mapped explicitly to biological pathways related to neurodegeneration, biogenesis of the extracellular matrix (ECM) components, axon guidance, and regulation of the actin cytoskeleton. Our data suggest that DNA methylation signatures in cortical malformations are not only of diagnostic value but also phenotypically relevant, providing the molecular underpinnings of structural and histopathological features associated with epilepsy. Further studies will be necessary to confirm these results and clarify their functional relevance and epileptogenic potential in these difficult-to-treat children.
Assuntos
Epilepsia Resistente a Medicamentos , Epilepsia , Displasia Cortical Focal , Malformações do Desenvolvimento Cortical , Criança , Adulto Jovem , Humanos , Pré-Escolar , Estudos Retrospectivos , Malformações do Desenvolvimento Cortical/diagnóstico por imagem , Malformações do Desenvolvimento Cortical/genética , Metilação de DNA , Epilepsia/genética , Epilepsia Resistente a Medicamentos/patologia , Imageamento por Ressonância MagnéticaRESUMO
In recent decades, the development of new drugs has become increasingly expensive and inefficient, and the molecular mechanisms of most pharmaceuticals remain poorly understood. In response, computational systems and network medicine tools have emerged to identify potential drug repurposing candidates. However, these tools often require complex installation and lack intuitive visual network mining capabilities. To tackle these challenges, we introduce Drugst.One, a platform that assists specialized computational medicine tools in becoming user-friendly, web-based utilities for drug repurposing. With just three lines of code, Drugst.One turns any systems biology software into an interactive web tool for modeling and analyzing complex protein-drug-disease networks. Demonstrating its broad adaptability, Drugst.One has been successfully integrated with 21 computational systems medicine tools. Available at https://drugst.one, Drugst.One has significant potential for streamlining the drug discovery process, allowing researchers to focus on essential aspects of pharmaceutical treatment research.
RESUMO
Most heritable diseases are polygenic. To comprehend the underlying genetic architecture, it is crucial to discover the clinically relevant epistatic interactions (EIs) between genomic single nucleotide polymorphisms (SNPs)1-3. Existing statistical computational methods for EI detection are mostly limited to pairs of SNPs due to the combinatorial explosion of higher-order EIs. With NeEDL (network-based epistasis detection via local search), we leverage network medicine to inform the selection of EIs that are an order of magnitude more statistically significant compared to existing tools and consist, on average, of five SNPs. We further show that this computationally demanding task can be substantially accelerated once quantum computing hardware becomes available. We apply NeEDL to eight different diseases and discover genes (affected by EIs of SNPs) that are partly known to affect the disease, additionally, these results are reproducible across independent cohorts. EIs for these eight diseases can be interactively explored in the Epistasis Disease Atlas (https://epistasis-disease-atlas.com). In summary, NeEDL is the first application that demonstrates the potential of seamlessly integrated quantum computing techniques to accelerate biomedical research. Our network medicine approach detects higher-order EIs with unprecedented statistical and biological evidence, yielding unique insights into polygenic diseases and providing a basis for the development of improved risk scores and combination therapies.
RESUMO
BACKGROUND: Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g., in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. OBJECTIVES: However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy. METHOD: This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems. CONCLUSION: As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy-preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.