Búsqueda | Portal Regional de la BVS

Guiding questions to avoid data leakage in biological machine learning applications.

Bernett, Judith; Blumenthal, David B; Grimm, Dominik G; Haselbeck, Florian; Joeres, Roman; Kalinina, Olga V; List, Markus.

Nat Methods ; 21(8): 1444-1453, 2024 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-39122953

RESUMEN

Machine learning methods for extracting patterns from high-dimensional data are very important in the biological sciences. However, in certain cases, real-world applications cannot confirm the reported prediction performance. One of the main reasons for this is data leakage, which can be seen as the illicit sharing of information between the training data and the test data, resulting in performance estimates that are far better than the performance observed in the intended application scenario. Data leakage can be difficult to detect in biological datasets due to their complex dependencies. With this in mind, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. We illustrate the usefulness of our questions by applying them to nontrivial examples. Our goal is to raise awareness of potential data leakage problems and to promote robust and reproducible machine learning-based research in biology.

Asunto(s)

Aprendizaje Automático , Humanos , Biología Computacional/métodos , Algoritmos

Cracking the black box of deep sequence-based protein-protein interaction prediction.

Bernett, Judith; Blumenthal, David B; List, Markus.

Brief Bioinform ; 25(2)2024 Jan 22.

Artículo en Inglés | MEDLINE | ID: mdl-38446741

RESUMEN

Identifying protein-protein interactions (PPIs) is crucial for deciphering biological pathways. Numerous prediction methods have been developed as cheap alternatives to biological experiments, reporting surprisingly high accuracy estimates. We systematically investigated how much reproducible deep learning models depend on data leakage, sequence similarities and node degree information, and compared them with basic machine learning models. We found that overlaps between training and test sets resulting from random splitting lead to strongly overestimated performances. In this setting, models learn solely from sequence similarities and node degrees. When data leakage is avoided by minimizing sequence similarities between training and test set, performances become random. Moreover, baseline models directly leveraging sequence similarity and network topology show good performances at a fraction of the computational cost. Thus, we advocate that any improvements should be reported relative to baseline methods in the future. Our findings suggest that predicting PPIs remains an unsolved task for proteins showing little sequence similarity to previously studied proteins, highlighting that further experimental research into the 'dark' protein interactome and better computational methods are needed.

Asunto(s)

Aprendizaje Automático

Robust disease module mining via enumeration of diverse prize-collecting Steiner trees.

Bernett, Judith; Krupke, Dominik; Sadegh, Sepideh; Baumbach, Jan; Fekete, Sándor P; Kacprowski, Tim; List, Markus; Blumenthal, David B.

Bioinformatics ; 38(6): 1600-1606, 2022 03 04.

Artículo en Inglés | MEDLINE | ID: mdl-34984440

RESUMEN

MOTIVATION: Disease module mining methods (DMMMs) extract subgraphs that constitute candidate disease mechanisms from molecular interaction networks such as protein-protein interaction (PPI) networks. Irrespective of the employed models, DMMMs typically include non-robust steps in their workflows, i.e. the computed subnetworks vary when running the DMMMs multiple times on equivalent input. This lack of robustness has a negative effect on the trustworthiness of the obtained subnetworks and is hence detrimental for the widespread adoption of DMMMs in the biomedical sciences. RESULTS: To overcome this problem, we present a new DMMM called ROBUST (robust disease module mining via enumeration of diverse prize-collecting Steiner trees). In a large-scale empirical evaluation, we show that ROBUST outperforms competing methods in terms of robustness, scalability and, in most settings, functional relevance of the produced modules, measured via KEGG (Kyoto Encyclopedia of Genes and Genomes) gene set enrichment scores and overlap with DisGeNET disease genes. AVAILABILITY AND IMPLEMENTATION: A Python 3 implementation and scripts to reproduce the results reported in this article are available on GitHub: https://github.com/bionetslab/robust, https://github.com/bionetslab/robust-eval. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos , Árboles , Biología Computacional/métodos , Mapas de Interacción de Proteínas

A systematic comparison of novel and existing differential analysis methods for CyTOF data.

Arend, Lis; Bernett, Judith; Manz, Quirin; Klug, Melissa; Lazareva, Olga; Baumbach, Jan; Bongiovanni, Dario; List, Markus.

Brief Bioinform ; 23(1)2022 01 17.

Artículo en Inglés | MEDLINE | ID: mdl-34850807

RESUMEN

Cytometry techniques are widely used to discover cellular characteristics at single-cell resolution. Many data analysis methods for cytometry data focus solely on identifying subpopulations via clustering and testing for differential cell abundance. For differential expression analysis of markers between conditions, only few tools exist. These tools either reduce the data distribution to medians, discarding valuable information, or have underlying assumptions that may not hold for all expression patterns. Here, we systematically evaluated existing and novel approaches for differential expression analysis on real and simulated CyTOF data. We found that methods using median marker expressions compute fast and reliable results when the data are not strongly zero-inflated. Methods using all data detect changes in strongly zero-inflated markers, but partially suffer from overprediction or cannot handle big datasets. We present a new method, CyEMD, based on calculating the earth mover's distance between expression distributions that can handle strong zero-inflation without being too sensitive. Additionally, we developed CYANUS - CYtometry ANalysis Using Shiny - a user-friendly R Shiny App allowing the user to analyze cytometry data with state-of-the-art tools, including well-performing methods from our comparison. A public web interface is available at https://exbio.wzw.tum.de/cyanus/.

Asunto(s)

Análisis por Conglomerados , Biomarcadores

Network medicine for disease module identification and drug repurposing with the NeDRex platform.

Sadegh, Sepideh; Skelton, James; Anastasi, Elisa; Bernett, Judith; Blumenthal, David B; Galindez, Gihanna; Salgado-Albarrán, Marisol; Lazareva, Olga; Flanagan, Keith; Cockell, Simon; Nogales, Cristian; Casas, Ana I; Schmidt, Harald H H W; Baumbach, Jan; Wipat, Anil; Kacprowski, Tim.

Nat Commun ; 12(1): 6848, 2021 11 25.

Artículo en Inglés | MEDLINE | ID: mdl-34824199

RESUMEN

Traditional drug discovery faces a severe efficacy crisis. Repurposing of registered drugs provides an alternative with lower costs and faster drug development timelines. However, the data necessary for the identification of disease modules, i.e. pathways and sub-networks describing the mechanisms of complex diseases which contain potential drug targets, are scattered across independent databases. Moreover, existing studies are limited to predictions for specific diseases or non-translational algorithmic approaches. There is an unmet need for adaptable tools allowing biomedical researchers to employ network-based drug repurposing approaches for their individual use cases. We close this gap with NeDRex, an integrative and interactive platform for network-based drug repurposing and disease module discovery. NeDRex integrates ten different data sources covering genes, drugs, drug targets, disease annotations, and their relationships. NeDRex allows for constructing heterogeneous biological networks, mining them for disease modules, prioritizing drugs targeting disease mechanisms, and statistical validation. We demonstrate the utility of NeDRex in five specific use-cases.

Asunto(s)

Bases de Datos Factuales , Reposicionamiento de Medicamentos/métodos , Algoritmos , Biología Computacional , Enfermedad/clasificación , Enfermedad/genética , Humanos , Bases del Conocimiento , Flujo de Trabajo

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA