Pesquisa | BVS Aleitamento Materno

Guiding questions to avoid data leakage in biological machine learning applications.

Bernett, Judith; Blumenthal, David B; Grimm, Dominik G; Haselbeck, Florian; Joeres, Roman; Kalinina, Olga V; List, Markus.

Nat Methods ; 21(8): 1444-1453, 2024 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-39122953

RESUMO

Machine learning methods for extracting patterns from high-dimensional data are very important in the biological sciences. However, in certain cases, real-world applications cannot confirm the reported prediction performance. One of the main reasons for this is data leakage, which can be seen as the illicit sharing of information between the training data and the test data, resulting in performance estimates that are far better than the performance observed in the intended application scenario. Data leakage can be difficult to detect in biological datasets due to their complex dependencies. With this in mind, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. We illustrate the usefulness of our questions by applying them to nontrivial examples. Our goal is to raise awareness of potential data leakage problems and to promote robust and reproducible machine learning-based research in biology.

Assuntos

Aprendizado de Máquina , Humanos , Biologia Computacional/métodos , Algoritmos

Cracking the black box of deep sequence-based protein-protein interaction prediction.

Bernett, Judith; Blumenthal, David B; List, Markus.

Brief Bioinform ; 25(2)2024 Jan 22.

Artigo em Inglês | MEDLINE | ID: mdl-38446741

RESUMO

Identifying protein-protein interactions (PPIs) is crucial for deciphering biological pathways. Numerous prediction methods have been developed as cheap alternatives to biological experiments, reporting surprisingly high accuracy estimates. We systematically investigated how much reproducible deep learning models depend on data leakage, sequence similarities and node degree information, and compared them with basic machine learning models. We found that overlaps between training and test sets resulting from random splitting lead to strongly overestimated performances. In this setting, models learn solely from sequence similarities and node degrees. When data leakage is avoided by minimizing sequence similarities between training and test set, performances become random. Moreover, baseline models directly leveraging sequence similarity and network topology show good performances at a fraction of the computational cost. Thus, we advocate that any improvements should be reported relative to baseline methods in the future. Our findings suggest that predicting PPIs remains an unsolved task for proteins showing little sequence similarity to previously studied proteins, highlighting that further experimental research into the 'dark' protein interactome and better computational methods are needed.

Assuntos

Aprendizado de Máquina

A systematic comparison of novel and existing differential analysis methods for CyTOF data.

Arend, Lis; Bernett, Judith; Manz, Quirin; Klug, Melissa; Lazareva, Olga; Baumbach, Jan; Bongiovanni, Dario; List, Markus.

Brief Bioinform ; 23(1)2022 01 17.

Artigo em Inglês | MEDLINE | ID: mdl-34850807

RESUMO

Cytometry techniques are widely used to discover cellular characteristics at single-cell resolution. Many data analysis methods for cytometry data focus solely on identifying subpopulations via clustering and testing for differential cell abundance. For differential expression analysis of markers between conditions, only few tools exist. These tools either reduce the data distribution to medians, discarding valuable information, or have underlying assumptions that may not hold for all expression patterns. Here, we systematically evaluated existing and novel approaches for differential expression analysis on real and simulated CyTOF data. We found that methods using median marker expressions compute fast and reliable results when the data are not strongly zero-inflated. Methods using all data detect changes in strongly zero-inflated markers, but partially suffer from overprediction or cannot handle big datasets. We present a new method, CyEMD, based on calculating the earth mover's distance between expression distributions that can handle strong zero-inflation without being too sensitive. Additionally, we developed CYANUS - CYtometry ANalysis Using Shiny - a user-friendly R Shiny App allowing the user to analyze cytometry data with state-of-the-art tools, including well-performing methods from our comparison. A public web interface is available at https://exbio.wzw.tum.de/cyanus/.

Assuntos

Análise por Conglomerados , Biomarcadores

Robust disease module mining via enumeration of diverse prize-collecting Steiner trees.

Bernett, Judith; Krupke, Dominik; Sadegh, Sepideh; Baumbach, Jan; Fekete, Sándor P; Kacprowski, Tim; List, Markus; Blumenthal, David B.

Bioinformatics ; 38(6): 1600-1606, 2022 03 04.

Artigo em Inglês | MEDLINE | ID: mdl-34984440

RESUMO

MOTIVATION: Disease module mining methods (DMMMs) extract subgraphs that constitute candidate disease mechanisms from molecular interaction networks such as protein-protein interaction (PPI) networks. Irrespective of the employed models, DMMMs typically include non-robust steps in their workflows, i.e. the computed subnetworks vary when running the DMMMs multiple times on equivalent input. This lack of robustness has a negative effect on the trustworthiness of the obtained subnetworks and is hence detrimental for the widespread adoption of DMMMs in the biomedical sciences. RESULTS: To overcome this problem, we present a new DMMM called ROBUST (robust disease module mining via enumeration of diverse prize-collecting Steiner trees). In a large-scale empirical evaluation, we show that ROBUST outperforms competing methods in terms of robustness, scalability and, in most settings, functional relevance of the produced modules, measured via KEGG (Kyoto Encyclopedia of Genes and Genomes) gene set enrichment scores and overlap with DisGeNET disease genes. AVAILABILITY AND IMPLEMENTATION: A Python 3 implementation and scripts to reproduce the results reported in this article are available on GitHub: https://github.com/bionetslab/robust, https://github.com/bionetslab/robust-eval. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Árvores , Biologia Computacional/métodos , Mapas de Interação de Proteínas

Network medicine for disease module identification and drug repurposing with the NeDRex platform.

Sadegh, Sepideh; Skelton, James; Anastasi, Elisa; Bernett, Judith; Blumenthal, David B; Galindez, Gihanna; Salgado-Albarrán, Marisol; Lazareva, Olga; Flanagan, Keith; Cockell, Simon; Nogales, Cristian; Casas, Ana I; Schmidt, Harald H H W; Baumbach, Jan; Wipat, Anil; Kacprowski, Tim.

Nat Commun ; 12(1): 6848, 2021 11 25.

Artigo em Inglês | MEDLINE | ID: mdl-34824199

RESUMO

Traditional drug discovery faces a severe efficacy crisis. Repurposing of registered drugs provides an alternative with lower costs and faster drug development timelines. However, the data necessary for the identification of disease modules, i.e. pathways and sub-networks describing the mechanisms of complex diseases which contain potential drug targets, are scattered across independent databases. Moreover, existing studies are limited to predictions for specific diseases or non-translational algorithmic approaches. There is an unmet need for adaptable tools allowing biomedical researchers to employ network-based drug repurposing approaches for their individual use cases. We close this gap with NeDRex, an integrative and interactive platform for network-based drug repurposing and disease module discovery. NeDRex integrates ten different data sources covering genes, drugs, drug targets, disease annotations, and their relationships. NeDRex allows for constructing heterogeneous biological networks, mining them for disease modules, prioritizing drugs targeting disease mechanisms, and statistical validation. We demonstrate the utility of NeDRex in five specific use-cases.

Assuntos

Bases de Dados Factuais , Reposicionamento de Medicamentos/métodos , Algoritmos , Biologia Computacional , Doença/classificação , Doença/genética , Humanos , Bases de Conhecimento , Fluxo de Trabalho

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA