Pesquisa | Biblioteca Virtual em Saúde

The impact of imputation quality on machine learning classifiers for datasets with missing values.

Shadbahr, Tolou; Roberts, Michael; Stanczuk, Jan; Gilbey, Julian; Teare, Philip; Dittmer, Sören; Thorpe, Matthew; Torné, Ramon Viñas; Sala, Evis; Lió, Pietro; Patel, Mishal; Preller, Jacobus; Rudd, James H F; Mirtti, Tuomas; Rannikko, Antti Sakari; Aston, John A D; Tang, Jing; Schönlieb, Carola-Bibiane.

Commun Med (Lond) ; 3(1): 139, 2023 Oct 06.

Artigo em Inglês | MEDLINE | ID: mdl-37803172

RESUMO

BACKGROUND: Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier's performance. METHODS: We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. RESULTS: The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. CONCLUSIONS: It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable.

Many artificial intelligence (AI) methods aim to classify samples of data into groups, e.g., patients with disease vs. those without. This often requires datasets to be complete, i.e., that all data has been collected for all samples. However, in clinical practice this is often not the case and some data can be missing. One solution is to 'complete' the dataset using a technique called imputation to replace those missing values. However, assessing how well the imputation method performs is challenging. In this work, we demonstrate why people should care about imputation, develop a new method for assessing imputation quality, and demonstrate that if we build AI models on poorly imputed data, the model can give different results to those we would hope for. Our findings may improve the utility and quality of AI models in the clinic.

Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis.

Fu, Yu; Jung, Alexander W; Torne, Ramon Viñas; Gonzalez, Santiago; Vöhringer, Harald; Shmatko, Artem; Yates, Lucy R; Jimenez-Linan, Mercedes; Moore, Luiza; Gerstung, Moritz.

Nat Cancer ; 1(8): 800-810, 2020 08.

Artigo em Inglês | MEDLINE | ID: mdl-35122049

RESUMO

We use deep transfer learning to quantify histopathological patterns across 17,355 hematoxylin and eosin-stained histopathology slide images from 28 cancer types and correlate these with matched genomic, transcriptomic and survival data. This approach accurately classifies cancer types and provides spatially resolved tumor and normal tissue distinction. Automatically learned computational histopathological features correlate with a large range of recurrent genetic aberrations across cancer types. This includes whole-genome duplications, which display universal features across cancer types, individual chromosomal aneuploidies, focal amplifications and deletions, as well as driver gene mutations. There are widespread associations between bulk gene expression levels and histopathology, which reflect tumor composition and enable the localization of transcriptomically defined tumor-infiltrating lymphocytes. Computational histopathology augments prognosis based on histopathological subtyping and grading, and highlights prognostically relevant areas such as necrosis or lymphocytic aggregates. These findings show the remarkable potential of computer vision in characterizing the molecular basis of tumor histopathology.

Assuntos

Aprendizado Profundo , Neoplasias , Hematoxilina , Humanos , Mutação , Neoplasias/diagnóstico , Prognóstico

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA