Large-scale benchmark study of survival prediction methods using multi-omics data.

Herrmann, Moritz; Probst, Philipp; Hornung, Roman; Jurinovic, Vindi; Boulesteix, Anne-Laure

Herrmann, Moritz; Probst, Philipp; Hornung, Roman; Jurinovic, Vindi; Boulesteix, Anne-Laure.

Afiliação

Herrmann M; Department of Statistics, Ludwig Maximilian University, Munich, 80539, Germany.
Probst P; Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany.
Hornung R; Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany.
Jurinovic V; Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany.
Boulesteix AL; Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany.

Brief Bioinform ; 22(3)2021 05 20.

Article em En | MEDLINE | ID: mdl-32823283

ABSTRACT

ABSTRACT

Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database 'The Cancer Genome Atlas' (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan-Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno's C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups-especially clinical variables-from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contactmoritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.

Assuntos

Benchmarking; Feminino; Humanos; Aprendizado de Máquina; Masculino; Neoplasias/genética; Neoplasias/patologia; Modelos de Riscos Proporcionais; Análise de Sobrevida

Palavras-chave

benchmark; machine learning; multi-omics data; prediction models; statistics; survival analysis

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Benchmarking Tipo de estudo: Prognostic_studies / Risk_factors_studies Limite: Female / Humans / Male Idioma: En Revista: Brief Bioinform Assunto da revista: BIOLOGIA / INFORMATICA MEDICA Ano de publicação: 2021 Tipo de documento: Article País de afiliação: Alemanha

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google