Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings.

Gilhodes, Julia; Dalenc, Florence; Gal, Jocelyn; Zemmour, Christophe; Leconte, Eve; Boher, Jean-Marie; Filleron, Thomas

Gilhodes, Julia; Dalenc, Florence; Gal, Jocelyn; Zemmour, Christophe; Leconte, Eve; Boher, Jean-Marie; Filleron, Thomas.

Afiliação

Gilhodes J; Department of Biostatistics, Institut Claudius Regaud, IUCT-O, Toulouse, France.
Dalenc F; Department of Medical Oncology, Institut Claudius Regaud, IUCT-O, Toulouse, France.
Gal J; Department of Biostatistics, Centre Antoine-Lacassagne, Nice, France.
Zemmour C; Department of Clinical Research and Investigation, Biostatistics and Methodology Unit, Institut Paoli-Calmettes, Aix-Marseille University, INSERM, IRD, SESSTIM, Marseille, France.
Leconte E; TSE-R, Université de Toulouse, France.
Boher JM; Department of Clinical Research and Investigation, Biostatistics and Methodology Unit, Institut Paoli-Calmettes, Aix-Marseille University, INSERM, IRD, SESSTIM, Marseille, France.
Filleron T; Department of Biostatistics, Institut Claudius Regaud, IUCT-O, Toulouse, France.

Comput Math Methods Med ; 2020: 6795392, 2020.

Article em En | MEDLINE | ID: mdl-32670394

ABSTRACT

ABSTRACT

Over the last decades, molecular signatures have become increasingly important in oncology and are opening up a new area of personalized medicine. Nevertheless, biological relevance and statistical tools necessary for the development of these signatures have been called into question in the literature. Here, we investigate six typical selection methods for high-dimensional settings and survival endpoints, including LASSO and some of its extensions, component-wise boosting, and random survival forests (RSF). A resampling algorithm based on data splitting was used on nine high-dimensional simulated datasets to assess selection stability on training sets and the intersection between selection methods. Prognostic performances were evaluated on respective validation sets. Finally, one application on a real breast cancer dataset has been proposed. The false discovery rate (FDR) was high for each selection method, and the intersection between lists of predictors was very poor. RSF selects many more variables than the other methods and thus becomes less efficient on validation sets. Due to the complex correlation structure in genomic data, stability in the selection procedure is generally poor for selected predictors, but can be improved with a higher training sample size. In a very high-dimensional setting, we recommend the LASSO-pcvl method since it outperforms other methods by reducing the number of selected genes and minimizing FDR in most scenarios. Nevertheless, this method still gives a high rate of false positives. Further work is thus necessary to propose new methods to overcome this issue where numerous predictors are present. Pluridisciplinary discussion between clinicians and statisticians is necessary to ensure both statistical and biological relevance of the predictors included in molecular signatures.

Assuntos

Algoritmos; Medicina de Precisão/métodos; Neoplasias da Mama/genética; Biologia Computacional; Simulação por Computador; Bases de Dados Genéticas/estatística & dados numéricos; Feminino; Humanos; Funções Verossimilhança; Medicina de Precisão/estatística & dados numéricos; Prognóstico; Modelos de Riscos Proporcionais; Estatísticas não Paramétricas

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / Medicina de Precisão Tipo de estudo: Prognostic_studies Limite: Female / Humans Idioma: En Ano de publicação: 2020 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google