Optimal ensemble construction for multistudy prediction with applications to mortality estimation.

Loewinger, Gabriel; Nunez, Rolando Acosta; Mazumder, Rahul; Parmigiani, Giovanni

Loewinger, Gabriel; Nunez, Rolando Acosta; Mazumder, Rahul; Parmigiani, Giovanni.

Afiliação

Loewinger G; Machine Learning Team, National Institute on Mental Health, Bethesda, Maryland, USA.
Nunez RA; Department of Biotatistics, Harvard School of Public Health, Boston, Massachusetts, USA.
Mazumder R; Regeneron Pharmaceuticals Inc., Tarrytown, New York, USA.
Parmigiani G; Operations Research Center and MIT Center for Statistics, MIT Sloan School of Management, Cambridge, Massachusetts, USA.

Stat Med ; 43(9): 1774-1789, 2024 Apr 30.

Article em En | MEDLINE | ID: mdl-38396313

ABSTRACT

ABSTRACT

It is increasingly common to encounter prediction tasks in the biomedical sciences for which multiple datasets are available for model training. Common approaches such as pooling datasets before model fitting can produce poor out-of-study prediction performance when datasets are heterogeneous. Theoretical and applied work has shown multistudy ensembling to be a viable alternative that leverages the variability across datasets in a manner that promotes model generalizability. Multistudy ensembling uses a two-stage stacking strategy which fits study-specific models and estimates ensemble weights separately. This approach ignores, however, the ensemble properties at the model-fitting stage, potentially resulting in performance losses. Motivated by challenges in the estimation of COVID-attributable mortality, we propose optimal ensemble construction, an approach to multistudy stacking whereby we jointly estimate ensemble weights and parameters associated with study-specific models. We prove that limiting cases of our approach yield existing methods such as multistudy stacking and pooling datasets before model fitting. We propose an efficient block coordinate descent algorithm to optimize the loss function. We use our method to perform multicountry COVID-19 baseline mortality prediction. We show that when little data is available for a country before the onset of the pandemic, leveraging data from other countries can substantially improve prediction accuracy. We further compare and characterize the method's performance in data-driven simulations and other numerical experiments. Our method remains competitive with or outperforms multistudy stacking and other earlier methods in the COVID-19 data application and in a range of simulation settings.

Assuntos

Algoritmos; COVID-19; Humanos; Simulação por Computador

Palavras-chave

COVID19 excess mortality; domain adaptation; domain generalization; transfer learning

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Algoritmos / COVID-19 Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Algoritmos / COVID-19 Idioma: En Ano de publicação: 2024 Tipo de documento: Article