ProJect: a powerful mixed-model missing value imputation method.

Kong, Weijia; Wong, Bertrand Jern Han; Hui, Harvard Wai Hann; Lim, Kai Peng; Wang, Yulan; Wong, Limsoon; Goh, Wilson Wen Bin

Kong, Weijia; Wong, Bertrand Jern Han; Hui, Harvard Wai Hann; Lim, Kai Peng; Wang, Yulan; Wong, Limsoon; Goh, Wilson Wen Bin.

Afiliación

Kong W; School of Biological Sciences, Nanyang Technological University, Singapore.
Wong BJH; Department of Computer Science, National University of Singapore, Singapore.
Hui HWH; Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore.
Lim KP; School of Biological Sciences, Nanyang Technological University, Singapore.
Wang Y; School of Biological Sciences, Nanyang Technological University, Singapore.
Wong L; School of Biological Sciences, Nanyang Technological University, Singapore.
Goh WWB; Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore.

Brief Bioinform ; 24(4)2023 07 20.

Article en En | MEDLINE | ID: mdl-37419612

ABSTRACT

ABSTRACT

Missing values (MVs) can adversely impact data analysis and machine-learning model development. We propose a novel mixed-model method for missing value imputation (MVI). This method, ProJect (short for Protein inJection), is a powerful and meaningful improvement over existing MVI methods such as Bayesian principal component analysis (PCA), probabilistic PCA, local least squares and quantile regression imputation of left-censored data. We rigorously tested ProJect on various high-throughput data types, including genomics and mass spectrometry (MS)-based proteomics. Specifically, we utilized renal cancer (RC) data acquired using DIA-SWATH, ovarian cancer (OC) data acquired using DIA-MS, bladder (BladderBatch) and glioblastoma (GBM) microarray gene expression dataset. Our results demonstrate that ProJect consistently performs better than other referenced MVI methods. It achieves the lowest normalized root mean square error (on average, scoring 45.92% less error in RC_C, 27.37% in RC_full, 29.22% in OC, 23.65% in BladderBatch and 20.20% in GBM relative to the closest competing method) and the Procrustes sum of squared error (Procrustes SS) (exhibits 79.71% less error in RC_C, 38.36% in RC full, 18.13% in OC, 74.74% in BladderBatch and 30.79% in GBM compared to the next best method). ProJect also leads with the highest correlation coefficient among all types of MV combinations (0.64% higher in RC_C, 0.24% in RC full, 0.55% in OC, 0.39% in BladderBatch and 0.27% in GBM versus the second-best performing method). ProJect's key strength is its ability to handle different types of MVs commonly found in real-world data. Unlike most MVI methods that are designed to handle only one type of MV, ProJect employs a decision-making algorithm that first determines if an MV is missing at random or missing not at random. It then employs targeted imputation strategies for each MV type, resulting in more accurate and reliable imputation outcomes. An R implementation of ProJect is available at https//github.com/miaomiao6606/ProJect.

Asunto(s)

Algoritmos; Genómica; Teorema de Bayes; Análisis de Secuencia por Matrices de Oligonucleótidos/métodos; Espectrometría de Masas/métodos

Palabras clave

bioinformatics; missing at random (MAR); missing not at random (MNAR); missing value imputation (MVI); statistics

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Algoritmos / Genómica Tipo de estudio: Prognostic_studies Idioma: En Revista: Brief Bioinform Asunto de la revista: BIOLOGIA / INFORMATICA MEDICA Año: 2023 Tipo del documento: Article País de afiliación: Singapur

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google