ABSTRACT
BACKGROUND: Colorectal cancer has a high incidence and mortality rate due to a low rate of early diagnosis. Therefore, efficient diagnostic methods are urgently needed. PURPOSE: This study assesses the diagnostic effectiveness of Carbohydrate Antigen 19-9 (CA19-9), Carcinoembryonic Antigen (CEA), Alpha-fetoprotein (AFP), and Cancer Antigen 125 (CA125) serum tumor markers for colorectal cancer (CRC) and investigates a machine learning-based diagnostic model incorporating these markers with blood biochemical indices for improved CRC detection. METHOD: Between January 2019 and December 2021, data from 800 CRC patients and 697 controls were collected; 52 patients and 63 controls attending the same hospital in 2022 were collected as an external validation set. Markers' effectiveness was analyzed individually and collectively, using metrics like ROC curve AUC and F1 score. Variables chosen through backward regression, including demographics and blood tests, were tested on six machine learning models using these metrics. RESULT: In the case group, the levels of CEA, CA199, and CA125 were found to be higher than those in the control group. Combining these with a fourth serum marker significantly improved predictive efficacy over using any single marker alone, achieving an Area Under the Curve (AUC) value of 0.801. Using stepwise regression (backward), 17 variables were meticulously selected for evaluation in six machine learning models. Among these models, the Gradient Boosting Machine (GBM) emerged as the top performer in the training set, test set, and external validation set, boasting an AUC value of over 0.9, indicating its superior predictive power. CONCLUSION: Machine learning models integrating tumor markers and blood indices offer superior CRC diagnostic accuracy, potentially enhancing clinical practice.