RESUMO
BACKGROUND AND OBJECTIVE: Colorectal cancer is one of the most common malignancies among the general population. Artificial Intelligence methodologies based on serum parameters are in continuous development to obtain less expensive tools for highly sensitive diagnoses. This study proposes a predictive system based on serum biomarkers and ensemble learning to predict colorectal cancer presence and the related TNM stage in patients. METHODS: We have selected 17 significant plasmatic proteins, i.e., Carcinoembryonic Antigen, CA 19-9, CA 125, CA 50, CA 72-4, Tissue Polypeptide Antigen, C-Reactive Protein, Ceruloplasmin, Haptoglobin, Transferrin, Ferritin, α-1-Antitrypsin, α-2-Macroglobulin, α-1 Acid Glycoprotein, Complement C4, Complement C3, and Retinol Binding Protein, regarding 345 patients (248 affected by the neoplastic disease). The proposed system consists of two predictors, i.e., binary and staging; the former predicts the presence/absence of cancer, while the latter identifies the related TNM stage (I, II, III, or IV). The experiments were conducted by deploying and comparing Random Forest, XGBoost, Support Vector Machine, and Multilayer Perceptron with feature selection based on Gini Importance and with dimensionality reduction via PCA. RESULTS: The results show that the system composed of XGBoost as binary and staging predictor reaches 91.30% accuracy, 90% sensitivity, and 93.33% specificity for the absence/presence outcome, while 66.66% accuracy for the staging response. With the expansion of the training set in favor of positive patients and majority voting, the system composed of the combination of Support Vector Machine, XGBoost, and Multilayer Perceptron as the binary predictor reaches 98.03% accuracy, 100% sensitivity, and 92.30% specificity, while the combination of Random Forest, XGBoost, and Multilayer Perceptron as staging predictor achieves 60% accuracy. The final system reaches, in terms of accuracy, 98.03%, and 66.66% for the binary and staging predictors, respectively. It was also found that the biomarkers which contribute most to the binary decision are Ceruloplasmin and α-2-Macroglobulin, while the least significant dimensions are CA 50 and α-1-Antitrypsin; instead, Carcinoembryonic Antigen and α-1 Acid Glycoprotein are the most significant to the staging decision. CONCLUSIONS: The present study proves the effectiveness of deploying serum biomarkers as feature dimensions for early colorectal cancer diagnosis and of using majority voting for noise reduction in the prediction.