RESUMEN
BACKGROUND: The field of epigenomics holds great promise in understanding and treating disease with advances in machine learning (ML) and artificial intelligence being vitally important in this pursuit. Increasingly, research now utilises DNA methylation measures at cytosine-guanine dinucleotides (CpG) to detect disease and estimate biological traits such as aging. Given the challenge of high dimensionality of DNA methylation data, feature-selection techniques are commonly employed to reduce dimensionality and identify the most important subset of features. In this study, our aim was to test and compare a range of feature-selection methods and ML algorithms in the development of a novel DNA methylation-based telomere length (TL) estimator. We utilised both nested cross-validation and two independent test sets for the comparisons. RESULTS: We found that principal component analysis in advance of elastic net regression led to the overall best performing estimator when evaluated using a nested cross-validation analysis and two independent test cohorts. This approach achieved a correlation between estimated and actual TL of 0.295 (83.4% CI [0.201, 0.384]) on the EXTEND test data set. Contrastingly, the baseline model of elastic net regression with no prior feature reduction stage performed less well in general-suggesting a prior feature-selection stage may have important utility. A previously developed TL estimator, DNAmTL, achieved a correlation of 0.216 (83.4% CI [0.118, 0.310]) on the EXTEND data. Additionally, we observed that different DNA methylation-based TL estimators, which have few common CpGs, are associated with many of the same biological entities. CONCLUSIONS: The variance in performance across tested approaches shows that estimators are sensitive to data set heterogeneity and the development of an optimal DNA methylation-based estimator should benefit from the robust methodological approach used in this study. Moreover, our methodology which utilises a range of feature-selection approaches and ML algorithms could be applied to other biological markers and disease phenotypes, to examine their relationship with DNA methylation and predictive value.
Asunto(s)
Metilación de ADN , Epigenómica , Homeostasis del Telómero , Algoritmos , Epigenómica/métodos , Análisis de Regresión , Aprendizaje Automático , HumanosRESUMEN
The diagnosis of prostate cancer is challenging due to the heterogeneity of its presentations, leading to the over diagnosis and treatment of non-clinically important disease. Accurate diagnosis can directly benefit a patient's quality of life and prognosis. Towards addressing this issue, we present a learning model for the automatic identification of prostate cancer. While many prostate cancer studies have adopted Raman spectroscopy approaches, none have utilised the combination of Raman Chemical Imaging (RCI) and other imaging modalities. This study uses multimodal images formed from stained Digital Histopathology (DP) and unstained RCI. The approach was developed and tested on a set of 178 clinical samples from 32 patients, containing a range of non-cancerous, Gleason grade 3 (G3) and grade 4 (G4) tissue microarray samples. For each histological sample, there is a pathologist labelled DP-RCI image pair. The hypothesis tested was whether multimodal image models can outperform single modality baseline models in terms of diagnostic accuracy. Binary non-cancer/cancer models and the more challenging G3/G4 differentiation were investigated. Regarding G3/G4 classification, the multimodal approach achieved a sensitivity of 73.8% and specificity of 88.1% while the baseline DP model showed a sensitivity and specificity of 54.1% and 84.7% respectively. The multimodal approach demonstrated a statistically significant 12.7% AUC advantage over the baseline with a value of 85.8% compared to 73.1%, also outperforming models based solely on RCI and mean and median Raman spectra. Feature fusion of DP and RCI does not improve the more trivial task of tumour identification but does deliver an observed advantage in G3/G4 discrimination. Building on these promising findings, future work could include the acquisition of larger datasets for enhanced model generalization.