RESUMEN
The availability of high-throughput sequencing data creates opportunities to comprehensively understand human diseases as well as challenges to train machine learning models using such high dimensions of data. Here, we propose a denoised multi-omics integration framework, which contains a distribution-based feature denoising algorithm, Feature Selection with Distribution (FSD), for dimension reduction and a multi-omics integration framework, Attention Multi-Omics Integration (AttentionMOI) to predict cancer prognosis and identify cancer subtypes. We demonstrated that FSD improved model performance either using single omic data or multi-omics data in 15 The Cancer Genome Atlas Program (TCGA) cancers for survival prediction and kidney cancer subtype identification. And our integration framework AttentionMOI outperformed machine learning models and current multi-omics integration algorithms with high dimensions of features. Furthermore, FSD identified features that were associated to cancer prognosis and could be considered as biomarkers.
Asunto(s)
Genómica , Neoplasias , Humanos , Genómica/métodos , Multiómica , Neoplasias/genética , AlgoritmosRESUMEN
AIMS/INTRODUCTION: Clinical guidelines for the management of individuals with type 2 diabetes mellitus endorse the systematic assessment of atherosclerotic cardiovascular disease risk for early interventions. In this study, we aimed to develop machine learning models to predict 3-year atherosclerotic cardiovascular disease risk in Chinese type 2 diabetes mellitus patients. MATERIALS AND METHODS: Clinical records of 4,722 individuals with type 2 diabetes mellitus admitted to 94 hospitals were used. The features included demographic information, disease histories, laboratory tests and physical examinations. Logistic regression, support vector machine, gradient boosting decision tree, random forest and adaptive boosting were applied for model construction. The performance of these models was evaluated using the area under the receiver operating characteristic curve. Additionally, we applied SHapley Additive exPlanation values to explain the prediction model. RESULTS: All five models achieved good performance in both internal and external test sets (area under the receiver operating characteristic curve >0.8). Random forest showed the highest discrimination ability, with sensitivity and specificity being 0.838 and 0.814, respectively. The SHapley Additive exPlanation analyses showed that previous history of diabetic peripheral vascular disease, older populations and longer diabetes duration were the three most influential predictors. CONCLUSIONS: The prediction models offer opportunities to personalize treatment and maximize the benefits of these medical interventions.