RESUMO
BACKGROUND: The temporal relationships across cardiometabolic diseases (CMDs) were recently conceptualized as the cardiometabolic continuum (CMC), sequence of cardiovascular events that stem from gene-environmental interactions, unhealthy lifestyle influences, and metabolic diseases such as diabetes, and hypertension. While the physiological pathways linking metabolic and cardiovascular diseases have been investigated, the study of the sex and population differences in the CMC have still not been described. METHODS: We present a machine learning approach to model the CMC and investigate sex and population differences in two distinct cohorts: the UK Biobank (17,700 participants) and the Brazilian Longitudinal Study of Adult Health (ELSA-Brasil) (7162 participants). We consider the following CMDs: hypertension (Hyp), diabetes (DM), heart diseases (HD: angina, myocardial infarction, or heart failure), and stroke (STK). For the identification of the CMC patterns, individual trajectories with the time of disease occurrence were clustered using k-means. Based on clinical, sociodemographic, and lifestyle characteristics, we built multiclass random forest classifiers and used the SHAP methodology to evaluate feature importance. RESULTS: Five CMC patterns were identified across both sexes and cohorts: EarlyHyp, FirstDM, FirstHD, Healthy, and LateHyp, named according to prevalence and disease occurrence time that depicted around 95%, 78%, 75%, 88% and 99% of individuals, respectively. Within the UK Biobank, more women were classified in the Healthy cluster and more men in all others. In the EarlyHyp and LateHyp clusters, isolated hypertension occurred earlier among women. Smoking habits and education had high importance and clear directionality for both sexes. For ELSA-Brasil, more men were classified in the Healthy cluster and more women in the FirstDM. The diabetes occurrence time when followed by hypertension was lower among women. Education and ethnicity had high importance and clear directionality for women, while for men these features were smoking, alcohol, and coffee consumption. CONCLUSIONS: There are clear sex differences in the CMC that varied across the UK and Brazilian cohorts. In particular, disadvantages regarding incidence and the time to onset of diseases were more pronounced in Brazil, against woman. The results show the need to strengthen public health policies to prevent and control the time course of CMD, with an emphasis on women.
Assuntos
Doenças Cardiovasculares , Aprendizado de Máquina , Adulto , Idoso , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Brasil/epidemiologia , Fatores de Risco Cardiometabólico , Doenças Cardiovasculares/epidemiologia , Estudos de Coortes , Estudos Longitudinais , Fatores Sexuais , Biobanco do Reino Unido , Reino Unido/epidemiologiaRESUMO
The expansion of urban areas contributes to the growth of impervious surfaces, leading to increased pollution and altering the configuration, composition, and context of land covers. This study employed machine learning methods (partial least square regressor and the Shapley Additive exPlanations) to explore the intricate relationships between urban expansion, land cover changes, and water quality in a watershed with a park and lake. To address this, we first evaluated the spatio-temporal variation of some physicochemical and microbiological water quality variables, generated yearly land cover maps of the basin adopting several machine learning classifiers, and computed the most suitable landscape metrics that better represent the land cover. The main results highlighted the importance of spatial arrangement and the size of the contributing watershed on water quality. Compact urban forms appeared to mitigate the impact on pollutants. This research provides valuable insights into the intricate relationship between landscape characteristics and water quality dynamics, informing targeted watershed management strategies aimed at mitigating pollution and ensuring the health and resilience of aquatic ecosystems.
Assuntos
Monitoramento Ambiental , Aprendizado de Máquina , Qualidade da Água , Monitoramento Ambiental/métodos , Uruguai , Urbanização , EcossistemaRESUMO
PURPOSE: Machine learning (ML) models presented an excellent performance in the prognosis prediction. However, the black box characteristic of ML models limited the clinical applications. Here, we aimed to establish explainable and visualizable ML models to predict biochemical recurrence (BCR) of prostate cancer (PCa). MATERIALS AND METHODS: A total of 647 PCa patients were retrospectively evaluated. Clinical parameters were identified using LASSO regression. Then, cohort was split into training and validation datasets with a ratio of 0.75:0.25 and BCR-related features were included in Cox regression and five ML algorithm to construct BCR prediction models. The clinical utility of each model was evaluated by concordance index (C-index) values and decision curve analyses (DCA). Besides, Shapley Additive Explanation (SHAP) values were used to explain the features in the models. RESULTS: We identified 11 BCR-related features using LASSO regression, then establishing five ML-based models, including random survival forest (RSF), survival support vector machine (SSVM), survival Tree (sTree), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and a Cox regression model, C-index were 0.846 (95%CI 0.796-0.894), 0.774 (95%CI 0.712-0.834), 0.757 (95%CI 0.694-0.818), 0.820 (95%CI 0.765-0.869), 0.793 (95%CI 0.735-0.852), and 0.807 (95%CI 0.753-0.858), respectively. The DCA showed that RSF model had significant advantages over all models. In interpretability of ML models, the SHAP value demonstrated the tangible contribution of each feature in RSF model. CONCLUSIONS: Our score system provide reference for the identification for BCR, and the crafting of a framework for making therapeutic decisions for PCa on a personalized basis.
Assuntos
Aprendizado de Máquina , Recidiva Local de Neoplasia , Neoplasias da Próstata , Humanos , Masculino , Neoplasias da Próstata/sangue , Neoplasias da Próstata/patologia , Recidiva Local de Neoplasia/sangue , Recidiva Local de Neoplasia/patologia , Estudos Retrospectivos , Idoso , Pessoa de Meia-Idade , Prognóstico , Árvores de Decisões , Modelos de Riscos Proporcionais , Algoritmos , Máquina de Vetores de Suporte , Antígeno Prostático Específico/sangueRESUMO
Identifying risk factors associated with COVID-19 lethality is crucial in combating the ongoing pandemic. In this study, we developed lethality predictive models for each epidemiological wave and for the overall dataset using the Extreme Gradient Boosting technique and analyzed them using Shapley values to determine the contribution levels of various features, including demographics, comorbidities, medical units, and recent medical information from confirmed COVID-19 cases in Mexico between February 23, 2020, and April 15, 2022. The results showed that pneumonia and advanced age were the most important factors predicting patient death in all cohorts. Additionally, the medical unit where the patient received care acted as a risk or protective factor. IMSS medical units were identified as high-risk factors in all cohorts, except in wave four, while SSA medical units generally were moderate protective factors. We also found that intubation was a high-risk factor in the first epidemiological wave and a moderate-risk factor in the following waves. Female gender was a protective factor of moderate-high importance in all cohorts, while being between 18 and 29 years old was a moderate protective factor and being between 50 and 59 years old was a moderate risk factor. Additionally, diabetes (all cohorts), obesity (third wave), and hypertension (fourth wave) were identified as moderate risk factors. Finally, residing in municipalities with the lowest Human Development Index level represented a moderate risk factor. In conclusion, this study identified several significant risk factors associated with COVID-19 lethality in Mexico, which could aid policymakers in developing targeted interventions to reduce mortality rates.
Assuntos
COVID-19 , Humanos , Feminino , Adolescente , Adulto Jovem , Adulto , Pessoa de Meia-Idade , COVID-19/epidemiologia , México/epidemiologia , Fatores de Risco , Obesidade , Aprendizado de MáquinaRESUMO
Introduction: The gut microbiota (GM) dysbiosis is one of the causal factors for the progression of different chronic metabolic diseases, including type 2 diabetes mellitus (T2D). Understanding the basis that laid this association may lead to developing new therapeutic strategies for preventing and treating T2D, such as probiotics, prebiotics, and fecal microbiota transplants. It may also help identify potential early detection biomarkers and develop personalized interventions based on an individual's gut microbiota profile. Here, we explore how supervised Machine Learning (ML) methods help to distinguish taxa for individuals with prediabetes (prediabetes) or T2D. Methods: To this aim, we analyzed the GM profile (16s rRNA gene sequencing) in a cohort of 410 Mexican naïve patients stratified into normoglycemic, prediabetes, and T2D individuals. Then, we compared six different ML algorithms and found that Random Forest had the highest predictive performance in classifying T2D and prediabetes patients versus controls. Results: We identified a set of taxa for predicting patients with T2D compared to normoglycemic individuals, including Allisonella, Slackia, Ruminococus_2, Megaspgaera, Escherichia/Shigella, and Prevotella, among them. Besides, we concluded that Anaerostipes, Intestinibacter, Prevotella_9, Blautia, Granulicatella, and Veillonella were the relevant genus in patients with prediabetes compared to normoglycemic subjects. Discussion: These findings allow us to postulate that GM is a distinctive signature in prediabetes and T2D patients during the development and progression of the disease. Our study highlights the role of GM and opens a window toward the rational design of new preventive and personalized strategies against the control of this disease.
Assuntos
Diabetes Mellitus Tipo 2 , Microbioma Gastrointestinal , Estado Pré-Diabético , Humanos , Diabetes Mellitus Tipo 2/diagnóstico , Estado Pré-Diabético/diagnóstico , Disbiose , RNA Ribossômico 16S/genética , Aprendizado de MáquinaRESUMO
The immunohistochemical (IHC) evaluation of epidermal growth factor 2 (HER2) for the diagnosis of breast cancer is still qualitative with a high degree of inter-observer variability, and thus requires the incorporation of complementary techniques such as fluorescent in situ hybridization (FISH) to resolve the diagnosis. Implementing automatic algorithms to classify IHC biomarkers is crucial for typifying the tumor and deciding on therapy for each patient with better performance. The present study aims to demonstrate that, using an explainable Machine Learning (ML) model for the classification of HER2 photomicrographs, it is possible to determine criteria to improve the value of IHC analysis. We trained a logistic regression-based supervised ML model with 393 IHC microscopy images from 131 patients, to discriminate between upregulated and normal expression of the HER2 protein. Pathologists' diagnoses (IHC only) vs. the final diagnosis complemented with FISH (IHC + FISH) were used as training outputs. Basic performance metrics and receiver operating characteristic curve analysis were used together with an explainability algorithm based on Shapley Additive exPlanations (SHAP) values to understand training differences. The model could discriminate amplified IHC from normal expression with better performance when the training output was the IHC + FISH final diagnosis (IHC vs. IHC + FISH: area under the curve, 0.94 vs. 0.81). This may be explained by the increased analytical impact of the membrane distribution criteria over the global intensity of the signal, according to SHAP value interpretation. The classification model improved its performance when the training input was the final diagnosis, downplaying the weighting of the intensity of the IHC signal, suggesting that to improve pathological diagnosis before FISH consultation, it is necessary to emphasize subcellular patterns of staining.
RESUMO
The large amount of data generated during the COVID-19 pandemic requires advanced tools for the long-term prediction of risk factors associated with COVID-19 mortality with higher accuracy. Machine learning (ML) methods directly address this topic and are essential tools to guide public health interventions. Here, we used ML to investigate the importance of demographic and clinical variables on COVID-19 mortality. We also analyzed how comorbidity networks are structured according to age groups. We conducted a retrospective study of COVID-19 mortality with hospitalized patients from Londrina, Parana, Brazil, registered in the database for severe acute respiratory infections (SIVEP-Gripe), from January 2021 to February 2022. We tested four ML models to predict the COVID-19 outcome: Logistic Regression, Support Vector Machine, Random Forest, and XGBoost. We also constructed a comorbidity network to investigate the impact of co-occurring comorbidities on COVID-19 mortality. Our study comprised 8358 hospitalized patients, of whom 2792 (33.40%) died. The XGBoost model achieved excellent performance (ROC-AUC = 0.90). Both permutation method and SHAP values highlighted the importance of age, ventilatory support status, and intensive care unit admission as key features in predicting COVID-19 outcomes. The comorbidity networks for old deceased patients are denser than those for young patients. In addition, the co-occurrence of heart disease and diabetes may be the most important combination to predict COVID-19 mortality, regardless of age and sex. This work presents a valuable combination of machine learning and comorbidity network analysis to predict COVID-19 outcomes. Reliable evidence on this topic is crucial for guiding the post-pandemic response and assisting in COVID-19 care planning and provision.
RESUMO
In the field of landscape epidemiology, the contribution of machine learning (ML) to modeling of epidemiological risk scenarios presents itself as a good alternative. This study aims to break with the "black box" paradigm that underlies the application of automatic learning techniques by using SHAP to determine the contribution of each variable in ML models applied to geospatial health, using the prevalence of hookworms, intestinal parasites, in Ethiopia, where they are widely distributed; the country bears the third-highest burden of hookworm in Sub-Saharan Africa. XGBoost software was used, a very popular ML model, to fit and analyze the data. The Python SHAP library was used to understand the importance in the trained model, of the variables for predictions. The description of the contribution of these variables on a particular prediction was obtained, using different types of plot methods. The results show that the ML models are superior to the classical statistical models; not only demonstrating similar results but also explaining, by using the SHAP package, the influence and interactions between the variables in the generated models. This analysis provides information to help understand the epidemiological problem presented and provides a tool for similar studies.
RESUMO
Characterizing the spatiotemporal variability of the Urban Heat Island (UHI) and its drivers is a key step in leveraging thermal comfort to create not only healthier cities, but also to enhance urban resilience to climate change. In this study, we developed specific daytime and nighttime multiple linear regression (MLR) and random forest (RF) models to analyze and predict the spatiotemporal evolution of the Urban Heat Island intensity (UHII), using the air temperature (Tair) as the response variable. We profited from the wealth of in situ Tair data and a comprehensive pool of predictors variables - including land cover, population, traffic, urban geometry, weather data and atmospheric vertical indices. Cluster analysis divided the study period into three main groups, each dominated by a combination of weather systems that, in turn, influenced the onset and strength of the UHII. Anticyclonic circulations favored the emergence of the largest UHII (hourly mean of 5.06 °C), while cyclonic circulations dampened its development. The MLR models were only able to explain a modest percentage of variance (64 and 34% for daytime and nighttime, respectively), which we interpret as part of their inability to capture key factors controlling Tair. The RF models, on the other hand, performed considerably better, with explanatory power over 96% of the variance for daytime and nighttime conditions, capturing and mapping the fine-scale Tair spatiotemporal variability in both periods and under each cluster condition. The feature importance analysis showed that the meteorological variables and the land cover were the main predictors of the Tair. Urban planners could benefit from these results, using the high-performing RF models as a robust framework for forecasting and mitigating the effects of the UHI.
Assuntos
Temperatura Alta , Meteorologia , Cidades , Modelos Lineares , TemperaturaRESUMO
Both reverse transcription-PCR (RT-PCR) and chest X-rays are used for the diagnosis of the coronavirus disease-2019 (COVID-19). However, COVID-19 pneumonia does not have a defined set of radiological findings. Our work aims to investigate radiomic features and classification models to differentiate chest X-ray images of COVID-19-based pneumonia and other types of lung patterns. The goal is to provide grounds for understanding the distinctive COVID-19 radiographic texture features using supervised ensemble machine learning methods based on trees through the interpretable Shapley Additive Explanations (SHAP) approach. We use 2,611 COVID-19 chest X-ray images and 2,611 non-COVID-19 chest X-rays. After segmenting the lung in three zones and laterally, a histogram normalization is applied, and radiomic features are extracted. SHAP recursive feature elimination with cross-validation is used to select features. Hyperparameter optimization of XGBoost and Random Forest ensemble tree models is applied using random search. The best classification model was XGBoost, with an accuracy of 0.82 and a sensitivity of 0.82. The explainable model showed the importance of the middle left and superior right lung zones in classifying COVID-19 pneumonia from other lung patterns.