RESUMO
Transportation is widely recognized as a significant contributor to heavy metal (HM) pollution in roadside soils. A better understanding of HM pollution in soils near expressways is crucial, particularly given the rapid expansion of expressway transportation in China in recent years. In this study, 329 roadside topsoil samples were collected along the Beijing-Tianjin Expressway, which connects two megacities in China. Chemical analysis showed that HM concentrations in the soil samples were generally below national limits. The mean pollution index (Pi) values for As, Cr, Cu, Ni, Pb, and Zn ranged from 0.94 to 1.01, while Cd and Hg exhibited slightly higher mean Pi values of 1.19 and 1.13, respectively. The Nemerow integrated pollution index values for all samples ranged from 0.71 to 4.97, with a mean of 1.26. This suggests a slight enrichment of HM above natural background levels, especially for Cd and Hg. Source apportionment using positive matrix factorization revealed that natural sources contributed the most to soil HMs (64.51 %), followed by agricultural sources (19.15 %), traffic sources (9.77 %), and industrial sources (6.57 %). The Shapley additive explanation analysis, based on the random forest model, identified soil organic carbon, deep soil HM content, altitude, total soil K2O, urbanization composite impact index, and total soil P as primary influencing factors. This indicates that the impact of transportation on roadside soils along the Beijing-Tianjin Expressway is currently relatively limited. The prominent influence of soil properties and altitude underscored the importance of "transport" and "receptor" in the soil HMs accumulation process at the local scale. These findings provide critical data and a scientific basis for decision-makers to develop policies for expressway design and roadside soil protection.
RESUMO
Heat stress poses a significant challenge to livestock farming, particularly affecting the health and productivity of high-yield dairy cows. This study develops a machine learning framework aimed at predicting the core body temperature (CBT) of dairy cows to enable more effective heat stress management and enhance animal welfare. The dataset includes 3005 records of physiological data from real-world production environments, encompassing environmental parameters, individual animal characteristics, and infrared temperature measurements. Employed machine learning algorithms include elastic net (EN), artificial neural networks (ANN), random forests (RF), extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and CatBoost, alongside several optimization algorithms such as Bayesian optimization (BO) and grey wolf optimizer (GWO) to refine model performance through hyperparameter tuning. Comparative analysis of various feature sets reveals that the feature set incorporating the average infrared temperature of the trunk (IRTave_TK) excels in CBT prediction, achieving a coefficient of determination (R2) value of 0.516, mean absolute error (MAE) of 0.239 °C, and root mean square error (RMSE) of 0.302 °C. Further analysis shows that the GWO-XGBoost model surpasses others in predictive accuracy with an R2 value of 0.540, RMSE as low as 0.294 °C, and MAE of just 0.232 °C, and leads in computational efficiency with an optimization time of merely 2.41 s-approximately 4500 times faster than the highest accuracy model. Through SHAP (SHapley Additive exPlanations) analysis, IRTave_TK, time zone (TZ), days in lactation (DOL), and body posture (BP) are identified as the four most critical factors in predicting CBT, and the interaction effects of IRTave_TK with other features such as body posture and time periods are unveiled. This study provides technological support for livestock management, facilitating the development and optimization of predictive models to implement timely and effective interventions, thereby maintaining the health and productivity of dairy cows.
RESUMO
Three-dimensional printing technology is a rapid prototyping technology that has been widely used in manufacturing. However, the printing parameters in the 3D printing process have an important impact on the printing effect, so these parameters need to be optimized to obtain the best printing effect. In order to further understand the impact of 3D printing parameters on the printing effect, make theoretical explanations from the dimensions of mathematical models, and clarify the rationality of certain important parameters in previous experience, the purpose of this study is to predict the impact of 3D printing parameters on the printing effect by using machine learning methods. Specifically, we used four machine learning algorithms: SVR (support vector regression): A regression method that uses the principle of structural risk minimization to find a hyperplane in a high-dimensional space that best fits the data, with the goal of minimizing the generalization error bound. Random forest: An ensemble learning method that constructs a multitude of decision trees and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. GBDT (gradient boosting decision tree): An iterative ensemble technique that combines multiple weak prediction models (decision trees) into a strong one by sequentially minimizing the loss function. Each subsequent tree is built to correct the errors of the previous tree. XGB (extreme gradient boosting): An optimized and efficient implementation of gradient boosting that incorporates various techniques to improve the performance of gradient boosting frameworks, such as regularization and sparsity-aware splitting algorithms. The influence of the print parameters on the results under the feature importance and SHAP (Shapley additive explanation) values is compared to determine which parameters have the greatest impact on the print effect. We also used feature importance and SHAP values to compare the importance impact of print parameters on results. In the experiment, we used a dataset with multiple parameters and divided it into a training set and a test set. Through Bayesian optimization and grid search, we determined the best hyperparameters for each algorithm and used the best model to make predictions for the test set. We compare the predictive performance of each model and confirm that the extrusion expansion ratio, elastic modulus, and elongation at break have the greatest influence on the printing effect, which is consistent with the experience. In future, we will continue to delve into methods for optimizing 3D printing parameters and explore how interpretive machine learning can be applied to the 3D printing process to achieve more efficient and reliable printing results.
RESUMO
BACKGROUND: Electroencephalography (EEG) and electrocorticography (ECoG) recordings have been used to decode finger movements by analyzing brain activity. Traditional methods focused on single bandpass power changes for movement decoding, utilizing machine learning models requiring manual feature extraction. NEW METHOD: This study introduces a 3D convolutional neural network (3D-CNN) model to decode finger movements using ECoG data. The model employs adaptive, explainable AI (xAI) techniques to interpret the physiological relevance of brain signals. ECoG signals from epilepsy patients during awake craniotomy were processed to extract power spectral density across multiple frequency bands. These data formed a 3D matrix used to train the 3D-CNN to predict finger trajectories. RESULTS: The 3D-CNN model showed significant accuracy in predicting finger movements, with root-mean-square error (RMSE) values of 0.26-0.38 for single finger movements and 0.20-0.24 for combined movements. Explainable AI techniques, Grad-CAM and SHAP, identified the high gamma (HG) band as crucial for movement prediction, showing specific cortical regions involved in different finger movements. These findings highlighted the physiological significance of the HG band in motor control. COMPARISON WITH EXISTING METHODS: The 3D-CNN model outperformed traditional machine learning approaches by effectively capturing spatial and temporal patterns in ECoG data. The use of xAI techniques provided clearer insights into the model's decision-making process, unlike the "black box" nature of standard deep learning models. CONCLUSIONS: The proposed 3D-CNN model, combined with xAI methods, enhances the decoding accuracy of finger movements from ECoG data. This approach offers a more efficient and interpretable solution for brain-computer interface (BCI) applications, emphasizing the HG band's role in motor control.
Assuntos
Eletrocorticografia , Dedos , Movimento , Redes Neurais de Computação , Humanos , Dedos/fisiologia , Eletrocorticografia/métodos , Movimento/fisiologia , Adulto , Masculino , Feminino , Epilepsia/fisiopatologia , Adulto Jovem , Aprendizado de Máquina , Processamento de Sinais Assistido por ComputadorRESUMO
PURPOSE: To determine the factors influencing the likelihood of biochemical pregnancy loss (BPL) after transfer of a euploid embryo from preimplantation genetic testing for aneuploidy (PGT-A) cycles. METHODS: The study employed an observational, retrospective cohort design, encompassing 6020 embryos from 2879 PGT-A cycles conducted between February 2013 and September 2021. Trophectoderm biopsies in day 5 (D5) or day 6 (D6) blastocysts were analyzed by next generation sequencing (NGS). Only single embryo transfers (SET) were considered, totaling 1161 transfers. Of these, 49.9% resulted in positive pregnancy tests, with 18.3% experiencing BPL. To establish a predictive model for BPL, both classical statistical methods and five different supervised classification machine learning algorithms were used. A total of forty-seven factors were incorporated as predictor variables in the machine learning models. RESULTS: Throughout the optimization process for each model, various performance metrics were computed. Random Forest model emerged as the best model, boasting the highest area under the ROC curve (AUC) value of 0.913, alongside an accuracy of 0.830, positive predictive value of 0.857, and negative predictive value of 0.807. For the selected model, SHAP (SHapley Additive exPlanations) values were determined for each of the variables to establish which had the best predictive ability. Notably, variables pertaining to embryo biopsy demonstrated the greatest predictive capacity, followed by factors associated with ovarian stimulation (COS), maternal age, and paternal age. CONCLUSIONS: The Random Forest model had a higher predictive power for identifying BPL occurrences in PGT-A cycles. Specifically, variables associated with the embryo biopsy procedure (biopsy day, number of biopsied embryos, and number of biopsied cells) and ovarian stimulation (number of oocytes retrieved and duration of stimulation), exhibited the strongest predictive power.
Assuntos
Aborto Espontâneo , Aneuploidia , Testes Genéticos , Aprendizado de Máquina , Diagnóstico Pré-Implantação , Humanos , Feminino , Gravidez , Diagnóstico Pré-Implantação/métodos , Estudos Retrospectivos , Adulto , Testes Genéticos/métodos , Aborto Espontâneo/diagnóstico , Aborto Espontâneo/genética , Aborto Espontâneo/epidemiologia , Transferência Embrionária/métodos , BlastocistoRESUMO
BACKGROUND: Acute respiratory distress syndrome (ARDS) after cardiac surgery is a severe respiratory complication with high mortality and morbidity. Traditional clinical approaches may lead to under recognition of this heterogeneous syndrome, potentially resulting in diagnosis delay. This study aims to develop and external validate seven machine learning (ML) models, trained on electronic health records data, for predicting ARDS after cardiac surgery. METHODS: This multicenter, observational cohort study included patients who underwent cardiac surgery in the training and testing cohorts (data from Nanjing First Hospital), as well as those patients who had cardiac surgery in a validation cohort (data from Shanghai General Hospital). The number of important features was determined using the sliding windows sequential forward feature selection method (SWSFS). We developed a set of tree-based ML models, including Decision Tree, GBDT, AdaBoost, XGBoost, LightGBM, Random Forest, and Deep Forest. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC) and Brier score. The SHapley Additive exPlanation (SHAP) techinque was employed to interpret the ML model. Furthermore, a comparison was made between the ML models and traditional scoring systems. ARDS is defined according to the Berlin definition. RESULTS: A total of 1996 patients who had cardiac surgery were included in the study. The top five important features identified by the SWSFS were chronic obstructive pulmonary disease, preoperative albumin, central venous pressure_T4, cardiopulmonary bypass time, and left ventricular ejection fraction. Among the seven ML models, Deep Forest demonstrated the best performance, with an AUC of 0.882 and a Brier score of 0.809 in the validation cohort. Notably, the SHAP values effectively illustrated the contribution of the 13 features attributed to the model output and the individual feature's effect on model prediction. In addition, the ensemble ML models demonstrated better performance than the other six traditional scoring systems. CONCLUSIONS: Our study identified 13 important features and provided multiple ML models to enhance the risk stratification for ARDS after cardiac surgery. Using these predictors and ML models might provide a basis for early diagnostic and preventive strategies in the perioperative management of ARDS patients.
Assuntos
Procedimentos Cirúrgicos Cardíacos , Aprendizado de Máquina , Síndrome do Desconforto Respiratório , Humanos , Síndrome do Desconforto Respiratório/etiologia , Masculino , Feminino , Pessoa de Meia-Idade , Estudos de Coortes , Procedimentos Cirúrgicos Cardíacos/efeitos adversos , Idoso , Curva ROC , Área Sob a CurvaRESUMO
Predicting type 2 diabetes mellitus (T2DM) by using phenotypic data with machine learning (ML) techniques has received significant attention in recent years. PyCaret, a low-code automated ML tool that enables the simultaneous application of 16 different algorithms, was used to predict T2DM by using phenotypic variables from the "Nurses' Health Study" and "Health Professionals' Follow-up Study" datasets. Ridge Classifier, Linear Discriminant Analysis, and Logistic Regression (LR) were the best-performing models for the male-only data subset. For the female-only data subset, LR, Gradient Boosting Classifier, and CatBoost Classifier were the strongest models. The AUC, accuracy, and precision were approximately 0.77, 0.70, and 0.70 for males and 0.79, 0.70, and 0.71 for females, respectively. The feature importance plot showed that family history of diabetes (famdb), never having smoked, and high blood pressure (hbp) were the most influential features in females, while famdb, hbp, and currently being a smoker were the major variables in males. In conclusion, PyCaret was used successfully for the prediction of T2DM by simplifying complex ML tasks. Gender differences are important to consider for T2DM prediction. Despite this comprehensive ML tool, phenotypic variables alone may not be sufficient for early T2DM prediction; genotypic variables could also be used in combination for future studies.
RESUMO
BACKGROUND: Suicide is the second-leading cause of death among adolescents and is associated with clusters of suicides. Despite numerous studies on this preventable cause of death, the focus has primarily been on single nations and traditional statistical methods. OBJECTIVE: This study aims to develop a predictive model for adolescent suicidal thinking using multinational data sets and machine learning (ML). METHODS: We used data from the Korea Youth Risk Behavior Web-based Survey with 566,875 adolescents aged between 13 and 18 years and conducted external validation using the Youth Risk Behavior Survey with 103,874 adolescents and Norway's University National General Survey with 19,574 adolescents. Several tree-based ML models were developed, and feature importance and Shapley additive explanations values were analyzed to identify risk factors for adolescent suicidal thinking. RESULTS: When trained on the Korea Youth Risk Behavior Web-based Survey data from South Korea with a 95% CI, the XGBoost model reported an area under the receiver operating characteristic (AUROC) curve of 90.06% (95% CI 89.97-90.16), displaying superior performance compared to other models. For external validation using the Youth Risk Behavior Survey data from the United States and the University National General Survey from Norway, the XGBoost model achieved AUROCs of 83.09% and 81.27%, respectively. Across all data sets, XGBoost consistently outperformed the other models with the highest AUROC score, and was selected as the optimal model. In terms of predictors of suicidal thinking, feelings of sadness and despair were the most influential, accounting for 57.4% of the impact, followed by stress status at 19.8%. This was followed by age (5.7%), household income (4%), academic achievement (3.4%), sex (2.1%), and others, which contributed less than 2% each. CONCLUSIONS: This study used ML by integrating diverse data sets from 3 countries to address adolescent suicide. The findings highlight the important role of emotional health indicators in predicting suicidal thinking among adolescents. Specifically, sadness and despair were identified as the most significant predictors, followed by stressful conditions and age. These findings emphasize the critical need for early diagnosis and prevention of mental health issues during adolescence.
Assuntos
Aprendizado de Máquina , Ideação Suicida , Humanos , Adolescente , Feminino , Masculino , República da Coreia , Algoritmos , Estudos de Coortes , Comportamento do Adolescente/psicologia , Suicídio/estatística & dados numéricos , Suicídio/psicologia , Noruega , Inquéritos e Questionários , Fatores de Risco , Assunção de RiscosRESUMO
The basis and key step to developing ozone (O3) prevention and control measures is determining the non-linear relationship between O3 and its precursors. Based on online observations of O3, volatile organic compounds (VOCs), nitrogen oxides (NOx), and meteorological elements from April to September 2020 at an urban site in Beijing, we analyzed the pollution characteristics of O3 and its precursors, explored key factors affecting O3 using the random forest (RF) model combined with SHAP values, and explored the O3-VOCs-NOx sensitivity through a multi-scenarios analysis. The results of correlation analysis showed that the hourly concentration of O3 was significantly positively correlated with temperature (T) and negatively correlated with TVOCs and NOx. However, in terms of the daily values, O3 was significantly positively correlated with T, TVOCs, and NOx. The simulated O3 values by the RF model agreed with the measured values. The SHAP values of each characteristic variable were further calculated. The results suggested that T and NOx showed the two highest effects on O3, with positive and negative values, respectively. Based on the average NOx and VOCs on O3 pollution days during the observation period (the base scenario), multi-scenarios with different NOx and VOCs were set up. The RF model was used to calculate O3 under different scenarios and obtain the O3 isopleth (EKMA curve). The results showed that the O3-VOCs-NOx sensitivity in urban areas of Beijing was in the VOCs-limited regime, which was consistent with the results obtained from the observation-based box model(OBM). This indicated that the RF model could be used as a complementary method for O3-VOCs-NOx sensitivity analysis.
RESUMO
PURPOSE: Machine learning (ML) models presented an excellent performance in the prognosis prediction. However, the black box characteristic of ML models limited the clinical applications. Here, we aimed to establish explainable and visualizable ML models to predict biochemical recurrence (BCR) of prostate cancer (PCa). MATERIALS AND METHODS: A total of 647 PCa patients were retrospectively evaluated. Clinical parameters were identified using LASSO regression. Then, cohort was split into training and validation datasets with a ratio of 0.75:0.25 and BCR-related features were included in Cox regression and five ML algorithm to construct BCR prediction models. The clinical utility of each model was evaluated by concordance index (C-index) values and decision curve analyses (DCA). Besides, Shapley Additive Explanation (SHAP) values were used to explain the features in the models. RESULTS: We identified 11 BCR-related features using LASSO regression, then establishing five ML-based models, including random survival forest (RSF), survival support vector machine (SSVM), survival Tree (sTree), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and a Cox regression model, C-index were 0.846 (95%CI 0.796-0.894), 0.774 (95%CI 0.712-0.834), 0.757 (95%CI 0.694-0.818), 0.820 (95%CI 0.765-0.869), 0.793 (95%CI 0.735-0.852), and 0.807 (95%CI 0.753-0.858), respectively. The DCA showed that RSF model had significant advantages over all models. In interpretability of ML models, the SHAP value demonstrated the tangible contribution of each feature in RSF model. CONCLUSIONS: Our score system provide reference for the identification for BCR, and the crafting of a framework for making therapeutic decisions for PCa on a personalized basis.
Assuntos
Aprendizado de Máquina , Recidiva Local de Neoplasia , Neoplasias da Próstata , Humanos , Masculino , Neoplasias da Próstata/sangue , Neoplasias da Próstata/patologia , Recidiva Local de Neoplasia/sangue , Recidiva Local de Neoplasia/patologia , Estudos Retrospectivos , Idoso , Pessoa de Meia-Idade , Prognóstico , Árvores de Decisões , Modelos de Riscos Proporcionais , Algoritmos , Máquina de Vetores de Suporte , Antígeno Prostático Específico/sangueRESUMO
Ambient ammonia (NH3) plays an important compound in forming particulate matters (PMs), and therefore, it is crucial to comprehend NH3's properties in order to better reduce PMs. However, it is not easy to achieve this goal due to the limited range/real-time NH3 data monitored by the air quality stations. While there were other studies to predict NH3 and its source apportionment, this manuscript provides a novel method (i.e., GEO-AI)) to look into NH3 predictions and their contribution sources. This study represents a pioneering effort in the application of a novel geospatial-artificial intelligence (Geo-AI) base model with parcel tracking functions. This innovative approach seamlessly integrates various machine learning algorithms and geographic predictor variables to estimate NH3 concentrations, marking the first instance of such a comprehensive methodology. The Shapley additive explanation (SHAP) was used to further analyze source contribution of NH3 with domain knowledge. From 2016 to 2018, Taichung's hourly average NH3 values were predicted with total variance up to 96%. SHAP values revealed that waterbody, traffic and agriculture emissions were the most significant factors to affect NH3 concentrations in Taichung among all the characteristics. Our methodology is a vital first step for shaping future policies and regulations and is adaptable to regions with limited monitoring sites.
Assuntos
Poluentes Atmosféricos , Poluição do Ar , Poluentes Atmosféricos/análise , Inteligência Artificial , Monitoramento Ambiental/métodos , Poluição do Ar/análise , Material Particulado/análiseRESUMO
Glutathione (GSH) production is of great industrial interest due to its essential properties. This study aimed to use machine learning (ML) methods to model GSHproduction under different growth conditions of Saccharomyces cerevisiae, namely cultivation time, culture volume, pressure, and magnetic field application. Different ML and regression models were evaluated for their statistics to select the most robust model. Results showed that eXtreme Gradient Boosting (XGB) was the best predictive performance model. From the best model, additive explanation techniques were used to identify the feature importance of process. According to variable analysis, the best conditions to obtain the highest GSH concentrations would be cultivation times of 72-96 h, low magnetic field intensity (3.02 mT), low pressure (0.5 kgf.cm-2), and high culture volume (3.5 L). XGB use and additive explanation techniques proved promising for determining process optimization conditions and selecting the essential process variables.
Assuntos
Glutationa , Saccharomyces cerevisiae , Indústrias , Luz , Aprendizado de MáquinaRESUMO
Construction workers face a high risk of various occupational accidents, many of which can result in fatalities. This study aims to develop a prediction model for nine prevalent types of construction accidents, utilizing construction tasks, activities, and tools/materials as input features, through the application of machine learning-based multi-class classification algorithms. 152,867 construction accident summary reports, composed of both structured (construction task, construction activity, accident type) and unstructured data (tools/materials) were used for the study. The study employed several data processing techniques, including keyword extraction through text mining, Boruta feature selection, and SMOTE data resampling enhance model accuracy. Three performance metrics (Multi-class area under the receiver operating characteristic curve (MAUC), Multi-class Matthews Correlation Coefficient (MMCC), Geometric-mean (G-mean)) were used to compare the predictive performance of four machine learning algorithms, including Decision tree, Random forest, Naïve bayes, and XGBoost. Of the four algorithms, XGBoost showed the highest performance in predicting accident type (MAUC: 0.8603, MMCC: 0.3523, G-mean: 0.5009). Furthermore, a Shapley additive explanation (SHAP) analysis was conducted to visualize feature importance. The findings of this study make a valuable contribution to improving construction safety by presenting a prediction model for accident types derived from real-world big data.
Assuntos
Acidentes de Trabalho , Indústria da Construção , Mineração de Dados , Aprendizado de Máquina , Mineração de Dados/métodos , Humanos , República da Coreia , Acidentes de Trabalho/prevenção & controle , Algoritmos , Teorema de BayesRESUMO
Pentachlorophenol (PCP) is a commonly found recalcitrant and toxic groundwater contaminant that resists degradation, bioaccumulates, and has a potential for long-range environmental transport. Taking proper actions to deal with the pollutant accounting for the life cycle consequences requires a better understanding of its behavior in the subsurface. We recognize the huge potential for enhancing decision-making at contaminated groundwater sites with the arrival of machine learning (ML) techniques in environmental applications. We used ML to enhance the understanding of the dynamics of PCP transport properties in the subsurface, and to determine key hydrochemical and hydrogeological drivers affecting its transport and fate. We demonstrate how this complementary knowledge, provided by data-driven methods, may enable a more targeted planning of monitoring and remediation at two highly contaminated Swedish groundwater sites, where the method was validated. We evaluated 6 interpretable ML methods, 3 linear regressors and 3 non-linear (i.e., tree-based) regressors, to predict PCP concentration in the groundwater. The modeling results indicate that simple linear ML models were found to be useful in the prediction of observations for datasets without any missing values, while tree-based regressors were more suitable for datasets containing missing values. Considering that missing values are common in datasets collected during contaminated site investigations, this could be of significant importance for contaminated site planners and managers, ultimately reducing site investigation and monitoring costs. Furthermore, we interpreted the proposed models using the SHAP (SHapley Additive exPlanations) approach to decipher the importance of different drivers in the prediction and simulation of critical hydrogeochemical variables. Among these, sum of chlorophenols is of highest significance in the analyses. Setting that aside from the model, tetra chlorophenols, dissolved organic carbon, and conductivity found to be of highest importance. Accordingly, ML methods could potentially be used to improve the understanding of groundwater contamination transport dynamics, filling gaps in knowledge that remain when using more sophisticated deterministic modeling approaches.
Assuntos
Clorofenóis , Água Subterrânea , Pentaclorofenol , Água Subterrânea/química , Poluição AmbientalRESUMO
Coastal harmful algal blooms (HABs) have become one of the challenging environmental problems in the world's thriving coastal cities due to the interference of multiple stressors from human activities and climate change. Past HAB predictions primarily relied on single-source data, overlooked upstream land use, and typically used a single prediction algorithm. To address these limitations, this study aims to develop predictive models to establish the relationship between the HAB indicator - chlorophyll-a (Chl-a) and various environmental stressors, under appropriate lagging predictive scenarios. To achieve this, we first applied the partial autocorrelation function (PACF) to Chl-a to precisely identify two prediction scenarios. We then combined multi-source data and several machine learning algorithms to predict harmful algae, using SHapley Additive exPlanations (SHAP) to extract key features influencing output from the prediction models. Our findings reveal an apparent 1-month autoregressive characteristic in Chl-a, leading us to create two scenarios: 1-month lead prediction and current-month prediction. The Extra Tree Regressor (ETR), with an R2 of 0.92, excelled in 1-month lead predictions, while the Random Forest Regressor (RFR) was most effective for current-month predictions with an R2 of 0.69. Additionally, we identified current month Chl-a, developed land use, total phosphorus, and nitrogen oxides (NOx) as critical features for accurate predictions. Our predictive framework, which can be applied to coastal regions worldwide, provides decision-makers with crucial tools for effectively predicting and mitigating HAB threats in major coastal cities.
Assuntos
Mudança Climática , Proliferação Nociva de Algas , Humanos , Clorofila A , Cidades , FósforoRESUMO
The detection of Parkinson's disease (PD) in its early stages is of great importance for its treatment and management, but consensus is lacking on what information is necessary and what models should be used to best predict PD risk. In our study, we first grouped PD-associated factors based on their cost and accessibility, and then gradually incorporated them into risk predictions, which were built using eight commonly used machine learning models to allow for comprehensive assessment. Finally, the Shapley Additive Explanations (SHAP) method was used to investigate the contributions of each factor. We found that models built with demographic variables, hospital admission examinations, clinical assessment, and polygenic risk score achieved the best prediction performance, and the inclusion of invasive biomarkers could not further enhance its accuracy. Among the eight machine learning models considered, penalized logistic regression and XGBoost were the most accurate algorithms for assessing PD risk, with penalized logistic regression achieving an area under the curve of 0.94 and a Brier score of 0.08. Olfactory function and polygenic risk scores were the most important predictors for PD risk. Our research has offered a practical framework for PD risk assessment, where necessary information and efficient machine learning tools were highlighted.
Assuntos
Doença de Parkinson , Humanos , Doença de Parkinson/diagnóstico , Doença de Parkinson/genética , Algoritmos , Estratificação de Risco Genético , Hospitalização , Aprendizado de MáquinaRESUMO
BACKGROUND: The goal of this study was to assess the effectiveness of machine learning models and create an interpretable machine learning model that adequately explained 3-year all-cause mortality in patients with chronic heart failure. METHODS: The data in this paper were selected from patients with chronic heart failure who were hospitalized at the First Affiliated Hospital of Kunming Medical University, from 2017 to 2019 with cardiac function class III-IV. The dataset was explored using six different machine learning models, including logistic regression, naive Bayes, random forest classifier, extreme gradient boost, K-nearest neighbor, and decision tree. Finally, interpretable methods based on machine learning, such as SHAP value, permutation importance, and partial dependence plots, were used to estimate the 3-year all-cause mortality risk and produce individual interpretations of the model's conclusions. RESULT: In this paper, random forest was identified as the optimal aools lgorithm for this dataset. We also incorporated relevant machine learning interpretable tand techniques to improve disease prognosis, including permutation importance, PDP plots and SHAP values for analysis. From this study, we can see that the number of hospitalizations, age, glomerular filtration rate, BNP, NYHA cardiac function classification, lymphocyte absolute value, serum albumin, hemoglobin, total cholesterol, pulmonary artery systolic pressure and so on were important for providing an optimal risk assessment and were important predictive factors of chronic heart failure. CONCLUSION: The machine learning-based cardiovascular risk models could be used to accurately assess and stratify the 3-year risk of all-cause mortality among CHF patients. Machine learning in combination with permutation importance, PDP plots, and the SHAP value could offer a clear explanation of individual risk prediction and give doctors an intuitive knowledge of the functions of important model components.
Assuntos
Insuficiência Cardíaca , Humanos , Teorema de Bayes , Doença Crônica , Análise por Conglomerados , Aprendizado de MáquinaRESUMO
Developing and evaluating statistical prediction models is challenging, and many pitfalls can arise. This article identifies what the authors believe are some common methodologic concerns that may be encountered. We describe each problem and make suggestions regarding how to address them. The hope is that this article will result in higher-quality publications of statistical prediction models.
Assuntos
Modelos Estatísticos , Humanos , Curva ROCRESUMO
Introduction: The gut microbiota (GM) dysbiosis is one of the causal factors for the progression of different chronic metabolic diseases, including type 2 diabetes mellitus (T2D). Understanding the basis that laid this association may lead to developing new therapeutic strategies for preventing and treating T2D, such as probiotics, prebiotics, and fecal microbiota transplants. It may also help identify potential early detection biomarkers and develop personalized interventions based on an individual's gut microbiota profile. Here, we explore how supervised Machine Learning (ML) methods help to distinguish taxa for individuals with prediabetes (prediabetes) or T2D. Methods: To this aim, we analyzed the GM profile (16s rRNA gene sequencing) in a cohort of 410 Mexican naïve patients stratified into normoglycemic, prediabetes, and T2D individuals. Then, we compared six different ML algorithms and found that Random Forest had the highest predictive performance in classifying T2D and prediabetes patients versus controls. Results: We identified a set of taxa for predicting patients with T2D compared to normoglycemic individuals, including Allisonella, Slackia, Ruminococus_2, Megaspgaera, Escherichia/Shigella, and Prevotella, among them. Besides, we concluded that Anaerostipes, Intestinibacter, Prevotella_9, Blautia, Granulicatella, and Veillonella were the relevant genus in patients with prediabetes compared to normoglycemic subjects. Discussion: These findings allow us to postulate that GM is a distinctive signature in prediabetes and T2D patients during the development and progression of the disease. Our study highlights the role of GM and opens a window toward the rational design of new preventive and personalized strategies against the control of this disease.
Assuntos
Diabetes Mellitus Tipo 2 , Microbioma Gastrointestinal , Estado Pré-Diabético , Humanos , Diabetes Mellitus Tipo 2/diagnóstico , Estado Pré-Diabético/diagnóstico , Disbiose , RNA Ribossômico 16S/genética , Aprendizado de MáquinaRESUMO
Mammography is considered the gold standard for breast cancer screening. Multiple risk factors that affect breast cancer development have been identified; however, there is an ongoing debate regarding the significance of these factors. Machine learning (ML) models and Shapley Additive Explanation (SHAP) methodology can rank risk factors and provide explanatory model results. This study used ML algorithms with SHAP to analyze the risk factors between two different age groups and evaluate the impact of each factor in predicting positive mammography. The ML model was built using data from the risk factor questionnaires of women participating in a breast cancer screening program from 2017 to 2021. Three ML models, least absolute shrinkage and selection operator (lasso) logistic regression, extreme gradient boosting (XGBoost), and random forest (RF), were applied. RF generated the best performance. The SHAP values were then applied to the RF model for further analysis. The model identified age at menarche, education level, parity, breast self-examination, and BMI as the top five significant risk factors affecting mammography outcomes. The differences between age groups ranked by reproductive lifespan and BMI were higher in the younger and older age groups, respectively. The use of SHAP frameworks allows us to understand the relationships between risk factors and generate individualized risk factor rankings. This study provides avenues for further research and individualized medicine.