RESUMO
Soil potassium is a crucial nutrient element necessary for crop growth, and its efficient measurement has become essential for developing rational fertilization plans and optimizing crop growth benefits. At present, data mining technology based on near-infrared (NIR) spectroscopy analysis has proven to be a powerful tool for real-time monitoring of soil potassium content. However, as technology and instruments improve, the curse of the dimensionality problem also increases accordingly. Therefore, it is urgent to develop efficient variable selection methods suitable for NIR spectroscopy analysis techniques. In this study, we proposed a three-step progressive hybrid variable selection strategy, which fully leveraged the respective strengths of several high-performance variable selection methods. By sequentially equipping synergy interval partial least squares (SiPLS), the random forest variable importance measurement (RF(VIM)), and the improved mean impact value algorithm (IMIV) into a fusion framework, a soil important potassium variable selection method was proposed, termed as SiPLS-RF(VIM)-IMIV. Finally, the optimized variables were fitted into a partial least squares (PLS) model. Experimental results demonstrated that the PLS model embedded with the hybrid strategy effectively improved the prediction performance while reducing the model complexity. The RMSET and RT on the test set were 0.01181% and 0.88246, respectively, better than the RMSET and RT of the full spectrum PLS, SiPLS, and SiPLS-RF(VIM) methods. This study demonstrated that the hybrid strategy established based on the combination of NIR spectroscopy data and the SiPLS-RF(VIM)-IMIV method could quantitatively analyze soil potassium content levels and potentially solve other issues of data-driven soil dynamic monitoring.
RESUMO
Eucalyptus plantations are widespread in the highlands of northern Ethiopia. The species has been used for centuries for various purposes. However, there are controversies surrounding the species with excessive soil nutrient and water consumption. Modelling the spatial distribution of the species is fundamental to understand its ecological and hydrological effects in the region for policy inputs. Therefore, the purpose of this study is to develop a model for mapping the spatial distribution of Eucalyptus globulus. We used the spectral bands of Sentinel-2 data, vegetation indices, and environmental data as predictor variables and three machine learning algorithms (Random Forest, Support Vector Machine, and Boosted Regression Trees) to model the current distribution of Eucalyptus globulus. Eleven of the twenty-five predictor variables were filtered using a variance inflation factor (VIF). 419 in situ georeferenced data points were used for training, and validating the models. The area under the curve (AUC), kappa statistic (K), true skill statistic (TSS), Root Mean Squared Error and coefficient of determination (R2) were used to validate the models' performance. The model validation metrics confirmed the highest performance of Random Forest. The prediction map of Random Forest revealed that Eucalyptus globulus was fairly detected in non-Eucalyptus globulus woody vegetation (R2 = 0.86, P < 0.001; RMSE = 0.31). We found that the Green Normalized Difference Vegetation Index and environmental variables, such as elevation and distance from the road, were the most important predictor variables in explaining the distribution of Eucalyptus globulus. Our findings demonstrate that machine learning algorithms with Sentinel-2 spectral bands and vegetation indices compounded with environmental data can effectively model the spatial distribution of Eucalyptus globulus.
RESUMO
The analysis of DNA methylation (DNAm) levels at specific CpG sites represents one of the most promising molecular techniques for estimating an individual's age. To date, a considerable number of studies have reported the development of age prediction models on the basis of DNAm in body fluids, with only a few utilizing buccal swabs. The objective of this study was to identify age-dependent methylation CpG sites in three different genes (HOXC4, TRIM59, and ELOVL2) in buccal swab samples from the Chinese Han population. A total of 461 buccal swabs, with an age range of 0.4-80.8 years, were divided into a training set (n = 325) and a validation set (n = 136). Samples were analyzed by pyrosequencing in order to identify age-related genes with correlation coefficient. A random forest regression model was ultimately proposed, including eight CpGs in three genes, with a mean absolute error (MAE) of 2.119 years. The model performs independent validation set with an MAE of 4.391 years. Our findings illustrate that buccal swabs present a suitable alternative to biological traces for age prediction based on DNAm pattern using pyrosequencing and random forest regression, offering the additional advantage of being collected noninvasively.
RESUMO
Introduction: The primary objective of this study was to identify variables that significantly influence the implementation of math Response to Intervention (RTI) at the school level, utilizing the ECLS-K: 2011 dataset. Methods: Due to missing values in the original dataset, a Random Forest algorithm was employed for data imputation, generating a total of 10 imputed datasets. Elastic net logistic regression, combined with nested cross-validation, was applied to each imputed dataset, potentially resulting in 10 models with different variables. Variables for the models derived from the imputed datasets were selected using four methods, leading to four candidate models for final selection. These models were assessed based on their performance of prediction accuracy, culminating in the selection of the final model that outperformed the others. Results and discussion: Method50 and Methodcoef emerged as the most effective, achieving a balanced accuracy of 0.852. The ultimate model selected relevant variables that effectively predicted RTI. The predictive accuracy of the final model was also demonstrated by the receiver operating characteristic (ROC) plot and the corresponding area under the curve (AUC) value, indicating its ability to accurately forecast math RTI implementation in schools for the following year.
RESUMO
Background: The early prediction of cerebral edema changes in patients with spontaneous intracerebral hemorrhage (SICH) may facilitate earlier interventions and result in improved outcomes. This study aimed to develop and validate machine learning models to predict cerebral edema changes within 72 h, using readily available clinical parameters, and to identify relevant influencing factors. Methods: An observational study was conducted between April 2021 and October 2023 at the Quzhou Affiliated Hospital of Wenzhou Medical University. After preprocessing the data, the study population was randomly divided into training and internal validation cohorts in a 7:3 ratio (training: N = 150; validation: N = 65). The most relevant variables were selected using Support Vector Machine Recursive Feature Elimination (SVM-RFE) and Least Absolute Shrinkage and Selection Operator (LASSO) algorithms. The predictive performance of random forest (RF), GDBT, linear regression (LR), and XGBoost models was evaluated using the area under the receiver operating characteristic curve (AUROC), precision-recall curve (AUPRC), accuracy, F1-score, precision, recall, sensitivity, and specificity. Feature importance was calculated, and the SHapley Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME) methods were employed to explain the top-performing model. Results: A total of 84 (39.1%) patients developed cerebral edema changes. In the validation cohort, GDBT outperformed LR and RF, achieving an AUC of 0.654 (95% CI: 0.611-0.699) compared to LR of 0.578 (95% CI, 0.535-0.623, DeLong: p = 0.197) and RF of 0.624 (95% CI, 0.588-0.687, DeLong: p = 0.236). XGBoost also demonstrated similar performance with an AUC of 0.660 (95% CI, 0.611-0.711, DeLong: p = 0.963). However, in the training set, GDBT still outperformed XGBoost, with an AUC of 0.603 ± 0.100 compared to XGBoost of 0.575 ± 0.096. SHAP analysis revealed that serum sodium, HDL, subarachnoid hemorrhage volume, sex, and left basal ganglia hemorrhage volume were the top five most important features for predicting cerebral edema changes in the GDBT model. Conclusion: The GDBT model demonstrated the best performance in predicting 72-h changes in cerebral edema. It has the potential to assist clinicians in identifying high-risk patients and guiding clinical decision-making.
RESUMO
Preeclampsia is a pregnancy syndrome characterized by complex symptoms which cause maternal and fetal problems and deaths. The aim of this study is to achieve preeclampsia risk prediction and early risk prediction in Xinjiang, China, based on the placental growth factor measured using the SiMoA or Elecsys platform. A novel reliable calibration modeling method and missing data imputing method are proposed, in which different strategies are used to adapt to small samples, training data, test data, independent features, and dependent feature pairs. Multiple machine learning algorithms were applied to train models using various datasets, such as single-platform versus bi-platform data, early pregnancy versus early plus non-early pregnancy data, and real versus real plus augmented data. It was found that a combination of two types of mono-platform data could improve risk prediction performance, and non-early pregnancy data could enhance early risk prediction performance when limited early pregnancy data were available. Additionally, the inclusion of augmented data resulted in achieving a high but unstable performance. The models in this study significantly reduced the incidence of preeclampsia in the region from 7.2% to 2.0%, and the mortality rate was reduced to 0%.
Assuntos
Aprendizado de Máquina , Pré-Eclâmpsia , Pré-Eclâmpsia/diagnóstico , Gravidez , Feminino , Humanos , Estudos Prospectivos , Calibragem , Adulto , China/epidemiologia , Medição de Risco/métodos , Fator de Crescimento Placentário/sangue , Fator de Crescimento Placentário/metabolismo , Fatores de Risco , AlgoritmosRESUMO
Accurate and robust positioning has become increasingly essential for emerging applications and services. While GPS (global positioning system) is widely used for outdoor environments, indoor positioning remains a challenging task. This paper presents a novel architecture for indoor positioning, leveraging machine learning techniques and a divide-and-conquer strategy to achieve low error estimates. The proposed method achieves an MAE (mean absolute error) of approximately 1 m for latitude and longitude. Our approach provides a precise and practical solution for indoor positioning. Additionally, some insights on the best machine learning techniques for these tasks are also envisaged.
RESUMO
Assessing vegetation changes in alpine arid and fragile ecosystems is imperative for informed ecological restoration initiatives and adaptive ecosystem management. Previous studies primarily employed the Normalized Difference Vegetation Index (NDVI) to reveal vegetation dynamics, ignoring the spatial heterogeneity alterations caused by bare soil. In this study, we used a comprehensive analysis of NDVI and its spatial heterogeneity to examine the vegetation changes across the Three-River Headwaters Region (TRHR) over the past two decades. A random forest model was used to elucidate the underlying causes of these changes. We found that between 2000 and 2022, 9.4% of the regions exhibited significant changes in both NDVI and its spatial heterogeneity. These regions were categorized into six distinct types of vegetation change: improving conditions (62.1%), regrowing conditions (11.0%), slight degradation (16.2%), medium degradation (8.4%), severe degradation (2.0%), and desertification (0.3%). In comparison with steppe regions, meadows showed a greater proportion of improved conditions and medium degradation, whereas steppes had more instances of regrowth and slight degradation. Climate variables are the dominant factors that caused vegetation changes, with contributions to NDVI and spatial heterogeneity reaching 68.9% and 73.2%, respectively. Temperature is the primary driver of vegetation dynamics across the different types of change, with a more pronounced impact in meadows. In severely degraded steppe and meadow regions, grazing intensity emerged as the predominant driver of NDVI change, with an importance value exceeding 0.50. Notably, as degradation progressed from slight to severe, the significance of this factor correspondingly increased. Our findings can provide effective information for guiding the implementation of ecological restoration projects and the sustainable management of alpine arid ecosystems.
RESUMO
BACKGROUND/OBJECTIVES: This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance. METHODS: A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison. RESULTS: The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes. CONCLUSIONS: This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.
RESUMO
Baijiu is popular with a long history and balanced flavor. Flavor type is the most widely used classification mode for Baijiu. However, the evolutionary relationships of Baijiu flavor types and the differential markers between flavor types are still unclear, significantly impacting the development of the Baijiu industry. In this study, a total of 319 trace components were identified using gas chromatography-olfactometry-mass spectrometry and gas chromatography-mass spectrometry. Among them, 91 trace components with high odor active values or taste active values were recognized as flavor components. Then random forests were conducted to screen differential markers between the derived and basic flavor types, while a principal component analysis assessed their effectiveness in distinguishing the flavor types of Baijiu. Finally, 19 differential markers (including 3-methylbutyric acid, pentanoic acid, 2-butanol, 2,3-butanediol, ethyl pro-panoate, isobutyl acetate, ethyl butanoate, ethyl hexanoate, ethyl heptanoate, ethyl lactate, ethyl 2-hydroxy butanoate, isopentyl hexanoate, ethyl nonanoate, isopropyl myristate, ethyl tetradecanoate, ethyl benzoate, 2,4-di-t-butylphenol, 2-methylbutanal and 3-octanone) were screened and proven to effectively reveal the evolution of Baijiu flavor types; these were further verified as key differential markers using addition tests and a correlation analysis.
RESUMO
This work applied three machine learning (ML) models-linear regression (LR), random forest (RF), and support vector regression (SVR)-to predict the lattice parameters of the monoclinic B19' phase in two distinct training datasets: previously published ZrO2-based shape-memory ceramics (SMCs) and NiTi-based high-entropy shape-memory alloys (HESMAs). Our findings showed that LR provided the most accurate predictions for ac, am, bm, and cm in NiTi-based HESMAs, while RF excelled in computing ßm for both datasets. SVR disclosed the largest deviation between the predicted and actual values of lattice parameters for both training datasets. A combination approach of RF and LR models enhanced the accuracy of predicting lattice parameters of martensitic phases in various shape-memory materials for stable high-temperature applications.
RESUMO
This study focuses on the development and evaluation of soft sensor models for predicting NH3-N values in a wastewater treatment process. The study compares the performance of linear regression (LR), neural networks (NN) and random forest regression (RFR) models. The proposed methodology involves optimizing the sequencing batch reactor process using artificial intelligence and an automatic control system. Real-time NH3-N values are obtained by inputting data from electronic conductivity and temperature sensors into the prediction models. Once the predicted NH3-N value falls below the effluent standard, the cycle ends, improving energy efficiency and sustainability by cutting down the agitator and aerator. The research results demonstrate that the RNN-based NH3-N soft sensor built in this study exhibits the best performance, which is promising for wastewater treatment process optimization and evaluation. The results show that sensor model NNR[0.5Y]H exhibits exceptional performance, utilizing recurrent neural network with 5-step input delays. Sensor NNR[0.5Y]H exhibits an R2 of 0.921, an RMSE of 6.110, and an MAE of 4.558. Based on the findings, recurrent neural network (RNN) variants emerge as the most effective modeling technique due to their ability to capture temporal dependencies and handle variable-length sequences. This study provides satisfied performance results for the NNR[0.5Y]H soft sensor model in NH3-N monitoring and process optimization in wastewater treatment, highlighting the effectiveness of recurrent neural networks and their contribution to improving interpretability, accuracy, and adaptability of soft sensor models.
RESUMO
Objectives: This study aimed to develop and validate a machine learning prediction model for post-dispatch cancellation of physician-staffed rapid car. Materials: Data were extracted from the physician-staffed rapid response car database at our Hospital between April 2017 and March 2019. Methods: After obtaining 2019 cases, we divided the dataset into a training set for developing the model and a test set for validation using stratified random sampling with an 8 : 2 allocation ratio. We selected random forest as the machine-learning classifier. The outcome was the post-dispatch cancellation of a rapid car. The model was trained using predictor variables, including 18 different reasons for rapid car request, age and gender of a patient, date (month), and distance from the hospital. Results: This machine learning model predicted the occurrence of post-dispatch cancellation of rapid cars with an accuracy of 75.5% [95% confidence interval (CI):â 71.0-79.6], sensitivity of 81.5% (CI:â 75.0-86.9), specificity of 70.8% (CI:â 64.4-76.6), and an area under the receiver operating characteristic value of 0.83 (CI:â 0.79-0.87). The important features were distance from the hospital to the scene, age, suspicion of non-witnessed cardiac arrest, farthest geographic area, and date (months). Conclusions: We developed a favorable machine learning model to predict post-dispatch cancellation of rapid cars in a local district. This study suggests the potential of machine-learning models in improving the efficiency of dispatching physicians outside hospitals.
RESUMO
Background: Low survival rates of breast cancer in developing countries are mainly due to the lack of early detection plans and adequate diagnosis and treatment facilities. Objectives: This study aimed to apply machine learning techniques to recognize the most important breast cancer risk factors. Methods: This case-control study included women aged 17-75 years who were referred to medical centers affiliated with Mashhad University of Medical Science between March 21, 2015, and March 19, 2016. The study had two datasets: one with 516 samples (258 cases and 258 controls) and another with 606 samples (303 cases and 303 controls). Written informed consent has been observed. Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), and Principal Component Analysis (PCA) were applied using R studio software. Results: Regarding the DT and RF, the most important features that impact breast cancer were family cancer, individual history of breast cancer, biopsy sampling, rarely consumption of a dairy, fruit, and vegetable meal, while in PCA and LR these features including family cancer, pregnancy number, pregnancy tendency, abortion, first menstruation, the age of first childbirth and childbirth number. Conclusions: Machine learning algorithms can be used to extract the most important factors in the diagnosis of breast cancer in developing countries such as Iran.
Assuntos
Neoplasias da Mama , Aprendizado de Máquina , Humanos , Feminino , Neoplasias da Mama/epidemiologia , Neoplasias da Mama/diagnóstico , Irã (Geográfico)/epidemiologia , Pessoa de Meia-Idade , Adulto , Fatores de Risco , Estudos de Casos e Controles , Adolescente , Idoso , Adulto Jovem , Modelos Logísticos , Árvores de Decisões , Análise de Componente PrincipalRESUMO
Grain crops are vulnerable to anthropogenic climate change and extreme temperature events. Despite this, previous studies have often neglected the impact of the spatio-temporal distribution of extreme temperature events on regional grain outputs. This research focuses on the Middle-Lower Yangtze Plains and aims to address this gap as well as to provide a renewed projection of climate-induced grain production variability for the rest of the century. The proposed model performs significantly superior to the benchmark multilinear grain production model. By 2100, grain production in the MLYP is projected to decrease by over 100 tons for the low-radiative-forcing/sustainable development scenario (SSP126) and the medium-radiative-forcing scenario (SSP245), and about 270 tons for the high-radiative-forcing/fossil-fueled development scenario (SSP585). Grain production may experience less decline than previously projected by studies using Representative Concentration Pathways. This difference is likely due to a decrease in coldwave frequency, which can offset the effects of more frequent heatwaves on grain production, combined with alterations in supply-side policies. Notably, the frequency of encoded heatwaves and coldwaves has a stronger impact on grain production compared to precipitation and labor indicators; higher levels of projected heatwaves frequency correspond with increased output variability over time. This study emphasizes the need for developing crop-specific mitigation/adaptation strategies against heat and cold stress amidst global warming.
Assuntos
Mudança Climática , Aprendizado Profundo , Grão Comestível , China , Grão Comestível/crescimento & desenvolvimento , Produtos Agrícolas/crescimento & desenvolvimentoRESUMO
Headache is the most common type of pain following mild traumatic brain injury. Roughly half of those with persistent post-traumatic headache (PPTH) also report neck pain which is associated with greater severity and functional impact of headache. This observational cohort study aimed to identify biological phenotypes to help inform mechanism-based approaches in the management of PPTH with and without concomitant neck pain. Thirty-three military Veterans (mean (SD) = 37±16 years, 29 males) with PPTH completed a clinical assessment, quantitative sensory testing, and magnetic resonance imaging of the brain and cervical spine. Multidimensional phenotyping was performed using a Random Forest analysis and Partitioning Around Medoids (PAM) clustering of input features from three biologic domains: 1) resting state functional connectivity (rsFC) of the periaqueductal gray (PAG), 2) quality and size of cervical muscles, and 3) mechanical pain sensitivity and central modulation of pain. Two subgroups were distinguished by biological features that included forehead pressure pain threshold and rsFC between the PAG and selected nodes within the default mode, salience, and sensorimotor networks. Compared to the High Pain Coping group, the Low Pain Coping group exhibited higher pain-related anxiety (p=0.009), higher pain catastrophizing (p=0.004), lower pain self-efficacy (p=0.010), and greater headache-related disability (p=0.012). Findings suggest that greater functional connectivity of pain modulation networks involving the PAG combined with impairments in craniofacial pain sensitivity, but not cervical muscle health, distinguish a clinically important subgroup of individuals with PPTH who are less able to cope with pain and more severely impacted by headache.
RESUMO
BACKGROUND: Depression is a major global public health concern, often co-occurring with Non-Suicidal Self-Injury (NSSI). Focused on Depressive adolescents, this study aimed to quantify the importance of factors in predicting NSSI and compare them between the only child and non-only child groups, enriching knowledge to leverage tailored intervention strategies. METHODS: A large multicenter survey was conducted in China. 2510 adolescents diagnosed with Major Depressive Disorder (MDD) volunteered for the study. 36 factors were included to train random forest models for NSSI prediction in only child and non-only child groups, respectively. The SHapley Additive exPlanations (SHAP) method was utilized to compute the relative importance of each factor in two groups. RESULTS: Adolescents with MDD exhibited a rather high prevalence of NSSI (52.0 %), among them 66.9 % were non-only children. Self-esteem was the most significant factor for both groups, while critical disparities of factors were also found. In the only child group, factors like family support, parental overprotection, drinking alcohol, sleep conditions and romantic relationship involvement showed greater importance, while higher depression degree, anxiety level and emotional abuse were more important factors for non-only children. LIMITATIONS: The use of cross-sectional data from Chinese adolescents may limit deeper analysis of NSSI mechanisms and the generalizability to Western cultures. CONCLUSIONS: Only and non-only child family structures may have different influence on factors related with NSSI occurrence of adolescents with MDD. Only children were more susceptible to vulnerable family environments, alcohol abuse and romantic experience, while non-only children were more disturbed by abnormal mental states.
RESUMO
Benefits of Glycyrrhiza uralensis include removing heat, detoxifying, and moistening the lungs, easing coughs, refueling the spleen, and balancing medications. In addition to providing theoretical guidance for the development of the G. uralensis industry and rural revitalization plan, it is anticipated that this paper will also provide basic data for the formulation of production layout of the G. uralensis industry at the county level, the control of cultivation industry direction, the establishment of high-quality G. uralensis cultivation technology system. The Maximum Entropy (MaxEnt) model was used to simulate the potential distribution of G. uralensis, a Chinese medicine resource, in Naiman Banner. By conducting a field inquiry and a broad assessment of the available Chinese medicine resources, the distribution information was acquired. The random forest technique was used to classify G. uralensis. The phenological cycle and development mode of vegetation, which exhibits diverse temporal traits and aids in identification, were elucidated through long-term series analysis. The random forest classification algorithm based on multiple features showed high accuracy in remote sensing (RS) recognition of G. uralensis. Comparative analysis of the MaxEnt and RS results showed that the planting area of G. uralensis was smaller than that of its potential distribution. The expansion to high-suitability areas planting should be prioritized. Based on the dual analysis of regional and remote sensing, it not only proved the great potential of using geographic information to predict the distribution of G. uralensis, but also verified the great potential of extracting the distribution of G. uralensis from GF-6 images. These results will guide the planting and development of G. uralensis in Naiman Banner and a scientific basis for the development of G. uralensis economy, conducive to optimizing the ecological environment and promoting rural revitalization programs.
Assuntos
Glycyrrhiza uralensis , Tecnologia de Sensoriamento Remoto , Glycyrrhiza uralensis/crescimento & desenvolvimento , Tecnologia de Sensoriamento Remoto/métodos , Algoritmos , Modelos TeóricosRESUMO
The presence of adverse drug reactions (ADRs) is an ongoing public health concern. While traditional methods to discover ADRs are very costly and limited, it is prudent to predict ADRs through non-invasive methods such as machine learning based on existing data. Although various studies exist regarding ADR prediction using non-clinical data, a process that leverages both demographic and non-clinical data for ADR prediction is missing. In addition, the importance of individual features in ADR prediction has yet to be fully explored. This study aims to develop an ADR prediction model based on demographic and non-clinical data, where we identify the highest contributing factors. We focus our efforts on 30 common and severe ADRs reported to the Food and Drug Administration (FDA) between 2012 and 2023. We have developed a random forest (RF) and deep learning (DL) machine learning model that ingests demographic data (e.g., Age and Gender of patients) and non-clinical data, which includes chemical, molecular, and biological drug characteristics. We successfully unified both demographic and non-clinical data sources within a complete dataset regarding ADR prediction. Model performances were assessed via the area under the receiver operating characteristic curve (AUC) and the mean average precision (MAP). We demonstrated that our parsimonious models, which include only the top 20 most important features comprising 5 demographic features and 15 non-clinical features (13 molecular and 2 biological), achieve ADR prediction performance comparable to a less practical, feature-rich model consisting of all 2,315 features. Specifically, our models achieved an AUC of 0.611 and 0.674 for RF and DL algorithms, respectively. We hope our research provides researchers and clinicians with valuable insights and facilitates future research designs by identifying top ADR predictors (including demographic information) and practical parsimonious models.
Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Aprendizado de Máquina , Humanos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/epidemiologia , Masculino , Feminino , Estados Unidos , Pessoa de Meia-Idade , United States Food and Drug Administration , Sistemas de Notificação de Reações Adversas a Medicamentos/estatística & dados numéricos , Adulto , Idoso , Adolescente , Adulto Jovem , Criança , Aprendizado ProfundoRESUMO
BACKGROUND: Globally, pre-eclampsia (PE) is a leading cause of maternal and perinatal morbidity and mortality. PE prediction using routinely collected data has the advantage of being widely applicable, particularly in low-resource settings. Early intervention for high-risk women might reduce PE incidence and related complications. We aimed to replicate our machine learning (ML) published work predicting another maternal condition (gestational diabetes) to (1) predict PE using routine health data, (2) identify the optimal ML model, and (3) compare it with logistic regression approach. METHODS: Data were from a large health service network with 48,250 singleton pregnancies between January 2016 and June 2021. Supervised ML models were employed. Maternal clinical and medical characteristics were the feature variables (predictors), and a 70/30 data split was used for training and testing the model. Predictive performance was assessed using area under the curve (AUC) and calibration plots. Shapley value analysis assessed the contribution of feature variables. RESULTS: The random forest approach provided excellent discrimination with an AUC of 0.84 (95% CI: 0.82-0.86) and highest prediction accuracy (0.79); however, the calibration curve (slope of 1.21, 95% CI 1.13-1.30) was acceptable only for a threshold of 0.3 or less. The next best approach was extreme gradient boosting, which provided an AUC of 0.77 (95% CI: 0.76-0.79) and well-calibrated (slope of 0.93, 95% CI 0.85-1.01). Logistic regression provided good discrimination performance with an AUC of 0.75 (95% CI: 0.74-0.76) and perfect calibration. Nulliparous, pre-pregnancy body mass index, previous pregnancy with prior PE, maternal age, family history of hypertension, and pre-existing hypertension and diabetes were the top-ranked features in Shapley value analysis. CONCLUSION: Two ML models created the highest-performing prediction using routinely collected data to identify women at high risk of PE, with acceptable discrimination. However, to confirm this result and also examine model generalisability, external validation studies are needed in other settings, utilising standardised prognostic factors.