Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 55
Filter
1.
JMIR Public Health Surveill ; 10: e53322, 2024 08 15.
Article in English | MEDLINE | ID: mdl-39146534

ABSTRACT

BACKGROUND: Postacute sequelae of COVID-19 (PASC), also known as long COVID, is a broad grouping of a range of long-term symptoms following acute COVID-19. These symptoms can occur across a range of biological systems, leading to challenges in determining risk factors for PASC and the causal etiology of this disorder. An understanding of characteristics that are predictive of future PASC is valuable, as this can inform the identification of high-risk individuals and future preventative efforts. However, current knowledge regarding PASC risk factors is limited. OBJECTIVE: Using a sample of 55,257 patients (at a ratio of 1 patient with PASC to 4 matched controls) from the National COVID Cohort Collaborative, as part of the National Institutes of Health Long COVID Computational Challenge, we sought to predict individual risk of PASC diagnosis from a curated set of clinically informed covariates. The National COVID Cohort Collaborative includes electronic health records for more than 22 million patients from 84 sites across the United States. METHODS: We predicted individual PASC status, given covariate information, using Super Learner (an ensemble machine learning algorithm also known as stacking) to learn the optimal combination of gradient boosting and random forest algorithms to maximize the area under the receiver operator curve. We evaluated variable importance (Shapley values) based on 3 levels: individual features, temporal windows, and clinical domains. We externally validated these findings using a holdout set of randomly selected study sites. RESULTS: We were able to predict individual PASC diagnoses accurately (area under the curve 0.874). The individual features of the length of observation period, number of health care interactions during acute COVID-19, and viral lower respiratory infection were the most predictive of subsequent PASC diagnosis. Temporally, we found that baseline characteristics were the most predictive of future PASC diagnosis, compared with characteristics immediately before, during, or after acute COVID-19. We found that the clinical domains of health care use, demographics or anthropometry, and respiratory factors were the most predictive of PASC diagnosis. CONCLUSIONS: The methods outlined here provide an open-source, applied example of using Super Learner to predict PASC status using electronic health record data, which can be replicated across a variety of settings. Across individual predictors and clinical domains, we consistently found that factors related to health care use were the strongest predictors of PASC diagnosis. This indicates that any observational studies using PASC diagnosis as a primary outcome must rigorously account for heterogeneous health care use. Our temporal findings support the hypothesis that clinicians may be able to accurately assess the risk of PASC in patients before acute COVID-19 diagnosis, which could improve early interventions and preventive care. Our findings also highlight the importance of respiratory characteristics in PASC risk assessment. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): RR2-10.1101/2023.07.27.23293272.


Subject(s)
COVID-19 , Post-Acute COVID-19 Syndrome , Humans , COVID-19/epidemiology , Cohort Studies , Female , Male , United States/epidemiology , Middle Aged , Aged , Adult , Risk Factors , Machine Learning
2.
J Environ Manage ; 359: 121040, 2024 May.
Article in English | MEDLINE | ID: mdl-38718609

ABSTRACT

This study aims to analyze comprehensively the impact of different economic and demographic factors, which affect economic development, on environmental performance. In this context, the study considers the Environmental Performance Index as the response variable, uses GDP per capita, tariff rate, tax burden, government expenditure, inflation, unemployment, population, income tax rate, public debt, FDI inflow, and corporate tax rate as the explanatory variables, examines 181 countries, performs a novel Super Learner (SL) algorithm, which includes a total of six machine learning (ML) algorithms, and uses data for the years 2018, 2020, and 2022. The results demonstrate that (i) the SL algorithm has a superior capacity with regard to other ML algorithms; (ii) gross domestic product per capita is the most crucial factor in the environmental performance followed by tariff rates, tax burden, government expenditure, and inflation, in order; (iii) among all, the corporate tax rate has the lowest importance on the environmental performance followed by also foreign direct investment, public debt, income tax rate, population, and unemployment; (iv) there are some critical thresholds, which imply that the impact of the factors on the environmental performance change according to these barriers. Overall, the study reveals the nonlinear impact of the variables on environmental performance as well as their relative importance and critical threshold. Thus, the study provides policymakers valuable insights in re-formulating their environmental policies to increase environmental performance. Accordingly, various policy options are discussed.


Subject(s)
Algorithms , Machine Learning , Environment , Economic Development , Gross Domestic Product
3.
BMC Med Inform Decis Mak ; 24(1): 97, 2024 Apr 16.
Article in English | MEDLINE | ID: mdl-38627734

ABSTRACT

BACKGROUND & AIM: Cardiovascular disease (CVD) is the most important cause of death in the world and has a potential impact on health care costs, this study aimed to evaluate the performance of machine learning survival models and determine the optimum model for predicting CVD-related mortality. METHOD: In this study, the research population was all participants in Tehran Lipid and Glucose Study (TLGS) aged over 30 years. We used the Gradient Boosting model (GBM), Support Vector Machine (SVM), Super Learner (SL), and Cox proportional hazard (Cox-PH) models to predict the CVD-related mortality using 26 features. The dataset was randomly divided into training (80%) and testing (20%). To evaluate the performance of the methods, we used the Brier Score (BS), Prediction Error (PE), Concordance Index (C-index), and time-dependent Area Under the Curve (TD-AUC) criteria. Four different clinical models were also performed to improve the performance of the methods. RESULTS: Out of 9258 participants with a mean age of (SD; range) 43.74 (15.51; 20-91), 56.60% were female. The CVD death proportion was 2.5% (228 participants). The death proportion was significantly higher in men (67.98% M, 32.02% F). Based on predefined selection criteria, the SL method has the best performance in predicting CVD-related mortality (TD-AUC > 93.50%). Among the machine learning (ML) methods, The SVM has the worst performance (TD-AUC = 90.13%). According to the relative effect, age, fasting blood sugar, systolic blood pressure, smoking, taking aspirin, diastolic blood pressure, Type 2 diabetes mellitus, hip circumference, body mss index (BMI), and triglyceride were identified as the most influential variables in predicting CVD-related mortality. CONCLUSION: According to the results of our study, compared to the Cox-PH model, Machine Learning models showed promising and sometimes better performance in predicting CVD-related mortality. This finding is based on the analysis of a large and diverse urban population from Tehran, Iran.


Subject(s)
Cardiovascular Diseases , Diabetes Mellitus, Type 2 , Male , Humans , Female , Adult , Cardiovascular Diseases/epidemiology , Glucose , Iran/epidemiology , Lipids
4.
BMC Med Res Methodol ; 24(1): 59, 2024 Mar 08.
Article in English | MEDLINE | ID: mdl-38459490

ABSTRACT

BACKGROUND: The primary treatment for patients with myocardial infarction (MI) is percutaneous coronary intervention (PCI). Despite this, the incidence of major adverse cardiovascular events (MACEs) remains a significant concern. Our study seeks to optimize PCI predictive modeling by employing an ensemble learning approach to identify the most effective combination of predictive variables. METHODS AND RESULTS: We conducted a retrospective, non-interventional analysis of MI patient data from 2018 to 2021, focusing on those who underwent PCI. Our principal metric was the occurrence of 1-year postoperative MACEs. Variable selection was performed using lasso regression, and predictive models were developed using the Super Learner (SL) algorithm. Model performance was appraised by the area under the receiver operating characteristic curve (AUC) and the average precision (AP) score. Our cohort included 3,880 PCI patients, with 475 (12.2%) experiencing MACEs within one year. The SL model exhibited superior discriminative performance, achieving a validated AUC of 0.982 and an AP of 0.971, which markedly surpassed the traditional logistic regression models (AUC: 0.826, AP: 0.626) in the test cohort. Thirteen variables were significantly associated with the occurrence of 1-year MACEs. CONCLUSION: Implementing the Super Learner algorithm has substantially enhanced the predictive accuracy for the risk of MACEs in MI patients. This advancement presents a promising tool for clinicians to craft individualized, data-driven interventions to better patient outcomes.


Subject(s)
Acute Coronary Syndrome , Myocardial Infarction , Percutaneous Coronary Intervention , Humans , Percutaneous Coronary Intervention/adverse effects , Acute Coronary Syndrome/complications , Acute Coronary Syndrome/surgery , Retrospective Studies , Myocardial Infarction/surgery , Risk Factors
5.
Math Biosci Eng ; 21(1): 1413-1444, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38303471

ABSTRACT

The green concretes industry benefits from utilizing gel to replace parts of the cement in concretes. However, measuring the compressive strength of geo-polymer concretes (CSGPoC) needs a significant amount of work and expenditure. Therefore, the best idea is predicting CSGPoC with a high level of accuracy. To do this, the base learner and super learner machine learning models were proposed in this study to anticipate CSGPoC. The decision tree (DT) is applied as base learner, and the random forest and extreme gradient boosting (XGBoost) techniques are used as super learner system. In this regard, a database was provided involving 259 CSGPoC data samples, of which four-fifths of is considered for the training model and one-fifth is selected for the testing models. The values of fly ash, ground-granulated blast-furnace slag (GGBS), Na2SiO3, NaOH, fine aggregate, gravel 4/10 mm, gravel 10/20 mm, water/solids ratio, and NaOH molarity were considered as input of the models to estimate CSGPoC. To evaluate the reliability and performance of the decision tree (DT), XGBoost, and random forest (RF) models, 12 performance evaluation metrics were determined. Based on the obtained results, the highest degree of accuracy is achieved by the XGBoost model with mean absolute error (MAE) of 2.073, mean absolute percentage error (MAPE) of 5.547, Nash-Sutcliffe (NS) of 0.981, correlation coefficient (R) of 0.991, R2 of 0.982, root mean square error (RMSE) of 2.458, Willmott's index (WI) of 0.795, weighted mean absolute percentage error (WMAPE) of 0.046, Bias of 2.073, square index (SI) of 0.054, p of 0.027, mean relative error (MRE) of -0.014, and a20 of 0.983 for the training model and MAE of 2.06, MAPE of 6.553, NS of 0.985, R of 0.993, R2 of 0.986, RMSE of 2.307, WI of 0.818, WMAPE of 0.05, Bias of 2.06, SI of 0.056, p of 0.028, MRE of -0.015, and a20 of 0.949 for the testing model. By importing the testing set into trained models, values of 0.8969, 0.9857, and 0.9424 for R2 were obtained for DT, XGBoost, and RF, respectively, which show the superiority of the XGBoost model in CSGPoC estimation. In conclusion, the XGBoost model is capable of more accurately predicting CSGPoC than DT and RF models.

6.
BioData Min ; 17(1): 2, 2024 Jan 25.
Article in English | MEDLINE | ID: mdl-38273386

ABSTRACT

BACKGROUND: Nowadays, the chance of discovering the best antibody candidates for predicting clinical malaria has notably increased due to the availability of multi-sera data. The analysis of these data is typically divided into a feature selection phase followed by a predictive one where several models are constructed for predicting the outcome of interest. A key question in the analysis is to determine which antibodies  should be included in the predictive stage and whether they should be included in the original or a transformed scale (i.e. binary/dichotomized). METHODS: To answer this question, we developed three approaches for antibody selection in the context of predicting clinical malaria: (i) a basic and simple approach based on selecting antibodies via the nonparametric Mann-Whitney-Wilcoxon test; (ii) an optimal dychotomizationdichotomization approach where each antibody was selected according to the optimal cut-off via maximization of the chi-squared (χ2) statistic for two-way tables; (iii) a hybrid parametric/non-parametric approach that integrates Box-Cox transformation followed by a t-test, together with the use of finite mixture models and the Mann-Whitney-Wilcoxon test as a last resort. We illustrated the application of these three approaches with published serological data of 36 Plasmodium falciparum antigens for predicting clinical malaria in 121 Kenyan children. The predictive analysis was based on a Super Learner where predictions from multiple classifiers including the Random Forest were pooled together. RESULTS: Our results led to almost similar areas under the Receiver Operating Characteristic curves of 0.72 (95% CI = [0.62, 0.82]), 0.80 (95% CI = [0.71, 0.89]), 0.79 (95% CI = [0.7, 0.88]) for the simple, dichotomization and hybrid approaches, respectively. These approaches were based on 6, 20, and 16 antibodies, respectively. CONCLUSIONS: The three feature selection strategies provided a better predictive performance of the outcome when compared to the previous results relying on Random Forest including all the 36 antibodies (AUC = 0.68, 95% CI = [0.57;0.79]). Given the similar predictive performance, we recommended that the three strategies should be used in conjunction in the same data set and selected according to their complexity.

7.
Pharmacoepidemiol Drug Saf ; 33(1): e5678, 2024 Jan.
Article in English | MEDLINE | ID: mdl-37609668

ABSTRACT

PURPOSE: High-dimensional propensity score (hdPS) is a semiautomated method that leverages a vast number of covariates available in healthcare databases to improve confounding adjustment. A novel combined Super Learner (SL)-hdPS approach was proposed to assist with selecting the number of covariates for propensity score inclusion, and was found in plasmode simulation studies to improve bias reduction and precision compared to hdPS alone. However, the approach has not been examined in the applied setting. METHODS: We compared SL-hdPS's performance with that of several hdPS models, each with prespecified covariates and a different number of empirically-identified covariates, using a cohort study comparing real-world bleeding rates between ibrutinib- and bendamustine-rituximab (BR)-treated individuals with chronic lymphocytic leukemia in Optum's de-identified Clinformatics® Data Mart commercial claims database (2013-2020). We used inverse probability of treatment weighting for confounding adjustment and Cox proportional hazards regression to estimate hazard ratios (HRs) for bleeding outcomes. Parameters of interest included prespecified and empirically-identified covariate balance (absolute standardized difference [ASD] thresholds of <0.10 and <0.05) and outcome HR precision (95% confidence intervals). RESULTS: We identified 2423 ibrutinib- and 1102 BR-treated individuals. Including >200 empirically-identified covariates in the hdPS model compromised covariate balance at both ASD thresholds. SL-hdPS balanced more covariates than all individual hdPS models at both ASD thresholds. The bleeding HR 95% confidence intervals were generally narrower with SL-hdPS than with individual hdPS models. CONCLUSION: In a real-world application, hdPS was sensitive to the number of covariates included, while use of SL for covariate selection resulted in improved covariate balance and possibly improved precision.


Subject(s)
Leukemia, Lymphocytic, Chronic, B-Cell , Humans , Propensity Score , Cohort Studies , Leukemia, Lymphocytic, Chronic, B-Cell/drug therapy , Proportional Hazards Models , Computer Simulation
8.
Biometrics ; 79(4): 2815-2829, 2023 12.
Article in English | MEDLINE | ID: mdl-37641532

ABSTRACT

We consider the problem of optimizing treatment allocation for statistical efficiency in randomized clinical trials. Optimal allocation has been studied previously for simple treatment effect estimators such as the sample mean difference, which are not fully efficient in the presence of baseline covariates. More efficient estimators can be obtained by incorporating covariate information, and modern machine learning methods make it increasingly feasible to approach full efficiency. Accordingly, we derive the optimal allocation ratio by maximizing the design efficiency of a randomized trial, assuming that an efficient estimator will be used for analysis. We then expand the scope of optimization by considering covariate-dependent randomization (CDR), which has some flavor of an observational study but provides the same level of scientific rigor as a standard randomized trial. We describe treatment effect estimators that are consistent, asymptotically normal, and (nearly) efficient under CDR, and derive the optimal propensity score by maximizing the design efficiency of a CDR trial (under the assumption that an efficient estimator will be used for analysis). Our optimality results translate into optimal designs that improve upon standard practice. Real-world examples and simulation results demonstrate that the proposed designs can produce substantial efficiency improvements in realistic settings.


Subject(s)
Models, Statistical , Randomized Controlled Trials as Topic , Computer Simulation , Propensity Score
9.
Biostatistics ; 2023 Aug 02.
Article in English | MEDLINE | ID: mdl-37531621

ABSTRACT

Cluster randomized trials (CRTs) often enroll large numbers of participants; yet due to resource constraints, only a subset of participants may be selected for outcome assessment, and those sampled may not be representative of all cluster members. Missing data also present a challenge: if sampled individuals with measured outcomes are dissimilar from those with missing outcomes, unadjusted estimates of arm-specific endpoints and the intervention effect may be biased. Further, CRTs often enroll and randomize few clusters, limiting statistical power and raising concerns about finite sample performance. Motivated by SEARCH-TB, a CRT aimed at reducing incident tuberculosis infection, we demonstrate interlocking methods to handle these challenges. First, we extend Two-Stage targeted minimum loss-based estimation to account for three sources of missingness: (i) subsampling; (ii) measurement of baseline status among those sampled; and (iii) measurement of final status among those in the incidence cohort (persons known to be at risk at baseline). Second, we critically evaluate the assumptions under which subunits of the cluster can be considered the conditionally independent unit, improving precision and statistical power but also causing the CRT to behave like an observational study. Our application to SEARCH-TB highlights the real-world impact of different assumptions on measurement and dependence; estimates relying on unrealistic assumptions suggested the intervention increased the incidence of TB infection by 18% (risk ratio [RR]=1.18, 95% confidence interval [CI]: 0.85-1.63), while estimates accounting for the sampling scheme, missingness, and within community dependence found the intervention decreased the incident TB by 27% (RR=0.73, 95% CI: 0.57-0.92).

10.
Stat Med ; 42(23): 4147-4176, 2023 10 15.
Article in English | MEDLINE | ID: mdl-37532119

ABSTRACT

There has been growing interest in using nonparametric machine learning approaches for propensity score estimation in order to foster robustness against misspecification of the propensity score model. However, the vast majority of studies focused on single-level data settings, and research on nonparametric propensity score estimation in clustered data settings is scarce. In this article, we extend existing research by describing a general algorithm for incorporating random effects into a machine learning model, which we implemented for generalized boosted modeling (GBM). In a simulation study, we investigated the performance of logistic regression, GBM, and Bayesian additive regression trees for inverse probability of treatment weighting (IPW) when the data are clustered, the treatment exposure mechanism is nonlinear, and unmeasured cluster-level confounding is present. For each approach, we compared fixed and random effects propensity score models to single-level models and evaluated their use in both marginal and clustered IPW. We additionally investigated the performance of the standard Super Learner and the balance Super Learner. The results showed that when there was no unmeasured confounding, logistic regression resulted in moderate bias in both marginal and clustered IPW, whereas the nonparametric approaches were unbiased. In presence of cluster-level confounding, fixed and random effects models greatly reduced bias compared to single-level models in marginal IPW, with fixed effects GBM and fixed effects logistic regression performing best. Finally, clustered IPW was overall preferable to marginal IPW and the balance Super Learner outperformed the standard Super Learner, though neither worked as well as their best candidate model.


Subject(s)
Multilevel Analysis , Observational Studies as Topic , Propensity Score , Humans , Bayes Theorem , Bias , Computer Simulation , Logistic Models
11.
Clin Res Cardiol ; 112(9): 1288-1301, 2023 Sep.
Article in English | MEDLINE | ID: mdl-37131096

ABSTRACT

BACKGROUND: In suspected myocardial infarction (MI), guidelines recommend using high-sensitivity cardiac troponin (hs-cTn)-based approaches. These require fixed assay-specific thresholds and timepoints, without directly integrating clinical information. Using machine-learning techniques including hs-cTn and clinical routine variables, we aimed to build a digital tool to directly estimate the individual probability of MI, allowing for numerous hs-cTn assays. METHODS: In 2,575 patients presenting to the emergency department with suspected MI, two ensembles of machine-learning models using single or serial concentrations of six different hs-cTn assays were derived to estimate the individual MI probability (ARTEMIS model). Discriminative performance of the models was assessed using area under the receiver operating characteristic curve (AUC) and logLoss. Model performance was validated in an external cohort with 1688 patients and tested for global generalizability in 13 international cohorts with 23,411 patients. RESULTS: Eleven routinely available variables including age, sex, cardiovascular risk factors, electrocardiography, and hs-cTn were included in the ARTEMIS models. In the validation and generalization cohorts, excellent discriminative performance was confirmed, superior to hs-cTn only. For the serial hs-cTn measurement model, AUC ranged from 0.92 to 0.98. Good calibration was observed. Using a single hs-cTn measurement, the ARTEMIS model allowed direct rule-out of MI with very high and similar safety but up to tripled efficiency compared to the guideline-recommended strategy. CONCLUSION: We developed and validated diagnostic models to accurately estimate the individual probability of MI, which allow for variable hs-cTn use and flexible timing of resampling. Their digital application may provide rapid, safe and efficient personalized patient care. TRIAL REGISTRATION NUMBERS: Data of following cohorts were used for this project: BACC ( www. CLINICALTRIALS: gov ; NCT02355457), stenoCardia ( www. CLINICALTRIALS: gov ; NCT03227159), ADAPT-BSN ( www.australianclinicaltrials.gov.au ; ACTRN12611001069943), IMPACT ( www.australianclinicaltrials.gov.au , ACTRN12611000206921), ADAPT-RCT ( www.anzctr.org.au ; ANZCTR12610000766011), EDACS-RCT ( www.anzctr.org.au ; ANZCTR12613000745741); DROP-ACS ( https://www.umin.ac.jp , UMIN000030668); High-STEACS ( www. CLINICALTRIALS: gov ; NCT01852123), LUND ( www. CLINICALTRIALS: gov ; NCT05484544), RAPID-CPU ( www. CLINICALTRIALS: gov ; NCT03111862), ROMI ( www. CLINICALTRIALS: gov ; NCT01994577), SAMIE ( https://anzctr.org.au ; ACTRN12621000053820), SEIGE and SAFETY ( www. CLINICALTRIALS: gov ; NCT04772157), STOP-CP ( www. CLINICALTRIALS: gov ; NCT02984436), UTROPIA ( www. CLINICALTRIALS: gov ; NCT02060760).


Subject(s)
Myocardial Infarction , Troponin I , Humans , Angina Pectoris , Biomarkers , Myocardial Infarction/diagnosis , ROC Curve , Troponin T , Clinical Studies as Topic
12.
Stat Med ; 42(13): 2116-2133, 2023 06 15.
Article in English | MEDLINE | ID: mdl-37004994

ABSTRACT

Gaussian graphical models (GGMs) are a popular form of network model in which nodes represent features in multivariate normal data and edges reflect conditional dependencies between these features. GGM estimation is an active area of research. Currently available tools for GGM estimation require investigators to make several choices regarding algorithms, scoring criteria, and tuning parameters. An estimated GGM may be highly sensitive to these choices, and the accuracy of each method can vary based on structural characteristics of the network such as topology, degree distribution, and density. Because these characteristics are a priori unknown, it is not straightforward to establish universal guidelines for choosing a GGM estimation method. We address this problem by introducing SpiderLearner, an ensemble method that constructs a consensus network from multiple estimated GGMs. Given a set of candidate methods, SpiderLearner estimates the optimal convex combination of results from each method using a likelihood-based loss function. K $$ K $$ -fold cross-validation is applied in this process, reducing the risk of overfitting. In simulations, SpiderLearner performs better than or comparably to the best candidate methods according to a variety of metrics, including relative Frobenius norm and out-of-sample likelihood. We apply SpiderLearner to publicly available ovarian cancer gene expression data including 2013 participants from 13 diverse studies, demonstrating our tool's potential to identify biomarkers of complex disease. SpiderLearner is implemented as flexible, extensible, open-source code in the R package ensembleGGM at https://github.com/katehoffshutta/ensembleGGM.


Subject(s)
Algorithms , Normal Distribution , Humans , Likelihood Functions , Software , Gene Expression , Ovarian Neoplasms/genetics
13.
Int J Epidemiol ; 52(4): 1276-1285, 2023 08 02.
Article in English | MEDLINE | ID: mdl-36905602

ABSTRACT

Common tasks encountered in epidemiology, including disease incidence estimation and causal inference, rely on predictive modelling. Constructing a predictive model can be thought of as learning a prediction function (a function that takes as input covariate data and outputs a predicted value). Many strategies for learning prediction functions from data (learners) are available, from parametric regressions to machine learning algorithms. It can be challenging to choose a learner, as it is impossible to know in advance which one is the most suitable for a particular dataset and prediction task. The super learner (SL) is an algorithm that alleviates concerns over selecting the one 'right' learner by providing the freedom to consider many, such as those recommended by collaborators, used in related research or specified by subject-matter experts. Also known as stacking, SL is an entirely prespecified and flexible approach for predictive modelling. To ensure the SL is well specified for learning the desired prediction function, the analyst does need to make a few important choices. In this educational article, we provide step-by-step guidelines for making these decisions, walking the reader through each of them and providing intuition along the way. In doing so, we aim to empower the analyst to tailor the SL specification to their prediction task, thereby ensuring their SL performs as well as possible. A flowchart provides a concise, easy-to-follow summary of key suggestions and heuristics, based on our accumulated experience and guided by SL optimality theory.


Subject(s)
Algorithms , Machine Learning , Humans
14.
J Appl Stat ; 50(3): 744-760, 2023.
Article in English | MEDLINE | ID: mdl-36819084

ABSTRACT

Causal inference under the potential outcome framework relies on the strongly ignorable treatment assumption. This assumption is usually questionable in observational studies, and the unmeasured confounding is one of the fundamental challenges in causal inference. To this end, we propose a new sensitivity analysis method to evaluate the impact of the unmeasured confounder by leveraging ideas of doubly robust estimators, the exponential tilt method, and the super learner algorithm. Compared to other existing methods of sensitivity analysis that parameterize the unmeasured confounder as a latent variable in the working models, the exponential tilting method does not impose any restrictions on the structure or models of the unmeasured confounders. In addition, in order to reduce the modeling bias of traditional parametric methods, we propose incorporating the super learner machine learning algorithm to perform nonparametric model estimation and the corresponding sensitivity analysis. Furthermore, most existing sensitivity analysis methods require multivariate sensitivity parameters, which make its choice difficult and subjective in practice. In comparison, the new method has a univariate sensitivity parameter with a nice and simple interpretation of log-odds ratios for binary outcomes, which makes its choice and the application of the new sensitivity analysis method very easy for practitioners.

15.
J Appl Stat ; 50(3): 805-826, 2023.
Article in English | MEDLINE | ID: mdl-36819087

ABSTRACT

Multi-parametric MRI (mpMRI) is a critical tool in prostate cancer (PCa) diagnosis and management. To further advance the use of mpMRI in patient care, computer aided diagnostic methods are under continuous development for supporting/supplanting standard radiological interpretation. While voxel-wise PCa classification models are the gold standard, few if any approaches have incorporated the inherent structure of the mpMRI data, such as spatial heterogeneity and between-voxel correlation, into PCa classification. We propose a machine learning-based method to fill in this gap. Our method uses an ensemble learning approach to capture regional heterogeneity in the data, where classifiers are developed at multiple resolutions and combined using the super learner algorithm, and further account for between-voxel correlation through a Gaussian kernel smoother. It allows any type of classifier to be the base learner and can be extended to further classify PCa sub-categories. We introduce the algorithms for binary PCa classification, as well as for classifying the ordinal clinical significance of PCa for which a weighted likelihood approach is implemented to improve the detection of less prevalent cancer categories. The proposed method has shown important advantages over conventional modeling and machine learning approaches in simulations and application to our motivating patient data.

16.
Photodiagnosis Photodyn Ther ; 42: 103351, 2023 Jun.
Article in English | MEDLINE | ID: mdl-36849089

ABSTRACT

BACKGROUND: Diabetic Retinopathy (DR) is a serious consequence of diabetes that can result to permanent vision loss for a person. Diabetes-related vision impairment can be significantly avoided with timely screening and treatment in its initial phase. The earliest and the most noticeable indications on the surface of the retina are micro-aneurysm and haemorrhage, which appear as dark patches. Therefore, the automatic detection of retinopathy begins with the identification of all these dark lesions. METHOD: In our study, we have developed a clinical knowledge based segmentation built on Early Treatment DR Study (ETDRS). ETDRS is a gold standard for identifying all red lesions using adaptive-thresholding approach followed by different pre-processing steps. The lesions are classified using super-learning approach to improve multi-class detection accuracy. Ensemble based super-learning approach finds optimal weights of base learners by minimizing the cross validated risk-function and it pledges the improved performance compared to base-learners predictions. For multi-class classification, a well informative feature-set based on colour, intensity, shape, size and texture, is developed. In this work, we have handled the data imbalance problem and compared the final accuracy with different synthetic data creation ratios. RESULT: The suggested approach uses publicly available resources to perform quantitative assessments at lesions-level. The overall accuracy of red lesion segregation is 93.5%, which has increased to 97.88% when data imbalance problem is taken care-off. CONCLUSION: The results of our system have achieved competitive performance compared with other modern approaches and handling of data imbalance further increases the performance of it.


Subject(s)
Diabetic Retinopathy , Photochemotherapy , Humans , Image Interpretation, Computer-Assisted/methods , Photochemotherapy/methods , Photosensitizing Agents , Fundus Oculi , Diabetic Retinopathy/diagnostic imaging , Algorithms
17.
Med Biol Eng Comput ; 61(3): 785-797, 2023 Mar.
Article in English | MEDLINE | ID: mdl-36602674

ABSTRACT

Diabetes mellitus has become a rapidly growing chronic health problem worldwide. There has been a noticeable increase in diabetes cases in the last two decades. Recent advances in ensemble machine learning methods play an important role in the early detection of diabetes mellitus. These methods are both faster and less costly than traditional methods. This study aims to propose a new super ensemble learning model to enable an early diagnosis of diabetes mellitus. Super learner is a cross-validation-based approach that makes better predictions by combining prediction results of more than one machine learning algorithm. The proposed super learner model was created with four base-learners (logistic regression, decision tree, random forest, gradient boosting) and a meta learner (support vector machines) as a result of a case study. Three different dataset were used to measure the robustness of the proposed model. Chi-square was determined as an optimal feature selection technique from five different techniques, and also hyper-parameter settings were made with GridSearch. Finally, the proposed new super learner model achieved to obtain the best accuracy results in the detection of Diabetes mellitus compared to the base-learners for the early-stage diabetes risk prediction (99.6%), PIMA (92%), and diabetes 130-US hospitals (98%) dataset, respectively. This study revealed that super learner algorithms can be effectively used in the detection of diabetes mellitus. Also, obtaining of the high and convincing statistical scores shows the robustness of the proposed super learner model.


Subject(s)
Algorithms , Diabetes Mellitus , Humans , Machine Learning , Logistic Models , Random Forest , Diabetes Mellitus/diagnosis
18.
Chemosphere ; 311(Pt 2): 137125, 2023 Jan.
Article in English | MEDLINE | ID: mdl-36347347

ABSTRACT

Chronic lead (Pb) exposure causes long term health effects. While recent exposure can be assessed by measuring blood lead (half-life 30 days), chronic exposures can be assessed by measuring lead in bone (half-life of many years to decades). Bone lead measurements, in turn, have been measured non-invasively in large population-based studies using x-ray fluorescence techniques, but the method remains limited due to technical availability, expense, and the need for licensing radioactive materials used by the instruments. Thus, we developed prediction models for bone lead concentrations using a flexible machine learning approach--Super Learner, which combines the predictions from a set of machine learning algorithms for better prediction performance. The study population included 695 men in the Normative Aging Study, aged 48 years and older, whose bone (patella and tibia) lead concentrations were directly measured using K-shell-X-ray fluorescence. Ten predictors (blood lead, age, education, job type, weight, height, body mass index, waist circumference, cumulative cigarette smoking (pack-year), and smoking status) were selected for patella lead and 11 (the same 10 predictors plus serum phosphorus) for tibia lead using the Boruta algorithm. We implemented Super Learner to predict bone lead concentrations by calculating a weighted combination of predictions from 8 algorithms. In the nested cross-validation, the correlation coefficients between measured and predicted bone lead concentrations were 0.58 for patella lead and 0.52 for tibia lead, which has improved the correlations obtained in previously-published linear regression-based prediction models. We evaluated the applicability of these prediction models to the National Health and Nutrition Examination Survey for the associations between predicted bone lead concentrations and blood pressure, and positive associations were observed. These bone lead prediction models provide reasonable accuracy and can be used to evaluate health effects of cumulative lead exposure in studies where bone lead is not measured.


Subject(s)
Aging , Lead , Male , Humans , Nutrition Surveys , Linear Models , Algorithms
19.
Anaesth Crit Care Pain Med ; 42(1): 101172, 2023 02.
Article in English | MEDLINE | ID: mdl-36375781

ABSTRACT

BACKGROUND: Post-cardiotomy low cardiac output syndrome (PC-LCOS) is a life-threatening complication after cardiac surgery involving a cardiopulmonary bypass (CPB). Mechanical circulatory support with veno-arterial membrane oxygenation (VA-ECMO) may be necessary in the case of refractory shock. The objective of the study was to develop a machine-learning algorithm to predict the need for VA-ECMO implantation in patients with PC-LCOS. PATIENTS AND METHODS: Patients were included in the study with moderate to severe PC-LCOS (defined by a vasoactive inotropic score (VIS) > 10 with clinical or biological markers of impaired organ perfusion or need for mechanical circulatory support after cardiac surgery) from two university hospitals in Paris, France. The Deep Super Learner, an ensemble machine learning algorithm, was trained to predict VA-ECMO implantation using features readily available at the end of a CPB. Feature importance was estimated using Shapley values. RESULTS: Between January 2016 and December 2019, 285 patients were included in the development dataset and 190 patients in the external validation dataset. The primary outcome, the need for VA-ECMO implantation, occurred respectively, in 16% (n = 46) and 10% (n = 19) in the development and the external validation datasets. The Deep Super Learner algorithm achieved a 0.863 (0.793-0.928) ROC AUC to predict the primary outcome in the external validation dataset. The most important features were the first postoperative arterial lactate value, intraoperative VIS, the absence of angiotensin-converting enzyme treatment, body mass index, and EuroSCORE II. CONCLUSIONS: We developed an explainable ensemble machine learning algorithm that could help clinicians predict the risk of deterioration and the need for VA-ECMO implantation in moderate to severe PC-LCOS patients.


Subject(s)
Cardiac Output, Low , Cardiac Surgical Procedures , Extracorporeal Membrane Oxygenation , Humans , Cardiac Output, Low/etiology , Cardiac Output, Low/therapy , Cardiac Surgical Procedures/adverse effects , Machine Learning , Algorithms
20.
Biostatistics ; 24(2): 502-517, 2023 04 14.
Article in English | MEDLINE | ID: mdl-34939083

ABSTRACT

Cluster randomized trials (CRTs) randomly assign an intervention to groups of individuals (e.g., clinics or communities) and measure outcomes on individuals in those groups. While offering many advantages, this experimental design introduces challenges that are only partially addressed by existing analytic approaches. First, outcomes are often missing for some individuals within clusters. Failing to appropriately adjust for differential outcome measurement can result in biased estimates and inference. Second, CRTs often randomize limited numbers of clusters, resulting in chance imbalances on baseline outcome predictors between arms. Failing to adaptively adjust for these imbalances and other predictive covariates can result in efficiency losses. To address these methodological gaps, we propose and evaluate a novel two-stage targeted minimum loss-based estimator to adjust for baseline covariates in a manner that optimizes precision, after controlling for baseline and postbaseline causes of missing outcomes. Finite sample simulations illustrate that our approach can nearly eliminate bias due to differential outcome measurement, while existing CRT estimators yield misleading results and inferences. Application to real data from the SEARCH community randomized trial demonstrates the gains in efficiency afforded through adaptive adjustment for baseline covariates, after controlling for missingness on individual-level outcomes.


Subject(s)
Outcome Assessment, Health Care , Research Design , Humans , Randomized Controlled Trials as Topic , Probability , Bias , Cluster Analysis , Computer Simulation
SELECTION OF CITATIONS
SEARCH DETAIL