Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 55
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Biostatistics ; 24(2): 502-517, 2023 04 14.
Artigo em Inglês | MEDLINE | ID: mdl-34939083

RESUMO

Cluster randomized trials (CRTs) randomly assign an intervention to groups of individuals (e.g., clinics or communities) and measure outcomes on individuals in those groups. While offering many advantages, this experimental design introduces challenges that are only partially addressed by existing analytic approaches. First, outcomes are often missing for some individuals within clusters. Failing to appropriately adjust for differential outcome measurement can result in biased estimates and inference. Second, CRTs often randomize limited numbers of clusters, resulting in chance imbalances on baseline outcome predictors between arms. Failing to adaptively adjust for these imbalances and other predictive covariates can result in efficiency losses. To address these methodological gaps, we propose and evaluate a novel two-stage targeted minimum loss-based estimator to adjust for baseline covariates in a manner that optimizes precision, after controlling for baseline and postbaseline causes of missing outcomes. Finite sample simulations illustrate that our approach can nearly eliminate bias due to differential outcome measurement, while existing CRT estimators yield misleading results and inferences. Application to real data from the SEARCH community randomized trial demonstrates the gains in efficiency afforded through adaptive adjustment for baseline covariates, after controlling for missingness on individual-level outcomes.


Assuntos
Avaliação de Resultados em Cuidados de Saúde , Projetos de Pesquisa , Humanos , Ensaios Clínicos Controlados Aleatórios como Assunto , Probabilidade , Viés , Análise por Conglomerados , Simulação por Computador
2.
Biostatistics ; 2023 Aug 02.
Artigo em Inglês | MEDLINE | ID: mdl-37531621

RESUMO

Cluster randomized trials (CRTs) often enroll large numbers of participants; yet due to resource constraints, only a subset of participants may be selected for outcome assessment, and those sampled may not be representative of all cluster members. Missing data also present a challenge: if sampled individuals with measured outcomes are dissimilar from those with missing outcomes, unadjusted estimates of arm-specific endpoints and the intervention effect may be biased. Further, CRTs often enroll and randomize few clusters, limiting statistical power and raising concerns about finite sample performance. Motivated by SEARCH-TB, a CRT aimed at reducing incident tuberculosis infection, we demonstrate interlocking methods to handle these challenges. First, we extend Two-Stage targeted minimum loss-based estimation to account for three sources of missingness: (i) subsampling; (ii) measurement of baseline status among those sampled; and (iii) measurement of final status among those in the incidence cohort (persons known to be at risk at baseline). Second, we critically evaluate the assumptions under which subunits of the cluster can be considered the conditionally independent unit, improving precision and statistical power but also causing the CRT to behave like an observational study. Our application to SEARCH-TB highlights the real-world impact of different assumptions on measurement and dependence; estimates relying on unrealistic assumptions suggested the intervention increased the incidence of TB infection by 18% (risk ratio [RR]=1.18, 95% confidence interval [CI]: 0.85-1.63), while estimates accounting for the sampling scheme, missingness, and within community dependence found the intervention decreased the incident TB by 27% (RR=0.73, 95% CI: 0.57-0.92).

3.
Brief Bioinform ; 23(3)2022 05 13.
Artigo em Inglês | MEDLINE | ID: mdl-35383362

RESUMO

Nuclear receptors (NRs) are important biological targets of endocrine-disrupting chemicals (EDCs). Identifying chemicals that can act as EDCs and modulate the function of NRs is difficult because of the time and cost of in vitro and in vivo screening to determine the potential hazards of the 100 000s of chemicals that humans are exposed to. Hence, there is a need for computational approaches to prioritize chemicals for biological testing. Machine learning (ML) techniques are alternative methods that can quickly screen millions of chemicals and identify those that may be an EDC. Computational models of chemical binding to multiple NRs have begun to emerge. Recently, a Nuclear Receptor Activity (NuRA) dataset, describing experimentally derived small-molecule activity against various NRs has been created. We have used the NuRA dataset to develop an ensemble of ML-based models to predict the agonism, antagonism, binding and effector binding of small molecules to nine different human NRs. We defined the applicability domain of the ML models as a measure of Tanimoto similarity to the molecules in the training set, which enhanced the performance of the developed classifiers. We further developed a user-friendly web server named 'NR-ToxPred' to predict the binding of chemicals to the nine NRs using the best-performing models for each receptor. This web server is freely accessible at http://nr-toxpred.cchem.berkeley.edu. Users can upload individual chemicals using Simplified Molecular-Input Line-Entry System, CAS numbers or sketch the molecule in the provided space to predict the compound's activity against the different NRs and predict the binding mode for each.


Assuntos
Disruptores Endócrinos , Receptores Citoplasmáticos e Nucleares , Disruptores Endócrinos/química , Disruptores Endócrinos/metabolismo , Humanos , Aprendizado de Máquina , Receptores Citoplasmáticos e Nucleares/genética
4.
BMC Med Res Methodol ; 24(1): 59, 2024 Mar 08.
Artigo em Inglês | MEDLINE | ID: mdl-38459490

RESUMO

BACKGROUND: The primary treatment for patients with myocardial infarction (MI) is percutaneous coronary intervention (PCI). Despite this, the incidence of major adverse cardiovascular events (MACEs) remains a significant concern. Our study seeks to optimize PCI predictive modeling by employing an ensemble learning approach to identify the most effective combination of predictive variables. METHODS AND RESULTS: We conducted a retrospective, non-interventional analysis of MI patient data from 2018 to 2021, focusing on those who underwent PCI. Our principal metric was the occurrence of 1-year postoperative MACEs. Variable selection was performed using lasso regression, and predictive models were developed using the Super Learner (SL) algorithm. Model performance was appraised by the area under the receiver operating characteristic curve (AUC) and the average precision (AP) score. Our cohort included 3,880 PCI patients, with 475 (12.2%) experiencing MACEs within one year. The SL model exhibited superior discriminative performance, achieving a validated AUC of 0.982 and an AP of 0.971, which markedly surpassed the traditional logistic regression models (AUC: 0.826, AP: 0.626) in the test cohort. Thirteen variables were significantly associated with the occurrence of 1-year MACEs. CONCLUSION: Implementing the Super Learner algorithm has substantially enhanced the predictive accuracy for the risk of MACEs in MI patients. This advancement presents a promising tool for clinicians to craft individualized, data-driven interventions to better patient outcomes.


Assuntos
Síndrome Coronariana Aguda , Infarto do Miocárdio , Intervenção Coronária Percutânea , Humanos , Intervenção Coronária Percutânea/efeitos adversos , Síndrome Coronariana Aguda/complicações , Síndrome Coronariana Aguda/cirurgia , Estudos Retrospectivos , Infarto do Miocárdio/cirurgia , Fatores de Risco
5.
Pharmacoepidemiol Drug Saf ; 33(1): e5678, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37609668

RESUMO

PURPOSE: High-dimensional propensity score (hdPS) is a semiautomated method that leverages a vast number of covariates available in healthcare databases to improve confounding adjustment. A novel combined Super Learner (SL)-hdPS approach was proposed to assist with selecting the number of covariates for propensity score inclusion, and was found in plasmode simulation studies to improve bias reduction and precision compared to hdPS alone. However, the approach has not been examined in the applied setting. METHODS: We compared SL-hdPS's performance with that of several hdPS models, each with prespecified covariates and a different number of empirically-identified covariates, using a cohort study comparing real-world bleeding rates between ibrutinib- and bendamustine-rituximab (BR)-treated individuals with chronic lymphocytic leukemia in Optum's de-identified Clinformatics® Data Mart commercial claims database (2013-2020). We used inverse probability of treatment weighting for confounding adjustment and Cox proportional hazards regression to estimate hazard ratios (HRs) for bleeding outcomes. Parameters of interest included prespecified and empirically-identified covariate balance (absolute standardized difference [ASD] thresholds of <0.10 and <0.05) and outcome HR precision (95% confidence intervals). RESULTS: We identified 2423 ibrutinib- and 1102 BR-treated individuals. Including >200 empirically-identified covariates in the hdPS model compromised covariate balance at both ASD thresholds. SL-hdPS balanced more covariates than all individual hdPS models at both ASD thresholds. The bleeding HR 95% confidence intervals were generally narrower with SL-hdPS than with individual hdPS models. CONCLUSION: In a real-world application, hdPS was sensitive to the number of covariates included, while use of SL for covariate selection resulted in improved covariate balance and possibly improved precision.


Assuntos
Leucemia Linfocítica Crônica de Células B , Humanos , Pontuação de Propensão , Estudos de Coortes , Leucemia Linfocítica Crônica de Células B/tratamento farmacológico , Modelos de Riscos Proporcionais , Simulação por Computador
6.
BMC Med Inform Decis Mak ; 24(1): 97, 2024 Apr 16.
Artigo em Inglês | MEDLINE | ID: mdl-38627734

RESUMO

BACKGROUND & AIM: Cardiovascular disease (CVD) is the most important cause of death in the world and has a potential impact on health care costs, this study aimed to evaluate the performance of machine learning survival models and determine the optimum model for predicting CVD-related mortality. METHOD: In this study, the research population was all participants in Tehran Lipid and Glucose Study (TLGS) aged over 30 years. We used the Gradient Boosting model (GBM), Support Vector Machine (SVM), Super Learner (SL), and Cox proportional hazard (Cox-PH) models to predict the CVD-related mortality using 26 features. The dataset was randomly divided into training (80%) and testing (20%). To evaluate the performance of the methods, we used the Brier Score (BS), Prediction Error (PE), Concordance Index (C-index), and time-dependent Area Under the Curve (TD-AUC) criteria. Four different clinical models were also performed to improve the performance of the methods. RESULTS: Out of 9258 participants with a mean age of (SD; range) 43.74 (15.51; 20-91), 56.60% were female. The CVD death proportion was 2.5% (228 participants). The death proportion was significantly higher in men (67.98% M, 32.02% F). Based on predefined selection criteria, the SL method has the best performance in predicting CVD-related mortality (TD-AUC > 93.50%). Among the machine learning (ML) methods, The SVM has the worst performance (TD-AUC = 90.13%). According to the relative effect, age, fasting blood sugar, systolic blood pressure, smoking, taking aspirin, diastolic blood pressure, Type 2 diabetes mellitus, hip circumference, body mss index (BMI), and triglyceride were identified as the most influential variables in predicting CVD-related mortality. CONCLUSION: According to the results of our study, compared to the Cox-PH model, Machine Learning models showed promising and sometimes better performance in predicting CVD-related mortality. This finding is based on the analysis of a large and diverse urban population from Tehran, Iran.


Assuntos
Doenças Cardiovasculares , Diabetes Mellitus Tipo 2 , Masculino , Humanos , Feminino , Adulto , Doenças Cardiovasculares/epidemiologia , Glucose , Irã (Geográfico)/epidemiologia , Lipídeos
7.
J Environ Manage ; 359: 121040, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38718609

RESUMO

This study aims to analyze comprehensively the impact of different economic and demographic factors, which affect economic development, on environmental performance. In this context, the study considers the Environmental Performance Index as the response variable, uses GDP per capita, tariff rate, tax burden, government expenditure, inflation, unemployment, population, income tax rate, public debt, FDI inflow, and corporate tax rate as the explanatory variables, examines 181 countries, performs a novel Super Learner (SL) algorithm, which includes a total of six machine learning (ML) algorithms, and uses data for the years 2018, 2020, and 2022. The results demonstrate that (i) the SL algorithm has a superior capacity with regard to other ML algorithms; (ii) gross domestic product per capita is the most crucial factor in the environmental performance followed by tariff rates, tax burden, government expenditure, and inflation, in order; (iii) among all, the corporate tax rate has the lowest importance on the environmental performance followed by also foreign direct investment, public debt, income tax rate, population, and unemployment; (iv) there are some critical thresholds, which imply that the impact of the factors on the environmental performance change according to these barriers. Overall, the study reveals the nonlinear impact of the variables on environmental performance as well as their relative importance and critical threshold. Thus, the study provides policymakers valuable insights in re-formulating their environmental policies to increase environmental performance. Accordingly, various policy options are discussed.


Assuntos
Algoritmos , Aprendizado de Máquina , Meio Ambiente , Desenvolvimento Econômico , Produto Interno Bruto
8.
Biometrics ; 79(4): 2815-2829, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-37641532

RESUMO

We consider the problem of optimizing treatment allocation for statistical efficiency in randomized clinical trials. Optimal allocation has been studied previously for simple treatment effect estimators such as the sample mean difference, which are not fully efficient in the presence of baseline covariates. More efficient estimators can be obtained by incorporating covariate information, and modern machine learning methods make it increasingly feasible to approach full efficiency. Accordingly, we derive the optimal allocation ratio by maximizing the design efficiency of a randomized trial, assuming that an efficient estimator will be used for analysis. We then expand the scope of optimization by considering covariate-dependent randomization (CDR), which has some flavor of an observational study but provides the same level of scientific rigor as a standard randomized trial. We describe treatment effect estimators that are consistent, asymptotically normal, and (nearly) efficient under CDR, and derive the optimal propensity score by maximizing the design efficiency of a CDR trial (under the assumption that an efficient estimator will be used for analysis). Our optimality results translate into optimal designs that improve upon standard practice. Real-world examples and simulation results demonstrate that the proposed designs can produce substantial efficiency improvements in realistic settings.


Assuntos
Modelos Estatísticos , Ensaios Clínicos Controlados Aleatórios como Assunto , Simulação por Computador , Pontuação de Propensão
9.
Stat Med ; 42(23): 4147-4176, 2023 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-37532119

RESUMO

There has been growing interest in using nonparametric machine learning approaches for propensity score estimation in order to foster robustness against misspecification of the propensity score model. However, the vast majority of studies focused on single-level data settings, and research on nonparametric propensity score estimation in clustered data settings is scarce. In this article, we extend existing research by describing a general algorithm for incorporating random effects into a machine learning model, which we implemented for generalized boosted modeling (GBM). In a simulation study, we investigated the performance of logistic regression, GBM, and Bayesian additive regression trees for inverse probability of treatment weighting (IPW) when the data are clustered, the treatment exposure mechanism is nonlinear, and unmeasured cluster-level confounding is present. For each approach, we compared fixed and random effects propensity score models to single-level models and evaluated their use in both marginal and clustered IPW. We additionally investigated the performance of the standard Super Learner and the balance Super Learner. The results showed that when there was no unmeasured confounding, logistic regression resulted in moderate bias in both marginal and clustered IPW, whereas the nonparametric approaches were unbiased. In presence of cluster-level confounding, fixed and random effects models greatly reduced bias compared to single-level models in marginal IPW, with fixed effects GBM and fixed effects logistic regression performing best. Finally, clustered IPW was overall preferable to marginal IPW and the balance Super Learner outperformed the standard Super Learner, though neither worked as well as their best candidate model.


Assuntos
Análise Multinível , Estudos Observacionais como Assunto , Pontuação de Propensão , Humanos , Teorema de Bayes , Viés , Simulação por Computador , Modelos Logísticos
10.
Stat Med ; 42(13): 2116-2133, 2023 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-37004994

RESUMO

Gaussian graphical models (GGMs) are a popular form of network model in which nodes represent features in multivariate normal data and edges reflect conditional dependencies between these features. GGM estimation is an active area of research. Currently available tools for GGM estimation require investigators to make several choices regarding algorithms, scoring criteria, and tuning parameters. An estimated GGM may be highly sensitive to these choices, and the accuracy of each method can vary based on structural characteristics of the network such as topology, degree distribution, and density. Because these characteristics are a priori unknown, it is not straightforward to establish universal guidelines for choosing a GGM estimation method. We address this problem by introducing SpiderLearner, an ensemble method that constructs a consensus network from multiple estimated GGMs. Given a set of candidate methods, SpiderLearner estimates the optimal convex combination of results from each method using a likelihood-based loss function. K $$ K $$ -fold cross-validation is applied in this process, reducing the risk of overfitting. In simulations, SpiderLearner performs better than or comparably to the best candidate methods according to a variety of metrics, including relative Frobenius norm and out-of-sample likelihood. We apply SpiderLearner to publicly available ovarian cancer gene expression data including 2013 participants from 13 diverse studies, demonstrating our tool's potential to identify biomarkers of complex disease. SpiderLearner is implemented as flexible, extensible, open-source code in the R package ensembleGGM at https://github.com/katehoffshutta/ensembleGGM.


Assuntos
Algoritmos , Distribuição Normal , Humanos , Funções Verossimilhança , Software , Expressão Gênica , Neoplasias Ovarianas/genética
11.
Neuroimage ; 257: 119296, 2022 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-35561944

RESUMO

The exclusion of high-motion participants can reduce the impact of motion in functional Magnetic Resonance Imaging (fMRI) data. However, the exclusion of high-motion participants may change the distribution of clinically relevant variables in the study sample, and the resulting sample may not be representative of the population. Our goals are two-fold: 1) to document the biases introduced by common motion exclusion practices in functional connectivity research and 2) to introduce a framework to address these biases by treating excluded scans as a missing data problem. We use a study of autism spectrum disorder in children without an intellectual disability to illustrate the problem and the potential solution. We aggregated data from 545 children (8-13 years old) who participated in resting-state fMRI studies at Kennedy Krieger Institute (173 autistic and 372 typically developing) between 2007 and 2020. We found that autistic children were more likely to be excluded than typically developing children, with 28.5% and 16.1% of autistic and typically developing children excluded, respectively, using a lenient criterion and 81.0% and 60.1% with a stricter criterion. The resulting sample of autistic children with usable data tended to be older, have milder social deficits, better motor control, and higher intellectual ability than the original sample. These measures were also related to functional connectivity strength among children with usable data. This suggests that the generalizability of previous studies reporting naïve analyses (i.e., based only on participants with usable data) may be limited by the selection of older children with less severe clinical profiles because these children are better able to remain still during an rs-fMRI scan. We adapt doubly robust targeted minimum loss based estimation with an ensemble of machine learning algorithms to address these data losses and the resulting biases. The proposed approach selects more edges that differ in functional connectivity between autistic and typically developing children than the naïve approach, supporting this as a promising solution to improve the study of heterogeneous populations in which motion is common.


Assuntos
Transtorno do Espectro Autista , Transtorno Autístico , Adolescente , Transtorno do Espectro Autista/diagnóstico por imagem , Encéfalo/diagnóstico por imagem , Mapeamento Encefálico/métodos , Criança , Cognição , Humanos , Imageamento por Ressonância Magnética/métodos
12.
Am J Epidemiol ; 190(8): 1483-1487, 2021 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-33751059

RESUMO

In this issue of the Journal, Mooney et al. (Am J Epidemiol. 2021;190(8):1476-1482) discuss machine learning as a tool for causal research in the style of Internet headlines. Here we comment by adapting famous literary quotations, including the one in our title (from "Sonnet 43" by Elizabeth Barrett Browning (Sonnets From the Portuguese, Adelaide Hanscom Leeson, 1850)). We emphasize that any use of machine learning to answer causal questions must be founded on a formal framework for both causal and statistical inference. We illustrate the pitfalls that can occur without such a foundation. We conclude with some practical recommendations for integrating machine learning into causal analyses in a principled way and highlight important areas of ongoing work.


Assuntos
Amor , Aprendizado de Máquina , Causalidade , Humanos
13.
Am J Epidemiol ; 2021 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-34268553

RESUMO

In this issue, Naimi et al. (Am J Epidemiol. XXXX;XXX(XX):XXXX-XXXX) discuss a critical topic in public health and beyond: obtaining valid statistical inference when using machine learning in causal research. In doing so, the authors review recent prominent methodological work and recommend: (i) double robust estimators, such as targeted maximum likelihood estimation (TMLE); (ii) ensemble methods, such as Super Learner, to combine predictions from a diverse library of algorithms, and (iii) sample-splitting to reduce bias and improve inference. We largely agree with these recommendations. In this commentary, we highlight the critical importance of the Super Learner library. Specifically, in both simulation settings considered by the authors, we demonstrate that low bias and valid statistical inference can be achieved using TMLE without sample-splitting and with a Super Learner library that excludes tree-based methods but includes regression splines. Whether extremely data-adaptive algorithms and sample-splitting are needed depends on the specific problem and should be informed by simulations reflecting the specific application. More research is needed on practical recommendations for selecting among these options in common situations arising in epidemiology.

14.
BMC Public Health ; 21(1): 1219, 2021 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-34167500

RESUMO

OBJECTIVES: The relationship between reproductive factors and breast cancer (BC) risk has been investigated in previous studies. Considering the discrepancies in the results, the aim of this study was to estimate the causal effect of reproductive factors on BC risk in a case-control study using the double robust approach of targeted maximum likelihood estimation. METHODS: This is a causal reanalysis of a case-control study done between 2005 and 2008 in Shiraz, Iran, in which 787 confirmed BC cases and 928 controls were enrolled. Targeted maximum likelihood estimation along with super Learner were used to analyze the data, and risk ratio (RR), risk difference (RD), andpopulation attributable fraction (PAF) were reported. RESULTS: Our findings did not support parity and age at the first pregnancy as risk factors for BC. The risk of BC was higher among postmenopausal women (RR = 3.3, 95% confidence interval (CI) = (2.3, 4.6)), women with the age at first marriage ≥20 years (RR = 1.6, 95% CI = (1.3, 2.1)), and the history of oral contraceptive (OC) use (RR = 1.6, 95% CI = (1.3, 2.1)) or breastfeeding duration ≤60 months (RR = 1.8, 95% CI = (1.3, 2.5)). The PAF for menopause status, breastfeeding duration, and OC use were 40.3% (95% CI = 39.5, 40.6), 27.3% (95% CI = 23.1, 30.8) and 24.4% (95% CI = 10.5, 35.5), respectively. CONCLUSIONS: Postmenopausal women, and women with a higher age at first marriage, shorter duration of breastfeeding, and history of OC use are at the higher risk of BC.


Assuntos
Neoplasias da Mama , Neoplasias da Mama/epidemiologia , Neoplasias da Mama/etiologia , Estudos de Casos e Controles , Feminino , Humanos , Irã (Geográfico)/epidemiologia , Funções Verossimilhança , Paridade , Gravidez , História Reprodutiva , Fatores de Risco
15.
Stat Med ; 39(23): 3059-3073, 2020 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-32578905

RESUMO

Human immunodeficiency virus (HIV) pre-exposure prophylaxis (PrEP) protects high risk patients from becoming infected with HIV. Clinicians need help to identify candidates for PrEP based on information routinely collected in electronic health records (EHRs). The greatest statistical challenge in developing a risk prediction model is that acquisition is extremely rare. METHODS: Data consisted of 180 covariates (demographic, diagnoses, treatments, prescriptions) extracted from records on 399 385 patient (150 cases) seen at Atrius Health (2007-2015), a clinical network in Massachusetts. Super learner is an ensemble machine learning algorithm that uses k-fold cross validation to evaluate and combine predictions from a collection of algorithms. We trained 42 variants of sophisticated algorithms, using different sampling schemes that more evenly balanced the ratio of cases to controls. We compared super learner's cross validated area under the receiver operating curve (cv-AUC) with that of each individual algorithm. RESULTS: The least absolute shrinkage and selection operator (lasso) using a 1:20 class ratio outperformed the super learner (cv-AUC = 0.86 vs 0.84). A traditional logistic regression model restricted to 23 clinician-selected main terms was slightly inferior (cv-AUC = 0.81). CONCLUSION: Machine learning was successful at developing a model to predict 1-year risk of acquiring HIV based on a physician-curated set of predictors extracted from EHRs.


Assuntos
Infecções por HIV , Profilaxia Pré-Exposição , Registros Eletrônicos de Saúde , HIV , Infecções por HIV/prevenção & controle , Humanos , Aprendizado de Máquina
16.
Jpn J Clin Oncol ; 50(10): 1133-1140, 2020 Sep 28.
Artigo em Inglês | MEDLINE | ID: mdl-32596714

RESUMO

OBJECTIVE: Improved prognostic prediction for patients with colorectal cancer stays an important challenge. This study aimed to develop an effective prognostic model for predicting survival in resected colorectal cancer patients through the implementation of the Super learner. METHODS: A total of 2333 patients who met the inclusion criteria were enrolled in the cohort. We used multivariate Cox regression analysis to identify significant prognostic factors and Super learner to construct prognostic models. Prediction models were internally validated by 10-fold cross-validation and externally validated with a dataset from The Cancer Genome Atlas. Discrimination and calibration were evaluated by Harrell concordence index (C-index) and calibration plots, respectively. RESULTS: Age, T stage, N stage, histological type, tumor location, lymph-vascular invasion, preoperative carcinoembryonic antigen and sample lymph nodes were integrated into prediction models. The concordance index of Super learner-based prediction model (SLM) was 0.792 (95% confidence interval: 0.767-0.818), which is higher than that of the seventh edition American Joint Committee on Cancer TNM staging system 0.689 (95% confidence interval: 0.672-0.703) for predicting overall survival (P < 0.05). In the external validation, the concordance index of the SLM for predicting overall survival was also higher than that of tumor-node-metastasis (TNM) stage system (0.764 vs. 0.682, respectively; P < 0.001). In addition, the SLM showed good calibration properties. CONCLUSIONS: We developed and externally validated an effective prognosis prediction model based on Super learner, which offered more reliable and accurate prognosis prediction and may be used to more accurately identify high-risk patients who need more active surveillance in patients with resected colorectal cancer.


Assuntos
Povo Asiático , Neoplasias Colorretais/mortalidade , Neoplasias Colorretais/cirurgia , Etnicidade , Modelos Biológicos , Idoso , Calibragem , Estudos de Coortes , Neoplasias Colorretais/patologia , Detecção Precoce de Câncer , Feminino , Humanos , Estimativa de Kaplan-Meier , Masculino , Pessoa de Meia-Idade , Estadiamento de Neoplasias , Nomogramas , Prognóstico , Modelos de Riscos Proporcionais , Reprodutibilidade dos Testes
17.
J Thromb Thrombolysis ; 49(1): 1-9, 2020 01.
Artigo em Inglês | MEDLINE | ID: mdl-31535314

RESUMO

Traditional statistical models allow population based inferences and comparisons. Machine learning (ML) explores datasets to develop algorithms that do not assume linear relationships between variables and outcomes and that may account for higher order interactions to make individualized outcome predictions. To evaluate the performance of machine learning models compared to traditional risk stratification methods for the prediction of major adverse cardiovascular events (MACE) and bleeding in patients with acute coronary syndrome (ACS) that are treated with antithrombotic therapy. Data on 24,178 ACS patients were pooled from four randomized controlled trials. The super learner ensemble algorithm selected weights for 23 machine learning models and was compared to traditional models. The efficacy endpoint was a composite of cardiovascular death, myocardial infarction, or stroke. The safety endpoint was a composite of TIMI major and minor bleeding or bleeding requiring medical attention. For the MACE outcome, the super learner model produced a higher c-statistic (0.734) than logistic regression (0.714), the TIMI risk score (0.489), and a new cardiovascular risk score developed in the dataset (0.644). For the bleeding outcome, the super learner demonstrated a similar c-statistic as the logistic regression model (0.670 vs. 0.671). The machine learning risk estimates were highly calibrated with observed efficacy and bleeding outcomes (Hosmer-Lemeshow p value = 0.692 and 0.970, respectively). The super learner algorithm was highly calibrated on both efficacy and safety outcomes and produced the highest c-statistic for prediction of MACE compared to traditional risk stratification methods. This analysis demonstrates a contemporary application of machine learning to guide patient-level antithrombotic therapy treatment decisions.Clinical Trial Registration ATLAS ACS-2 TIMI 46: https://clinicaltrials.gov/ct2/show/NCT00402597. Unique Identifier: NCT00402597. ATLAS ACS-2 TIMI 51: https://clinicaltrials.gov/ct2/show/NCT00809965. Unique Identifier: NCT00809965. GEMINI ACS-1: https://clinicaltrials.gov/ct2/show/NCT02293395. Unique Identifier: NCT02293395. PIONEER-AF PCI: https://clinicaltrials.gov/ct2/show/NCT01830543. Unique Identifier: NCT01830543.


Assuntos
Síndrome Coronariana Aguda , Fibrinolíticos/efeitos adversos , Hemorragia , Aprendizado de Máquina , Síndrome Coronariana Aguda/tratamento farmacológico , Síndrome Coronariana Aguda/epidemiologia , Idoso , Feminino , Fibrinolíticos/administração & dosagem , Hemorragia/induzido quimicamente , Hemorragia/epidemiologia , Humanos , Masculino , Pessoa de Meia-Idade , Modelos Cardiovasculares , Ensaios Clínicos Controlados Aleatórios como Assunto , Medição de Risco
18.
Stat Med ; 38(10): 1703-1714, 2019 05 10.
Artigo em Inglês | MEDLINE | ID: mdl-30474289

RESUMO

Clinical trials are widely considered the gold standard for treatment evaluation, and they can be highly expensive in terms of time and money. The efficiency of clinical trials can be improved by incorporating information from baseline covariates that are related to clinical outcomes. This can be done by modifying an unadjusted treatment effect estimator with an augmentation term that involves a function of covariates. The optimal augmentation is well characterized in theory but must be estimated in practice. In this article, we investigate the use of machine learning methods to estimate the optimal augmentation. We consider and compare an indirect approach based on an estimated regression function and a direct approach that aims directly to minimize the asymptotic variance of the treatment effect estimator. Theoretical considerations and simulation results indicate that the direct approach is generally preferable over the indirect approach. The direct approach can be implemented using any existing prediction algorithm that can minimize a weighted sum of squared prediction errors. Many such prediction algorithms are available, and the super learning principle can be used to combine multiple algorithms into a super learner under the direct approach. The resulting direct super learner has a desirable oracle property, is easy to implement, and performs well in realistic settings. The proposed methodology is illustrated with real data from a stroke trial.


Assuntos
Aprendizado de Máquina , Modelos Estatísticos , Ensaios Clínicos Controlados Aleatórios como Assunto/estatística & dados numéricos , Simulação por Computador , Eficiência , Fibrinolíticos/uso terapêutico , Humanos , Avaliação de Resultados em Cuidados de Saúde , Projetos de Pesquisa , Acidente Vascular Cerebral/tratamento farmacológico , Ativador de Plasminogênio Tecidual/uso terapêutico
19.
Stat Med ; 38(9): 1690-1702, 2019 04 30.
Artigo em Inglês | MEDLINE | ID: mdl-30586681

RESUMO

In investigations of the effect of treatment on outcome, the propensity score is a tool to eliminate imbalance in the distribution of confounding variables between treatment groups. Recent work has suggested that Super Learner, an ensemble method, outperforms logistic regression in nonlinear settings; however, experience with real-data analyses tends to show overfitting of the propensity score model using this approach. We investigated a wide range of simulated settings of varying complexities including simulations based on real data to compare the performances of logistic regression, generalized boosted models, and Super Learner in providing balance and for estimating the average treatment effect via propensity score regression, propensity score matching, and inverse probability of treatment weighting. We found that Super Learner and logistic regression are comparable in terms of covariate balance, bias, and mean squared error (MSE); however, Super Learner is computationally very expensive thus leaving no clear advantage to the more complex approach. Propensity scores estimated by generalized boosted models were inferior to the other two estimation approaches. We also found that propensity score regression adjustment was superior to either matching or inverse weighting when the form of the dependence on the treatment on the outcome is correctly specified.


Assuntos
Causalidade , Dinâmica não Linear , Pontuação de Propensão , Viés , Simulação por Computador , Fatores de Confusão Epidemiológicos , Interpretação Estatística de Dados , Humanos , Modelos Logísticos
20.
Biometrics ; 74(4): 1271-1281, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-29701875

RESUMO

A common scientific problem is to determine a surrogate outcome for a long-term outcome so that future randomized studies can restrict themselves to only collecting the surrogate outcome. We consider the setting that we observe n independent and identically distributed observations of a random variable consisting of baseline covariates, a treatment, a vector of candidate surrogate outcomes at an intermediate time point, and the final outcome of interest at a final time point. We assume the treatment is randomized, conditional on the baseline covariates. The goal is to use these data to learn a most-promising surrogate for use in future trials for inference about a mean contrast treatment effect on the final outcome. We define an optimal surrogate for the current study as the function of the data generating distribution collected by the intermediate time point that satisfies the Prentice definition of a valid surrogate endpoint and that optimally predicts the final outcome: this optimal surrogate is an unknown parameter. We show that this optimal surrogate is a conditional mean and present super-learner and targeted super-learner based estimators, whose predicted outcomes are used as the surrogate in applications. We demonstrate a number of desirable properties of this optimal surrogate and its estimators, and study the methodology in simulations and an application to dengue vaccine efficacy trials.


Assuntos
Biomarcadores , Biometria/métodos , Simulação por Computador/estatística & dados numéricos , Ensaios Clínicos Controlados Aleatórios como Assunto/estatística & dados numéricos , Vacinas contra Dengue/normas , Humanos , Funções Verossimilhança , Avaliação de Resultados em Cuidados de Saúde
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA