RESUMEN
Understanding treatment effects on health-related outcomes using real-world data requires defining a causal parameter and imposing relevant identification assumptions to translate it into a statistical estimand. Semiparametric methods, like the targeted maximum likelihood estimator (TMLE), have been developed to construct asymptotically linear estimators of these parameters. To further establish the asymptotic efficiency of these estimators, two conditions must be met: 1) the relevant components of the data likelihood must fall within a Donsker class, and 2) the estimates of nuisance parameters must converge to their true values at a rate faster than n -1/4 . The Highly Adaptive LASSO (HAL) satisfies these criteria by acting as an empirical risk minimizer within a class of càdlàg functions with a bounded sectional variation norm, which is known to be Donsker. HAL achieves the desired rate of convergence, thereby guaranteeing the estimators' asymptotic efficiency. The function class over which HAL minimizes its risk is flexible enough to capture realistic functions while maintaining the conditions for establishing efficiency. Additionally, HAL enables robust inference for non-pathwise differentiable parameters, such as the conditional average treatment effect (CATE) and causal dose-response curve, which are important in precision health. While these parameters are often considered in machine learning literature, these applications typically lack proper statistical inference. HAL addresses this gap by providing reliable statistical uncertainty quantification that is essential for informed decision-making in health research.
RESUMEN
BACKGROUND: Long COVID, also known as post-acute sequelae of COVID-19 (PASC), is a poorly understood condition with symptoms across a range of biological domains that often have debilitating consequences. Some have recently suggested that lingering SARS-CoV-2 virus particles in the gut may impede serotonin production and that low serotonin may drive many Long COVID symptoms across a range of biological systems. Therefore, selective serotonin reuptake inhibitors (SSRIs), which increase synaptic serotonin availability, may be used to prevent or treat Long COVID. SSRIs are commonly prescribed for depression, therefore restricting a study sample to only include patients with depression can reduce the concern of confounding by indication. METHODS: In an observational sample of electronic health records from patients in the National COVID Cohort Collaborative (N3C) with a COVID-19 diagnosis between September 1, 2021, and December 1, 2022, and a comorbid depressive disorder, the leading indication for SSRI use, we evaluated the relationship between SSRI use during acute COVID-19 and subsequent 12-month risk of Long COVID (defined by ICD-10 code U09.9). We defined SSRI use as a prescription for SSRI medication beginning at least 30 days before acute COVID-19 and not ending before SARS-CoV-2 infection. To minimize bias, we estimated relationships using nonparametric targeted maximum likelihood estimation to aggressively adjust for high-dimensional covariates. RESULTS: We analyzed a sample (n = 302,626) of patients with a diagnosis of a depressive condition before COVID-19 diagnosis, where 100,803 (33%) were using an SSRI. We found that SSRI users had a significantly lower risk of Long COVID compared to nonusers (adjusted causal relative risk 0.92, 95% CI (0.86, 0.99)) and we found a similar relationship comparing new SSRI users (first SSRI prescription 1 to 4 months before acute COVID-19 with no prior history of SSRI use) to nonusers (adjusted causal relative risk 0.89, 95% CI (0.80, 0.98)). CONCLUSIONS: These findings suggest that SSRI use during acute COVID-19 may be protective against Long COVID, supporting the hypothesis that serotonin may be a key mechanistic biomarker of Long COVID.
Asunto(s)
COVID-19 , SARS-CoV-2 , Inhibidores Selectivos de la Recaptación de Serotonina , Humanos , COVID-19/epidemiología , COVID-19/complicaciones , Inhibidores Selectivos de la Recaptación de Serotonina/uso terapéutico , Femenino , Masculino , Persona de Mediana Edad , SARS-CoV-2/efectos de los fármacos , Adulto , Anciano , Depresión/tratamiento farmacológico , Pandemias , Síndrome Post Agudo de COVID-19 , Infecciones por Coronavirus/tratamiento farmacológico , Infecciones por Coronavirus/epidemiología , Infecciones por Coronavirus/complicaciones , Betacoronavirus/efectos de los fármacos , Neumonía Viral/tratamiento farmacológico , Neumonía Viral/epidemiología , Factores de RiesgoRESUMEN
BACKGROUND: Postacute sequelae of COVID-19 (PASC), also known as long COVID, is a broad grouping of a range of long-term symptoms following acute COVID-19. These symptoms can occur across a range of biological systems, leading to challenges in determining risk factors for PASC and the causal etiology of this disorder. An understanding of characteristics that are predictive of future PASC is valuable, as this can inform the identification of high-risk individuals and future preventative efforts. However, current knowledge regarding PASC risk factors is limited. OBJECTIVE: Using a sample of 55,257 patients (at a ratio of 1 patient with PASC to 4 matched controls) from the National COVID Cohort Collaborative, as part of the National Institutes of Health Long COVID Computational Challenge, we sought to predict individual risk of PASC diagnosis from a curated set of clinically informed covariates. The National COVID Cohort Collaborative includes electronic health records for more than 22 million patients from 84 sites across the United States. METHODS: We predicted individual PASC status, given covariate information, using Super Learner (an ensemble machine learning algorithm also known as stacking) to learn the optimal combination of gradient boosting and random forest algorithms to maximize the area under the receiver operator curve. We evaluated variable importance (Shapley values) based on 3 levels: individual features, temporal windows, and clinical domains. We externally validated these findings using a holdout set of randomly selected study sites. RESULTS: We were able to predict individual PASC diagnoses accurately (area under the curve 0.874). The individual features of the length of observation period, number of health care interactions during acute COVID-19, and viral lower respiratory infection were the most predictive of subsequent PASC diagnosis. Temporally, we found that baseline characteristics were the most predictive of future PASC diagnosis, compared with characteristics immediately before, during, or after acute COVID-19. We found that the clinical domains of health care use, demographics or anthropometry, and respiratory factors were the most predictive of PASC diagnosis. CONCLUSIONS: The methods outlined here provide an open-source, applied example of using Super Learner to predict PASC status using electronic health record data, which can be replicated across a variety of settings. Across individual predictors and clinical domains, we consistently found that factors related to health care use were the strongest predictors of PASC diagnosis. This indicates that any observational studies using PASC diagnosis as a primary outcome must rigorously account for heterogeneous health care use. Our temporal findings support the hypothesis that clinicians may be able to accurately assess the risk of PASC in patients before acute COVID-19 diagnosis, which could improve early interventions and preventive care. Our findings also highlight the importance of respiratory characteristics in PASC risk assessment. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): RR2-10.1101/2023.07.27.23293272.
Asunto(s)
COVID-19 , Síndrome Post Agudo de COVID-19 , Humanos , COVID-19/epidemiología , Estudios de Cohortes , Femenino , Masculino , Estados Unidos/epidemiología , Persona de Mediana Edad , Anciano , Adulto , Factores de Riesgo , Aprendizaje AutomáticoRESUMEN
We describe semiparametric estimation and inference for causal effects using observational data from a single social network. Our asymptotic results are the first to allow for dependence of each observation on a growing number of other units as sample size increases. In addition, while previous methods have implicitly permitted only one of two possible sources of dependence among social network observations, we allow for both dependence due to transmission of information across network ties and for dependence due to latent similarities among nodes sharing ties. We propose new causal effects that are specifically of interest in social network settings, such as interventions on network ties and network structure. We use our methods to reanalyze an influential and controversial study that estimated causal peer effects of obesity using social network data from the Framingham Heart Study; after accounting for network structure we find no evidence for causal peer effects.
RESUMEN
Strategic test allocation is important for control of both emerging and existing pandemics (eg, COVID-19, HIV). It supports effective epidemic control by (1) reducing transmission via identifying cases and (2) tracking outbreak dynamics to inform targeted interventions. However, infectious disease surveillance presents unique statistical challenges. For instance, the true outcome of interest (positive infection status) is often a latent variable. In addition, presence of both network and temporal dependence reduces data to a single observation. In this work, we study an adaptive sequential design, which allows for unspecified dependence among individuals and across time. Our causal parameter is the mean latent outcome we would have obtained, if, starting at time t given the observed past, we had carried out a stochastic intervention that maximizes the outcome under a resource constraint. The key strength of the method is that we do not have to model network and time dependence: a short-term performance Online Super Learner is used to select among dependence models and randomization schemes. The proposed strategy learns the optimal choice of testing over time while adapting to the current state of the outbreak and learning across samples, through time, or both. We demonstrate the superior performance of the proposed strategy in an agent-based simulation modeling a residential university environment during the COVID-19 pandemic.
Asunto(s)
COVID-19 , Enfermedades Transmisibles , Humanos , Pandemias/prevención & control , COVID-19/epidemiología , Simulación por Computador , Brotes de EnfermedadesRESUMEN
Sustainable Development Goal 2.2-to end malnutrition by 2030-includes the elimination of child wasting, defined as a weight-for-length z-score that is more than two standard deviations below the median of the World Health Organization standards for child growth1. Prevailing methods to measure wasting rely on cross-sectional surveys that cannot measure onset, recovery and persistence-key features that inform preventive interventions and estimates of disease burden. Here we analyse 21 longitudinal cohorts and show that wasting is a highly dynamic process of onset and recovery, with incidence peaking between birth and 3 months. Many more children experience an episode of wasting at some point during their first 24 months than prevalent cases at a single point in time suggest. For example, at the age of 24 months, 5.6% of children were wasted, but by the same age (24 months), 29.2% of children had experienced at least one wasting episode and 10.0% had experienced two or more episodes. Children who were wasted before the age of 6 months had a faster recovery and shorter episodes than did children who were wasted at older ages; however, early wasting increased the risk of later growth faltering, including concurrent wasting and stunting (low length-for-age z-score), and thus increased the risk of mortality. In diverse populations with high seasonal rainfall, the population average weight-for-length z-score varied substantially (more than 0.5 z in some cohorts), with the lowest mean z-scores occurring during the rainiest months; this indicates that seasonally targeted interventions could be considered. Our results show the importance of establishing interventions to prevent wasting from birth to the age of 6 months, probably through improved maternal nutrition, to complement current programmes that focus on children aged 6-59 months.
Asunto(s)
Caquexia , Países en Desarrollo , Trastornos del Crecimiento , Desnutrición , Preescolar , Humanos , Lactante , Recién Nacido , Caquexia/epidemiología , Caquexia/mortalidad , Caquexia/prevención & control , Estudios Transversales , Trastornos del Crecimiento/epidemiología , Trastornos del Crecimiento/mortalidad , Trastornos del Crecimiento/prevención & control , Incidencia , Estudios Longitudinales , Desnutrición/epidemiología , Desnutrición/mortalidad , Desnutrición/prevención & control , Lluvia , Estaciones del AñoRESUMEN
Globally, 149 million children under 5 years of age are estimated to be stunted (length more than 2 standard deviations below international growth standards)1,2. Stunting, a form of linear growth faltering, increases the risk of illness, impaired cognitive development and mortality. Global stunting estimates rely on cross-sectional surveys, which cannot provide direct information about the timing of onset or persistence of growth faltering-a key consideration for defining critical windows to deliver preventive interventions. Here we completed a pooled analysis of longitudinal studies in low- and middle-income countries (n = 32 cohorts, 52,640 children, ages 0-24 months), allowing us to identify the typical age of onset of linear growth faltering and to investigate recurrent faltering in early life. The highest incidence of stunting onset occurred from birth to the age of 3 months, with substantially higher stunting at birth in South Asia. From 0 to 15 months, stunting reversal was rare; children who reversed their stunting status frequently relapsed, and relapse rates were substantially higher among children born stunted. Early onset and low reversal rates suggest that improving children's linear growth will require life course interventions for women of childbearing age and a greater emphasis on interventions for children under 6 months of age.
Asunto(s)
Países en Desarrollo , Trastornos del Crecimiento , Adulto , Preescolar , Femenino , Humanos , Lactante , Recién Nacido , Sur de Asia/epidemiología , Cognición , Estudios Transversales , Países en Desarrollo/estadística & datos numéricos , Discapacidades del Desarrollo/epidemiología , Discapacidades del Desarrollo/mortalidad , Discapacidades del Desarrollo/prevención & control , Trastornos del Crecimiento/epidemiología , Trastornos del Crecimiento/mortalidad , Trastornos del Crecimiento/prevención & control , Estudios Longitudinales , MadresRESUMEN
Growth faltering in children (low length for age or low weight for length) during the first 1,000 days of life (from conception to 2 years of age) influences short-term and long-term health and survival1,2. Interventions such as nutritional supplementation during pregnancy and the postnatal period could help prevent growth faltering, but programmatic action has been insufficient to eliminate the high burden of stunting and wasting in low- and middle-income countries. Identification of age windows and population subgroups on which to focus will benefit future preventive efforts. Here we use a population intervention effects analysis of 33 longitudinal cohorts (83,671 children, 662,763 measurements) and 30 separate exposures to show that improving maternal anthropometry and child condition at birth accounted for population increases in length-for-age z-scores of up to 0.40 and weight-for-length z-scores of up to 0.15 by 24 months of age. Boys had consistently higher risk of all forms of growth faltering than girls. Early postnatal growth faltering predisposed children to subsequent and persistent growth faltering. Children with multiple growth deficits exhibited higher mortality rates from birth to 2 years of age than children without growth deficits (hazard ratios 1.9 to 8.7). The importance of prenatal causes and severe consequences for children who experienced early growth faltering support a focus on pre-conception and pregnancy as a key opportunity for new preventive interventions.
Asunto(s)
Caquexia , Países en Desarrollo , Trastornos del Crecimiento , Preescolar , Femenino , Humanos , Lactante , Recién Nacido , Masculino , Embarazo , Caquexia/economía , Caquexia/epidemiología , Caquexia/etiología , Caquexia/prevención & control , Estudios de Cohortes , Países en Desarrollo/economía , Países en Desarrollo/estadística & datos numéricos , Suplementos Dietéticos , Trastornos del Crecimiento/epidemiología , Trastornos del Crecimiento/prevención & control , Estudios Longitudinales , Madres , Factores Sexuales , Desnutrición/economía , Desnutrición/epidemiología , Desnutrición/etiología , Desnutrición/prevención & control , AntropometríaRESUMEN
The covariance matrix plays a fundamental role in many modern exploratory and inferential statistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, the optimality of the sample covariance matrix as an estimator of this parameter is well-established. High-dimensional regimes do not admit such a convenience. Thus, a variety of estimators have been derived to overcome the shortcomings of the canonical estimator in such settings. Yet, selecting an optimal estimator from among the plethora available remains an open challenge. Using the framework of cross-validated loss-based estimation, we develop the theoretical underpinnings of just such an estimator selection procedure. We propose a general class of loss functions for covariance matrix estimation and establish accompanying finite-sample risk bounds and conditions for the asymptotic optimality of the cross-validation selector. In numerical experiments, we demonstrate the optimality of our proposed selector in moderate sample sizes and across diverse data-generating processes. The practical benefits of our procedure are highlighted in a dimension reduction application to single-cell transcriptome sequencing data.
RESUMEN
Across research disciplines, cluster randomized trials (CRTs) are commonly implemented to evaluate interventions delivered to groups of participants, such as communities and clinics. Despite advances in the design and analysis of CRTs, several challenges remain. First, there are many possible ways to specify the causal effect of interest (eg, at the individual-level or at the cluster-level). Second, the theoretical and practical performance of common methods for CRT analysis remain poorly understood. Here, we present a general framework to formally define an array of causal effects in terms of summary measures of counterfactual outcomes. Next, we provide a comprehensive overview of CRT estimators, including the t-test, generalized estimating equations (GEE), augmented-GEE, and targeted maximum likelihood estimation (TMLE). Using finite sample simulations, we illustrate the practical performance of these estimators for different causal effects and when, as commonly occurs, there are limited numbers of clusters of different sizes. Finally, our application to data from the Preterm Birth Initiative (PTBi) study demonstrates the real-world impact of varying cluster sizes and targeting effects at the cluster-level or at the individual-level. Specifically, the relative effect of the PTBi intervention was 0.81 at the cluster-level, corresponding to a 19% reduction in outcome incidence, and was 0.66 at the individual-level, corresponding to a 34% reduction in outcome risk. Given its flexibility to estimate a variety of user-specified effects and ability to adaptively adjust for covariates for precision gains while maintaining Type-I error control, we conclude TMLE is a promising tool for CRT analysis.
Asunto(s)
Nacimiento Prematuro , Recién Nacido , Femenino , Humanos , Simulación por Computador , Ensayos Clínicos Controlados Aleatorios como Asunto , Tamaño de la Muestra , Causalidad , Análisis por ConglomeradosRESUMEN
In this work we introduce the personalized online super learner (POSL), an online personalizable ensemble machine learning algorithm for streaming data. POSL optimizes predictions with respect to baseline covariates, so personalization can vary from completely individualized, that is, optimization with respect to subject ID, to many individuals, that is, optimization with respect to common baseline covariates. As an online algorithm, POSL learns in real time. As a super learner, POSL is grounded in statistical optimality theory and can leverage a diversity of candidate algorithms, including online algorithms with different training and update times, fixed/offline algorithms that are not updated during POSL's fitting procedure, pooled algorithms that learn from many individuals' time series, and individualized algorithms that learn from within a single time series. POSL's ensembling of the candidates can depend on the amount of data collected, the stationarity of the time series, and the mutual characteristics of a group of time series. Depending on the underlying data-generating process and the information available in the data, POSL is able to adapt to learning across samples, through time, or both. For a range of simulations that reflect realistic forecasting scenarios and in a medical application, we examine the performance of POSL relative to other current ensembling and online learning methods. We show that POSL is able to provide reliable predictions for both short and long time series, and it's able to adjust to changing data-generating environments. We further cultivate POSL's practicality by extending it to settings where time series dynamically enter and exit.
Asunto(s)
Algoritmos , Aprendizaje Automático , HumanosRESUMEN
Common tasks encountered in epidemiology, including disease incidence estimation and causal inference, rely on predictive modelling. Constructing a predictive model can be thought of as learning a prediction function (a function that takes as input covariate data and outputs a predicted value). Many strategies for learning prediction functions from data (learners) are available, from parametric regressions to machine learning algorithms. It can be challenging to choose a learner, as it is impossible to know in advance which one is the most suitable for a particular dataset and prediction task. The super learner (SL) is an algorithm that alleviates concerns over selecting the one 'right' learner by providing the freedom to consider many, such as those recommended by collaborators, used in related research or specified by subject-matter experts. Also known as stacking, SL is an entirely prespecified and flexible approach for predictive modelling. To ensure the SL is well specified for learning the desired prediction function, the analyst does need to make a few important choices. In this educational article, we provide step-by-step guidelines for making these decisions, walking the reader through each of them and providing intuition along the way. In doing so, we aim to empower the analyst to tailor the SL specification to their prediction task, thereby ensuring their SL performs as well as possible. A flowchart provides a concise, easy-to-follow summary of key suggestions and heuristics, based on our accumulated experience and guided by SL optimality theory.
Asunto(s)
Algoritmos , Aprendizaje Automático , HumanosRESUMEN
This work considers targeted maximum likelihood estimation (TMLE) of treatment effects on absolute risk and survival probabilities in classical time-to-event settings characterized by right-censoring and competing risks. TMLE is a general methodology combining flexible ensemble learning and semiparametric efficiency theory in a two-step procedure for substitution estimation of causal parameters. We specialize and extend the continuous-time TMLE methods for competing risks settings, proposing a targeting algorithm that iteratively updates cause-specific hazards to solve the efficient influence curve equation for the target parameter. As part of the work, we further detail and implement the recently proposed highly adaptive lasso estimator for continuous-time conditional hazards with L1 -penalized Poisson regression. The resulting estimation procedure benefits from relying solely on very mild nonparametric restrictions on the statistical model, thus providing a novel tool for machine-learning-based semiparametric causal inference for continuous-time time-to-event data. We apply the methods to a publicly available dataset on follicular cell lymphoma where subjects are followed over time until disease relapse or death without relapse. The data display important time-varying effects that can be captured by the highly adaptive lasso. In our simulations that are designed to imitate the data, we compare our methods to a similar approach based on random survival forests and to the discrete-time TMLE.
Asunto(s)
Algoritmos , Modelos Estadísticos , Humanos , Funciones de Verosimilitud , Aprendizaje Automático , RecurrenciaRESUMEN
The widespread availability of high-dimensional biological data has made the simultaneous screening of many biological characteristics a central problem in computational and high-dimensional biology. As the dimensionality of datasets continues to grow, so too does the complexity of identifying biomarkers linked to exposure patterns. The statistical analysis of such data often relies upon parametric modeling assumptions motivated by convenience, inviting opportunities for model misspecification. While estimation frameworks incorporating flexible, data adaptive regression strategies can mitigate this, their standard variance estimators are often unstable in high-dimensional settings, resulting in inflated Type-I error even after standard multiple testing corrections. We adapt a shrinkage approach compatible with parametric modeling strategies to semiparametric variance estimators of a family of efficient, asymptotically linear estimators of causal effects, defined by counterfactual exposure contrasts. Augmenting the inferential stability of these estimators in high-dimensional settings yields a data adaptive approach for robustly uncovering stable causal associations, even when sample sizes are limited. Our generalized variance estimator is evaluated against appropriate alternatives in numerical experiments, and an open source R/Bioconductor package, biotmle, is introduced. The proposal is demonstrated in an analysis of high-dimensional DNA methylation data from an observational study on the epigenetic effects of tobacco smoking.
Asunto(s)
Biología , Proyectos de Investigación , Tamaño de la Muestra , CausalidadRESUMEN
Inverse-probability-weighted estimators are the oldest and potentially most commonly used class of procedures for the estimation of causal effects. By adjusting for selection biases via a weighting mechanism, these procedures estimate an effect of interest by constructing a pseudopopulation in which selection biases are eliminated. Despite their ease of use, these estimators require the correct specification of a model for the weighting mechanism, are known to be inefficient, and suffer from the curse of dimensionality. We propose a class of nonparametric inverse-probability-weighted estimators in which the weighting mechanism is estimated via undersmoothing of the highly adaptive lasso, a nonparametric regression function proven to converge at nearly n - 1 / 3 $ n^{-1/3}$ -rate to the true weighting mechanism. We demonstrate that our estimators are asymptotically linear with variance converging to the nonparametric efficiency bound. Unlike doubly robust estimators, our procedures require neither derivation of the efficient influence function nor specification of the conditional outcome model. Our theoretical developments have broad implications for the construction of efficient inverse-probability-weighted estimators in large statistical models and a variety of problem settings. We assess the practical performance of our estimators in simulation studies and demonstrate use of our proposed methodology with data from a large-scale epidemiologic study.
Asunto(s)
Modelos Estadísticos , Probabilidad , Simulación por Computador , Sesgo de Selección , CausalidadRESUMEN
We consider estimation of a functional parameter of a realistically modeled data distribution based on observing independent and identically distributed observations. The highly adaptive lasso estimator of the functional parameter is defined as the minimizer of the empirical risk over a class of cadlag functions with finite sectional variation norm, where the functional parameter is parametrized in terms of such a class of functions. In this article we establish that this HAL estimator yields an asymptotically efficient estimator of any smooth feature of the functional parameter under a global undersmoothing condition. It is formally shown that the L 1-restriction in HAL does not obstruct it from solving the score equations along paths that do not enforce this condition. Therefore, from an asymptotic point of view, the only reason for undersmoothing is that the true target function might not be complex so that the HAL-fit leaves out key basis functions that are needed to span the desired efficient influence curve of the smooth target parameter. Nonetheless, in practice undersmoothing appears to be beneficial and a simple targeted method is proposed and practically verified to perform well. We demonstrate our general result HAL-estimator of a treatment-specific mean and of the integrated square density. We also present simulations for these two examples confirming the theory.
Asunto(s)
Empleo , Funciones de VerosimilitudRESUMEN
An endeavor central to precision medicine is predictive biomarker discovery; they define patient subpopulations which stand to benefit most, or least, from a given treatment. The identification of these biomarkers is often the byproduct of the related but fundamentally different task of treatment rule estimation. Using treatment rule estimation methods to identify predictive biomarkers in clinical trials where the number of covariates exceeds the number of participants often results in high false discovery rates. The higher than expected number of false positives translates to wasted resources when conducting follow-up experiments for drug target identification and diagnostic assay development. Patient outcomes are in turn negatively affected. We propose a variable importance parameter for directly assessing the importance of potentially predictive biomarkers and develop a flexible nonparametric inference procedure for this estimand. We prove that our estimator is double robust and asymptotically linear under loose conditions in the data-generating process, permitting valid inference about the importance metric. The statistical guarantees of the method are verified in a thorough simulation study representative of randomized control trials with moderate and high-dimensional covariate vectors. Our procedure is then used to discover predictive biomarkers from among the tumor gene expression data of metastatic renal cell carcinoma patients enrolled in recently completed clinical trials. We find that our approach more readily discerns predictive from nonpredictive biomarkers than procedures whose primary purpose is treatment rule estimation. An open-source software implementation of the methodology, the uniCATE R package, is briefly introduced.
Asunto(s)
Investigación Biomédica , Carcinoma de Células Renales , Neoplasias Renales , Humanos , Carcinoma de Células Renales/diagnóstico , Carcinoma de Células Renales/genética , Neoplasias Renales/diagnóstico , Neoplasias Renales/genética , Biomarcadores , Simulación por ComputadorRESUMEN
In biomedical science, analyzing treatment effect heterogeneity plays an essential role in assisting personalized medicine. The main goals of analyzing treatment effect heterogeneity include estimating treatment effects in clinically relevant subgroups and predicting whether a patient subpopulation might benefit from a particular treatment. Conventional approaches often evaluate the subgroup treatment effects via parametric modeling and can thus be susceptible to model mis-specifications. In this paper, we take a model-free semiparametric perspective and aim to efficiently evaluate the heterogeneous treatment effects of multiple subgroups simultaneously under the one-step targeted maximum-likelihood estimation (TMLE) framework. When the number of subgroups is large, we further expand this path of research by looking at a variation of the one-step TMLE that is robust to the presence of small estimated propensity scores in finite samples. From our simulations, our method demonstrates substantial finite sample improvements compared to conventional methods. In a case study, our method unveils the potential treatment effect heterogeneity of rs12916-T allele (a proxy for statin usage) in decreasing Alzheimer's disease risk.