Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 253
Filtrar
1.
Epidemiology ; 2024 Aug 01.
Artículo en Inglés | MEDLINE | ID: mdl-39087681

RESUMEN

The Causal Roadmap outlines a systematic approach to asking and answering questions of cause and effect: define the quantity of interest, evaluate needed assumptions, conduct statistical estimation, and carefully interpret results. To protect research integrity, it is essential that the algorithm for statistical estimation and inference be prespecified prior to conducting any effectiveness analyses. However, it is often unclear which algorithm will perform optimally for the real-data application. Instead, there is a temptation to simply implement one's favorite algorithm, recycling prior code or relying on the default settings of a computing package. Here, we call for the use of simulations that realistically reflect the application, including key characteristics such as strong confounding and dependent or missing outcomes, to objectively compare candidate estimators and facilitate full specification of the statistical analysis plan. Such simulations are informed by the Causal Roadmap and conducted after data collection but prior to effect estimation. We illustrate with two worked examples. First, in an observational longitudinal study, we use outcome-blind simulations to inform nuisance parameter estimation and variance estimation for longitudinal targeted minimum loss-based estimation. Second, in a cluster randomized trial with missing outcomes, we use treatment-blind simulations to examine type-I error control in two-stage targeted minimum loss-based estimation. In both examples, realistic simulations empower us to prespecify an estimation approach that is expected to have strong finite sample performance and also yield quality-controlled computing code for the actual analysis. Together, this process helps to improve the rigor and reproducibility of our research.

2.
JMIR Public Health Surveill ; 10: e53322, 2024 08 15.
Artículo en Inglés | MEDLINE | ID: mdl-39146534

RESUMEN

BACKGROUND: Postacute sequelae of COVID-19 (PASC), also known as long COVID, is a broad grouping of a range of long-term symptoms following acute COVID-19. These symptoms can occur across a range of biological systems, leading to challenges in determining risk factors for PASC and the causal etiology of this disorder. An understanding of characteristics that are predictive of future PASC is valuable, as this can inform the identification of high-risk individuals and future preventative efforts. However, current knowledge regarding PASC risk factors is limited. OBJECTIVE: Using a sample of 55,257 patients (at a ratio of 1 patient with PASC to 4 matched controls) from the National COVID Cohort Collaborative, as part of the National Institutes of Health Long COVID Computational Challenge, we sought to predict individual risk of PASC diagnosis from a curated set of clinically informed covariates. The National COVID Cohort Collaborative includes electronic health records for more than 22 million patients from 84 sites across the United States. METHODS: We predicted individual PASC status, given covariate information, using Super Learner (an ensemble machine learning algorithm also known as stacking) to learn the optimal combination of gradient boosting and random forest algorithms to maximize the area under the receiver operator curve. We evaluated variable importance (Shapley values) based on 3 levels: individual features, temporal windows, and clinical domains. We externally validated these findings using a holdout set of randomly selected study sites. RESULTS: We were able to predict individual PASC diagnoses accurately (area under the curve 0.874). The individual features of the length of observation period, number of health care interactions during acute COVID-19, and viral lower respiratory infection were the most predictive of subsequent PASC diagnosis. Temporally, we found that baseline characteristics were the most predictive of future PASC diagnosis, compared with characteristics immediately before, during, or after acute COVID-19. We found that the clinical domains of health care use, demographics or anthropometry, and respiratory factors were the most predictive of PASC diagnosis. CONCLUSIONS: The methods outlined here provide an open-source, applied example of using Super Learner to predict PASC status using electronic health record data, which can be replicated across a variety of settings. Across individual predictors and clinical domains, we consistently found that factors related to health care use were the strongest predictors of PASC diagnosis. This indicates that any observational studies using PASC diagnosis as a primary outcome must rigorously account for heterogeneous health care use. Our temporal findings support the hypothesis that clinicians may be able to accurately assess the risk of PASC in patients before acute COVID-19 diagnosis, which could improve early interventions and preventive care. Our findings also highlight the importance of respiratory characteristics in PASC risk assessment. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): RR2-10.1101/2023.07.27.23293272.


Asunto(s)
COVID-19 , Síndrome Post Agudo de COVID-19 , Humanos , COVID-19/epidemiología , Estudios de Cohortes , Femenino , Masculino , Estados Unidos/epidemiología , Persona de Mediana Edad , Anciano , Adulto , Factores de Riesgo , Aprendizaje Automático
3.
Nat Med ; 2024 Jul 04.
Artículo en Inglés | MEDLINE | ID: mdl-38965434

RESUMEN

Malaria-elimination interventions aim to extinguish hotspots and prevent transmission to nearby areas. Here, we re-analyzed a cluster-randomized trial of reactive, focal interventions (chemoprevention using artemether-lumefantrine and/or indoor residual spraying with pirimiphos-methyl) delivered within 500 m of confirmed malaria index cases in Namibia to measure direct effects (among intervention recipients within 500 m) and spillover effects (among non-intervention recipients within 3 km) on incidence, prevalence and seroprevalence. There was no or weak evidence of direct effects, but the sample size of intervention recipients was small, limiting statistical power. There was the strongest evidence of spillover effects of combined chemoprevention and indoor residual spraying. Among non-recipients within 1 km of index cases, the combined intervention reduced malaria incidence by 43% (95% confidence interval, 20-59%). In analyses among non-recipients within 3 km of interventions, the combined intervention reduced infection prevalence by 79% (6-95%) and seroprevalence, which captures recent infections and has higher statistical power, by 34% (20-45%). Accounting for spillover effects increased the cost-effectiveness of the combined intervention by 42%. Targeting hotspots with combined chemoprevention and vector-control interventions can indirectly benefit non-recipients up to 3 km away.

4.
J Am Stat Assoc ; 119(545): 597-611, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38800714

RESUMEN

We describe semiparametric estimation and inference for causal effects using observational data from a single social network. Our asymptotic results are the first to allow for dependence of each observation on a growing number of other units as sample size increases. In addition, while previous methods have implicitly permitted only one of two possible sources of dependence among social network observations, we allow for both dependence due to transmission of information across network ties and for dependence due to latent similarities among nodes sharing ties. We propose new causal effects that are specifically of interest in social network settings, such as interventions on network ties and network structure. We use our methods to reanalyze an influential and controversial study that estimated causal peer effects of obesity using social network data from the Framingham Heart Study; after accounting for network structure we find no evidence for causal peer effects.

5.
Am J Epidemiol ; 2024 Mar 21.
Artículo en Inglés | MEDLINE | ID: mdl-38517025

RESUMEN

Lasso regression is widely used for large-scale propensity score (PS) estimation in healthcare database studies. In these settings, previous work has shown that undersmoothing (overfitting) Lasso PS models can improve confounding control, but it can also cause problems of non-overlap in covariate distributions. It remains unclear how to select the degree of undersmoothing when fitting large-scale Lasso PS models to improve confounding control while avoiding issues that can result from reduced covariate overlap. Here, we used simulations to evaluate the performance of using collaborative-controlled targeted learning to data-adaptively select the degree of undersmoothing when fitting large-scale PS models within both singly and doubly robust frameworks to reduce bias in causal estimators. Simulations showed that collaborative learning can data-adaptively select the degree of undersmoothing to reduce bias in estimated treatment effects. Results further showed that when fitting undersmoothed Lasso PS-models, the use of cross-fitting was important for avoiding non-overlap in covariate distributions and reducing bias in causal estimates.

6.
Biometrics ; 80(1)2024 Jan 29.
Artículo en Inglés | MEDLINE | ID: mdl-38281772

RESUMEN

Strategic test allocation is important for control of both emerging and existing pandemics (eg, COVID-19, HIV). It supports effective epidemic control by (1) reducing transmission via identifying cases and (2) tracking outbreak dynamics to inform targeted interventions. However, infectious disease surveillance presents unique statistical challenges. For instance, the true outcome of interest (positive infection status) is often a latent variable. In addition, presence of both network and temporal dependence reduces data to a single observation. In this work, we study an adaptive sequential design, which allows for unspecified dependence among individuals and across time. Our causal parameter is the mean latent outcome we would have obtained, if, starting at time t given the observed past, we had carried out a stochastic intervention that maximizes the outcome under a resource constraint. The key strength of the method is that we do not have to model network and time dependence: a short-term performance Online Super Learner is used to select among dependence models and randomization schemes. The proposed strategy learns the optimal choice of testing over time while adapting to the current state of the outbreak and learning across samples, through time, or both. We demonstrate the superior performance of the proposed strategy in an agent-based simulation modeling a residential university environment during the COVID-19 pandemic.


Asunto(s)
COVID-19 , Enfermedades Transmisibles , Humanos , Pandemias/prevención & control , COVID-19/epidemiología , Simulación por Computador , Brotes de Enfermedades
7.
J Clin Transl Sci ; 7(1): e231, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38028337

RESUMEN

Introduction: Increasing interest in real-world evidence has fueled the development of study designs incorporating real-world data (RWD). Using the Causal Roadmap, we specify three designs to evaluate the difference in risk of major adverse cardiovascular events (MACE) with oral semaglutide versus standard-of-care: (1) the actual sequence of non-inferiority and superiority randomized controlled trials (RCTs), (2) a single RCT, and (3) a hybrid randomized-external data study. Methods: The hybrid design considers integration of the PIONEER 6 RCT with RWD controls using the experiment-selector cross-validated targeted maximum likelihood estimator. We evaluate 95% confidence interval coverage, power, and average patient time during which participants would be precluded from receiving a glucagon-like peptide-1 receptor agonist (GLP1-RA) for each design using simulations. Finally, we estimate the effect of oral semaglutide on MACE for the hybrid PIONEER 6-RWD analysis. Results: In simulations, Designs 1 and 2 performed similarly. The tradeoff between decreased coverage and patient time without the possibility of a GLP1-RA for Designs 1 and 3 depended on the simulated bias. In real data analysis using Design 3, external controls were integrated in 84% of cross-validation folds, resulting in an estimated risk difference of -1.53%-points (95% CI -2.75%-points to -0.30%-points). Conclusions: The Causal Roadmap helps investigators to minimize potential bias in studies using RWD and to quantify tradeoffs between study designs. The simulation results help to interpret the level of evidence provided by the real data analysis in support of the superiority of oral semaglutide versus standard-of-care for cardiovascular risk reduction.

8.
J Clin Transl Sci ; 7(1): e212, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37900353

RESUMEN

Increasing emphasis on the use of real-world evidence (RWE) to support clinical policy and regulatory decision-making has led to a proliferation of guidance, advice, and frameworks from regulatory agencies, academia, professional societies, and industry. A broad spectrum of studies use real-world data (RWD) to produce RWE, ranging from randomized trials with outcomes assessed using RWD to fully observational studies. Yet, many proposals for generating RWE lack sufficient detail, and many analyses of RWD suffer from implausible assumptions, other methodological flaws, or inappropriate interpretations. The Causal Roadmap is an explicit, itemized, iterative process that guides investigators to prespecify study design and analysis plans; it addresses a wide range of guidance within a single framework. By supporting the transparent evaluation of causal assumptions and facilitating objective comparisons of design and analysis choices based on prespecified criteria, the Roadmap can help investigators to evaluate the quality of evidence that a given study is likely to produce, specify a study to generate high-quality RWE, and communicate effectively with regulatory agencies and other stakeholders. This paper aims to disseminate and extend the Causal Roadmap framework for use by clinical and translational researchers; three companion papers demonstrate applications of the Causal Roadmap for specific use cases.

9.
medRxiv ; 2023 Nov 30.
Artículo en Inglés | MEDLINE | ID: mdl-37790419

RESUMEN

Malaria elimination interventions in low-transmission settings aim to extinguish hot spots and prevent transmission to nearby areas. In malaria elimination settings, the World Health Organization recommends reactive, focal interventions targeted to the area near malaria cases shortly after they are detected. A key question is whether these interventions reduce transmission to nearby uninfected or asymptomatic individuals who did not receive interventions. Here, we measured direct effects (among intervention recipients) and spillover effects (among non-recipients) of reactive, focal interventions delivered within 500m of confirmed malaria index cases in a cluster-randomized trial in Namibia. The trial delivered malaria chemoprevention (artemether lumefantrine) and vector control (indoor residual spraying with Actellic) separately and in combination using a factorial design. We compared incidence, infection prevalence, and seroprevalence between study arms among intervention recipients (direct effects) and non-recipients (spillover effects) up to 3 km away from index cases. We calculated incremental cost-effectiveness ratios accounting for spillover effects. The combined chemoprevention and vector control intervention produced direct effects and spillover effects. In the primary analysis among non-recipients within 1 km from index cases, the combined intervention reduced malaria incidence by 43% (95% CI 20%, 59%). In secondary analyses among non-recipients 500m-3 km from interventions, the combined intervention reduced infection by 79% (6%, 95%) and seroprevalence 34% (20%, 45%). Accounting for spillover effects increased the cost-effectiveness of the combined intervention by 37%. Our findings provide the first evidence that targeting hot spots with combined chemoprevention and vector control interventions can indirectly benefit non-recipients up to 3 km away.

12.
Nature ; 621(7979): 558-567, 2023 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-37704720

RESUMEN

Sustainable Development Goal 2.2-to end malnutrition by 2030-includes the elimination of child wasting, defined as a weight-for-length z-score that is more than two standard deviations below the median of the World Health Organization standards for child growth1. Prevailing methods to measure wasting rely on cross-sectional surveys that cannot measure onset, recovery and persistence-key features that inform preventive interventions and estimates of disease burden. Here we analyse 21 longitudinal cohorts and show that wasting is a highly dynamic process of onset and recovery, with incidence peaking between birth and 3 months. Many more children experience an episode of wasting at some point during their first 24 months than prevalent cases at a single point in time suggest. For example, at the age of 24 months, 5.6% of children were wasted, but by the same age (24 months), 29.2% of children had experienced at least one wasting episode and 10.0% had experienced two or more episodes. Children who were wasted before the age of 6 months had a faster recovery and shorter episodes than did children who were wasted at older ages; however, early wasting increased the risk of later growth faltering, including concurrent wasting and stunting (low length-for-age z-score), and thus increased the risk of mortality. In diverse populations with high seasonal rainfall, the population average weight-for-length z-score varied substantially (more than 0.5 z in some cohorts), with the lowest mean z-scores occurring during the rainiest months; this indicates that seasonally targeted interventions could be considered. Our results show the importance of establishing interventions to prevent wasting from birth to the age of 6 months, probably through improved maternal nutrition, to complement current programmes that focus on children aged 6-59 months.


Asunto(s)
Caquexia , Países en Desarrollo , Trastornos del Crecimiento , Desnutrición , Preescolar , Humanos , Lactante , Recién Nacido , Caquexia/epidemiología , Caquexia/mortalidad , Caquexia/prevención & control , Estudios Transversales , Trastornos del Crecimiento/epidemiología , Trastornos del Crecimiento/mortalidad , Trastornos del Crecimiento/prevención & control , Incidencia , Estudios Longitudinales , Desnutrición/epidemiología , Desnutrición/mortalidad , Desnutrición/prevención & control , Lluvia , Estaciones del Año
13.
Nature ; 621(7979): 550-557, 2023 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-37704719

RESUMEN

Globally, 149 million children under 5 years of age are estimated to be stunted (length more than 2 standard deviations below international growth standards)1,2. Stunting, a form of linear growth faltering, increases the risk of illness, impaired cognitive development and mortality. Global stunting estimates rely on cross-sectional surveys, which cannot provide direct information about the timing of onset or persistence of growth faltering-a key consideration for defining critical windows to deliver preventive interventions. Here we completed a pooled analysis of longitudinal studies in low- and middle-income countries (n = 32 cohorts, 52,640 children, ages 0-24 months), allowing us to identify the typical age of onset of linear growth faltering and to investigate recurrent faltering in early life. The highest incidence of stunting onset occurred from birth to the age of 3 months, with substantially higher stunting at birth in South Asia. From 0 to 15 months, stunting reversal was rare; children who reversed their stunting status frequently relapsed, and relapse rates were substantially higher among children born stunted. Early onset and low reversal rates suggest that improving children's linear growth will require life course interventions for women of childbearing age and a greater emphasis on interventions for children under 6 months of age.


Asunto(s)
Países en Desarrollo , Trastornos del Crecimiento , Adulto , Preescolar , Femenino , Humanos , Lactante , Recién Nacido , Sur de Asia/epidemiología , Cognición , Estudios Transversales , Países en Desarrollo/estadística & datos numéricos , Discapacidades del Desarrollo/epidemiología , Discapacidades del Desarrollo/mortalidad , Discapacidades del Desarrollo/prevención & control , Trastornos del Crecimiento/epidemiología , Trastornos del Crecimiento/mortalidad , Trastornos del Crecimiento/prevención & control , Estudios Longitudinales , Madres
14.
Nature ; 621(7979): 568-576, 2023 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-37704722

RESUMEN

Growth faltering in children (low length for age or low weight for length) during the first 1,000 days of life (from conception to 2 years of age) influences short-term and long-term health and survival1,2. Interventions such as nutritional supplementation during pregnancy and the postnatal period could help prevent growth faltering, but programmatic action has been insufficient to eliminate the high burden of stunting and wasting in low- and middle-income countries. Identification of age windows and population subgroups on which to focus will benefit future preventive efforts. Here we use a population intervention effects analysis of 33 longitudinal cohorts (83,671 children, 662,763 measurements) and 30 separate exposures to show that improving maternal anthropometry and child condition at birth accounted for population increases in length-for-age z-scores of up to 0.40 and weight-for-length z-scores of up to 0.15 by 24 months of age. Boys had consistently higher risk of all forms of growth faltering than girls. Early postnatal growth faltering predisposed children to subsequent and persistent growth faltering. Children with multiple growth deficits exhibited higher mortality rates from birth to 2 years of age than children without growth deficits (hazard ratios 1.9 to 8.7). The importance of prenatal causes and severe consequences for children who experienced early growth faltering support a focus on pre-conception and pregnancy as a key opportunity for new preventive interventions.


Asunto(s)
Caquexia , Países en Desarrollo , Trastornos del Crecimiento , Preescolar , Femenino , Humanos , Lactante , Recién Nacido , Masculino , Embarazo , Caquexia/economía , Caquexia/epidemiología , Caquexia/etiología , Caquexia/prevención & control , Estudios de Cohortes , Países en Desarrollo/economía , Países en Desarrollo/estadística & datos numéricos , Suplementos Dietéticos , Trastornos del Crecimiento/epidemiología , Trastornos del Crecimiento/prevención & control , Estudios Longitudinales , Madres , Factores Sexuales , Desnutrición/economía , Desnutrición/epidemiología , Desnutrición/etiología , Desnutrición/prevención & control , Antropometría
15.
BMC Med Res Methodol ; 23(1): 178, 2023 08 02.
Artículo en Inglés | MEDLINE | ID: mdl-37533017

RESUMEN

BACKGROUND: The Targeted Learning roadmap provides a systematic guide for generating and evaluating real-world evidence (RWE). From a regulatory perspective, RWE arises from diverse sources such as randomized controlled trials that make use of real-world data, observational studies, and other study designs. This paper illustrates a principled approach to assessing the validity and interpretability of RWE. METHODS: We applied the roadmap to a published observational study of the dose-response association between ritodrine hydrochloride and pulmonary edema among women pregnant with twins in Japan. The goal was to identify barriers to causal effect estimation beyond unmeasured confounding reported by the study's authors, and to explore potential options for overcoming the barriers that robustify results. RESULTS: Following the roadmap raised issues that led us to formulate alternative causal questions that produced more reliable, interpretable RWE. The process revealed a lack of information in the available data to identify a causal dose-response curve. However, under explicit assumptions the effect of treatment with any amount of ritodrine versus none, albeit a less ambitious parameter, can be estimated from data. CONCLUSIONS: Before RWE can be used in support of clinical and regulatory decision-making, its quality and reliability must be systematically evaluated. The TL roadmap prescribes how to carry out a thorough, transparent, and realistic assessment of RWE. We recommend this approach be a routine part of any decision-making process.


Asunto(s)
Proyectos de Investigación , Femenino , Humanos , Reproducibilidad de los Resultados , Japón , Ensayos Clínicos Controlados Aleatorios como Asunto
16.
Artículo en Inglés | MEDLINE | ID: mdl-37398941

RESUMEN

Statistical causal inference of mixed exposures has been limited by reliance on parametric models and, until recently, by researchers considering only one exposure at a time, usually estimated as a beta coefficient in a generalized linear regression model (GLM). This independent assessment of exposures poorly estimates the joint impact of a collection of the same exposures in a realistic exposure setting. Marginal methods for mixture variable selection such as ridge/lasso regression are biased by linear assumptions and the interactions modeled are chosen by the user. Clustering methods such as principal component regression lose both interpretability and valid inference. Newer mixture methods such as quantile g-computation (Keil et al., 2020) are biased by linear/additive assumptions. More flexible methods such as Bayesian kernel machine regression (BKMR)(Bobb et al., 2014) are sensitive to the choice of tuning parameters, are computationally taxing and lack an interpretable and robust summary statistic of dose-response relationships. No methods currently exist which finds the best flexible model to adjust for covariates while applying a non-parametric model that targets for interactions in a mixture and delivers valid inference for a target parameter. Non-parametric methods such as decision trees are a useful tool to evaluate combined exposures by finding partitions in the joint-exposure (mixture) space that best explain the variance in an outcome. However, current methods using decision trees to assess statistical inference for interactions are biased and are prone to overfitting by using the full data to both identify nodes in the tree and make statistical inference given these nodes. Other methods have used an independent test set to derive inference which does not use the full data. The CVtreeMLE R package provides researchers in (bio)statistics, epidemiology, and environmental health sciences with access to state-of-the-art statistical methodology for evaluating the causal effects of a data-adaptively determined mixed exposure using decision trees. Our target audience are those analysts who would normally use a potentially biased GLM based model for a mixed exposure. Instead, we hope to provide users with a non-parametric statistical machine where users simply specify the exposures, covariates and outcome, CVtreeMLE then determines if a best fitting decision tree exists and delivers interpretable results.

17.
Stat Med ; 42(19): 3443-3466, 2023 08 30.
Artículo en Inglés | MEDLINE | ID: mdl-37308115

RESUMEN

Across research disciplines, cluster randomized trials (CRTs) are commonly implemented to evaluate interventions delivered to groups of participants, such as communities and clinics. Despite advances in the design and analysis of CRTs, several challenges remain. First, there are many possible ways to specify the causal effect of interest (eg, at the individual-level or at the cluster-level). Second, the theoretical and practical performance of common methods for CRT analysis remain poorly understood. Here, we present a general framework to formally define an array of causal effects in terms of summary measures of counterfactual outcomes. Next, we provide a comprehensive overview of CRT estimators, including the t-test, generalized estimating equations (GEE), augmented-GEE, and targeted maximum likelihood estimation (TMLE). Using finite sample simulations, we illustrate the practical performance of these estimators for different causal effects and when, as commonly occurs, there are limited numbers of clusters of different sizes. Finally, our application to data from the Preterm Birth Initiative (PTBi) study demonstrates the real-world impact of varying cluster sizes and targeting effects at the cluster-level or at the individual-level. Specifically, the relative effect of the PTBi intervention was 0.81 at the cluster-level, corresponding to a 19% reduction in outcome incidence, and was 0.66 at the individual-level, corresponding to a 34% reduction in outcome risk. Given its flexibility to estimate a variety of user-specified effects and ability to adaptively adjust for covariates for precision gains while maintaining Type-I error control, we conclude TMLE is a promising tool for CRT analysis.


Asunto(s)
Nacimiento Prematuro , Recién Nacido , Femenino , Humanos , Simulación por Computador , Ensayos Clínicos Controlados Aleatorios como Asunto , Tamaño de la Muestra , Causalidad , Análisis por Conglomerados
18.
J Comput Graph Stat ; 32(2): 601-612, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37273839

RESUMEN

The covariance matrix plays a fundamental role in many modern exploratory and inferential statistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, the optimality of the sample covariance matrix as an estimator of this parameter is well-established. High-dimensional regimes do not admit such a convenience. Thus, a variety of estimators have been derived to overcome the shortcomings of the canonical estimator in such settings. Yet, selecting an optimal estimator from among the plethora available remains an open challenge. Using the framework of cross-validated loss-based estimation, we develop the theoretical underpinnings of just such an estimator selection procedure. We propose a general class of loss functions for covariance matrix estimation and establish accompanying finite-sample risk bounds and conditions for the asymptotic optimality of the cross-validation selector. In numerical experiments, we demonstrate the optimality of our proposed selector in moderate sample sizes and across diverse data-generating processes. The practical benefits of our procedure are highlighted in a dimension reduction application to single-cell transcriptome sequencing data.

19.
Int J Epidemiol ; 52(4): 1276-1285, 2023 08 02.
Artículo en Inglés | MEDLINE | ID: mdl-36905602

RESUMEN

Common tasks encountered in epidemiology, including disease incidence estimation and causal inference, rely on predictive modelling. Constructing a predictive model can be thought of as learning a prediction function (a function that takes as input covariate data and outputs a predicted value). Many strategies for learning prediction functions from data (learners) are available, from parametric regressions to machine learning algorithms. It can be challenging to choose a learner, as it is impossible to know in advance which one is the most suitable for a particular dataset and prediction task. The super learner (SL) is an algorithm that alleviates concerns over selecting the one 'right' learner by providing the freedom to consider many, such as those recommended by collaborators, used in related research or specified by subject-matter experts. Also known as stacking, SL is an entirely prespecified and flexible approach for predictive modelling. To ensure the SL is well specified for learning the desired prediction function, the analyst does need to make a few important choices. In this educational article, we provide step-by-step guidelines for making these decisions, walking the reader through each of them and providing intuition along the way. In doing so, we aim to empower the analyst to tailor the SL specification to their prediction task, thereby ensuring their SL performs as well as possible. A flowchart provides a concise, easy-to-follow summary of key suggestions and heuristics, based on our accumulated experience and guided by SL optimality theory.


Asunto(s)
Algoritmos , Aprendizaje Automático , Humanos
20.
Stat Med ; 42(7): 1013-1044, 2023 03 30.
Artículo en Inglés | MEDLINE | ID: mdl-36897184

RESUMEN

In this work we introduce the personalized online super learner (POSL), an online personalizable ensemble machine learning algorithm for streaming data. POSL optimizes predictions with respect to baseline covariates, so personalization can vary from completely individualized, that is, optimization with respect to subject ID, to many individuals, that is, optimization with respect to common baseline covariates. As an online algorithm, POSL learns in real time. As a super learner, POSL is grounded in statistical optimality theory and can leverage a diversity of candidate algorithms, including online algorithms with different training and update times, fixed/offline algorithms that are not updated during POSL's fitting procedure, pooled algorithms that learn from many individuals' time series, and individualized algorithms that learn from within a single time series. POSL's ensembling of the candidates can depend on the amount of data collected, the stationarity of the time series, and the mutual characteristics of a group of time series. Depending on the underlying data-generating process and the information available in the data, POSL is able to adapt to learning across samples, through time, or both. For a range of simulations that reflect realistic forecasting scenarios and in a medical application, we examine the performance of POSL relative to other current ensembling and online learning methods. We show that POSL is able to provide reliable predictions for both short and long time series, and it's able to adjust to changing data-generating environments. We further cultivate POSL's practicality by extending it to settings where time series dynamically enter and exit.


Asunto(s)
Algoritmos , Aprendizaje Automático , Humanos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...