RESUMEN
Causally interpretable meta-analysis combines information from a collection of randomized controlled trials to estimate treatment effects in a target population in which experimentation may not be possible but from which covariate information can be obtained. In such analyses, a key practical challenge is the presence of systematically missing data when some trials have collected data on one or more baseline covariates, but other trials have not, such that the covariate information is missing for all participants in the latter. In this article, we provide identification results for potential (counterfactual) outcome means and average treatment effects in the target population when covariate data are systematically missing from some of the trials in the meta-analysis. We propose three estimators for the average treatment effect in the target population, examine their asymptotic properties, and show that they have good finite-sample performance in simulation studies. We use the estimators to analyze data from two large lung cancer screening trials and target population data from the National Health and Nutrition Examination Survey (NHANES). To accommodate the complex survey design of the NHANES, we modify the methods to incorporate survey sampling weights and allow for clustering.
Asunto(s)
Detección Precoz del Cáncer , Neoplasias Pulmonares , Humanos , Encuestas Nutricionales , Neoplasias Pulmonares/epidemiología , Simulación por Computador , Proyectos de InvestigaciónRESUMEN
Evidence synthesis involves drawing conclusions from trial samples that may differ from the target population of interest, and there is often heterogeneity among trials in sample characteristics, treatment implementation, study design, and assessment of covariates. Stitching together this patchwork of evidence requires subject-matter knowledge, a clearly defined target population, and guidance on how to weigh evidence from different trials. Transportability analysis has provided formal identifiability conditions required to make unbiased causal inference in the target population. In this manuscript, we review these conditions along with an additional assumption required to address systematic missing data. The identifiability conditions highlight the importance of accounting for differences in treatment effect modifiers between the populations underlying the trials and the target population. We perform simulations to evaluate the bias of conventional random effect models and multiply imputed estimates using the pooled trials sample and describe causal estimators that explicitly address trial-to-target differences in key covariates in the context of systematic missing data. Results indicate that the causal transportability estimators are unbiased when treatment effect modifiers are accounted for in the analyses. Results also highlight the importance of carefully evaluating identifiability conditions for each trial to reduce bias due to differences in participant characteristics between trials and the target population. Bias can be limited by adjusting for covariates that are strongly correlated with missing treatment effect modifiers, including data from trials that do not differ from the target on treatment modifiers, and removing trials that do differ from the target and did not assess a modifier.
Asunto(s)
Necesidades y Demandas de Servicios de Salud , Proyectos de Investigación , Humanos , Sesgo , Causalidad , ConocimientoRESUMEN
Ultrahigh and high dimensional data are common in regression analysis for various fields, such as omics data, finance, and biological engineering. In addition to the problem of dimension, the data might also be contaminated. There are two main types of contamination: outliers and model misspecification. We develop an unique method that takes into account the ultrahigh or high dimensional issues and both types of contamination. In this article, we propose a framework for feature screening and selection based on the minimum Lq-likelihood estimation (MLqE), which accounts for the model misspecification contamination issue and has also been shown to be robust to outliers. In numerical analysis, we explore the robustness of this framework under different outliers and model misspecification scenarios. To examine the performance of this framework, we conduct real data analysis using the skin cutaneous melanoma data. When comparing with traditional screening and feature selection methods, the proposed method shows superiority in both variable identification effectiveness and parameter estimation accuracy.
Asunto(s)
Melanoma , Neoplasias Cutáneas , Humanos , Análisis de Regresión , Probabilidad , Melanoma Cutáneo MalignoRESUMEN
To increase power and minimize bias in statistical analyses, quantitative outcomes are often adjusted for precision and confounding variables using standard regression approaches. The outcome is modeled as a linear function of the precision variables and confounders; however, for many complex phenotypes, the assumptions of the linear regression models are not always met. As an alternative, we used neural networks for the modeling of complex phenotypes and covariate adjustments. We compared the prediction accuracy of the neural network models to that of classical approaches based on linear regression. Using data from the UK Biobank, COPDGene study, and Childhood Asthma Management Program (CAMP), we examined the features of neural networks in this context and compared them with traditional regression approaches for prediction of three outcomes: forced expiratory volume in one second (FEV1), age at smoking cessation, and log transformation of age at smoking cessation (due to age at smoking cessation being right-skewed). We used mean squared error to compare neural network and regression models, and found the models performed similarly unless the observed distribution of the phenotype was skewed, in which case the neural network had smaller mean squared error. Our results suggest neural network models have an advantage over standard regression approaches when the phenotypic distribution is skewed. However, when the distribution is not skewed, the approaches performed similarly. Our findings are relevant to studies that analyze phenotypes that are skewed by nature or where the phenotype of interest is skewed as a result of the ascertainment condition.
Asunto(s)
Redes Neurales de la Computación , Fumar , Volumen Espiratorio Forzado/genética , Fenotipo , EspirometríaRESUMEN
In correlated data settings, analysts typically choose between fitting conditional and marginal models, whose parameters come with distinct interpretations, and as such the choice between the two should be made on scientific grounds. For settings where interest lies in marginal-or population-averaged-parameters, the question of how best to estimate those parameters is a statistical one, and analysts have at their disposal two distinct modeling frameworks: generalized estimating equations (GEE) and marginalized multilevel models (MMMs). The two have been contrasted theoretically and in large sample settings, but asymptotic theory provides no guarantees in the small sample settings that are commonplace. In a comprehensive series of simulation studies, we shed light on the relative performance of GEE and MMMs in small-sample settings to help guide analysis decisions in practice. We find that both GEE and MMMs exhibit similar small-sample bias when the correct correlation structure is adopted (ie, when the random effects distribution is correctly specified or moderately misspecified)-but MMMs can be sensitive to misspecification of the correlation structure. When there are a small number of clusters, MMMs only slightly underestimate standard errors (SEs) for within-cluster associations but can severely underestimate SEs for between-cluster associations. By contrast, while GEE severely underestimates SEs, the Mancl and DeRouen correction provides approximately valid inference.
Asunto(s)
Modelos Estadísticos , Sesgo , Análisis por Conglomerados , Simulación por Computador , Humanos , Análisis MultinivelRESUMEN
In the research on complex diseases, gene expression (GE) data have been extensively used for clustering samples. The clusters so generated can serve as the basis for disease subtype identification, risk stratification, and many other purposes. With the small sample sizes of genetic profiling studies and noisy nature of GE data, clustering analysis results are often unsatisfactory. In the most recent studies, a prominent trend is to conduct multidimensional profiling, which collects data on GEs and their regulators (copy number alterations, microRNAs, methylation, etc.) on the same subjects. With the regulation relationships, regulators contain important information on the properties of GEs. We develop a novel assisted clustering method, which effectively uses regulator information to improve clustering analysis using GE data. To account for the fact that not all GEs are informative, we propose a weighted strategy, where the weights are determined data-dependently and can discriminate informative GEs from noises. The proposed method is built on the NCut technique and effectively realized using a simulated annealing algorithm. Simulations demonstrate that it can well outperform multiple direct competitors. In the analysis of TCGA cutaneous melanoma and lung adenocarcinoma data, biologically sensible findings different from the alternatives are made.