RESUMO
Gene-environment (GE) interactions are essential in understanding human complex traits. Identifying these interactions is necessary for deciphering the biological basis of such traits. In this study, we review state-of-art methods for estimating the proportion of phenotypic variance explained by genome-wide GE interactions and introduce a novel statistical method Linkage-Disequilibrium Eigenvalue Regression for Gene-Environment interactions (LDER-GE). LDER-GE improves the accuracy of estimating the phenotypic variance component explained by genome-wide GE interactions using large-scale biobank association summary statistics. LDER-GE leverages the complete Linkage Disequilibrium (LD) matrix, as opposed to only the diagonal squared LD matrix utilized by LDSC (Linkage Disequilibrium Score)-based methods. Our extensive simulation studies demonstrate that LDER-GE performs better than LDSC-based approaches by enhancing statistical efficiency by ~23%. This improvement is equivalent to a sample size increase of around 51%. Additionally, LDER-GE effectively controls type-I error rate and produces unbiased results. We conducted an analysis using UK Biobank data, comprising 307 259 unrelated European-Ancestry subjects and 966 766 variants, across 217 environmental covariate-phenotype (E-Y) pairs. LDER-GE identified 34 significant E-Y pairs while LDSC-based method only identified 23 significant E-Y pairs with 22 overlapped with LDER-GE. Furthermore, we employed LDER-GE to estimate the aggregated variance component attributed to multiple GE interactions, leading to an increase in the explained phenotypic variance with GE interactions compared to considering main genetic effects only. Our results suggest the importance of impacts of GE interactions on human complex traits.
Assuntos
Interação Gene-Ambiente , Desequilíbrio de Ligação , Fenótipo , Humanos , Herança Multifatorial , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único , Modelos GenéticosRESUMO
Stepped wedge cluster randomized trials (SW-CRTs) with binary outcomes are increasingly used in prevention and implementation studies. Marginal models represent a flexible tool for analyzing SW-CRTs with population-averaged interpretations, but the joint estimation of the mean and intraclass correlation coefficients (ICCs) can be computationally intensive due to large cluster-period sizes. Motivated by the need for marginal inference in SW-CRTs, we propose a simple and efficient estimating equations approach to analyze cluster-period means. We show that the quasi-score for the marginal mean defined from individual-level observations can be reformulated as the quasi-score for the same marginal mean defined from the cluster-period means. An additional mapping of the individual-level ICCs into correlations for the cluster-period means further provides a rigorous justification for the cluster-period approach. The proposed approach addresses a long-recognized computational burden associated with estimating equations defined based on individual-level observations, and enables fast point and interval estimation of the intervention effect and correlations. We further propose matrix-adjusted estimating equations to improve the finite-sample inference for ICCs. By providing a valid approach to estimate ICCs within the class of generalized linear models for correlated binary outcomes, this article operationalizes key recommendations from the CONSORT extension to SW-CRTs, including the reporting of ICCs.
Assuntos
Projetos de Pesquisa , Análise por Conglomerados , Humanos , Modelos Lineares , Tamanho da AmostraRESUMO
Cluster randomization results in an increase in sample size compared to individual randomization, referred to as an efficiency loss. This efficiency loss is typically presented under an assumption of no contamination in the individually randomized trial. An alternative comparator is the sample size needed under individual randomization to detect the attenuated treatment effect due to contamination. A general framework is provided for determining the extent of contamination that can be tolerated in an individually randomized trial before a cluster randomized design yields a larger sample size. Results are presented for a variety of cluster trial designs including parallel arm, stepped-wedge and cluster crossover trials. Results reinforce what is expected: individually randomized trials can tolerate a surprisingly large amount of contamination before they become less efficient than cluster designs. We determine the point at which the contamination means an individual randomized design to detect an attenuated effect requires a larger sample size than cluster randomization under a nonattenuated effect. This critical rate is a simple function of the design effect for clustering and the design effect for multiple periods as well as design effects for stratification or repeated measures under individual randomization. These findings are important for pragmatic comparisons between a novel treatment and usual care as any bias due to contamination will only attenuate the true treatment effect. This is a bias that operates in a predictable direction. Yet, cluster randomized designs with post-randomization recruitment without blinding, are at high risk of bias due to the differential recruitment across treatment arms. This sort of bias operates in an unpredictable direction. Thus, with knowledge that cluster randomized trials are generally at a greater risk of biases that can operate in a nonpredictable direction, results presented here suggest that even in situations where there is a risk of contamination, individual randomization might still be the design of choice.
Assuntos
Ensaios Clínicos Controlados Aleatórios como Assunto , Projetos de Pesquisa , Análise por Conglomerados , Estudos Cross-Over , Humanos , Tamanho da AmostraRESUMO
BACKGROUND: When conducting a survival analysis, researchers might consider two broad classes of models: nonparametric models and parametric models. While nonparametric models are more flexible because they make few assumptions regarding the shape of the data distribution, parametric models are more efficient. Here we sought to make concrete the difference in efficiency between these two model types using effective sample size. METHODS: We compared cumulative risk of AIDS or death estimated using four survival models - nonparametric, generalized gamma, Weibull, and exponential - and data from 1164 HIV patients who were alive and AIDS-free in 1995. We added pseudo-observations to the sample until the spread of the 95% confidence limits for the nonparametric model became less than that for the parametric models. RESULTS: We found the 3-parameter generalized gamma to be a good fit to the nonparametric risk curve, but the 1-parameter exponential both underestimated and overestimated the risk at different times. Using two year-risk as an example, we had to add 354, 593, and 3960 observations for the nonparametric model to be as efficient as the generalized gamma, Weibull, and exponential models, respectively. CONCLUSIONS: These added observations represent the hidden observations underlying the efficiency gained through parametric model form assumptions. If the model is correctly specified, the efficiency gain may be justified, as appeared to be the case for the generalized gamma model. Otherwise, precision will be improved, but at the cost of specification bias, as was the case for the exponential model.
Assuntos
Síndrome da Imunodeficiência Adquirida/diagnóstico , Algoritmos , Infecções por HIV/diagnóstico , Modelos Biológicos , Síndrome da Imunodeficiência Adquirida/complicações , Síndrome da Imunodeficiência Adquirida/mortalidade , Estudos de Coortes , Feminino , Seguimentos , HIV/fisiologia , Infecções por HIV/complicações , Infecções por HIV/virologia , Humanos , Prognóstico , Fatores de Risco , Análise de Sobrevida , Taxa de Sobrevida , Fatores de TempoRESUMO
Assessing performance of diagnostic markers is a necessary step for their use in decision making regarding various conditions of interest in diagnostic medicine and other fields. Globally useful markers could, however, have ranges of values that are "diagnostically non-informative". This paper demonstrates that the presence of marker values from diagnostically non-informative ranges could lead to a loss in statistical efficiency during nonparametric evaluation and shows that grouping non-informative values provides a natural resolution to this problem. These points are theoretically proven and an extensive simulation study is conducted to illustrate the possible benefits of using grouped marker values in a number of practically reasonable scenarios. The results contradict the common conjecture regarding the detrimental effect of grouped marker values during performance assessments. Specifically, contrary to the common assumption that grouped marker values lead to bias, grouping non-informative values does not introduce bias and could substantially reduce sampling variability. The proven concept that grouped marker values could be statistically beneficial without detrimental consequences implies that in practice, tied values do not always require resolution whereas the use of continuous diagnostic results without addressing diagnostically non-informative ranges could be statistically detrimental. Based on these findings, more efficient methods for evaluating diagnostic markers could be developed.
RESUMO
OBJECTIVES: Observational work sampling is often used in occupational studies to assess categorical biomechanical exposures and occurrence of specific work tasks. The statistical performance of data obtained by work sampling is, however, not well understood, impeding informed measurement strategy design. The purpose of this study was to develop a procedure for assessing the statistical properties of work sampling strategies evaluating categorical exposure variables and to illustrate the usefulness of this procedure to examine bias and precision of exposure estimates from samples of different sizes. METHODS: From a parent data set of observations on 10 construction workers performing a single operation, the probabilities were determined for each worker of performing four component tasks and working in four mutually exclusive trunk posture categories (neutral, mild flexion, severe flexion, twisted). Using these probabilities, 5000 simulated data sets were created via probability-based resampling for each of six sampling strategies, ranging from 300 to 4500 observations. For each strategy, mean exposure and exposure variability metrics were calculated at both the operation level and task level and for each metric, bias and precision were assessed across the 5000 simulations. RESULTS: Estimates of exposure variability were substantially more uncertain at all sample sizes than estimates of mean exposures and task proportions. Estimates at small sample sizes were also biased. With only 600 samples, proportions of the different tasks and of working with a neutral trunk posture (the most common) were within 10% of the true target value in at least 80% of all the simulated data sets; rarer exposures required at least 1500 samples. For most task-level mean exposure variables and for all operation-level and task-level estimates of exposure variability, performance was low, even with 4500 samples. In general, the precision of mean exposure estimates did not depend on the exposure variability between workers. CONCLUSIONS: The suggested probability-based simulation approach proved to be versatile and generally suitable for assessing bias and precision of data collection strategies using work sampling to estimate categorical data. The approach can be used in both real and hypothetical scenarios, in ergonomics, as well as in other areas of occupational epidemiology and intervention research. The reported statistical properties associated with sample size are likely widely relevant to studies using work sampling to assess categorical variables.
Assuntos
Biometria , Simulação por Computador , Coleta de Dados/métodos , Exposição Ocupacional/estatística & dados numéricos , Probabilidade , Fenômenos Biomecânicos , Indústria da Construção , Humanos , Postura , Local de TrabalhoRESUMO
Cancer is a heterogeneous disease, and rapid progress in sequencing and -omics technologies has enabled researchers to characterize tumors comprehensively. This has stimulated an intensive interest in studying how risk factors are associated with various tumor heterogeneous features. The Cancer Prevention Study-II (CPS-II) cohort is one of the largest prospective studies, particularly valuable for elucidating associations between cancer and risk factors. In this paper, we investigate the association of smoking with novel colorectal tumor markers obtained from targeted sequencing. However, due to cost and logistic difficulties, only a limited number of tumors can be assayed, which limits our capability for studying these associations. Meanwhile, there are extensive studies for assessing the association of smoking with overall cancer risk and established colorectal tumor markers. Importantly, such summary information is readily available from the literature. By linking this summary information to parameters of interest with proper constraints, we develop a generalized integration approach for polytomous logistic regression model with outcome characterized by tumor features. The proposed approach gains the efficiency through maximizing the joint likelihood of individual-level tumor data and external summary information under the constraints that narrow the parameter searching space. We apply the proposed method to the CPS-II data and identify the association of smoking with colorectal cancer risk differing by the mutational status of APC and RNF43 genes, neither of which is identified by the conventional analysis of CPS-II individual data only. These results help better understand the role of smoking in the etiology of colorectal cancer.
RESUMO
Vibratory function of the vocal folds is largely determined by the rheological properties or viscoelastic shear properties of the vocal fold lamina propria. To date, investigation of the sample size estimation and statistical experimental design for vocal fold rheological studies is nonexistent. The current work provides the closed-form sample size formulas for two major study designs (i.e. paired and two-group designs) in vocal fold research. Our results demonstrated that the paired design could greatly increase the statistical power compared to the two-group design. By comparing the variance of estimated treatment effect, this study also confirms that ignoring within-subject and within-vocal fold correlations during rheological data analysis will likely increase type I errors. Finally, viscoelastic shear properties of intact and scarred rabbit vocal fold lamina propria were measured and used to illustrate theoretical findings in a realistic scenario and project sample size requirement for future studies.
Assuntos
Reologia/métodos , Prega Vocal/fisiologia , Animais , Fenômenos Biomecânicos , Feminino , Masculino , Coelhos , VibraçãoRESUMO
OBJECTIVES: Statistical techniques currently used in musculoskeletal research often inefficiently account for paired-limb measurements or the relationship between measurements taken from multiple regions within limbs. This study compared three commonly used analysis methods with a mixed-models approach that appropriately accounted for the association between limbs, regions, and trials and that utilised all information available from repeated trials. METHOD: Four analysis were applied to an existing data set containing plantar pressure data, which was collected for seven masked regions on right and left feet, over three trials, across three participant groups. Methods 1-3 averaged data over trials and analysed right foot data (Method 1), data from a randomly selected foot (Method 2), and averaged right and left foot data (Method 3). Method 4 used all available data in a mixed-effects regression that accounted for repeated measures taken for each foot, foot region and trial. Confidence interval widths for the mean differences between groups for each foot region were used as a criterion for comparison of statistical efficiency. RESULTS: Mean differences in pressure between groups were similar across methods for each foot region, while the confidence interval widths were consistently smaller for Method 4. Method 4 also revealed significant between-group differences that were not detected by Methods 1-3. CONCLUSION: A mixed effects linear model approach generates improved efficiency and power by producing more precise estimates compared to alternative approaches that discard information in the process of accounting for paired-limb measurements. This approach is recommended in generating more clinically sound and statistically efficient research outputs.
Assuntos
Pesquisa Biomédica/estatística & dados numéricos , Coleta de Dados/estatística & dados numéricos , Interpretação Estatística de Dados , Pé/fisiopatologia , Lateralidade Funcional/fisiologia , Fenômenos Fisiológicos Musculoesqueléticos , Suporte de Carga/fisiologia , Adulto , Feminino , Marcha/fisiologia , Gota/fisiopatologia , Calcanhar/fisiopatologia , Humanos , Hiperuricemia/fisiopatologia , Masculino , Ossos do Metatarso/fisiopatologia , Análise Multivariada , Valores de Referência , Análise de Regressão , Reprodutibilidade dos Testes , Projetos de PesquisaRESUMO
OBJECTIVES: This study reviews simulation studies of discrete choice experiments to determine (i) how survey design features affect statistical efficiency, (ii) and to appraise their reporting quality. OUTCOMES: Statistical efficiency was measured using relative design (D-) efficiency, D-optimality, or D-error. METHODS: For this systematic survey, we searched Journal Storage (JSTOR), Since Direct, PubMed, and OVID which included a search within EMBASE. Searches were conducted up to year 2016 for simulation studies investigating the impact of DCE design features on statistical efficiency. Studies were screened and data were extracted independently and in duplicate. Results for each included study were summarized by design characteristic. Previously developed criteria for reporting quality of simulation studies were also adapted and applied to each included study. RESULTS: Of 371 potentially relevant studies, 9 were found to be eligible, with several varying in study objectives. Statistical efficiency improved when increasing the number of choice tasks or alternatives; decreasing the number of attributes, attribute levels; using an unrestricted continuous "manipulator" attribute; using model-based approaches with covariates incorporating response behaviour; using sampling approaches that incorporate previous knowledge of response behaviour; incorporating heterogeneity in a model-based design; correctly specifying Bayesian priors; minimizing parameter prior variances; and using an appropriate method to create the DCE design for the research question. The simulation studies performed well in terms of reporting quality. Improvement is needed in regards to clearly specifying study objectives, number of failures, random number generators, starting seeds, and the software used. CONCLUSION: These results identify the best approaches to structure a DCE. An investigator can manipulate design characteristics to help reduce response burden and increase statistical efficiency. Since studies varied in their objectives, conclusions were made on several design characteristics, however, the validity of each conclusion was limited. Further research should be conducted to explore all conclusions in various design settings and scenarios. Additional reviews to explore other statistical efficiency outcomes and databases can also be performed to enhance the conclusions identified from this review.
RESUMO
Several studies have shown that our visual system may construct a "summary statistical representation" over groups of visual objects. Although there is a general understanding that human observers can accurately represent sets of a variety of features, many questions on how summary statistics, such as an average, are computed remain unanswered. This study investigated sampling properties of visual information used by human observers to extract two types of summary statistics of item sets, average and variance. We presented three models of ideal observers to extract the summary statistics: a global sampling model without sampling noise, global sampling model with sampling noise, and limited sampling model. We compared the performance of an ideal observer of each model with that of human observers using statistical efficiency analysis. Results suggest that summary statistics of items in a set may be computed without representing individual items, which makes it possible to discard the limited sampling account. Moreover, the extraction of summary statistics may not necessarily require the representation of individual objects with focused attention when the sets of items are larger than 4.