RESUMO
We commonly encounter the problem of identifying an optimally weight-adjusted version of the empirical distribution of observed data, adhering to predefined constraints on the weights. Such constraints often manifest as restrictions on the moments, tail behavior, shapes, number of modes, etc., of the resulting weight-adjusted empirical distribution. In this article, we substantially enhance the flexibility of such a methodology by introducing a nonparametrically imbued distributional constraint on the weights and developing a general framework leveraging the maximum entropy principle and tools from optimal transport. The key idea is to ensure that the maximum entropy weight-adjusted empirical distribution of the observed data is close to a pre-specified probability distribution in terms of the optimal transport metric, while allowing for subtle departures. The proposed scheme for the re-weighting of observations subject to constraints is reminiscent of the empirical likelihood and related ideas, but offers greater flexibility in applications where parametric distribution-guided constraints arise naturally. The versatility of the proposed framework is demonstrated in the context of three disparate applications where data re-weighting is warranted to satisfy side constraints on the optimization problem at the heart of the statistical task-namely, portfolio allocation, semi-parametric inference for complex surveys, and ensuring algorithmic fairness in machine learning algorithms.
RESUMO
Small area estimation (SAE) entails estimating characteristics of interest for domains, often geographical areas, in which there may be few or no samples available. SAE has a long history and a wide variety of methods have been suggested, from a bewildering range of philosophical standpoints. We describe design-based and model-based approaches and models that are specified at the area-level and at the unit-level, focusing on health applications and fully Bayesian spatial models. The use of auxiliary information is a key ingredient for successful inference when response data are sparse and we discuss a number of approaches that allow the inclusion of covariate data. SAE for HIV prevalence, using data collected from a Demographic Health Survey in Malawi in 2015-2016, is used to illustrate a number of techniques. The potential use of SAE techniques for outcomes related to COVID-19 is discussed.
RESUMO
This article analyzes the efficacy of the randomized response technique (RRT) in achieving honest self-reporting about sexual behavior, compared with traditional survey techniques. A complex survey was conducted of 1,246 university students in Spain, who were asked sensitive quantitative questions about their sexual behavior, either via the RRT (n = 754) or by direct questioning (DQ) (n = 492). The RRT estimates of the number of times that the students were unable to restrain their inappropriate sexual behavior were significantly higher than the DQ estimates, among both male and female students. The results obtained suggest that the RRT method elicits higher values of self-stigmatizing reports of sexual experiences by increasing privacy in the data collection process. The RRT is shown to be a useful method for investigating sexual behavior.
Assuntos
Comportamento Problema/psicologia , Autorrelato , Comportamento Sexual/psicologia , Estudantes/estatística & dados numéricos , Inquéritos e Questionários/estatística & dados numéricos , Adulto , Feminino , Humanos , Masculino , Disfunções Sexuais Psicogênicas/psicologia , Participação Social , Espanha , Adulto JovemRESUMO
OBJECTIVE: To describe the methodology of the 2014 Ontario Child Health Study (OCHS): a province-wide, cross-sectional, epidemiologic study of child health and mental disorder among 4- to 17-year-olds living in household dwellings. METHOD: Implemented by Statistics Canada, the 2014 OCHS was led by academic researchers at the Offord Centre for Child Studies (McMaster University). Eligible households included families with children aged 4 to 17 years, who were listed on the 2014 Canadian Child Tax Benefit File. The survey design included area and household stratification by income and 3-stage cluster sampling of areas and households to yield a probability sample of families. RESULTS: The 2014 OCHS included 6,537 responding households (50.8%) with 10,802 children aged 4 to 17 years. Lower income families living in low-income neighbourhoods were less likely to participate. In addition to measures of childhood mental disorder assessed by the Mini International Neuropsychiatric Interview for Children and Adolescents (MINI-KID) and OCHS Emotional Behavioural Scales (OCHS-EBS), the survey contains measures of neighbourhoods, schools, families and children, and includes administrative data held by the Ministries of Education and Health and Long-Term Care. CONCLUSIONS: The complex survey design and differential non-response of the 2014 OCHS required the use of sampling weights and adjustment for design effects. The study is available throughout Canada in the Statistics Canada Research Data Centres (RDCs). We urge external investigators to access the study through the RDCs or to contact us directly to collaborate on future secondary analysis studies based on the OCHS.
Assuntos
Saúde da Criança , Inquéritos Epidemiológicos/métodos , Transtornos Mentais/epidemiologia , Adolescente , Criança , Pré-Escolar , Estudos Transversais , Feminino , Humanos , Masculino , Ontário/epidemiologiaRESUMO
BACKGROUND AND AIMS: The prevalence of cannabis use based on self-reports is likely to be underestimated in population surveys, especially in contexts where its use is a criminal offence. Indirect survey methods ask sensitive questions ensuring that answers cannot be identified with an individual respondent, therefore potentially resulting in more reliable estimates. We aimed to measure whether the indirect survey method 'randomized response technique' (RRT) increased response rate and/or increased disclosure of cannabis use among young adults compared with a traditional survey. DESIGN: We conducted two parallel nation-wide surveys during the spring and the summer of 2021. The first survey was a traditional questionnaire-based one (focusing on substance use and gambling). The second survey applied an indirect survey method known as 'the cross-wise model' to questions related to cannabis use. The two surveys employed identical procedures (e.g. invitations, reminders and wording of the questions) SETTING AND PARTICIPANTS: The participants were young adults (aged 18-29 years) living in Sweden. The traditional survey had 1200 respondents (56.9% women) and the indirect survey had 2951 respondents (53.6% women). MEASUREMENTS: In both surveys, cannabis use was assessed according to three time-frames: life-time use; use during the past year; and use during the past 30 days. FINDINGS: The estimated prevalence of cannabis use was two- to threefold higher on all measures when estimated using the indirect survey method compared with the traditional survey: use during life-time (43.2 versus 27.3%); during the past year (19.2 versus 10.4%); and during the past 30 days (13.2 versus 3.7%). The discrepancy was larger among males and individuals with an education shorter than 10 years, who were unemployed, and who were born in non-European countries. CONCLUSIONS: Indirect survey methods may provide more accurate estimates than traditional surveys on prevalence of self-reported cannabis use.
Assuntos
Cannabis , Transtornos Relacionados ao Uso de Substâncias , Feminino , Humanos , Masculino , Adulto Jovem , Prevalência , Transtornos Relacionados ao Uso de Substâncias/epidemiologia , Inquéritos e Questionários , Suécia/epidemiologia , Ensaios Clínicos Controlados Aleatórios como AssuntoRESUMO
We discuss a two-step approach to test for a mediated effect using data gathered via complex sampling. The approach incorporates design-based multiple linear regressions and a generalized Sobel's method to test for significance of a mediated effect. We illustrate the applications to a study of nicotine dependence, race/ethnicity and cigarette purchase price among daily smokers in the U.S. The study goal was to assess significance of cigarette purchase price as a mediator in the association between race/ethnicity (non-Hispanic Black/African American, non-Hispanic White) and nicotine dependence measured in terms of the average number of cigarettes smoked per day. The single-mediator model incorporated 18 covariates as control factors. The results indicated a significant mediated effect of cigarette purchase price on the association. However, the relative effect size of 5% indicated low practical significance of the cigarette purchase price as a mediator in the association between race/ethnicity and nicotine dependence. The approach can be modified to studies where data are gathered via other types of complex sampling.
RESUMO
BACKGROUND: Many studies in psychological and educational research aim to estimate population average treatment effects (PATE) using data from large complex survey samples, and many of these studies use propensity score methods. Recent advances have investigated how to incorporate survey weights with propensity score methods. However, to this point, that work had not been well summarized, and it was not clear how much difference the different PATE estimation methods would make empirically. PURPOSE: The purpose of this study is to systematically summarize the appropriate use of survey weights in propensity score analysis of complex survey data and use a case study to empirically compare the PATE estimates using multiple analysis methods that include ordinary least squares regression, weighted least squares regression, and various propensity score applications. METHODS: We first summarize various propensity score methods that handle survey weights. We then demonstrate the performance of various analysis methods using a nationally representative data set, the Early Childhood Longitudinal Study-Kindergarten to estimate the effects of preschool on children's academic achievement. The correspondence of the results was evaluated using multiple criteria. RESULTS AND CONCLUSIONS: It is important for researchers to think carefully about their estimand of interest and use methods appropriate for that estimand. If interest is in drawing inferences to the survey target population, it is important to take the survey weights into account, particularly in the outcome analysis stage for estimating the PATE. The case study shows, however, not much difference among various analysis methods in one applied example.
Assuntos
Inquéritos Epidemiológicos , Pontuação de Propensão , Resultado do Tratamento , Humanos , Estudos Longitudinais , Método de Monte CarloRESUMO
Accurate estimates of the under-five mortality rate in a developing world context are a key barometer of the health of a nation. This paper describes a new model to analyze survey data on mortality in this context. We are interested in both spatial and temporal description, that is wishing to estimate under-five mortality rate across regions and years and to investigate the association between the under-five mortality rate and spatially varying covariate surfaces. We illustrate the methodology by producing yearly estimates for subnational areas in Kenya over the period 1980-2014 using data from the Demographic and Health Surveys, which use stratified cluster sampling. We use a binomial likelihood with fixed effects for the urban/rural strata and random effects for the clustering to account for the complex survey design. Smoothing is carried out using Bayesian hierarchical models with continuous spatial and temporally discrete components. A key component of the model is an offset to adjust for bias due to the effects of HIV epidemics. Substantively, there has been a sharp decline in Kenya in the under-five mortality rate in the period 1980-2014, but large variability in estimated subnational rates remains. A priority for future research is understanding this variability. In exploratory work, we examine whether a variety of spatial covariate surfaces can explain the variability in under-five mortality rate. Temperature, precipitation, a measure of malaria infection prevalence, and a measure of nearness to cities were candidates for inclusion in the covariate model, but the interplay between space, time, and covariates is complex.
Assuntos
Teorema de Bayes , Mortalidade da Criança/tendências , Países em Desenvolvimento , Mortalidade Infantil/tendências , Pré-Escolar , Feminino , Humanos , Lactente , Recém-Nascido , Quênia/epidemiologia , Masculino , Conglomerados Espaço-TemporaisRESUMO
The most widespread method of computing confidence intervals (CIs) in complex surveys is to add and subtract the margin of error (MOE) from the point estimate, where the MOE is the estimated standard error multiplied by the suitable Gaussian quantile. This Wald-type interval is used by the American Community Survey (ACS), the largest US household sample survey. For inferences on small proportions with moderate sample sizes, this method often results in marked under-coverage and lower CI endpoint less than 0. We assess via simulation the coverage and width, in complex sample surveys, of seven alternatives to the Wald interval for a binomial proportion with sample size replaced by the 'effective sample size,' that is, the sample size divided by the design effect. Building on previous work by the present authors, our simulations address the impact of clustering, stratification, different stratum sampling fractions, and stratum-specific proportions. We show that all intervals undercover when there is clustering and design effects are computed from a simple design-based estimator of sampling variance. Coverage can be better calibrated for the alternatives to Wald by improving estimation of the effective sample size through superpopulation modeling. This approach is more effective in our simulations than previously proposed modifications of effective sample size. We recommend intervals of the Wilson or Bayes uniform prior form, with the Jeffreys prior interval not far behind.
RESUMO
Cannabis is the most widely used illicit drug in developed countries, and has a significant impact on mental and physical health in the general population. Although the evaluation of levels of substance use is difficult, a method such as the randomized response technique (RRT), which includes both a personal component and an assurance of confidentiality, provides a combination which can achieve a considerable degree of accuracy. Various RRT surveys have been conducted to measure the prevalence of drug use, but to date no studies have been made of the effectiveness of this approach in surveys with respect to quantitative variables related to drug use. This paper describes a probabilistic, stratified sample of 1146 university students asking sensitive quantitative questions about cannabis use in Spanish universities, conducted using the RRT. On comparing the results of the direct question (DQ) survey and those of the randomized response (RR) survey, we find that the number of cannabis cigarettes consumed during the past year (DQ = 3, RR = 17 approximately), and the number of days when consumption took place (DQ = 1, RR = 7) are much higher with RRT. The advantages of RRT, reported previously and corroborated in our study, make it a useful method for investigating cannabis use. Copyright © 2016 John Wiley & Sons, Ltd.
Assuntos
Uso da Maconha/epidemiologia , Estudantes/estatística & dados numéricos , Inquéritos e Questionários , Adolescente , Feminino , Humanos , Masculino , Espanha/epidemiologia , Adulto JovemRESUMO
Small area estimation (SAE) is an important endeavor in many fields and is used for resource allocation by both public health and government organizations. Often, complex surveys are carried out within areas, in which case it is common for the data to consist only of the response of interest and an associated sampling weight, reflecting the design. While it is appealing to use spatial smoothing models, and many approaches have been suggested for this endeavor, it is rare for spatial models to incorporate the weighting scheme, leaving the analysis potentially subject to bias. To examine the properties of various approaches to estimation we carry out a simulation study, looking at bias due to both non-response and non-random sampling. We also carry out SAE of smoking prevalence in Washington State, at the zip code level, using data from the 2006 Behavioral Risk Factor Surveillance System. The computation times for the methods we compare are short, and all approaches are implemented in R using currently available packages.