RESUMO
The weaponization of digital communications and social media to conduct disinformation campaigns at immense scale, speed, and reach presents new challenges to identify and counter hostile influence operations (IOs). This paper presents an end-to-end framework to automate detection of disinformation narratives, networks, and influential actors. The framework integrates natural language processing, machine learning, graph analytics, and a network causal inference approach to quantify the impact of individual actors in spreading IO narratives. We demonstrate its capability on real-world hostile IO campaigns with Twitter datasets collected during the 2017 French presidential elections and known IO accounts disclosed by Twitter over a broad range of IO campaigns (May 2007 to February 2020), over 50,000 accounts, 17 countries, and different account types including both trolls and bots. Our system detects IO accounts with 96% precision, 79% recall, and 96% area-under-the precision-recall (P-R) curve; maps out salient network communities; and discovers high-impact accounts that escape the lens of traditional impact statistics based on activity counts and network centrality. Results are corroborated with independent sources of known IO accounts from US Congressional reports, investigative journalism, and IO datasets provided by Twitter.
Assuntos
Meios de Comunicação/tendências , Disseminação de Informação/métodos , Política , Mídias Sociais/tendências , Comunicação , Humanos , Análise de Rede Social , Rede SocialRESUMO
Data analyses typically rely upon assumptions about the missingness mechanisms that lead to observed versus missing data, assumptions that are typically unassessable. We explore an approach where the joint distribution of observed data and missing data are specified in a nonstandard way. In this formulation, which traces back to a representation of the joint distribution of the data and missingness mechanism, apparently first proposed by J. W. Tukey, the modeling assumptions about the distributions are either assessable or are designed to allow relatively easy incorporation of substantive knowledge about the problem at hand, thereby offering a possibly realistic portrayal of the data, both observed and missing. We develop Tukey's representation for exponential-family models, propose a computationally tractable approach to inference in this class of models, and offer some general theoretical comments. We then illustrate the utility of this approach with an example in systems biology.
RESUMO
A catalytic prior distribution is designed to stabilize a high-dimensional "working model" by shrinking it toward a "simplified model." The shrinkage is achieved by supplementing the observed data with a small amount of "synthetic data" generated from a predictive distribution under the simpler model. We apply this framework to generalized linear models, where we propose various strategies for the specification of a tuning parameter governing the degree of shrinkage and study resultant theoretical properties. In simulations, the resulting posterior estimation using such a catalytic prior outperforms maximum likelihood estimation from the working model and is generally comparable with or superior to existing competitive methods in terms of frequentist prediction accuracy of point estimation and coverage accuracy of interval estimation. The catalytic priors have simple interpretations and are easy to formulate.
Assuntos
Simulação por Computador/estatística & dados numéricos , Modelos Lineares , Teorema de Bayes , Simulação por Computador/tendências , Análise de Dados , Coleta de Dados , Tamanho da Amostra , Estatística como AssuntoRESUMO
We describe a new method to combine propensity-score matching with regression adjustment in treatment-control studies when outcomes are binary by multiply imputing potential outcomes under control for the matched treated subjects. This enables the estimation of clinically meaningful measures of effect such as the risk difference. We used Monte Carlo simulation to explore the effect of the number of imputed potential outcomes under control for the matched treated subjects on inferences about the risk difference. We found that imputing potential outcomes under control (either single imputation or multiple imputation) resulted in a substantial reduction in bias compared with what was achieved using conventional nearest neighbor matching alone. Increasing the number of imputed potential outcomes under control resulted in more efficient estimation, with more efficient estimation of the estimated risk difference when increasing the number of the imputed potential outcomes. The greatest relative increase in efficiency was achieved by imputing five potential outcomes; once 20 outcomes under control were imputed for each matched treated subject, further improvements in efficiency were negligible. We also examined the effect of the number of these imputed potential outcomes on: (i) estimated standard errors; (ii) mean squared error; (iii) coverage of estimated confidence intervals. We illustrate the application of the method by estimating the effect on the risk of death within 1 year of prescribing beta-blockers to patients discharged from hospital with a diagnosis of heart failure.
Assuntos
Projetos de Pesquisa , Viés , Simulação por Computador , Humanos , Método de Monte Carlo , Pontuação de PropensãoRESUMO
Although complete randomization ensures covariate balance on average, the chance of observing significant differences between treatment and control covariate distributions increases with many covariates. Rerandomization discards randomizations that do not satisfy a predetermined covariate balance criterion, generally resulting in better covariate balance and more precise estimates of causal effects. Previous theory has derived finite sample theory for rerandomization under the assumptions of equal treatment group sizes, Gaussian covariate and outcome distributions, or additive causal effects, but not for the general sampling distribution of the difference-in-means estimator for the average causal effect. We develop asymptotic theory for rerandomization without these assumptions, which reveals a non-Gaussian asymptotic distribution for this estimator, specifically a linear combination of a Gaussian random variable and truncated Gaussian random variables. This distribution follows because rerandomization affects only the projection of potential outcomes onto the covariate space but does not affect the corresponding orthogonal residuals. We demonstrate that, compared with complete randomization, rerandomization reduces the asymptotic quantile ranges of the difference-in-means estimator. Moreover, our work constructs accurate large-sample confidence intervals for the average causal effect.
Assuntos
Modelos Teóricos , Distribuição AleatóriaRESUMO
The wise use of statistical ideas in practice essentially requires some Bayesian thinking, in contrast to the classical rigid frequentist dogma. This dogma too often has seemed to influence the applications of statistics, even at agencies like the FDA. Greg Campbell was one of the most important advocates there for more nuanced modes of thought, especially Bayesian statistics. Because two brilliant statisticians, Ronald Fisher and Jerzy Neyman, are often credited with instilling the traditional frequentist approach in current practice, I argue that both men were actually seeking very Bayesian answers, and neither would have endorsed the rigid application of their ideas.
Assuntos
Teorema de Bayes , Estatística como Assunto , United States Food and Drug Administration , Humanos , Estados UnidosRESUMO
Estimation of causal effects in non-randomized studies comprises two distinct phases: design, without outcome data, and analysis of the outcome data according to a specified protocol. Recently, Gutman and Rubin (2013) proposed a new analysis-phase method for estimating treatment effects when the outcome is binary and there is only one covariate, which viewed causal effect estimation explicitly as a missing data problem. Here, we extend this method to situations with continuous outcomes and multiple covariates and compare it with other commonly used methods (such as matching, subclassification, weighting, and covariance adjustment). We show, using an extensive simulation, that of all methods considered, and in many of the experimental conditions examined, our new 'multiple-imputation using two subclassification splines' method appears to be the most efficient and has coverage levels that are closest to nominal. In addition, it can estimate finite population average causal effects as well as non-linear causal estimands. This type of analysis also allows the identification of subgroups of units for which the effect appears to be especially beneficial or harmful.
Assuntos
Modelos Estatísticos , Projetos de Pesquisa , Terapêutica , Estudos Observacionais como Assunto , Ensaios Clínicos Controlados Aleatórios como Assunto , Resultado do TratamentoRESUMO
By 'partially post-hoc' subgroup analyses, we mean analyses that compare existing data from a randomized experiment-from which a subgroup specification is derived-to new, subgroup-only experimental data. We describe a motivating example in which partially post hoc subgroup analyses instigated statistical debate about a medical device's efficacy. We clarify the source of such analyses' invalidity and then propose a randomization-based approach for generating valid posterior predictive p-values for such partially post hoc subgroups. Lastly, we investigate the approach's operating characteristics in a simple illustrative setting through a series of simulations, showing that it can have desirable properties under both null and alternative hypotheses.
Assuntos
Seleção de Pacientes , Ensaios Clínicos Controlados Aleatórios como Assunto/métodos , Projetos de Pesquisa , Biometria , Simulação por Computador , Equipamentos e Provisões , Géis/uso terapêutico , Humanos , Osteoartrite do Joelho/tratamento farmacológico , Estados Unidos , United States Food and Drug AdministrationRESUMO
Health and medical data are increasingly being generated, collected, and stored in electronic form in healthcare facilities and administrative agencies. Such data hold a wealth of information vital to effective health policy development and evaluation, as well as to enhanced clinical care through evidence-based practice and safety and quality monitoring. These initiatives are aimed at improving individuals' health and well-being. Nevertheless, analyses of health data archives must be conducted in such a way that individuals' privacy is not compromised. One important aspect of protecting individuals' privacy is protecting the confidentiality of their data. It is the purpose of this paper to provide a review of a number of approaches to reducing disclosure risk when making data available for research, and to present a taxonomy for such approaches. Some of these methods are widely used, whereas others are still in development. It is important to have a range of methods available because there is also a range of data-use scenarios, and it is important to be able to choose between methods suited to differing scenarios. In practice, it is necessary to find a balance between allowing the use of health and medical data for research and protecting confidentiality. This balance is often presented as a trade-off between disclosure risk and data utility, because methods that reduce disclosure risk, in general, also reduce data utility.
Assuntos
Pesquisa Biomédica/legislação & jurisprudência , Confidencialidade/legislação & jurisprudência , Interpretação Estatística de Dados , Medicina Baseada em Evidências/legislação & jurisprudência , Política de Saúde/legislação & jurisprudência , Austrália , Pesquisa Biomédica/métodos , Pesquisa Biomédica/estatística & dados numéricos , Segurança Computacional/legislação & jurisprudência , Segurança Computacional/normas , Segurança Computacional/estatística & dados numéricos , Confidencialidade/normas , União Europeia , Medicina Baseada em Evidências/métodos , Medicina Baseada em Evidências/estatística & dados numéricos , Health Insurance Portability and Accountability Act , Humanos , Estados UnidosRESUMO
Although recent guidelines for dealing with missing data emphasize the need for sensitivity analyses, and such analyses have a long history in statistics, universal recommendations for conducting and displaying these analyses are scarce. We propose graphical displays that help formalize and visualize the results of sensitivity analyses, building upon the idea of 'tipping-point' analysis for randomized experiments with a binary outcome and a dichotomous treatment. The resulting 'enhanced tipping-point displays' are convenient summaries of conclusions obtained from making different modeling assumptions about missingness mechanisms. The primary goal of the displays is to make formal sensitivity analysesmore comprehensible to practitioners, thereby helping them assess the robustness of the experiment's conclusions to plausible missingness mechanisms. We also present a recent example of these enhanced displays in amedical device clinical trial that helped lead to FDA approval.
Assuntos
Interpretação Estatística de Dados , Modelos Estatísticos , Ensaios Clínicos Controlados Aleatórios como Assunto/métodos , Simulação por Computador , Fraturas por Compressão/cirurgia , Humanos , Cifoplastia/efeitos adversos , Cifoplastia/normas , Dor/prevenção & controle , Fraturas da Coluna Vertebral/cirurgia , Estados UnidosRESUMO
A number of mixture modeling approaches assume both normality and independent observations. However, these two assumptions are at odds with the reality of many data sets, which are often characterized by an abundance of zero-valued or highly skewed observations as well as observations from biologically related (i.e., non-independent) subjects. We present here a finite mixture model with a zero-inflated Poisson regression component that may be applied to both types of data. This flexible approach allows the use of covariates to model both the Poisson mean and rate of zero inflation and can incorporate random effects to accommodate non-independent observations. We demonstrate the utility of this approach by applying these models to a candidate endophenotype for schizophrenia, but the same methods are applicable to other types of data characterized by zero inflation and non-independence.
Assuntos
Conjuntos de Dados como Assunto/estatística & dados numéricos , Modelos Estatísticos , Distribuição de Poisson , Adulto , Endofenótipos , Humanos , Pessoa de Meia-Idade , Razão de Chances , Esquizofrenia/genéticaRESUMO
Existing methods that use propensity scores for heterogeneous treatment effect estimation on non-experimental data do not readily extend to the case of more than two treatment options. In this work, we develop a new propensity score-based method for heterogeneous treatment effect estimation when there are three or more treatment options, and prove that it generates unbiased estimates. We demonstrate our method on a real patient registry of patients in Singapore with diabetic dyslipidemia. On this dataset, our method generates heterogeneous treatment recommendations for patients among three options: Statins, fibrates, and non-pharmacological treatment to control patients' lipid ratios (total cholesterol divided by high-density lipoprotein level). In our numerical study, our proposed method generated more stable estimates compared to a benchmark method based on a multi-dimensional propensity score.
Assuntos
Dislipidemias , Inibidores de Hidroximetilglutaril-CoA Redutases , Pontuação de Propensão , Humanos , Dislipidemias/tratamento farmacológico , Inibidores de Hidroximetilglutaril-CoA Redutases/uso terapêutico , Singapura , Causalidade , Modelos Estatísticos , Ácidos Fíbricos/uso terapêutico , Hipolipemiantes/uso terapêuticoRESUMO
Matching on an estimated propensity score is frequently used to estimate the effects of treatments from observational data. Since the 1970s, different authors have proposed methods to combine matching at the design stage with regression adjustment at the analysis stage when estimating treatment effects for continuous outcomes. Previous work has consistently shown that the combination has generally superior statistical properties than either method by itself. In biomedical and epidemiological research, survival or time-to-event outcomes are common. We propose a method to combine regression adjustment and propensity score matching to estimate survival curves and hazard ratios based on estimating an imputed potential outcome under control for each successfully matched treated subject, which is accomplished using either an accelerated failure time parametric survival model or a Cox proportional hazard model that is fit to the matched control subjects. That is, a fitted model is then applied to the matched treated subjects to allow simulation of the missing potential outcome under control for each treated subject. Conventional survival analyses (e.g., estimation of survival curves and hazard ratios) can then be conducted using the observed outcome under treatment and the imputed outcome under control. We evaluated the repeated-sampling bias of the proposed methods using simulations. When using nearest neighbor matching, the proposed method resulted in decreased bias compared to crude analyses in the matched sample. We illustrate the method in an example prescribing beta-blockers at hospital discharge to patients hospitalized with heart failure.
Assuntos
Pontuação de Propensão , Viés , Humanos , Método de Monte Carlo , Modelos de Riscos Proporcionais , Análise de SobrevidaRESUMO
Consider a statistical analysis that draws causal inferences from an observational dataset, inferences that are presented as being valid in the standard frequentist senses; i.e. the analysis produces: (1) consistent point estimates, (2) valid p-values, valid in the sense of rejecting true null hypotheses at the nominal level or less often, and/or (3) confidence intervals, which are presented as having at least their nominal coverage for their estimands. For the hypothetical validity of these statements, the analysis must embed the observational study in a hypothetical randomized experiment that created the observed data, or a subset of that hypothetical randomized data set. This multistage effort with thought-provoking tasks involves: (1) a purely conceptual stage that precisely formulate the causal question in terms of a hypothetical randomized experiment where the exposure is assigned to units; (2) a design stage that approximates a randomized experiment before any outcome data are observed, (3) a statistical analysis stage comparing the outcomes of interest in the exposed and non-exposed units of the hypothetical randomized experiment, and (4) a summary stage providing conclusions about statistical evidence for the sizes of possible causal effects. Stages 2 and 3 may rely on modern computing to implement the effort, whereas Stage 1 demands careful scientific argumentation to make the embedding plausible to scientific readers of the proffered statistical analysis. Otherwise, the resulting analysis is vulnerable to criticism for being simply a presentation of scientifically meaningless arithmetic calculations. The conceptually most demanding tasks are often the most scientifically interesting to the dedicated researcher and readers of the resulting statistical analyses. This perspective is rarely implemented with any rigor, for example, completely eschewing the first stage. We illustrate our approach using an example examining the effect of parental smoking on children's lung function collected in families living in East Boston in the 1970s.
Assuntos
Causalidade , Modelos Estatísticos , Estudos Observacionais como Assunto/estatística & dados numéricos , Ensaios Clínicos Controlados Aleatórios como Assunto/estatística & dados numéricos , Projetos de Pesquisa , Adulto , Feminino , Humanos , Masculino , Pais , Abandono do Hábito de Fumar/estatística & dados numéricosRESUMO
The seminal work of Morgan & Rubin (2012) considers rerandomization for all the units at one time.In practice, however, experimenters may have to rerandomize units sequentially. For example, a clinician studying a rare disease may be unable to wait to perform an experiment until all the experimental units are recruited. Our work offers a mathematical framework for sequential rerandomization designs, where the experimental units are enrolled in groups. We formulate an adaptive rerandomization procedure for balancing treatment/control assignments over some continuous or binary covariates, using Mahalanobis distance as the imbalance measure. We prove in our key result that given the same number of rerandomizations, in expected value, under certain mild assumptions, sequential rerandomization achieves better covariate balance than rerandomization at one time.
RESUMO
Blinded randomized controlled trials (RCT) require participants to be uncertain if they are receiving a treatment or placebo. Although uncertainty is ideal for isolating the treatment effect from all other potential effects, it is poorly suited for estimating the treatment effect under actual conditions of intended use-when individuals are certain that they are receiving a treatment. We propose an experimental design, randomization to randomization probabilities (R2R), which significantly improves estimates of treatment effects under actual conditions of use by manipulating participant expectations about receiving treatment. In the R2R design, participants are first randomized to a value, π, denoting their probability of receiving treatment (vs. placebo). Subjects are then told their value of π and randomized to either treatment or placebo with probabilities π and 1-π, respectively. Analysis of the treatment effect includes statistical controls for π (necessary for causal inference) and typically a π-by-treatment interaction. Random assignment of subjects to π and disclosure of its value to subjects manipulates subject expectations about receiving the treatment without deception. This method offers a better treatment effect estimate under actual conditions of use than does a conventional RCT. Design properties, guidelines for power analyses, and limitations of the approach are discussed. We illustrate the design by implementing an RCT of caffeine effects on mood and vigilance and show that some of the actual effects of caffeine differ by the expectation that one is receiving the active drug. (PsycINFO Database Record
Assuntos
Pesquisa Biomédica/métodos , Avaliação de Resultados em Cuidados de Saúde/métodos , Distribuição Aleatória , Ensaios Clínicos Controlados Aleatórios como Assunto/métodos , Projetos de Pesquisa , Adulto , Afeto/efeitos dos fármacos , Nível de Alerta/efeitos dos fármacos , Cafeína/farmacologia , Estimulantes do Sistema Nervoso Central/farmacologia , HumanosRESUMO
Prior research has focused on the latent structure of endophenotypic markers of schizophrenia liability, or schizotypy. The work supports the existence of 2 relatively distinct latent classes and derives largely from the taxometric analysis of psychometric values. The present study used finite mixture modeling as a technique for discerning latent structure and the laboratory-measured endophenotypes of sustained attention deficits and eye-tracking dysfunction as endophenotype indexes. In a large adult community sample (N=311), finite mixture analysis of the sustained attention index d' and 2 eye-tracking indexes (gain and catch-up saccade rate) revealed evidence for 2 latent components. A putative schizotypy class accounted for 27% of the sample. A supplementary maximum covariance taxometric analysis yielded highly consistent results. Subjects in the schizotypy component displayed higher rates of schizotypal personality features and an increased rate of treated schizophrenia in their 1st-degree biological relatives compared with subjects in the other component. Implications of these results are examined in light of major theories of schizophrenia liability, and methodological advantages of finite mixture modeling for psychopathology research, with particular emphasis on genomic issues, are discussed.