Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 31
Filtrar
1.
Proc Natl Acad Sci U S A ; 120(21): e2207185120, 2023 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-37192169

RESUMO

Collecting complete network data is expensive, time-consuming, and often infeasible. Aggregated Relational Data (ARD), which ask respondents questions of the form "How many people with trait X do you know?" provide a low-cost option when collecting complete network data is not possible. Rather than asking about connections between each pair of individuals directly, ARD collect the number of contacts the respondent knows with a given trait. Despite widespread use and a growing literature on ARD methodology, there is still no systematic understanding of when and why ARD should accurately recover features of the unobserved network. This paper provides such a characterization by deriving conditions under which statistics about the unobserved network (or functions of these statistics like regression coefficients) can be consistently estimated using ARD. We first provide consistent estimates of network model parameters for three commonly used probabilistic models: the beta-model with node-specific unobserved effects, the stochastic block model with unobserved community structure, and latent geometric space models with unobserved latent locations. A key observation is that cross-group link probabilities for a collection of (possibly unobserved) groups identify the model parameters, meaning ARD are sufficient for parameter estimation. With these estimated parameters, it is possible to simulate graphs from the fitted distribution and analyze the distribution of network statistics. We can then characterize conditions under which the simulated networks based on ARD will allow for consistent estimation of the unobserved network statistics, such as eigenvector centrality, or response functions by or of the unobserved network, such as regression coefficients.

2.
Proc Natl Acad Sci U S A ; 117(48): 30266-30275, 2020 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-33208538

RESUMO

Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.


Assuntos
Aprendizado de Máquina , Causas de Morte , Simulação por Computador , Humanos , Especificidade de Órgãos
3.
Popul Health Metr ; 20(1): 3, 2022 01 10.
Artigo em Inglês | MEDLINE | ID: mdl-35012587

RESUMO

BACKGROUND: The mortality pattern from birth to age five is known to vary by underlying cause of mortality, which has been documented in multiple instances. Many countries without high functioning vital registration systems could benefit from estimates of age- and cause-specific mortality to inform health programming, however, to date the causes of under-five death have only been described for broad age categories such as for neonates (0-27 days), infants (0-11 months), and children age 12-59 months. METHODS: We adapt the log quadratic model to mortality patterns for children under five to all-cause child mortality and then to age- and cause-specific mortality (U5ACSM). We apply these methods to empirical sample registration system mortality data in China from 1996 to 2015. Based on these empirical data, we simulate probabilities of mortality in the case when the true relationships between age and mortality by cause are known. RESULTS: We estimate U5ACSM within 0.1-0.7 deaths per 1000 livebirths in hold out strata for life tables constructed from the China sample registration system, representing considerable improvement compared to an error of 1.2 per 1000 livebirths using a standard approach. This improved prediction error for U5ACSM is consistently demonstrated for all-cause as well as pneumonia- and injury-specific mortality. We also consistently identified cause-specific mortality patterns in simulated mortality scenarios. CONCLUSION: The log quadratic model is a significant improvement over the standard approach for deriving U5ACSM based on both simulation and empirical results.


Assuntos
Mortalidade da Criança , Mortalidade Infantil , Causas de Morte , Criança , Pré-Escolar , China/epidemiologia , Humanos , Lactente , Recém-Nascido , Tábuas de Vida
4.
Biostatistics ; 20(4): 549-564, 2019 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-29741607

RESUMO

In many clinical settings, a patient outcome takes the form of a scalar time series with a recovery curve shape, which is characterized by a sharp drop due to a disruptive event (e.g., surgery) and subsequent monotonic smooth rise towards an asymptotic level not exceeding the pre-event value. We propose a Bayesian model that predicts recovery curves based on information available before the disruptive event. A recovery curve of interest is the quantified sexual function of prostate cancer patients after prostatectomy surgery. We illustrate the utility of our model as a pre-treatment medical decision aid, producing personalized predictions that are both interpretable and accurate. We uncover covariate relationships that agree with and supplement that in existing medical literature.


Assuntos
Técnicas de Apoio para a Decisão , Modelos Estatísticos , Avaliação de Resultados em Cuidados de Saúde/estatística & dados numéricos , Prostatectomia/estatística & dados numéricos , Idoso , Teorema de Bayes , Humanos , Masculino , Pessoa de Meia-Idade , Prostatectomia/efeitos adversos
5.
BMC Med ; 18(1): 69, 2020 03 26.
Artigo em Inglês | MEDLINE | ID: mdl-32213178

RESUMO

BACKGROUND: A verbal autopsy (VA) is an interview conducted with the caregivers of someone who has recently died to describe the circumstances of the death. In recent years, several algorithmic methods have been developed to classify cause of death using VA data. The performance of one method-InSilicoVA-was evaluated in a study by Flaxman et al., published in BMC Medicine in 2018. The results of that study are different from those previously published by our group. METHODS: Based on the description of methods in the Flaxman et al. study, we attempt to replicate the analysis to understand why the published results differ from those of our previous work. RESULTS: We failed to reproduce the results published in Flaxman et al. Most of the discrepancies we find likely result from undocumented differences in data pre-processing, and/or values assigned to key parameters governing the behavior of the algorithm. CONCLUSION: This finding highlights the importance of making replication code available along with published results. All code necessary to replicate the work described here is freely available on GitHub.


Assuntos
Autopsia/métodos , Causas de Morte/tendências , Humanos , Projetos de Pesquisa , Estudos de Validação como Assunto
6.
Am Econ Rev ; 110(8): 2454-2484, 2020 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-34526729

RESUMO

Social network data are often prohibitively expensive to collect, limiting empirical network research. We propose an inexpensive and feasible strategy for network elicitation using Aggregated Relational Data (ARD): responses to questions of the form "how many of your links have trait k ?" Our method uses ARD to recover parameters of a network formation model, which permits sampling from a distribution over node- or graph-level statistics. We replicate the results of two field experiments that used network data and draw similar conclusions with ARD alone.

7.
BMC Med ; 17(1): 116, 2019 06 27.
Artigo em Inglês | MEDLINE | ID: mdl-31242925

RESUMO

BACKGROUND: Verbal autopsies with physician assignment of cause of death (COD) are commonly used in settings where medical certification of deaths is uncommon. It remains unanswered if automated algorithms can replace physician assignment. METHODS: We randomized verbal autopsy interviews for deaths in 117 villages in rural India to either physician or automated COD assignment. Twenty-four trained lay (non-medical) surveyors applied the allocated method using a laptop-based electronic system. Two of 25 physicians were allocated randomly to independently code the deaths in the physician assignment arm. Six algorithms (Naïve Bayes Classifier (NBC), King-Lu, InSilicoVA, InSilicoVA-NT, InterVA-4, and SmartVA) coded each death in the automated arm. The primary outcome was concordance with the COD distribution in the standard physician-assigned arm. Four thousand six hundred fifty-one (4651) deaths were allocated to physician (standard), and 4723 to automated arms. RESULTS: The two arms were nearly identical in demographics and key symptom patterns. The average concordances of automated algorithms with the standard were 62%, 56%, and 59% for adult, child, and neonatal deaths, respectively. Automated algorithms showed inconsistent results, even for causes that are relatively easy to identify such as road traffic injuries. Automated algorithms underestimated the number of cancer and suicide deaths in adults and overestimated other injuries in adults and children. Across all ages, average weighted concordance with the standard was 62% (range 79-45%) with the best to worst ranking automated algorithms being InterVA-4, InSilicoVA-NT, InSilicoVA, SmartVA, NBC, and King-Lu. Individual-level sensitivity for causes of adult deaths in the automated arm was low between the algorithms but high between two independent physicians in the physician arm. CONCLUSIONS: While desirable, automated algorithms require further development and rigorous evaluation. Lay reporting of deaths paired with physician COD assignment of verbal autopsies, despite some limitations, remains a practicable method to document the patterns of mortality reliably for unattended deaths. TRIAL REGISTRATION: ClinicalTrials.gov , NCT02810366. Submitted on 11 April 2016.


Assuntos
Autopsia/métodos , Coleta de Dados/métodos , Médicos/normas , Adulto , Criança , Morte , Feminino , Humanos , Índia , Masculino
8.
Proc Natl Acad Sci U S A ; 113(51): 14668-14673, 2016 12 20.
Artigo em Inglês | MEDLINE | ID: mdl-27930328

RESUMO

Respondent-driven sampling (RDS) is a network-based form of chain-referral sampling used to estimate attributes of populations that are difficult to access using standard survey tools. Although it has grown quickly in popularity since its introduction, the statistical properties of RDS estimates remain elusive. In particular, the sampling variability of these estimates has been shown to be much higher than previously acknowledged, and even methods designed to account for RDS result in misleadingly narrow confidence intervals. In this paper, we introduce a tree bootstrap method for estimating uncertainty in RDS estimates based on resampling recruitment trees. We use simulations from known social networks to show that the tree bootstrap method not only outperforms existing methods but also captures the high variability of RDS, even in extreme cases with high design effects. We also apply the method to data from injecting drug users in Ukraine. Unlike other methods, the tree bootstrap depends only on the structure of the sampled recruitment trees, not on the attributes being measured on the respondents, so correlations between attributes can be estimated as well as variability. Our results suggest that it is possible to accurately assess the high level of uncertainty inherent in RDS.


Assuntos
Infecções por HIV/epidemiologia , Infecções por HIV/transmissão , Seleção de Pacientes , Apoio Social , Adolescente , Comportamento do Adolescente , Algoritmos , Centers for Disease Control and Prevention, U.S. , Colorado , Simulação por Computador , Feminino , Heterossexualidade , Humanos , Estudos Longitudinais , Masculino , Modelos Estatísticos , Probabilidade , Assunção de Riscos , Instituições Acadêmicas , Profissionais do Sexo , Comportamento Sexual , Abuso de Substâncias por Via Intravenosa , Inquéritos e Questionários , Ucrânia , Incerteza , Estados Unidos
9.
Demography ; 55(5): 1979-1999, 2018 10.
Artigo em Inglês | MEDLINE | ID: mdl-30276667

RESUMO

The digital traces that we leave online are increasingly fruitful sources of data for social scientists, including those interested in demographic research. The collection and use of digital data also presents numerous statistical, computational, and ethical challenges, motivating the development of new research approaches to address these burgeoning issues. In this article, we argue that researchers with formal training in demography-those who have a history of developing innovative approaches to using challenging data-are well positioned to contribute to this area of work. We discuss the benefits and challenges of using digital trace data for social and demographic research, and we review examples of current demographic literature that creatively use digital trace data to study processes related to fertility, mortality, and migration. Focusing on Facebook data for advertisers-a novel "digital census" that has largely been untapped by demographers-we provide illustrative and empirical examples of how demographic researchers can manage issues such as bias and representation when using digital trace data. We conclude by offering our perspective on the road ahead regarding demography and its role in the data revolution.


Assuntos
Big Data , Coleta de Dados/métodos , Demografia/métodos , Pesquisa , Mídias Sociais/estatística & dados numéricos , Viés , Coeficiente de Natalidade/tendências , Coleta de Dados/ética , Demografia/ética , Ética em Pesquisa , Humanos , Mortalidade/tendências , Privacidade , Grupos Raciais/estatística & dados numéricos , Mídias Sociais/ética
11.
Sociol Methods Res ; 46(3): 390-421, 2017 08.
Artigo em Inglês | MEDLINE | ID: mdl-29033471

RESUMO

Despite recent and growing interest in using Twitter to examine human behavior and attitudes, there is still significant room for growth regarding the ability to leverage Twitter data for social science research. In particular, gleaning demographic information about Twitter users-a key component of much social science research-remains a challenge. This article develops an accurate and reliable data processing approach for social science researchers interested in using Twitter data to examine behaviors and attitudes, as well as the demographic characteristics of the populations expressing or engaging in them. Using information gathered from Twitter users who state an intention to not vote in the 2012 presidential election, we describe and evaluate a method for processing data to retrieve demographic information reported by users that is not encoded as text (e.g., details of images) and evaluate the reliability of these techniques. We end by assessing the challenges of this data collection strategy and discussing how large-scale social media data may benefit demographic researchers.

12.
Nat Commun ; 15(1): 4626, 2024 May 30.
Artigo em Inglês | MEDLINE | ID: mdl-38816383

RESUMO

The human infectious reservoir of Plasmodium falciparum is governed by transmission efficiency during vector-human contact and mosquito biting preferences. Understanding biting bias in a natural setting can help target interventions to interrupt transmission. In a 15-month cohort in western Kenya, we detected P. falciparum in indoor-resting Anopheles and human blood samples by qPCR and matched mosquito bloodmeals to cohort participants using short-tandem repeat genotyping. Using risk factor analyses and discrete choice models, we assessed mosquito biting behavior with respect to parasite transmission. Biting was highly unequal; 20% of people received 86% of bites. Biting rates were higher on males (biting rate ratio (BRR): 1.68; CI: 1.28-2.19), children 5-15 years (BRR: 1.49; CI: 1.13-1.98), and P. falciparum-infected individuals (BRR: 1.25; CI: 1.01-1.55). In aggregate, P. falciparum-infected school-age (5-15 years) boys accounted for 50% of bites potentially leading to onward transmission and had an entomological inoculation rate 6.4x higher than any other group. Additionally, infectious mosquitoes were nearly 3x more likely than non-infectious mosquitoes to bite P. falciparum-infected individuals (relative risk ratio 2.76, 95% CI 1.65-4.61). Thus, persistent P. falciparum transmission was characterized by disproportionate onward transmission from school-age boys and by the preference of infected mosquitoes to feed upon infected people.


Assuntos
Anopheles , Mordeduras e Picadas de Insetos , Malária Falciparum , Mosquitos Vetores , Plasmodium falciparum , Humanos , Anopheles/parasitologia , Anopheles/fisiologia , Animais , Plasmodium falciparum/fisiologia , Plasmodium falciparum/isolamento & purificação , Plasmodium falciparum/genética , Malária Falciparum/transmissão , Malária Falciparum/parasitologia , Masculino , Adolescente , Criança , Pré-Escolar , Feminino , Quênia/epidemiologia , Mosquitos Vetores/parasitologia , Mosquitos Vetores/fisiologia , Adulto , Comportamento Alimentar , Adulto Jovem , Lactente
13.
Biometrics ; 68(1): 23-30, 2012 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-21838812

RESUMO

We propose an online binary classification procedure for cases when there is uncertainty about the model to use and parameters within a model change over time. We account for model uncertainty through dynamic model averaging, a dynamic extension of Bayesian model averaging in which posterior model probabilities may also change with time. We apply a state-space model to the parameters of each model and we allow the data-generating model to change over time according to a Markov chain. Calibrating a "forgetting" factor accommodates different levels of change in the data-generating mechanism. We propose an algorithm that adjusts the level of forgetting in an online fashion using the posterior predictive distribution, and so accommodates various levels of change at different times. We apply our method to data from children with appendicitis who receive either a traditional (open) appendectomy or a laparoscopic procedure. Factors associated with which children receive a particular type of procedure changed substantially over the 7 years of data collection, a feature that is not captured using standard regression modeling. Because our procedure can be implemented completely online, future data collection for similar studies would require storing sensitive patient information only temporarily, reducing the risk of a breach of confidentiality.


Assuntos
Apendicectomia/estatística & dados numéricos , Apendicite/epidemiologia , Apendicite/cirurgia , Laparoscopia/estatística & dados numéricos , Modelos Logísticos , Reconhecimento Automatizado de Padrão/métodos , Criança , Simulação por Computador , Humanos , Prevalência , Análise de Regressão , Resultado do Tratamento
14.
R J ; 14(4): 316-334, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37974934

RESUMO

Verbal autopsy (VA) is a survey-based tool widely used to infer cause of death (COD) in regions without complete-coverage civil registration and vital statistics systems. In such settings, many deaths happen outside of medical facilities and are not officially documented by a medical professional. VA surveys, consisting of signs and symptoms reported by a person close to the decedent, are used to infer the COD for an individual, and to estimate and monitor the COD distribution in the population. Several classification algorithms have been developed and widely used to assign causes of death using VA data. However, the incompatibility between different idiosyncratic model implementations and required data structure makes it difficult to systematically apply and compare different methods. The openVA package provides the first standardized framework for analyzing VA data that is compatible with all openly available methods and data structure. It provides an open-source, R implementation of several most widely used VA methods. It supports different data input and output formats, and customizable information about the associations between causes and symptoms. The paper discusses the relevant algorithms, their implementations in R packages under the openVA suite, and demonstrates the pipeline of model fitting, summary, comparison, and visualization in the R environment.

15.
Ann Appl Stat ; 16(1): 124-143, 2022 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-37621750

RESUMO

In order to implement disease-specific interventions in young age groups, policy makers in low- and middle-income countries require timely and accurate estimates of age- and cause-specific child mortality. High-quality data is not available in settings where these interventions are most needed, but there is a push to create sample registration systems that collect detailed mortality information. current methods that estimate mortality from this data employ multistage frameworks without rigorous statistical justification that separately estimate all-cause and cause-specific mortality and are not sufficiently adaptable to capture important features of the data. We propose a flexible Bayesian modeling framework to estimate age- and cause-specific child mortality from sample registration data. We provide a theoretical justification for the framework, explore its properties via simulation, and use it to estimate mortality trends using data from the Maternal and Child Health Surveillance System in China.

16.
17.
Epidemics ; 36: 100477, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34171509

RESUMO

The novel SARS-CoV-2 virus, as it manifested in India in April 2020, showed marked heterogeneity in its transmission. Here, we used data collected from contact tracing during the lockdown in response to the first wave of COVID-19 in Punjab, a major state in India, to quantify this heterogeneity, and to examine implications for transmission dynamics. We found evidence of heterogeneity acting at multiple levels: in the number of potentially infectious contacts per index case, and in the per-contact risk of infection. Incorporating these findings in simple mathematical models of disease transmission reveals that these heterogeneities act in combination to strongly influence transmission dynamics. Standard approaches, such as representing heterogeneity through secondary case distributions, could be biased by neglecting these underlying interactions between heterogeneities. We discuss implications for policy, and for more efficient contact tracing in resource-constrained settings such as India. Our results highlight how contact tracing, an important public health measure, can also provide important insights into epidemic spread and control.


Assuntos
COVID-19 , SARS-CoV-2 , Controle de Doenças Transmissíveis , Busca de Comunicante , Humanos , Índia/epidemiologia
18.
Ann Appl Stat ; 14(1): 241-256, 2020 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-33520049

RESUMO

The distribution of deaths by cause provides crucial information for public health planning, response and evaluation. About 60% of deaths globally are not registered or given a cause, limiting our ability to understand disease epidemiology. Verbal autopsy (VA) surveys are increasingly used in such settings to collect information on the signs, symptoms and medical history of people who have recently died. This article develops a novel Bayesian method for estimation of population distributions of deaths by cause using verbal autopsy data. The proposed approach is based on a multivariate probit model where associations among items in questionnaires are flexibly induced by latent factors. Using the Population Health Metrics Research Consortium labeled data that include both VA and medically certified causes of death, we assess performance of the proposed method. Further, we estimate important questionnaire items that are highly associated with causes of death. This framework provides insights that will simplify future data.

19.
medRxiv ; 2020 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-32995809

RESUMO

The novel SARS-CoV-2 virus shows marked heterogeneity in its transmission. Here, we used data collected from contact tracing during the lockdown in Punjab, a major state in India, to quantify this heterogeneity, and to examine implications for transmission dynamics. We found evidence of heterogeneity acting at multiple levels: in the number of potentially infectious contacts per index case, and in the per-contact risk of infection. Incorporating these findings in simple mathematical models of disease transmission reveals that these heterogeneities act in combination to strongly influence transmission dynamics. Standard approaches, such as representing heterogeneity through secondary case distributions, could be biased by neglecting these underlying interactions between heterogeneities. We discuss implications for policy, and for more efficient contact tracing in resource-constrained settings such as India. Our results highlight how contact tracing, an important public health measure, can also provide important insights into epidemic spread and control.

20.
J Comput Graph Stat ; 28(4): 767-777, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-33033426

RESUMO

Bayesian graphical models are a useful tool for understanding dependence relationships among many variables, particularly in situations with external prior information. In high-dimensional settings, the space of possible graphs becomes enormous, rendering even state-of-the-art Bayesian stochastic search computationally infeasible. We propose a deterministic alternative to estimate Gaussian and Gaussian copula graphical models using an Expectation Conditional Maximization (ECM) algorithm, extending the EM approach from Bayesian variable selection to graphical model estimation. We show that the ECM approach enables fast posterior exploration under a sequence of mixture priors, and can incorporate multiple sources of information.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA