Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 41
Filter
1.
NPJ Digit Med ; 7(1): 109, 2024 May 02.
Article in English | MEDLINE | ID: mdl-38698174

ABSTRACT

In the most comprehensive population surveys, mental health is only broadly captured through questionnaires asking about "mentally unhealthy days" or feelings of "sadness." Further, population mental health estimates are predominantly consolidated to yearly estimates at the state level, which is considerably coarser than the best estimates of physical health. Through the large-scale analysis of social media, robust estimation of population mental health is feasible at finer resolutions. In this study, we created a pipeline that used ~1 billion Tweets from 2 million geo-located users to estimate mental health levels and changes for depression and anxiety, the two leading mental health conditions. Language-based mental health assessments (LBMHAs) had substantially higher levels of reliability across space and time than available survey measures. This work presents reliable assessments of depression and anxiety down to the county-weeks level. Where surveys were available, we found moderate to strong associations between the LBMHAs and survey scores for multiple levels of granularity, from the national level down to weekly county measurements (fixed effects ß = 0.34 to 1.82; p < 0.001). LBMHAs demonstrated temporal validity, showing clear absolute increases after a list of major societal events (+23% absolute change for depression assessments). LBMHAs showed improved external validity, evidenced by stronger correlations with measures of health and socioeconomic status than population surveys. This study shows that the careful aggregation of social media data yields spatiotemporal estimates of population mental health that exceed the granularity achievable by existing population surveys, and does so with generally greater reliability and validity.

2.
Proc Natl Acad Sci U S A ; 121(14): e2319837121, 2024 Apr 02.
Article in English | MEDLINE | ID: mdl-38530887

ABSTRACT

Depression has robust natural language correlates and can increasingly be measured in language using predictive models. However, despite evidence that language use varies as a function of individual demographic features (e.g., age, gender), previous work has not systematically examined whether and how depression's association with language varies by race. We examine how race moderates the relationship between language features (i.e., first-person pronouns and negative emotions) from social media posts and self-reported depression, in a matched sample of Black and White English speakers in the United States. Our findings reveal moderating effects of race: While depression severity predicts I-usage in White individuals, it does not in Black individuals. White individuals use more belongingness and self-deprecation-related negative emotions. Machine learning models trained on similar amounts of data to predict depression severity performed poorly when tested on Black individuals, even when they were trained exclusively using the language of Black individuals. In contrast, analogous models tested on White individuals performed relatively well. Our study reveals surprising race-based differences in the expression of depression in natural language and highlights the need to understand these effects better, especially before language-based models for detecting psychological phenomena are integrated into clinical practice.


Subject(s)
Depression , Social Media , Humans , United States , Depression/psychology , Emotions , Language
3.
Emotion ; 24(1): 106-115, 2024 Feb.
Article in English | MEDLINE | ID: mdl-37199938

ABSTRACT

Many scholars have proposed that feeling what we believe others are feeling-often known as "empathy"-is essential for other-regarding sentiments and plays an important role in our moral lives. Caring for and about others (without necessarily sharing their feelings)-often known as "compassion"-is also frequently discussed as a relevant force for prosocial motivation and action. Here, we explore the relationship between empathy and compassion using the methods of computational linguistics. Analyses of 2,356,916 Facebook posts suggest that individuals (N = 2,781) high in empathy use different language than those high in compassion, after accounting for shared variance between these constructs. Empathic people, controlling for compassion, often use self-focused language and write about negative feelings, social isolation, and feeling overwhelmed. Compassionate people, controlling for empathy, often use other-focused language and write about positive feelings and social connections. In addition, high empathy without compassion is related to negative health outcomes, while high compassion without empathy is related to positive health outcomes, positive lifestyle choices, and charitable giving. Such findings favor an approach to moral motivation that is grounded in compassion rather than empathy. (PsycInfo Database Record (c) 2024 APA, all rights reserved).


Subject(s)
Emotions , Empathy , Humans , Motivation , Morals , Linguistics
5.
Front Public Health ; 11: 1275975, 2023.
Article in English | MEDLINE | ID: mdl-38074754

ABSTRACT

Introduction: Substances and the people who use them have been dehumanized for decades. As a result, lawmakers and healthcare providers have implemented policies that subjected millions to criminalization, incarceration, and inadequate resources to support health and wellbeing. While there have been recent shifts in public opinion on issues such as legalization, in the case of marijuana in the U.S., or addiction as a disease, dehumanization and stigma are still leading barriers for individuals seeking treatment. Integral to the narrative of "substance users" as thoughtless zombies or violent criminals is their portrayal in popular media, such as films and news. Methods: This study attempts to quantify the dehumanization of people who use substances (PWUS) across time using a large corpus of over 3 million news articles. We apply a computational linguistic framework for measuring dehumanization across three decades of New York Times articles. Results: We show that (1) levels of dehumanization remain high and (2) while marijuana has become less dehumanized over time, attitudes toward other substances such as heroin and cocaine remain stable. Discussion: This work highlights the importance of a holistic view of substance use that places all substances within the context of addiction as a disease, prioritizes the humanization of PWUS, and centers around harm reduction.


Subject(s)
Behavior, Addictive , Cannabis , Substance-Related Disorders , Humans , Dehumanization , Social Stigma
6.
Hepatol Commun ; 7(12)2023 Dec 01.
Article in English | MEDLINE | ID: mdl-38055637

ABSTRACT

BACKGROUND: Sensors within smartphones, such as accelerometer and location, can describe longitudinal markers of behavior as represented through devices in a method called digital phenotyping. This study aimed to assess the feasibility of digital phenotyping for patients with alcohol-associated liver disease and alcohol use disorder, determine correlations between smartphone data and alcohol craving, and establish power assessment for future studies to prognosticate clinical outcomes. METHODS: A total of 24 individuals with alcohol-associated liver disease and alcohol use disorder were instructed to download the AWARE application to collect continuous sensor data and complete daily ecological momentary assessments on alcohol craving and mood for up to 30 days. Data from sensor streams were processed into features like accelerometer magnitude, number of calls, and location entropy, which were used for statistical analysis. We used repeated measures correlation for longitudinal data to evaluate associations between sensors and ecological momentary assessments and standard Pearson correlation to evaluate within-individual relationships between sensors and craving. RESULTS: Alcohol craving significantly correlated with mood obtained from ecological momentary assessments. Across all sensors, features associated with craving were also significantly correlated with all moods (eg, loneliness and stress) except boredom. Individual-level analysis revealed significant relationships between craving and features of location entropy and average accelerometer magnitude. CONCLUSIONS: Smartphone sensors may serve as markers for alcohol craving and mood in alcohol-associated liver disease and alcohol use disorder. Findings suggest that location-based and accelerometer-based features may be associated with alcohol craving. However, data missingness and low participant retention remain challenges. Future studies are needed for further digital phenotyping of relapse risk and progression of liver disease.


Subject(s)
Alcoholism , Liver Diseases, Alcoholic , Humans , Craving , Alcoholism/diagnosis , Smartphone , Alcohol Drinking
7.
Article in English | MEDLINE | ID: mdl-38125747

ABSTRACT

Full national coverage below the state level is difficult to attain through survey-based data collection. Even the largest survey-based data collections, such as the CDC's Behavioral Risk Factor Surveillance System or the Gallup-Healthways Well-being Index (both with more than 300,000 responses p.a.) only allow for the estimation of annual averages for about 260 out of roughly U.S. 3,000 counties when a threshold of 300 responses per county is used. Using a relatively high threshold of 300 responses gives substantially higher convergent validity-higher correlations with health variables-than lower thresholds but covers a reduced and biased sample of the population. We present principled methods to interpolate spatial estimates and show that including large-scale geotagged social media data can increase interpolation accuracy. In this work, we focus on Gallup-reported life satisfaction, a widely-used measure of subjective well-being. We use Gaussian Processes (GP), a formal Bayesian model, to interpolate life satisfaction, which we optimally combine with estimates from low-count data. We interpolate over several spaces (geographic and socioeconomic) and extend these evaluations to the space created by variables encoding language frequencies of approximately 6 million geotagged Twitter users. We find that Twitter language use can serve as a rough aggregate measure of socioeconomic and cultural similarity, and improves upon estimates derived from a wide variety of socioeconomic, demographic, and geographic similarity measures. We show that applying Gaussian Processes to the limited Gallup data allows us to generate estimates for a much larger number of counties while maintaining the same level of convergent validity with external criteria (i.e., N = 1,133 vs. 2,954 counties). This work suggests that spatial coverage of psychological variables can be reliably extended through Bayesian techniques while maintaining out-of-sample prediction accuracy and that Twitter language adds important information about cultural similarity over and above traditional socio-demographic and geographic similarity measures. Finally, to facilitate the adoption of these methods, we have also open-sourced an online tool that researchers can freely use to interpolate their data across geographies.

8.
Drug Alcohol Depend Rep ; 8: 100186, 2023 Sep.
Article in English | MEDLINE | ID: mdl-37692907

ABSTRACT

Background: Americans reported significant increases in mental health and substance use problems after the COVID-19 pandemic outbreak. This can be a product of the pandemic disruptions in everyday life, with some populations being more impacted than others. Objectives: To assess the ongoing impact of the COVID-19 pandemic on mental health and substance use in U.S. adults from September 2020 to August 2021. Methods: Participants included 1056 adults (68.5% women) who participated in a national longitudinal online survey assessing the perceived impact of COVID-19 on daily life, stress, depression and anxiety symptoms, and alcohol and cannabis use at 3-time points from September 2020 to August 2021. Results: Individuals with lower self-reported social status reported the highest perceived impact. Participants' perceived impact of the COVID-19 pandemic on daily life, stress, anxiety, and alcohol use risk significantly decreased over time but remained high. However, there was no change in depressive symptoms and cannabis use. Higher levels of perceived impact of the pandemic significantly predicted both more baseline mental health concerns and lower decreases over time. Lower self-report social status predicted more baseline mental health concerns and smaller decreases in those concerns. Black adults reported significantly higher cannabis use rates than non-Hispanic White adults. Conclusion: The impact of COVID-19 on daily life continued to be a risk factor for mental health during the second wave of the pandemic. In addition to infection prevention, public health policies should focus on pandemic-related social factors such as economic concerns and caretaking that continue to affect mental health.

9.
Sci Rep ; 13(1): 9027, 2023 06 03.
Article in English | MEDLINE | ID: mdl-37270657

ABSTRACT

Opioid poisoning mortality is a substantial public health crisis in the United States, with opioids involved in approximately 75% of the nearly 1 million drug related deaths since 1999. Research suggests that the epidemic is driven by both over-prescribing and social and psychological determinants such as economic stability, hopelessness, and isolation. Hindering this research is a lack of measurements of these social and psychological constructs at fine-grained spatial and temporal resolutions. To address this issue, we use a multi-modal data set consisting of natural language from Twitter, psychometric self-reports of depression and well-being, and traditional area-based measures of socio-demographics and health-related risk factors. Unlike previous work using social media data, we do not rely on opioid or substance related keywords to track community poisonings. Instead, we leverage a large, open vocabulary of thousands of words in order to fully characterize communities suffering from opioid poisoning, using a sample of 1.5 billion tweets from 6 million U.S. county mapped Twitter users. Results show that Twitter language predicted opioid poisoning mortality better than factors relating to socio-demographics, access to healthcare, physical pain, and psychological well-being. Additionally, risk factors revealed by the Twitter language analysis included negative emotions, discussions of long work hours, and boredom, whereas protective factors included resilience, travel/leisure, and positive emotions, dovetailing with results from the psychometric self-report data. The results show that natural language from public social media can be used as a surveillance tool for both predicting community opioid poisonings and understanding the dynamic social and psychological nature of the epidemic.


Subject(s)
Social Media , Humans , United States/epidemiology , Analgesics, Opioid , Self Report , Language , Anxiety
10.
Psychol Methods ; 28(6): 1478-1498, 2023 Dec.
Article in English | MEDLINE | ID: mdl-37126041

ABSTRACT

The language that individuals use for expressing themselves contains rich psychological information. Recent significant advances in Natural Language Processing (NLP) and Deep Learning (DL), namely transformers, have resulted in large performance gains in tasks related to understanding natural language. However, these state-of-the-art methods have not yet been made easily accessible for psychology researchers, nor designed to be optimal for human-level analyses. This tutorial introduces text (https://r-text.org/), a new R-package for analyzing and visualizing human language using transformers, the latest techniques from NLP and DL. The text-package is both a modular solution for accessing state-of-the-art language models and an end-to-end solution catered for human-level analyses. Hence, text provides user-friendly functions tailored to test hypotheses in social sciences for both relatively small and large data sets. The tutorial describes methods for analyzing text, providing functions with reliable defaults that can be used off-the-shelf as well as providing a framework for the advanced users to build on for novel pipelines. The reader learns about three core methods: (1) textEmbed(): to transform text to modern transformer-based word embeddings; (2) textTrain() and textPredict(): to train predictive models with embeddings as input, and use the models to predict from; (3) textSimilarity() and textDistance(): to compute semantic similarity/distance scores between texts. The reader also learns about two extended methods: (1) textProjection()/textProjectionPlot() and (2) textCentrality()/textCentralityPlot(): to examine and visualize text within the embedding space. (PsycInfo Database Record (c) 2024 APA, all rights reserved).


Subject(s)
Language , Natural Language Processing , Humans , Semantics , Social Sciences
11.
Front Public Health ; 11: 1092269, 2023.
Article in English | MEDLINE | ID: mdl-37033081

ABSTRACT

Background: Racial/ethnic minorities are disproportionately impacted by the COVID-19 pandemic, as they are more likely to experience structural and interpersonal racial discrimination, and thus social marginalization. Based on this, we tested for associations between pandemic distress outcomes and four exposures: racial segregation, coronavirus-related racial bias, social status, and social support. Methods: Data were collected as part of a larger longitudinal national study on mental health during the pandemic (n = 1,309). We tested if county-level segregation and individual-level social status, social support, and coronavirus racial bias were associated with pandemic distress using cumulative ordinal regression models, both unadjusted and adjusted for covariates (gender, age, education, and income). Results: Both the segregation index (PR = 1.19; 95% CI 1.03, 1.36) and the coronavirus racial bias scale (PR = 1.17; 95% CI 1.06, 1.29) were significantly associated with pandemic distress. Estimates were similar, after adjusting for covariates, for both segregation (aPR = 1.15; 95% CI 1.01, 1.31) and coronavirus racial bias (PR = 1.12; 95% CI 1.02, 1.24). Higher social status (aPR = 0.74; 95% CI 0.64, 0.86) and social support (aPR = 0.81; 95% CI 0.73, 0.90) were associated with lower pandemic distress after adjustment. Conclusion: Segregation and coronavirus racial bias are relevant pandemic stressors, and thus have implications for minority health. Future research exploring potential mechanisms of this relationship, including specific forms of racial discrimination related to pandemic distress and implications for social justice efforts, are recommended.


Subject(s)
COVID-19 , Racism , Humans , COVID-19/epidemiology , Pandemics , Income , Longitudinal Studies
12.
Neuropsychopharmacology ; 48(11): 1579-1585, 2023 10.
Article in English | MEDLINE | ID: mdl-37095253

ABSTRACT

The reoccurrence of use (relapse) and treatment dropout is frequently observed in substance use disorder (SUD) treatment. In the current paper, we evaluated the predictive capability of an AI-based digital phenotype using the social media language of patients receiving treatment for substance use disorders (N = 269). We found that language phenotypes outperformed a standard intake psychometric assessment scale when predicting patients' 90-day treatment outcomes. We also use a modern deep learning-based AI model, Bidirectional Encoder Representations from Transformers (BERT) to generate risk scores using pre-treatment digital phenotype and intake clinic data to predict dropout probabilities. Nearly all individuals labeled as low-risk remained in treatment while those identified as high-risk dropped out (risk score for dropout AUC = 0.81; p < 0.001). The current study suggests the possibility of utilizing social media digital phenotypes as a new tool for intake risk assessment to identify individuals most at risk of treatment dropout and relapse.


Subject(s)
Behavior, Addictive , Social Media , Substance-Related Disorders , Humans , Behavior, Addictive/therapy , Substance-Related Disorders/therapy , Patient Dropouts , Risk Factors
13.
Alcohol Alcohol ; 58(4): 393-403, 2023 Jul 10.
Article in English | MEDLINE | ID: mdl-37097736

ABSTRACT

This study aimed to examine differences in mental health and alcohol use outcomes across distinct patterns of work, home, and social life disruptions associated with the COVID-19 pandemic. Data from 2093 adult participants were collected from September 2020 to April 2021 as a part of a larger study examining the impacts of the COVID-19 pandemic on substance use. Participants provided data on COVID-19 pandemic experiences, mental health outcomes, media consumption, and alcohol use at baseline. Alcohol use difficulties, including problems related to the use, desire to use alcohol, failure to cut down on alcohol use, and family/friend concern with alcohol use, were measured at 60-day follow-up. Factor mixture modeling followed by group comparisons, multiple linear regressions, and multiple logistic regressions was conducted. A four-profile model was selected. Results indicated that profile membership predicted differences in mental health and alcohol use outcomes above and beyond demographics. Individuals experiencing the most disruption reported the strongest daily impact of COVID-19 and significantly high levels of depression, anxiety, loneliness, overwhelm, alcohol use at baseline, and alcohol use difficulties measured at 60-day follow-up. The findings highlight the need for integrated mental health and/or alcohol services and social services targeting work, home, and social life during public health emergencies in order to respond effectively and comprehensively to the needs of those requiring different types of support.


Subject(s)
COVID-19 , Mental Health , Adult , Humans , Pandemics , COVID-19/epidemiology , Alcohol Drinking/epidemiology , Anxiety/epidemiology , Ethanol
14.
NPJ Digit Med ; 6(1): 35, 2023 Mar 08.
Article in English | MEDLINE | ID: mdl-36882633

ABSTRACT

Targeting of location-specific aid for the U.S. opioid epidemic is difficult due to our inability to accurately predict changes in opioid mortality across heterogeneous communities. AI-based language analyses, having recently shown promise in cross-sectional (between-community) well-being assessments, may offer a way to more accurately longitudinally predict community-level overdose mortality. Here, we develop and evaluate, TROP (Transformer for Opiod Prediction), a model for community-specific trend projection that uses community-specific social media language along with past opioid-related mortality data to predict future changes in opioid-related deaths. TOP builds on recent advances in sequence modeling, namely transformer networks, to use changes in yearly language on Twitter and past mortality to project the following year's mortality rates by county. Trained over five years and evaluated over the next two years TROP demonstrated state-of-the-art accuracy in predicting future county-specific opioid trends. A model built using linear auto-regression and traditional socioeconomic data gave 7% error (MAPE) or within 2.93 deaths per 100,000 people on average; our proposed architecture was able to forecast yearly death rates with less than half that error: 3% MAPE and within 1.15 per 100,000 people.

15.
Health Place ; 80: 102997, 2023 03.
Article in English | MEDLINE | ID: mdl-36867991

ABSTRACT

Extensive evidence demonstrates the effects of area-based disadvantage on a variety of life outcomes, such as increased mortality and low economic mobility. Despite these well-established patterns, disadvantage, often measured using composite indices, is inconsistently operationalized across studies. To address this issue, we systematically compared 5 U.S. disadvantage indices at the county-level on their relationships to 24 diverse life outcomes related to mortality, physical health, mental health, subjective well-being, and social capital from heterogeneous data sources. We further examined which domains of disadvantage are most important when creating these indices. Of the five indices examined, the Area Deprivation Index (ADI) and Child Opportunity Index 2.0 (COI) were most related to a diverse set of life outcomes, particularly physical health. Within each index, variables from the domains of education and employment were most important in relationships with life outcomes. Disadvantage indices are being used in real-world policy and resource allocation decisions; an index's generalizability across diverse life outcomes, and the domains of disadvantage which constitute the index, should be considered when guiding such decisions.


Subject(s)
Employment , Mental Health , Child , Humans , United States , Educational Status
16.
Am J Drug Alcohol Abuse ; 49(4): 371-380, 2023 07 04.
Article in English | MEDLINE | ID: mdl-36995266

ABSTRACT

Dehumanization, the perception or treatment of people as subhuman, has been recognized as "endemic" in medicine and contributes to the stigmatization of people who use illegal drugs, in particular. As a result of dehumanization, people who use drugs are subject to systematically biased policies, long-lasting stigma, and suboptimal healthcare. One major contributor to the public opinion of drugs and people who use them is the media, whose coverage of these topics consistently uses negative imagery and language. This narrative review of the literature and American media on the dehumanization of illegal drugs and the people who use them provides a perspective on the components of dehumanization in each case and explores the consequences of dehumanization on health, law, and society. Drawing from language and images from American news outlets, anti-drug campaigns, and academic research, we recommend a shift away from the disingenuous trope of people who use drugs as poor, uneducated, and most likely of color. To this end, positive media portrayals and the humanization of people who use drugs can help form a common identity, engender empathy, and ultimately improve health outcomes.


Subject(s)
Public Opinion , Social Stigma , Humans , United States , Dehumanization
18.
Soc Sci Med ; 317: 115599, 2023 01.
Article in English | MEDLINE | ID: mdl-36525785

ABSTRACT

OBJECTIVE: Black, Asian, and Hispanic/Latino people are disproportionately impacted by the COVID-19 pandemic and were more likely to experience coronavirus-related racial discrimination. This study examined the association between pandemic-related stressors, including employment and housing disruptions, coronavirus-related victimization distress, and perceptions of pandemic-associated increase in societal racial biases, and substance use disorder (SUD) risk among Asian, Black, Hispanic/Latino, and non-Hispanic White adults in the U.S. METHODS: Data were collected as part of a larger national survey on substance use during the pandemic. Eligible participants for the current study were 1336 adults self-identified as Asian (8.53%), Black (10.55%), Hispanic/Latino (10.93%), and non-Hispanic White (69.99%). Measures included demographic and COVID-19-related employment, housing, and health items, the coronavirus victimization distress scale (CVD), the coronavirus racial bias scale (CRB), and measures of substance use risk. RESULTS: Across racial/ethnic groups, employment disruption distress and housing disruption due to the pandemic were associated with SUD risk. Binary logistic regression analyses controlling for demographic variables indicated CVD was associated with higher odds of tobacco use risk (AOR = 1.36, 95% CI [1.01, 1.81]) and polysubstance use risk (AOR = 1.87, 95% CI [1.14, 3.06]), yet CRB was unrelated to any SUDs. Logistic regressions for each racial/ethnic group found different patterns of relationships between stressors and risk for SUDs. CONCLUSIONS: Results highlight the significance of examining how the current pandemic has exacerbated racial/ethnic systemic inequalities through COVID-19 related victimization. The data also suggest that across all racial/ethnic groups employment and housing disruptions and perceptions of pandemic instigated increases in societal racial bias are risk factors for SUD. The study calls for further empirical research on substance use prevention and intervention practice sensitive to specific needs of diverse populations during the current and future health crises.


Subject(s)
COVID-19 , Cardiovascular Diseases , Substance-Related Disorders , Adult , Humans , United States/epidemiology , Ethnicity , Hispanic or Latino , Pandemics , Social Determinants of Health , COVID-19/epidemiology , Substance-Related Disorders/epidemiology
19.
AJPM Focus ; 2(1): 100062, 2023 Mar.
Article in English | MEDLINE | ID: mdl-36573174

ABSTRACT

Introduction: Although surveys are a well-established instrument to capture the population prevalence of mental health at a moment in time, public Twitter is a continuously available data source that can provide a broader window into population mental health. We characterized the relationship between COVID-19 case counts, stay-at-home orders because of COVID-19, and anxiety and depression in 7 major U.S. cities utilizing Twitter data. Methods: We collected 18 million Tweets from January to September 2019 (baseline) and 2020 from 7 U.S. cities with large populations and varied COVID-19 response protocols: Atlanta, Chicago, Houston, Los Angeles, Miami, New York, and Phoenix. We applied machine learning‒based language prediction models for depression and anxiety validated in previous work with Twitter data. As an alternative public big data source, we explored Google Trends data using search query frequencies. A qualitative evaluation of trends is presented. Results: Twitter depression and anxiety scores were consistently elevated above their 2019 baselines across all the 7 locations. Twitter depression scores increased during the early phase of the pandemic, with a peak in early summer and a subsequent decline in late summer. The pattern of depression trends was aligned with national COVID-19 case trends rather than with trends in individual states. Anxiety was consistently and steadily elevated throughout the pandemic. Google search trends data showed noisy and inconsistent results. Conclusions: Our study shows the feasibility of using Twitter to capture trends of depression and anxiety during the COVID-19 public health crisis and suggests that social media data can supplement survey data to monitor long-term mental health trends.

20.
Proc Int AAAI Conf Weblogs Soc Media ; 16(1): 228-240, 2022 May 31.
Article in English | MEDLINE | ID: mdl-36467573

ABSTRACT

Social media is increasingly used for large-scale population predictions, such as estimating community health statistics. However, social media users are not typically a representative sample of the intended population - a "selection bias". Within the social sciences, such a bias is typically addressed with restratification techniques, where observations are reweighted according to how under- or over-sampled their socio-demographic groups are. Yet, restratifaction is rarely evaluated for improving prediction. In this two-part study, we first evaluate standard, "out-of-the-box" restratification techniques, finding they provide no improvement and often even degraded prediction accuracies across four tasks of esimating U.S. county population health statistics from Twitter. The core reasons for degraded performance seem to be tied to their reliance on either sparse or shrunken estimates of each population's socio-demographics. In the second part of our study, we develop and evaluate Robust Poststratification, which consists of three methods to address these problems: (1) estimator redistribution to account for shrinking, as well as (2) adaptive binning and (3) informed smoothing to handle sparse socio-demographic estimates. We show that each of these methods leads to significant improvement in prediction accuracies over the standard restratification approaches. Taken together, Robust Poststratification enables state-of-the-art prediction accuracies, yielding a 53.0% increase in variance explained (R 2) in the case of surveyed life satisfaction, and a 17.8% average increase across all tasks.

SELECTION OF CITATIONS
SEARCH DETAIL
...