RESUMEN
Vaccine refusal can lead to renewed outbreaks of previously eliminated diseases and even delay global eradication. Vaccinating decisions exemplify a complex, coupled system where vaccinating behavior and disease dynamics influence one another. Such systems often exhibit critical phenomena-special dynamics close to a tipping point leading to a new dynamical regime. For instance, critical slowing down (declining rate of recovery from small perturbations) may emerge as a tipping point is approached. Here, we collected and geocoded tweets about measles-mumps-rubella vaccine and classified their sentiment using machine-learning algorithms. We also extracted data on measles-related Google searches. We find critical slowing down in the data at the level of California and the United States in the years before and after the 2014-2015 Disneyland, California measles outbreak. Critical slowing down starts growing appreciably several years before the Disneyland outbreak as vaccine uptake declines and the population approaches the tipping point. However, due to the adaptive nature of coupled behavior-disease systems, the population responds to the outbreak by moving away from the tipping point, causing "critical speeding up" whereby resilience to perturbations increases. A mathematical model of measles transmission and vaccine sentiment predicts the same qualitative patterns in the neighborhood of a tipping point to greatly reduced vaccine uptake and large epidemics. These results support the hypothesis that population vaccinating behavior near the disease elimination threshold is a critical phenomenon. Developing new analytical tools to detect these patterns in digital social data might help us identify populations at heightened risk of widespread vaccine refusal.
Asunto(s)
Bases de Datos Factuales , Aprendizaje Automático , Vacunación Masiva , Vacuna contra el Sarampión-Parotiditis-Rubéola/administración & dosificación , Medios de Comunicación Sociales , California , Femenino , Humanos , MasculinoRESUMEN
BACKGROUND: The discovery of the CRISPR-Cas9-based gene editing method has opened unprecedented new potential for biological and medical engineering, sparking a growing public debate on both the potential and dangers of CRISPR applications. Given the speed of technology development and the almost instantaneous global spread of news, it is important to follow evolving debates without much delay and in sufficient detail, as certain events may have a major long-term impact on public opinion and later influence policy decisions. OBJECTIVE: Social media networks such as Twitter have shown to be major drivers of news dissemination and public discourse. They provide a vast amount of semistructured data in almost real-time and give direct access to the content of the conversations. We can now mine and analyze such data quickly because of recent developments in machine learning and natural language processing. METHODS: Here, we used Bidirectional Encoder Representations from Transformers (BERT), an attention-based transformer model, in combination with statistical methods to analyze the entirety of all tweets ever published on CRISPR since the publication of the first gene editing application in 2013. RESULTS: We show that the mean sentiment of tweets was initially very positive, but began to decrease over time, and that this decline was driven by rare peaks of strong negative sentiments. Due to the high temporal resolution of the data, we were able to associate these peaks with specific events and to observe how trending topics changed over time. CONCLUSIONS: Overall, this type of analysis can provide valuable and complementary insights into ongoing public debates, extending the traditional empirical bioethics toolset.
Asunto(s)
Sistemas CRISPR-Cas/fisiología , Colaboración de las Masas/métodos , Aprendizaje Profundo/normas , Opinión Pública , HumanosRESUMEN
The digital revolution has contributed to very large data sets (ie, big data) relevant for public health. The two major data sources are electronic health records from traditional health systems and patient-generated data. As the two data sources have complementary strengths-high veracity in the data from traditional sources and high velocity and variety in patient-generated data-they can be combined to build more-robust public health systems. However, they also have unique challenges. Patient-generated data in particular are often completely unstructured and highly context dependent, posing essentially a machine-learning challenge. Some recent examples from infectious disease surveillance and adverse drug event monitoring demonstrate that the technical challenges can be solved. Despite these advances, the problem of verification remains, and unless traditional and digital epidemiologic approaches are combined, these data sources will be constrained by their intrinsic limits.
Asunto(s)
Recolección de Datos/métodos , Monitoreo Epidemiológico , Almacenamiento y Recuperación de la Información/métodos , Farmacovigilancia , HumanosRESUMEN
BACKGROUND: Contact surveys and diaries have conventionally been used to measure contact networks in different settings for elucidating infectious disease transmission dynamics of respiratory infections. More recently, technological advances have permitted the use of wireless sensor devices, which can be worn by individuals interacting in a particular social context to record high resolution mixing patterns. To date, a direct comparison of these two different methods for collecting contact data has not been performed. METHODS: We studied the contact network at a United States high school in the spring of 2012. All school members (i.e., students, teachers, and other staff) were invited to wear wireless sensor devices for a single school day, and asked to remember and report the name and duration of all of their close proximity conversational contacts for that day in an online contact survey. We compared the two methods in terms of the resulting network densities, nodal degrees, and degree distributions. We also assessed the correspondence between the methods at the dyadic and individual levels. RESULTS: We found limited congruence in recorded contact data between the online contact survey and wireless sensors. In particular, there was only negligible correlation between the two methods for nodal degree, and the degree distribution differed substantially between both methods. We found that survey underreporting was a significant source of the difference between the two methods, and that this difference could be improved by excluding individuals who reported only a few contact partners. Additionally, survey reporting was more accurate for contacts of longer duration, and very inaccurate for contacts of shorter duration. Finally, female participants tended to report more accurately than male participants. CONCLUSIONS: Online contact surveys and wireless sensor devices collected incongruent network data from an identical setting. This finding suggests that these two methods cannot be used interchangeably for informing models of infectious disease dynamics.
Asunto(s)
Trazado de Contacto/instrumentación , Trazado de Contacto/métodos , Recolección de Datos/métodos , Modelos Estadísticos , Conducta Social , Tecnología Inalámbrica , Recolección de Datos/instrumentación , Docentes , Femenino , Humanos , Internet , Masculino , Registros Médicos , Infecciones del Sistema Respiratorio/transmisión , Instituciones Académicas , Medio Social , Estudiantes , Telemetría , Estados UnidosRESUMEN
OBJECTIVES: The role of social media as a source of timely and massive information has become more apparent since the era of Web 2.0.Multiple studies illustrated the use of information in social media to discover biomedical and health-related knowledge.Most methods proposed in the literature employ traditional document classification techniques that represent a document as a bag of words.These techniques work well when documents are rich in text and conform to standard English; however, they are not optimal for social media data where sparsity and noise are norms.This paper aims to address the limitations posed by the traditional bag-of-word based methods and propose to use heterogeneous features in combination with ensemble machine learning techniques to discover health-related information, which could prove to be useful to multiple biomedical applications, especially those needing to discover health-related knowledge in large scale social media data.Furthermore, the proposed methodology could be generalized to discover different types of information in various kinds of textual data. METHODOLOGY: Social media data is characterized by an abundance of short social-oriented messages that do not conform to standard languages, both grammatically and syntactically.The problem of discovering health-related knowledge in social media data streams is then transformed into a text classification problem, where a text is identified as positive if it is health-related and negative otherwise.We first identify the limitations of the traditional methods which train machines with N-gram word features, then propose to overcome such limitations by utilizing the collaboration of machine learning based classifiers, each of which is trained to learn a semantically different aspect of the data.The parameter analysis for tuning each classifier is also reported. DATA SETS: Three data sets are used in this research.The first data set comprises of approximately 5000 hand-labeled tweets, and is used for cross validation of the classification models in the small scale experiment, and for training the classifiers in the real-world large scale experiment.The second data set is a random sample of real-world Twitter data in the US.The third data set is a random sample of real-world Facebook Timeline posts. EVALUATIONS: Two sets of evaluations are conducted to investigate the proposed model's ability to discover health-related information in the social media domain: small scale and large scale evaluations.The small scale evaluation employs 10-fold cross validation on the labeled data, and aims to tune parameters of the proposed models, and to compare with the stage-of-the-art method.The large scale evaluation tests the trained classification models on the native, real-world data sets, and is needed to verify the ability of the proposed model to handle the massive heterogeneity in real-world social media. FINDINGS: The small scale experiment reveals that the proposed method is able to mitigate the limitations in the well established techniques existing in the literature, resulting in performance improvement of 18.61% (F-measure).The large scale experiment further reveals that the baseline fails to perform well on larger data with higher degrees of heterogeneity, while the proposed method is able to yield reasonably good performance and outperform the baseline by 46.62% (F-Measure) on average.
Asunto(s)
Educación en Salud , Servicios de Información , Medios de Comunicación Sociales , Algoritmos , Inteligencia ArtificialRESUMEN
Online public health discourse is becoming more and more important in shaping public health dynamics. Large Language Models (LLMs) offer a scalable solution for analysing the vast amounts of unstructured text found on online platforms. Here, we explore the effectiveness of Large Language Models (LLMs), including GPT models and open-source alternatives, for extracting public stances towards vaccination from social media posts. Using an expert-annotated dataset of social media posts related to vaccination, we applied various LLMs and a rule-based sentiment analysis tool to classify the stance towards vaccination. We assessed the accuracy of these methods through comparisons with expert annotations and annotations obtained through crowdsourcing. Our results demonstrate that few-shot prompting of best-in-class LLMs are the best performing methods, and that all alternatives have significant risks of substantial misclassification. The study highlights the potential of LLMs as a scalable tool for public health professionals to quickly gauge public opinion on health policies and interventions, offering an efficient alternative to traditional data analysis methods. With the continuous advancement in LLM development, the integration of these models into public health surveillance systems could substantially improve our ability to monitor and respond to changing public health attitudes.
RESUMEN
BACKGROUND: Infectious disease outbreaks in communities can be controlled by early detection and effective prevention measures. Assessing the relative importance of each individual community member with respect to these two processes requires detailed knowledge about the underlying social contact network on which the disease can spread. However, mapping social contact networks is typically too resource-intensive to be a practical possibility for most communities and institutions. METHODS: Here, we describe a simple, low-cost method - called collocation ranking - to assess individual importance for early detection and targeted intervention strategies that are easily implementable in practice. The method is based on knowledge about individual collocation which is readily available in many community settings such as schools, offices, hospitals, and so on. We computationally validate our method in a school setting by comparing the outcome of the method against computational predictions based on outbreak simulations on an empirical high-resolution contact network. We compare collocation ranking to other methods for assessing the epidemiological importance of the members of a population. To this end, we select subpopulations of the school population by applying these assessment methods to the population and adding individuals to the subpopulation according to their individual rank. Then, we assess how suited these subpopulations are for early detection and targeted intervention strategies. RESULTS: Likelihood and timing of infection during an outbreak are important features for early detection and targeted intervention strategies. Subpopulations selected by the collocation ranking method show a substantially higher average infection probability and an earlier onset of symptoms than randomly selected subpopulations. Furthermore, these subpopulations selected by the collocation ranking method were close to the optimum. CONCLUSIONS: Our results indicate that collocation ranking is a highly effective method to assess individual importance, providing critical low-cost information for the development of sentinel surveillance systems and prevention strategies.
Asunto(s)
Control de Enfermedades Transmisibles/economía , Control de Enfermedades Transmisibles/métodos , Enfermedades Transmisibles/diagnóstico , Enfermedades Transmisibles/transmisión , Brotes de Enfermedades/prevención & control , Métodos Epidemiológicos , Adolescente , Adulto , Diagnóstico Precoz , Femenino , Humanos , Masculino , Persona de Mediana Edad , Adulto JovenRESUMEN
Mobile, social, real-time: the ongoing revolution in the way people communicate has given rise to a new kind of epidemiology. Digital data sources, when harnessed appropriately, can provide local and timely information about disease and health dynamics in populations around the world. The rapid, unprecedented increase in the availability of relevant data from various digital sources creates considerable technical and computational challenges.
Asunto(s)
Biología Computacional/métodos , Métodos Epidemiológicos , Internet , Programas Informáticos , Algoritmos , Teléfono Celular , Minería de Datos , Bases de Datos Factuales , HumanosRESUMEN
Highly lethal pathogens (e.g., hantaviruses, hendra virus, anthrax, or plague) pose unique public-health problems, because they seem to periodically flare into outbreaks before disappearing into long quiescent phases. A key element to their possible control and eradication is being able to understand where they persist in the latent phase and how to identify the conditions that result in sporadic epidemics or epizootics. In American grasslands, plague, caused by Yersinia pestis, exemplifies this quiescent-outbreak pattern, because it sporadically erupts in epizootics that decimate prairie dog (Cynomys ludovicianus) colonies, yet the causes of outbreaks and mechanisms for interepizootic persistence of this disease are poorly understood. Using field data on prairie community ecology, flea behavior, and plague-transmission biology, we find that plague can persist in prairie-dog colonies for prolonged periods, because host movement is highly spatially constrained. The abundance of an alternate host for disease vectors, the grasshopper mouse (Onychomys leucogaster), drives plague outbreaks by increasing the connectivity of the prairie dog hosts and therefore, permitting percolation of the disease throughout the primary host population. These results offer an alternative perspective on plague's ecology (i.e., disease transmission exacerbated by alternative hosts) and may have ramifications for plague dynamics in Asia and Africa, where a single main host has traditionally been considered to drive Yersinia ecology. Furthermore, abundance thresholds of alternate hosts may be a key phenomenon determining outbreaks of disease in many multihost-disease systems.
Asunto(s)
Brotes de Enfermedades , Peste/transmisión , Sciuridae/microbiología , Yersinia pestis , África , Migración Animal , Animales , Asia , Ratones , Dinámica Poblacional , SiphonapteraRESUMEN
The most frequent infectious diseases in humans--and those with the highest potential for rapid pandemic spread--are usually transmitted via droplets during close proximity interactions (CPIs). Despite the importance of this transmission route, very little is known about the dynamic patterns of CPIs. Using wireless sensor network technology, we obtained high-resolution data of CPIs during a typical day at an American high school, permitting the reconstruction of the social network relevant for infectious disease transmission. At 94% coverage, we collected 762,868 CPIs at a maximal distance of 3 m among 788 individuals. The data revealed a high-density network with typical small-world properties and a relatively homogeneous distribution of both interaction time and interaction partners among subjects. Computer simulations of the spread of an influenza-like disease on the weighted contact graph are in good agreement with absentee data during the most recent influenza season. Analysis of targeted immunization strategies suggested that contact network data are required to design strategies that are significantly more effective than random immunization. Immunization strategies based on contact network data were most effective at high vaccination coverage.
Asunto(s)
Enfermedades Transmisibles/transmisión , Simulación por Computador , Transmisión de Enfermedad Infecciosa , Gripe Humana/transmisión , Modelos Biológicos , Control de Enfermedades Transmisibles , Enfermedades Transmisibles/epidemiología , Femenino , Humanos , Inmunización , Gripe Humana/epidemiología , Gripe Humana/prevención & control , Masculino , Pandemias , Instituciones Académicas , Estados UnidosRESUMEN
Introduction: This study presents COVID-Twitter-BERT (CT-BERT), a transformer-based model that is pre-trained on a large corpus of COVID-19 related Twitter messages. CT-BERT is specifically designed to be used on COVID-19 content, particularly from social media, and can be utilized for various natural language processing tasks such as classification, question-answering, and chatbots. This paper aims to evaluate the performance of CT-BERT on different classification datasets and compare it with BERT-LARGE, its base model. Methods: The study utilizes CT-BERT, which is pre-trained on a large corpus of COVID-19 related Twitter messages. The authors evaluated the performance of CT-BERT on five different classification datasets, including one in the target domain. The model's performance is compared to its base model, BERT-LARGE, to measure the marginal improvement. The authors also provide detailed information on the training process and the technical specifications of the model. Results: The results indicate that CT-BERT outperforms BERT-LARGE with a marginal improvement of 10-30% on all five classification datasets. The largest improvements are observed in the target domain. The authors provide detailed performance metrics and discuss the significance of these results. Discussion: The study demonstrates the potential of pre-trained transformer models, such as CT-BERT, for COVID-19 related natural language processing tasks. The results indicate that CT-BERT can improve the classification performance on COVID-19 related content, especially on social media. These findings have important implications for various applications, such as monitoring public sentiment and developing chatbots to provide COVID-19 related information. The study also highlights the importance of using domain-specific pre-trained models for specific natural language processing tasks. Overall, this work provides a valuable contribution to the development of COVID-19 related NLP models.
RESUMEN
Traditional contact tracing is one of the most powerful weapons people have in the battle against a pandemic, especially when vaccines do not yet exist or do not afford complete protection from infection. But the effectiveness of contact tracing hinges on its ability to find infected people quickly and obtain accurate information from them. Therefore, contact tracing inherits the challenges associated with the fallibilities of memory. Against this backdrop, digital contact tracing is the "dream scenario"-an unobtrusive, vigilant, and accurate recorder of danger that should outperform manual contact tracing on every dimension. There is reason to celebrate the success of digital contact tracing. Indeed, epidemiologists report that digital contact tracing probably reduced the incidence of COVID-19 cases by at least 25% in many countries, a feat that would have been hard to match with its manual counterpart. Yet there is also reason to speculate that digital contact tracing delivered on only a fraction of its potential because it almost completely ignored the relevant psychological science. We discuss the strengths and weaknesses of digital contact tracing, its hits and misses in the COVID-19 pandemic, and its need to be integrated with the science of human behavior.
RESUMEN
Nutrition is a key contributor to health. Recently, several studies have identified associations between factors such as microbiota composition and health-related responses to dietary intake, raising the potential of personalized nutritional recommendations. To further our understanding of personalized nutrition, detailed individual data must be collected from participants in their day-to-day lives. However, this is challenging in conventional studies that require clinical measurements and site visits. So-called digital or remote cohorts allow in situ data collection on a daily basis through mobile applications, online services, and wearable sensors, but they raise questions about study retention and data quality. "Food & You" is a personalized nutrition study implemented as a digital cohort in which participants track food intake, physical activity, gut microbiota, glycemia, and other data for two to four weeks. Here, we describe the study protocol, report on study completion rates, and describe the collected data, focusing on assessing their quality and reliability. Overall, the study collected data from over 1000 participants, including high-resolution data of nutritional intake of more than 46 million kcal collected from 315,126 dishes over 23,335 participant days, 1,470,030 blood glucose measurements, 49,110 survey responses, and 1,024 stool samples for gut microbiota analysis. Retention was high, with over 60% of the enrolled participants completing the study. Various data quality assessment efforts suggest the captured high-resolution nutritional data accurately reflect individual diet patterns, paving the way for digital cohorts as a typical study design for personalized nutrition.
RESUMEN
BACKGROUND: Antagonistic species interactions can lead to coevolutionary genotype or phenotype frequency oscillations, with important implications for ecological and evolutionary processes. However, direct empirical evidence of such oscillations is rare. The rarity of observations is generally attributed to inherent difficulties of ecological and evolutionary long-term studies, to weak or absent interaction between species, or to the absence of negative frequency-dependence. RESULTS: Here, we show that another factor - non-genetic inheritance, mediated for example by epigenetic mechanisms - can completely eliminate oscillations in the presence of such negative frequency dependence, even if only a small fraction of offspring are affected. We analytically derive the threshold value of this fraction at which the dynamics change from oscillatory to stable, and investigate how selection, mutation and generation times differences between the two species affect the threshold value. These results strongly suggest that the lack of phenotype frequency oscillations should not be attributed to the lack of strong interactions between antagonistic species. CONCLUSIONS: Given increasing evidence of non-genetic effects on the outcomes of antagonistic species interactions, we suggest that these effects should be incorporated into ecological and evolutionary models of interacting species.
Asunto(s)
Evolución Biológica , Interacciones Huésped-Parásitos , Modelos Genéticos , Animales , Epigénesis Genética , MutaciónRESUMEN
There is great interest in the dynamics of health behaviors in social networks and how they affect collective public health outcomes, but measuring population health behaviors over time and space requires substantial resources. Here, we use publicly available data from 101,853 users of online social media collected over a time period of almost six months to measure the spatio-temporal sentiment towards a new vaccine. We validated our approach by identifying a strong correlation between sentiments expressed online and CDC-estimated vaccination rates by region. Analysis of the network of opinionated users showed that information flows more often between users who share the same sentiments - and less often between users who do not share the same sentiments - than expected by chance alone. We also found that most communities are dominated by either positive or negative sentiments towards the novel vaccine. Simulations of infectious disease transmission show that if clusters of negative vaccine sentiments lead to clusters of unprotected individuals, the likelihood of disease outbreaks is greatly increased. Online social media provide unprecedented access to data allowing for inexpensive and efficient tools to identify target areas for intervention efforts and to evaluate their effectiveness.
Asunto(s)
Actitud Frente a la Salud , Brotes de Enfermedades , Inmunización/psicología , Medios de Comunicación Sociales , Centers for Disease Control and Prevention, U.S. , Humanos , Inmunización/estadística & datos numéricos , Estados UnidosRESUMEN
Introduction: Making epidemiological indicators for COVID-19 publicly available through websites and social media can support public health experts in the near-real-time monitoring of the situation worldwide, and in the establishment of rapid response and public health measures to reduce the consequences of the pandemic. Little is known, however, about the timeliness of such sources. Here, we assess the timeliness of official public COVID-19 sources for the WHO regions of Europe and Africa. Methods: We monitored official websites and social media accounts for updates and calculated the time difference between daily updates on COVID-19 cases. We covered a time period of 52 days and a geographic range of 62 countries, 28 from the WHO African region and 34 from the WHO European region. Results: The most prevalent categories were social media updates only (no website reporting) in the WHO African region (32.7% of the 1,092 entries), and updates in both social media and websites in the WHO European region (51.9% of the 884 entries for EU/EEA countries, and 73.3% of the 884 entries for non-EU/EEA countries), showing an overall clear tendency in using social media as an official source to report on COVID-19 indicators. We further show that the time difference for each source group and geographical region were statistically significant in all WHO regions, indicating a tendency to focus on one of the two sources instead of using both as complementary sources. Discussion: Public health communication via social media platforms has numerous benefits, but it is worthwhile to do it in combination with other, more traditional means of communication, such as websites or offline communication.
Asunto(s)
COVID-19 , Humanos , COVID-19/epidemiología , SARS-CoV-2 , Pandemias , Comunicación , Europa (Continente)/epidemiologíaRESUMEN
Introduction: Online social media have been both a field of research and a source of data for research since the beginning of the COVID-19 pandemic. In this study, we aimed to determine how and whether the content of tweets by Twitter users reporting SARS-CoV-2 infections changed over time. Methods: We built a regular expression to detect users reporting being infected, and we applied several Natural Language Processing methods to assess the emotions, topics, and self-reports of symptoms present in the timelines of the users. Results: Twelve thousand one hundred and twenty-one twitter users matched the regular expression and were considered in the study. We found that the proportions of health-related, symptom-containing, and emotionally non-neutral tweets increased after users had reported their SARS-CoV-2 infection on Twitter. Our results also show that the number of weeks accounting for the increased proportion of symptoms was consistent with the duration of the symptoms in clinically confirmed COVID-19 cases. Furthermore, we observed a high temporal correlation between self-reports of SARS-CoV-2 infection and officially reported cases of the disease in the largest English-speaking countries. Discussion: This study confirms that automated methods can be used to find digital users publicly sharing information about their health status on social media, and that the associated data analysis may supplement clinical assessments made in the early phases of the spread of emerging diseases. Such automated methods may prove particularly useful for newly emerging health conditions that are not rapidly captured in the traditional health systems, such as the long term sequalae of SARS-CoV-2 infections.
Asunto(s)
COVID-19 , Medios de Comunicación Sociales , Humanos , COVID-19/epidemiología , SARS-CoV-2 , Pandemias , Conducta SocialRESUMEN
The automatic recognition of food on images has numerous interesting applications, including nutritional tracking in medical cohorts. The problem has received significant research attention, but an ongoing public benchmark on non-biased (i.e., not scraped from web) data to develop open and reproducible algorithms has been missing. Here, we report on the setup of such a benchmark using publicly available food images sourced through the mobile MyFoodRepo app used in research cohorts. Through four rounds, the benchmark released the MyFoodRepo-273 dataset constituting 24,119 images and a total of 39,325 segmented polygons categorized in 273 different classes. Models were evaluated on private tests sets from the same platform with 5,000 images and 7,865 annotations in the final round. Top-performing models on the 273 food categories reached a mean average precision of 0.568 (round 4) and a mean average recall of 0.885 (round 3), and were deployed in production use of the MyFoodRepo app. We present experimental validation of round 4 results, and discuss implications of the benchmark setup designed to increase the size and diversity of the dataset for future rounds.