ABSTRACT
In recent decades, the field of statistical linguistics has made significant strides, which have been fueled by the availability of data. Leveraging Twitter data, this paper explores the English and Spanish languages, investigating their rank diversity across different scales: temporal intervals (ranging from 3 to 96 h), spatial radii (spanning 3 km to over 3000 km), and grammatical word ngrams (ranging from 1-grams to 5-grams). The analysis focuses on word ngrams, examining a time period of 1 year (2014) and eight different countries. Our findings highlight the relevance of all three scales with the most substantial changes observed at the grammatical level. Specifically, at the monogram level, rank diversity curves exhibit remarkable similarity across languages, countries, and temporal or spatial scales. However, as the grammatical scale expands, variations in rank diversity become more pronounced and influenced by temporal, spatial, linguistic, and national factors. Additionally, we investigate the statistical characteristics of Twitter-specific tokens, including emojis, hashtags, and user mentions, revealing a sigmoid pattern in their rank diversity function. These insights contribute to quantifying universal language statistics while also identifying potential sources of variation.
ABSTRACT
Surveillance plays a crucial role in preventing emerging infectious diseases from becoming epidemic. In circumstances where it is possible to monitor the infection status of certain people, transport hubs, or hospitals, early detection of the disease allows interventions to be implemented before most of the damage can occur, or at least its impact can be mitigated. This paper addresses the question of which nodes we should select in a network of individuals susceptible to some infectious disease in order to minimize the number of casualties. By simulating disease outbreaks on a collection of empirical and synthetic networks we show that the best strategy depends on topological characteristics of the network. For highly modular or spatially embedded networks it is better to place the sentinels on nodes distributed across different regions. However, if the degree heterogeneity is high, then a strategy that targets network hubs is preferred. We further consider the consequences of having an incomplete sample of the network and demonstrate that the value of new information diminishes as more data is collected. Finally we find further marginal improvements using two heuristics informed by known results in graph theory that exploit the fragmented structure of sparse network data.