RESUMEN
OBJECTIVE: We assessed whether machine learning can be utilized to allow efficient extraction of infectious disease activity information from online media reports. MATERIALS AND METHODS: We curated a data set of labeled media reports (n = 8322) indicating which articles contain updates about disease activity. We trained a classifier on this data set. To validate our system, we used a held out test set and compared our articles to the World Health Organization Disease Outbreak News reports. RESULTS: Our classifier achieved a recall and precision of 88.8% and 86.1%, respectively. The overall surveillance system detected 94% of the outbreaks identified by the WHO covered by online media (89%) and did so 43.4 (IQR: 9.5-61) days earlier on average. DISCUSSION: We constructed a global real-time disease activity database surveilling 114 illnesses and syndromes. We must further assess our system for bias, representativeness, granularity, and accuracy. CONCLUSION: Machine learning, natural language processing, and human expertise can be used to efficiently identify disease activity from digital media reports.