Integrating statistical and visual analytic methods for bot identification of health-related survey data.

Chen, Annie T; Komi, Midori; Bessler, Sierrah; Mikles, Sean P; Zhang, Yan

Chen, Annie T; Komi, Midori; Bessler, Sierrah; Mikles, Sean P; Zhang, Yan.

Afiliación

Chen AT; Department of Biomedical Informatics and Medical Education, University of Washington School of Medicine, 850 Republican St., Box 358047, Seattle, WA 98195, United States. Electronic address: atchen@uw.edu.
Komi M; University of Washington, Department of Mathematics Box 354350, Seattle, WA 98195-4350, United States.
Bessler S; University of Washington, Department of Applied Mathematics, 4182 W Stevens Way NE, Seattle, WA 98105, United States. Electronic address: sbessl@uw.edu.
Mikles SP; Lineberger Comprehensive Cancer Outcomes Program, Lineberger Comprehensive Cancer Center, UNC School of Medicine, 450 West Drive, Chapel Hill, NC 27514, United States.
Zhang Y; School of Information, The University of Texas at Austin, 1616 Guadalupe Suite #5.202, Austin, TX 78701-1213, United States. Electronic address: yanz@utexas.edu.

J Biomed Inform ; 144: 104439, 2023 08.

Article en En | MEDLINE | ID: mdl-37419375

ABSTRACT

ABSTRACT

OBJECTIVE:

In recent years, we have increasingly observed issues concerning quality of online information due to misinformation and disinformation. Aside from social media, there is growing awareness that questionnaire data collected using online recruitment methods may include suspect data provided by bots. Issues with data quality can be particularly problematic in health and/or biomedical contexts; thus, developing robust methods for suspect data identification and removal is of paramount importance in informatics. In this study, we describe an interactive visual analytics approach to suspect data identification and removal and demonstrate the application of this approach on questionnaire data pertaining to COVID-19 derived from different recruitment venues, including listservs and social media.

METHODS:

We developed a pipeline for data cleaning, pre-processing, analysis, and automated ranking of data to address data quality issues. We then employed the ranking in conjunction with manual review to identify suspect data and remove them from subsequent analyses. Last, we compared differences in the data before and after removal.

RESULTS:

We performed data cleaning, pre-processing, and exploratory analysis on a survey dataset (N = 4,163) collected using multiple recruitment mechanins using the Qualtrics survey platform. Based on these results, we identified suspect features and used these to generate a suspect feature indicator for each survey response. We excluded survey responses that did not fit the inclusion criteria for the study (n = 29) and then performed manual review of the remaining responses, triangulating with the suspect feature indicator. Based on this review, we excluded 2,921 responses. Additional responses were excluded based on a spam classification by Qualtrics (n=13), and the percentage of survey completion (n=328), resulting in a final sample size of 872. We performed additional analyses to demonstrate the extent to which the suspect feature indicator was congruent with eventual inclusion, as well as compared the characteristics of the included and excluded data.

CONCLUSION:

Our main contributions are 1) a proposed framework for data quality assessment, including suspect data identification and removal; 2) the analysis of potential consequences in terms of representation bias in the dataset; and 3) recommendations for implementation of this approach in practice.

Asunto(s)

COVID-19; Medios de Comunicación Sociales; Humanos; COVID-19/epidemiología; Encuestas y Cuestionarios; Programas Informáticos; Exactitud de los Datos

Palabras clave

Bot identification; COVID-19; Data quality; Interactive visual analytics; Questionnaire data; Visual data analysis

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Medios de Comunicación Sociales / COVID-19 Tipo de estudio: Diagnostic_studies / Guideline / Prognostic_studies Límite: Humans Idioma: En Revista: J Biomed Inform Asunto de la revista: INFORMATICA MEDICA Año: 2023 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google