Search | VHL Regional Portal

1.

Evaluating the efficacy of NoteAid on EHR note comprehension among US Veterans through Amazon Mechanical Turk.

Lalor, John P; Wu, Hao; Mazor, Kathleen M; Yu, Hong.

Int J Med Inform ; 172: 105006, 2023 04.

Article in English | MEDLINE | ID: mdl-36780789

ABSTRACT

OBJECTIVE: Low health literacy is a concern among US Veterans. In this study, we evaluated NoteAid, a system that provides lay definitions to medical jargon terms in EHR notes to help Veterans comprehend EHR notes. We expected that low initial scores for Veterans would be improved by using NoteAid. MATERIALS AND METHODS: We recruited Veterans from the Amazon Mechanical Turk crowd work platform (MTurk). We also recruited non-Veterans from MTurk as a control group for comparison. We randomly split recruited MTurk Veteran participants into control and intervention groups. We recruited non-Veteran participants into mutually exclusive control or intervention tasks on the MTurk platform. We showed participants de-identified EHR notes and asked them to answer comprehension questions related to the notes. We provided participants in the intervention group with EHR note content processed with NoteAid, while NoteAid was not available for participants in the control group. RESULTS: We recruited 94 Veterans and 181 non-Veterans. NoteAid leads to a significant improvement for non-Veterans but not for Veterans. Comparing Veterans recruited via MTurk with non-Veterans recruited via MTurk, we found that without NoteAid, Veterans have significantly higher raw scores than non-Veterans. This difference is not significant with NoteAid. DISCUSSION: That Veterans outperform a comparable population of non-Veterans is a surprising outcome. Without NoteAid, scores on the test are already high for Veterans, therefore, minimizing the ability of an intervention such as NoteAid to improve performance. With regards to Veterans, understanding the health literacy of Veterans has been an open question. We show here that Veterans score higher than a comparable, non-Veteran population. CONCLUSION: Veterans on MTurk do not see improved scores when using NoteAid, but they already score high on the test, significantly higher than non-Veterans. When evaluating NoteAid, population specifics need to be considered, as performance may vary across groups. Future work investigating the effectiveness of NoteAid on improving comprehension with local Veterans and developing a more difficult test to assess groups with higher health literacy is needed.

Subject(s)

Crowdsourcing , Health Literacy , Humans , Comprehension

2.

Evaluating the Effectiveness of NoteAid in a Community Hospital Setting: Randomized Trial of Electronic Health Record Note Comprehension Interventions With Patients.

Lalor, John P; Hu, Wen; Tran, Matthew; Wu, Hao; Mazor, Kathleen M; Yu, Hong.

J Med Internet Res ; 23(5): e26354, 2021 05 13.

Article in English | MEDLINE | ID: mdl-33983124

ABSTRACT

BACKGROUND: Interventions to define medical jargon have been shown to improve electronic health record (EHR) note comprehension among crowdsourced participants on Amazon Mechanical Turk (AMT). However, AMT participants may not be representative of the general population or patients who are most at-risk for low health literacy. OBJECTIVE: In this work, we assessed the efficacy of an intervention (NoteAid) for EHR note comprehension among participants in a community hospital setting. METHODS: Participants were recruited from Lowell General Hospital (LGH), a community hospital in Massachusetts, to take the ComprehENotes test, a web-based test of EHR note comprehension. Participants were randomly assigned to control (n=85) or intervention (n=89) groups to take the test without or with NoteAid, respectively. For comparison, we used a sample of 200 participants recruited from AMT to take the ComprehENotes test (100 in the control group and 100 in the intervention group). RESULTS: A total of 174 participants were recruited from LGH, and 200 participants were recruited from AMT. Participants in both intervention groups (community hospital and AMT) scored significantly higher than participants in the control groups (P<.001). The average score for the community hospital participants was significantly lower than the average score for the AMT participants (P<.001), consistent with the lower education levels in the community hospital sample. Education level had a significant effect on scores for the community hospital participants (P<.001). CONCLUSIONS: Use of NoteAid was associated with significantly improved EHR note comprehension in both community hospital and AMT samples. Our results demonstrate the generalizability of ComprehENotes as a test of EHR note comprehension and the effectiveness of NoteAid for improving EHR note comprehension.

Subject(s)

Comprehension , Electronic Health Records , Hospitals, Community , Humans

3.

Dynamic Data Selection for Curriculum Learning via Ability Estimation.

Lalor, John P; Yu, Hong.

Proc Conf Empir Methods Nat Lang Process ; 2020: 545-555, 2020 Nov.

Article in English | MEDLINE | ID: mdl-33381774

ABSTRACT

Curriculum learning methods typically rely on heuristics to estimate the difficulty of training examples or the ability of the model. In this work, we propose replacing difficulty heuristics with learned difficulty parameters. We also propose Dynamic Data selection for Curriculum Learning via Ability Estimation (DDaCLAE), a strategy that probes model ability at each training epoch to select the best training examples at that point. We show that models using learned difficulty and/or ability outperform heuristic-based curriculum learning models on the GLUE classification tasks.

4.

Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds.

Lalor, John P; Wu, Hao; Yu, Hong.

Proc Conf Empir Methods Nat Lang Process ; 2019: 4240-4250, 2019 Nov.

Article in English | MEDLINE | ID: mdl-31803865

ABSTRACT

Incorporating Item Response Theory (IRT) into NLP tasks can provide valuable information about model performance and behavior. Traditionally, IRT models are learned using human response pattern (RP) data, presenting a significant bottleneck for large data sets like those required for training deep neural networks (DNNs). In this work we propose learning IRT models using RPs generated from artificial crowds of DNN models. We demonstrate the effectiveness of learning IRT models using DNN-generated data through quantitative and qualitative analyses for two NLP tasks. Parameters learned from human and machine RPs for natural language inference and sentiment analysis exhibit medium to large positive correlations. We demonstrate a use-case for latent difficulty item parameters, namely training set filtering, and show that using difficulty to sample training data outperforms baseline methods. Finally, we highlight cases where human expectation about item difficulty does not match difficulty as estimated from the machine RPs.

5.

Improving Electronic Health Record Note Comprehension With NoteAid: Randomized Trial of Electronic Health Record Note Comprehension Interventions With Crowdsourced Workers.

Lalor, John P; Woolf, Beverly; Yu, Hong.

J Med Internet Res ; 21(1): e10793, 2019 01 16.

Article in English | MEDLINE | ID: mdl-30664453

ABSTRACT

BACKGROUND: Patient portals are becoming more common, and with them, the ability of patients to access their personal electronic health records (EHRs). EHRs, in particular the free-text EHR notes, often contain medical jargon and terms that are difficult for laypersons to understand. There are many Web-based resources for learning more about particular diseases or conditions, including systems that directly link to lay definitions or educational materials for medical concepts. OBJECTIVE: Our goal is to determine whether use of one such tool, NoteAid, leads to higher EHR note comprehension ability. We use a new EHR note comprehension assessment tool instead of patient self-reported scores. METHODS: In this work, we compare a passive, self-service educational resource (MedlinePlus) with an active resource (NoteAid) where definitions are provided to the user for medical concepts that the system identifies. We use Amazon Mechanical Turk (AMT) to recruit individuals to complete ComprehENotes, a new test of EHR note comprehension. RESULTS: Mean scores for individuals with access to NoteAid are significantly higher than the mean baseline scores, both for raw scores (P=.008) and estimated ability (P=.02). CONCLUSIONS: In our experiments, we show that the active intervention leads to significantly higher scores on the comprehension test as compared with a baseline group with no resources provided. In contrast, there is no significant difference between the group that was provided with the passive intervention and the baseline group. Finally, we analyze the demographics of the individuals who participated in our AMT task and show differences between groups that align with the current understanding of health literacy between populations. This is the first work to show improvements in comprehension using tools such as NoteAid as measured by an EHR note comprehension assessment tool as opposed to patient self-reported scores.

Subject(s)

Comprehension/physiology , Crowdsourcing/methods , Electronic Health Records/standards , Health Literacy/standards , Patient Portals/standards , Female , Humans , Male

6.

ComprehENotes, an Instrument to Assess Patient Reading Comprehension of Electronic Health Record Notes: Development and Validation.

Lalor, John P; Wu, Hao; Chen, Li; Mazor, Kathleen M; Yu, Hong.

J Med Internet Res ; 20(4): e139, 2018 04 25.

Article in English | MEDLINE | ID: mdl-29695372

ABSTRACT

BACKGROUND: Patient portals are widely adopted in the United States and allow millions of patients access to their electronic health records (EHRs), including their EHR clinical notes. A patient's ability to understand the information in the EHR is dependent on their overall health literacy. Although many tests of health literacy exist, none specifically focuses on EHR note comprehension. OBJECTIVE: The aim of this paper was to develop an instrument to assess patients' EHR note comprehension. METHODS: We identified 6 common diseases or conditions (heart failure, diabetes, cancer, hypertension, chronic obstructive pulmonary disease, and liver failure) and selected 5 representative EHR notes for each disease or condition. One note that did not contain natural language text was removed. Questions were generated from these notes using Sentence Verification Technique and were analyzed using item response theory (IRT) to identify a set of questions that represent a good test of ability for EHR note comprehension. RESULTS: Using Sentence Verification Technique, 154 questions were generated from the 29 EHR notes initially obtained. Of these, 83 were manually selected for inclusion in the Amazon Mechanical Turk crowdsourcing tasks and 55 were ultimately retained following IRT analysis. A follow-up validation with a second Amazon Mechanical Turk task and IRT analysis confirmed that the 55 questions test a latent ability dimension for EHR note comprehension. A short test of 14 items was created along with the 55-item test. CONCLUSIONS: We developed ComprehENotes, an instrument for assessing EHR note comprehension from existing EHR notes, gathered responses using crowdsourcing, and used IRT to analyze those responses, thus resulting in a set of questions to measure EHR note comprehension. Crowdsourced responses from Amazon Mechanical Turk can be used to estimate item parameters and select a subset of items for inclusion in the test set using IRT. The final set of questions is the first test of EHR note comprehension.

Subject(s)

Comprehension/physiology , Electronic Health Records/instrumentation , Reading , Electronic Health Records/standards , Female , Humans , Male , Validation Studies as Topic

7.

Understanding Deep Learning Performance through an Examination of Test Set Difficulty: A Psychometric Case Study.

Lalor, John P; Wu, Hao; Munkhdalai, Tsendsuren; Yu, Hong.

Proc Conf Empir Methods Nat Lang Process ; 2018: 4711-4716, 2018.

Article in English | MEDLINE | ID: mdl-33241233

ABSTRACT

Interpreting the performance of deep learning models beyond test set accuracy is challenging. Characteristics of individual data points are often not considered during evaluation, and each data point is treated equally. We examine the impact of a test set question's difficulty to determine if there is a relationship between difficulty and performance. We model difficulty using well-studied psychometric methods on human response patterns. Experiments on Natural Language Inference (NLI) and Sentiment Analysis (SA) show that the likelihood of answering a question correctly is impacted by the question's difficulty. As DNNs are trained with more data, easy examples are learned more quickly than hard examples.

8.

Building an Evaluation Scale using Item Response Theory.

Lalor, John P; Wu, Hao; Yu, Hong.

Proc Conf Empir Methods Nat Lang Process ; 2016: 648-657, 2016 Nov.

Article in English | MEDLINE | ID: mdl-28004039

ABSTRACT

Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL