Your browser doesn't support javascript.
loading
Retrieval augmentation of large language models for lay language generation.
Guo, Yue; Qiu, Wei; Leroy, Gondy; Wang, Sheng; Cohen, Trevor.
Affiliation
  • Guo Y; Biomedical and Health Informatics, University of Washington, United States of America. Electronic address: yguo50@uw.edu.
  • Qiu W; Paul G. Allen School of Computer Science, University of Washington, United States of America.
  • Leroy G; Management Information Systems, University of Arizona, United States of America.
  • Wang S; Paul G. Allen School of Computer Science, University of Washington, United States of America.
  • Cohen T; Biomedical and Health Informatics, University of Washington, United States of America.
J Biomed Inform ; 149: 104580, 2024 01.
Article in En | MEDLINE | ID: mdl-38163514
ABSTRACT
The complex linguistic structures and specialized terminology of expert-authored content limit the accessibility of biomedical literature to the general public. Automated methods have the potential to render this literature more interpretable to readers with different educational backgrounds. Prior work has framed such lay language generation as a summarization or simplification task. However, adapting biomedical text for the lay public includes the additional and distinct task of background explanation adding external content in the form of definitions, motivation, or examples to enhance comprehensibility. This task is especially challenging because the source document may not include the required background knowledge. Furthermore, background explanation capabilities have yet to be formally evaluated, and little is known about how best to enhance them. To address this problem, we introduce Retrieval-Augmented Lay Language (RALL) generation, which intuitively fits the need for external knowledge beyond that in expert-authored source documents. In addition, we introduce CELLS, the largest (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. To evaluate RALL, we augmented state-of-the-art text generation models with information retrieval of either term definitions from the UMLS and Wikipedia, or embeddings of explanations from Wikipedia documents. Of these, embedding-based RALL models improved summary quality and simplicity while maintaining factual correctness, suggesting that Wikipedia is a helpful source for background explanation in this context. We also evaluated the ability of both an open-source Large Language Model (Llama 2) and a closed-source Large Language Model (GPT-4) in background explanation, with and without retrieval augmentation. Results indicate that these LLMs can generate simplified content, but that the summary quality is not ideal. Taken together, this work presents the first comprehensive study of background explanation for lay language generation, paving the path for disseminating scientific knowledge to a broader audience. Our code and data are publicly available at https//github.com/LinguisticAnomalies/pls_retrieval.
Subject(s)
Key words

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Natural Language Processing / Language Language: En Journal: J Biomed Inform Journal subject: INFORMATICA MEDICA Year: 2024 Document type: Article Country of publication: United States

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Natural Language Processing / Language Language: En Journal: J Biomed Inform Journal subject: INFORMATICA MEDICA Year: 2024 Document type: Article Country of publication: United States