Search | VHL Regional Portal

Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models.

Jeong, Minbyul; Sohn, Jiwoong; Sung, Mujeen; Kang, Jaewoo.

Bioinformatics ; 40(Supplement_1): i119-i129, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940167

ABSTRACT

SUMMARY: Recent proprietary large language models (LLMs), such as GPT-4, have achieved a milestone in tackling diverse challenges in the biomedical domain, ranging from multiple-choice questions to long-form generations. To address challenges that still cannot be handled with the encoded knowledge of LLMs, various retrieval-augmented generation (RAG) methods have been developed by searching documents from the knowledge corpus and appending them unconditionally or selectively to the input of LLMs for generation. However, when applying existing methods to different domain-specific problems, poor generalization becomes apparent, leading to fetching incorrect documents or making inaccurate judgments. In this paper, we introduce Self-BioRAG, a framework reliable for biomedical text that specializes in generating explanations, retrieving domain-specific documents, and self-reflecting generated responses. We utilize 84k filtered biomedical instruction sets to train Self-BioRAG that can assess its generated explanations with customized reflective tokens. Our work proves that domain-specific components, such as a retriever, domain-related document corpus, and instruction sets are necessary for adhering to domain-related instructions. Using three major medical question-answering benchmark datasets, experimental results of Self-BioRAG demonstrate significant performance gains by achieving a 7.2% absolute improvement on average over the state-of-the-art open-foundation model with a parameter size of 7B or less. Similarly, Self-BioRAG outperforms RAG by 8% Rouge-1 score in generating more proficient answers on two long-form question-answering benchmarks on average. Overall, we analyze that Self-BioRAG finds the clues in the question, retrieves relevant documents if needed, and understands how to answer with information from retrieved documents and encoded knowledge as a medical expert does. We release our data and code for training our framework components and model weights (7B and 13B) to enhance capabilities in biomedical and clinical domains. AVAILABILITY AND IMPLEMENTATION: Self-BioRAG is available at https://github.com/dmis-lab/self-biorag.

Subject(s)

Information Storage and Retrieval , Humans , Information Storage and Retrieval/methods , Natural Language Processing

Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII.

Leaman, Robert; Islamaj, Rezarta; Adams, Virginia; Alliheedi, Mohammed A; Almeida, João Rafael; Antunes, Rui; Bevan, Robert; Chang, Yung-Chun; Erdengasileng, Arslan; Hodgskiss, Matthew; Ida, Ryuki; Kim, Hyunjae; Li, Keqiao; Mercer, Robert E; Mertová, Lukrécia; Mobasher, Ghadeer; Shin, Hoo-Chang; Sung, Mujeen; Tsujimura, Tomoki; Yeh, Wen-Chao; Lu, Zhiyong.

Database (Oxford) ; 20232023 03 07.

Article in English | MEDLINE | ID: mdl-36882099

ABSTRACT

The BioCreative National Library of Medicine (NLM)-Chem track calls for a community effort to fine-tune automated recognition of chemical names in the biomedical literature. Chemicals are one of the most searched biomedical entities in PubMed, and-as highlighted during the coronavirus disease 2019 pandemic-their identification may significantly advance research in multiple biomedical subfields. While previous community challenges focused on identifying chemical names mentioned in titles and abstracts, the full text contains valuable additional detail. We, therefore, organized the BioCreative NLM-Chem track as a community effort to address automated chemical entity recognition in full-text articles. The track consisted of two tasks: (i) chemical identification and (ii) chemical indexing. The chemical identification task required predicting all chemicals mentioned in recently published full-text articles, both span [i.e. named entity recognition (NER)] and normalization (i.e. entity linking), using Medical Subject Headings (MeSH). The chemical indexing task required identifying which chemicals reflect topics for each article and should therefore appear in the listing of MeSH terms for the document in the MEDLINE article indexing. This manuscript summarizes the BioCreative NLM-Chem track and post-challenge experiments. We received a total of 85 submissions from 17 teams worldwide. The highest performance achieved for the chemical identification task was 0.8672 F-score (0.8759 precision and 0.8587 recall) for strict NER performance and 0.8136 F-score (0.8621 precision and 0.7702 recall) for strict normalization performance. The highest performance achieved for the chemical indexing task was 0.6073 F-score (0.7417 precision and 0.5141 recall). This community challenge demonstrated that (i) the current substantial achievements in deep learning technologies can be utilized to improve automated prediction accuracy further and (ii) the chemical indexing task is substantially more challenging. We look forward to further developing biomedical text-mining methods to respond to the rapid growth of biomedical literature. The NLM-Chem track dataset and other challenge materials are publicly available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/. Database URL https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/.

Subject(s)

COVID-19 , United States , Humans , National Library of Medicine (U.S.) , Data Mining , Databases, Factual , MEDLINE

BERN2: an advanced neural biomedical named entity recognition and normalization tool.

Sung, Mujeen; Jeong, Minbyul; Choi, Yonghwa; Kim, Donghyeon; Lee, Jinhyuk; Kang, Jaewoo.

Bioinformatics ; 38(20): 4837-4839, 2022 10 14.

Article in English | MEDLINE | ID: mdl-36053172

ABSTRACT

In biomedical natural language processing, named entity recognition (NER) and named entity normalization (NEN) are key tasks that enable the automatic extraction of biomedical entities (e.g. diseases and drugs) from the ever-growing biomedical literature. In this article, we present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by employing a multi-task NER model and neural network-based NEN models to achieve much faster and more accurate inference. We hope that our tool can help annotate large-scale biomedical texts for various tasks such as biomedical knowledge graph construction. AVAILABILITY AND IMPLEMENTATION: Web service of BERN2 is publicly available at http://bern2.korea.ac.kr. We also provide local installation of BERN2 at https://github.com/dmis-lab/BERN2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Neural Networks, Computer , Software , Natural Language Processing

Full-text chemical identification with improved generalizability and tagging consistency.

Kim, Hyunjae; Sung, Mujeen; Yoon, Wonjin; Park, Sungjoon; Kang, Jaewoo.

Database (Oxford) ; 20222022 09 28.

Article in English | MEDLINE | ID: mdl-36170114

ABSTRACT

Chemical identification involves finding chemical entities in text (i.e. named entity recognition) and assigning unique identifiers to the entities (i.e. named entity normalization). While current models are developed and evaluated based on article titles and abstracts, their effectiveness has not been thoroughly verified in full text. In this paper, we identify two limitations of models in tagging full-text articles: (1) low generalizability to unseen mentions and (2) tagging inconsistency. We use simple training and post-processing methods to address the limitations such as transfer learning and mention-wise majority voting. We also present a hybrid model for the normalization task that utilizes the high recall of a neural model while maintaining the high precision of a dictionary model. In the BioCreative VII NLM-Chem track challenge, our best model achieves 86.72 and 78.31 F1 scores in named entity recognition and normalization, significantly outperforming the median (83.73 and 77.49 F1 scores) and taking first place in named entity recognition. In a post-challenge evaluation, we re-implement our model and obtain 84.70 F1 score in the normalization task, outperforming the best score in the challenge by 3.34 F1 score. Database URL: https://github.com/dmis-lab/bc7-chem-id.

Subject(s)

Data Mining , Data Mining/methods , Databases, Factual

Pandemics are catalysts of scientific novelty: Evidence from COVID-19.

Liu, Meijun; Bu, Yi; Chen, Chongyan; Xu, Jian; Li, Daifeng; Leng, Yan; Freeman, Richard B; Meyer, Eric T; Yoon, Wonjin; Sung, Mujeen; Jeong, Minbyul; Lee, Jinhyuk; Kang, Jaewoo; Min, Chao; Song, Min; Zhai, Yujia; Ding, Ying.

J Assoc Inf Sci Technol ; 73(8): 1065-1078, 2022 Aug.

Article in English | MEDLINE | ID: mdl-35441082

ABSTRACT

Scientific novelty drives the efforts to invent new vaccines and solutions during the pandemic. First-time collaboration and international collaboration are two pivotal channels to expand teams' search activities for a broader scope of resources required to address the global challenge, which might facilitate the generation of novel ideas. Our analysis of 98,981 coronavirus papers suggests that scientific novelty measured by the BioBERT model that is pretrained on 29 million PubMed articles, and first-time collaboration increased after the outbreak of COVID-19, and international collaboration witnessed a sudden decrease. During COVID-19, papers with more first-time collaboration were found to be more novel and international collaboration did not hamper novelty as it had done in the normal periods. The findings suggest the necessity of reaching out for distant resources and the importance of maintaining a collaborative scientific community beyond nationalism during a pandemic.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL