Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 61
Filtrar
1.
Stud Health Technol Inform ; 316: 1487-1491, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176485

RESUMEN

This article presents our experience in development an ontological model can be used in clinical decision support systems (CDSS) creating. We have used the largest international biomedical terminological metathesaurus the Unified Medical Language System (UMLS) as the basis of our model. This metathesaurus has been adapted into Russian using an automated hybrid translation system with expert control. The product we have created was named as the National Unified Terminological System (NUTS). We have added more than 33 million scientific and clinical relationships between NUTS terms, extracted from the texts of scientific articles and electronic health records. We have also computed weights for each relationship, standardized their values and created symptom checker in preliminary diagnostics based on this. We expect, that the NUTS allow solving task of named entity recognition (NER) and increasing terms interoperability in different CDSS.


Asunto(s)
Registros Electrónicos de Salud , Bases del Conocimiento , Unified Medical Language System , Sistemas de Apoyo a Decisiones Clínicas , Procesamiento de Lenguaje Natural , Humanos , Federación de Rusia , Vocabulario Controlado
2.
Stud Health Technol Inform ; 316: 1492-1493, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176486

RESUMEN

This article presents experience in construction the National Unified Terminological System (NUTS) with an ontological structure based on international Unified Medical Language System (UMLS). UMLS has been adapted and enriched with formulations from national directories, relationships, extracted from the texts of scientific articles and electronic health records, and weight coefficients.


Asunto(s)
Registros Electrónicos de Salud , Unified Medical Language System , Procesamiento de Lenguaje Natural , Terminología como Asunto , Vocabulario Controlado
3.
Stud Health Technol Inform ; 316: 1536-1537, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176497

RESUMEN

Our novel Intelligent Tutoring System (ITS) architecture integrates HL7 Fast Healthcare Interoperability Resources (FHIR) for data exchange and Unified Medical Language System (UMLS) codes for content mapping.


Asunto(s)
Estándar HL7 , Unified Medical Language System , Interoperabilidad de la Información en Salud , Integración de Sistemas , Humanos
4.
Stud Health Technol Inform ; 316: 1933-1937, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176870

RESUMEN

Biomedical data analysis and visualization often demand data experts for each unique health event. There is a clear lack of automatic tools for semantic visualization of the spread of health risks through biomedical data. Illnesses such as coronavirus disease (COVID-19) and Monkeypox spread rampantly around the world before governments could make decisions based on the analysis of such data. We propose the design of a knowledge graph (KG) for spatio-temporal tracking of public health event propagation. To achieve this, we propose the specialization of the Core Propagation Phenomenon Ontology (PropaPhen) into a health-related propagation phenomenon domain ontology. Data from the UMLS and OpenStreetMaps are suggested for instantiating the proposed knowledge graph. Finally, the results of a use case on COVID-19 data from the World Health Organization are analyzed to evaluate the possibilities of our approach.


Asunto(s)
COVID-19 , Humanos , SARS-CoV-2 , Ontologías Biológicas , Unified Medical Language System
5.
Stud Health Technol Inform ; 316: 771-775, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176907

RESUMEN

Ontologies play a key role in representing and structuring domain knowledge. In the biomedical domain, the need for this type of representation is crucial for structuring, coding, and retrieving data. However, available ontologies do not encompass all the relevant concepts and relationships. In this paper, we propose the framework SiMHOMer (Siamese Models for Health Ontologies Merging) to semantically merge and integrate the most relevant ontologies in the healthcare domain, with a first focus on diseases, symptoms, drugs, and adverse events. We propose to rely on the siamese neural models we developed and trained on biomedical data, BioSTransformers, to identify new relevant relations between concepts and to create new semantic relations, the objective being to build a new merging ontology that could be used in applications. To validate the proposed approach and the new relations, we relied on the UMLS Metathesaurus and the Semantic Network. Our first results show promising improvements for future research.


Asunto(s)
Ontologías Biológicas , Semántica , Redes Neurales de la Computación , Humanos , Unified Medical Language System
6.
J Biomed Semantics ; 15(1): 14, 2024 Aug 10.
Artículo en Inglés | MEDLINE | ID: mdl-39123237

RESUMEN

BACKGROUND: Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects. CLINICALTRIALS: gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance. RESULTS: In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, the Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy. CONCLUSION: This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.


Asunto(s)
Ontologías Biológicas , Ensayos Clínicos como Asunto , Vacunas , Vacunas/inmunología , Humanos , Procesamiento de Lenguaje Natural , Unified Medical Language System
7.
Bioinformatics ; 40(8)2024 Aug 02.
Artículo en Inglés | MEDLINE | ID: mdl-39067036

RESUMEN

MOTIVATION: Biomedical entity linking (BEL) is the task of grounding entity mentions to a given knowledge base (KB). Recently, neural name-based methods, system identifying the most appropriate name in the KB for a given mention using neural network (either via dense retrieval or autoregressive modeling), achieved remarkable results for the task, without requiring manual tuning or definition of domain/entity-specific rules. However, as name-based methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). RESULTS: We present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. BELHD builds upon the BioSyn model with two crucial extensions. First, it performs pre-processing of the KB, during which it expands homonyms with a specifically constructed disambiguating string, thus enforcing unique linking decisions. Second, it introduces candidate sharing, a novel strategy that strengthens the overall training signal by including similar mentions from the same document as positive or negative examples, according to their corresponding KB identifier. Experiments with 10 corpora and 5 entity types show that BELHD improves upon current neural state-of-the-art approaches, achieving the best results in 6 out of 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the prediction model and thus can also improve other neural methods, which we exemplify for GenBioEL, a generative name-based BEL approach. AVAILABILITY AND IMPLEMENTATION: The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belhd.


Asunto(s)
Procesamiento de Lenguaje Natural , Redes Neurales de la Computación , Bases del Conocimiento , Algoritmos , Unified Medical Language System , Humanos , Biología Computacional/métodos
8.
Sci Data ; 11(1): 688, 2024 Jun 26.
Artículo en Inglés | MEDLINE | ID: mdl-38926396

RESUMEN

Automated medical image analysis systems often require large amounts of training data with high quality labels, which are difficult and time consuming to generate. This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access subset. It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018. It further provides manually curated concepts for imaging modalities with additional anatomical and directional concepts for X-rays. The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023. The dataset is suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using Unified Medical Language System (UMLS) concepts provided with each image. In addition, it can serve for pre-training of medical domain models, and evaluation of deep learning models for multi-task learning.


Asunto(s)
Imagen Multimodal , Radiología , Humanos , Procesamiento de Imagen Asistido por Computador , Unified Medical Language System
9.
J Biomed Semantics ; 15(1): 8, 2024 May 30.
Artículo en Inglés | MEDLINE | ID: mdl-38816758

RESUMEN

BACKGROUND: A huge amount of research is carried out nowadays in Artificial Intelligence to propose automated ways to analyse medical data with the aim to support doctors in delivering medical diagnoses. However, a main issue of these approaches is the lack of transparency and interpretability of the achieved results, making it hard to employ such methods for educational purposes. It is therefore necessary to develop new frameworks to enhance explainability in these solutions. RESULTS: In this paper, we present a novel full pipeline to generate automatically natural language explanations for medical diagnoses. The proposed solution starts from a clinical case description associated with a list of correct and incorrect diagnoses and, through the extraction of the relevant symptoms and findings, enriches the information contained in the description with verified medical knowledge from an ontology. Finally, the system returns a pattern-based explanation in natural language which elucidates why the correct (incorrect) diagnosis is the correct (incorrect) one. The main contribution of the paper is twofold: first, we propose two novel linguistic resources for the medical domain (i.e, a dataset of 314 clinical cases annotated with the medical entities from UMLS, and a database of biological boundaries for common findings), and second, a full Information Extraction pipeline to extract symptoms and findings from the clinical cases and match them with the terms in a medical ontology and to the biological boundaries. An extensive evaluation of the proposed approach shows the our method outperforms comparable approaches. CONCLUSIONS: Our goal is to offer AI-assisted educational support framework to form clinical residents to formulate sound and exhaustive explanations for their diagnoses to patients.


Asunto(s)
Procesamiento de Lenguaje Natural , Humanos , Diagnóstico , Inteligencia Artificial , Errores Diagnósticos , Unified Medical Language System
10.
J Med Internet Res ; 26: e52655, 2024 May 30.
Artículo en Inglés | MEDLINE | ID: mdl-38814687

RESUMEN

BACKGROUND: Since the beginning of the COVID-19 pandemic, >1 million studies have been collected within the COVID-19 Open Research Dataset, a corpus of manuscripts created to accelerate research against the disease. Their related abstracts hold a wealth of information that remains largely unexplored and difficult to search due to its unstructured nature. Keyword-based search is the standard approach, which allows users to retrieve the documents of a corpus that contain (all or some of) the words in a target list. This type of search, however, does not provide visual support to the task and is not suited to expressing complex queries or compensating for missing specifications. OBJECTIVE: This study aims to consider small graphs of concepts and exploit them for expressing graph searches over existing COVID-19-related literature, leveraging the increasing use of graphs to represent and query scientific knowledge and providing a user-friendly search and exploration experience. METHODS: We considered the COVID-19 Open Research Dataset corpus and summarized its content by annotating the publications' abstracts using terms selected from the Unified Medical Language System and the Ontology of Coronavirus Infectious Disease. Then, we built a co-occurrence network that includes all relevant concepts mentioned in the corpus, establishing connections when their mutual information is relevant. A sophisticated graph query engine was built to allow the identification of the best matches of graph queries on the network. It also supports partial matches and suggests potential query completions using shortest paths. RESULTS: We built a large co-occurrence network, consisting of 128,249 entities and 47,198,965 relationships; the GRAPH-SEARCH interface allows users to explore the network by formulating or adapting graph queries; it produces a bibliography of publications, which are globally ranked; and each publication is further associated with the specific parts of the query that it explains, thereby allowing the user to understand each aspect of the matching. CONCLUSIONS: Our approach supports the process of query formulation and evidence search upon a large text corpus; it can be reapplied to any scientific domain where documents corpora and curated ontologies are made available.


Asunto(s)
Algoritmos , COVID-19 , SARS-CoV-2 , COVID-19/epidemiología , Humanos , Pandemias , Almacenamiento y Recuperación de la Información/métodos , Investigación Biomédica/métodos , Unified Medical Language System , Motor de Búsqueda
11.
J Biomed Inform ; 155: 104659, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38777085

RESUMEN

OBJECTIVE: This study aims to promote interoperability in precision medicine and translational research by aligning the Observational Medical Outcomes Partnership (OMOP) and Phenopackets data models. Phenopackets is an expert knowledge-driven schema designed to facilitate the storage and exchange of multimodal patient data, and support downstream analysis. The first goal of this paper is to explore model alignment by characterizing the common data models using a newly developed data transformation process and evaluation method. Second, using OMOP normalized clinical data, we evaluate the mapping of real-world patient data to Phenopackets. We evaluate the suitability of Phenopackets as a patient data representation for real-world clinical cases. METHODS: We identified mappings between OMOP and Phenopackets and applied them to a real patient dataset to assess the transformation's success. We analyzed gaps between the models and identified key considerations for transforming data between them. Further, to improve ambiguous alignment, we incorporated Unified Medical Language System (UMLS) semantic type-based filtering to direct individual concepts to their most appropriate domain and conducted a domain-expert evaluation of the mapping's clinical utility. RESULTS: The OMOP to Phenopacket transformation pipeline was executed for 1,000 Alzheimer's disease patients and successfully mapped all required entities. However, due to missing values in OMOP for required Phenopacket attributes, 10.2 % of records were lost. The use of UMLS-semantic type filtering for ambiguous alignment of individual concepts resulted in 96 % agreement with clinical thinking, increased from 68 % when mapping exclusively by domain correspondence. CONCLUSION: This study presents a pipeline to transform data from OMOP to Phenopackets. We identified considerations for the transformation to ensure data quality, handling restrictions for successful Phenopacket validation and discrepant data formats. We identified unmappable Phenopacket attributes that focus on specialty use cases, such as genomics or oncology, which OMOP does not currently support. We introduce UMLS semantic type filtering to resolve ambiguous alignment to Phenopacket entities to be most appropriate for real-world interpretation. We provide a systematic approach to align OMOP and Phenopackets schemas. Our work facilitates future use of Phenopackets in clinical applications by addressing key barriers to interoperability when deriving a Phenopacket from real-world patient data.


Asunto(s)
Unified Medical Language System , Humanos , Semántica , Registros Electrónicos de Salud , Medicina de Precisión/métodos , Investigación Biomédica Traslacional , Informática Médica/métodos , Procesamiento de Lenguaje Natural , Enfermedad de Alzheimer
12.
J Am Med Inform Assoc ; 31(9): 1912-1920, 2024 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-38777805

RESUMEN

OBJECTIVES: Biomedical Knowledge Graphs play a pivotal role in various biomedical research domains. Concurrently, term clustering emerges as a crucial step in constructing these knowledge graphs, aiming to identify synonymous terms. Due to a lack of knowledge, previous contrastive learning models trained with Unified Medical Language System (UMLS) synonyms struggle at clustering difficult terms and do not generalize well beyond UMLS terms. In this work, we leverage the world knowledge from large language models (LLMs) and propose Contrastive Learning for Representing Terms via Explanations (CoRTEx) to enhance term representation and significantly improves term clustering. MATERIALS AND METHODS: The model training involves generating explanations for a cleaned subset of UMLS terms using ChatGPT. We employ contrastive learning, considering term and explanation embeddings simultaneously, and progressively introduce hard negative samples. Additionally, a ChatGPT-assisted BIRCH algorithm is designed for efficient clustering of a new ontology. RESULTS: We established a clustering test set and a hard negative test set, where our model consistently achieves the highest F1 score. With CoRTEx embeddings and the modified BIRCH algorithm, we grouped 35 580 932 terms from the Biomedical Informatics Ontology System (BIOS) into 22 104 559 clusters with O(N) queries to ChatGPT. Case studies highlight the model's efficacy in handling challenging samples, aided by information from explanations. CONCLUSION: By aligning terms to their explanations, CoRTEx demonstrates superior accuracy over benchmark models and robustness beyond its training set, and it is suitable for clustering terms for large-scale biomedical ontologies.


Asunto(s)
Algoritmos , Aprendizaje Automático , Procesamiento de Lenguaje Natural , Unified Medical Language System , Análisis por Conglomerados , Ontologías Biológicas , Investigación Biomédica , Humanos , Terminología como Asunto
13.
J Am Med Inform Assoc ; 31(7): 1569-1577, 2024 Jun 20.
Artículo en Inglés | MEDLINE | ID: mdl-38718216

RESUMEN

OBJECTIVE: Social media-based public health research is crucial for epidemic surveillance, but most studies identify relevant corpora with keyword-matching. This study develops a system to streamline the process of curating colloquial medical dictionaries. We demonstrate the pipeline by curating a Unified Medical Language System (UMLS)-colloquial symptom dictionary from COVID-19-related tweets as proof of concept. METHODS: COVID-19-related tweets from February 1, 2020, to April 30, 2022 were used. The pipeline includes three modules: a named entity recognition module to detect symptoms in tweets; an entity normalization module to aggregate detected entities; and a mapping module that iteratively maps entities to Unified Medical Language System concepts. A random 500 entity samples were drawn from the final dictionary for accuracy validation. Additionally, we conducted a symptom frequency distribution analysis to compare our dictionary to a pre-defined lexicon from previous research. RESULTS: We identified 498 480 unique symptom entity expressions from the tweets. Pre-processing reduces the number to 18 226. The final dictionary contains 38 175 unique expressions of symptoms that can be mapped to 966 UMLS concepts (accuracy = 95%). Symptom distribution analysis found that our dictionary detects more symptoms and is effective at identifying psychiatric disorders like anxiety and depression, often missed by pre-defined lexicons. CONCLUSIONS: This study advances public health research by implementing a novel, systematic pipeline for curating symptom lexicons from social media data. The final lexicon's high accuracy, validated by medical professionals, underscores the potential of this methodology to reliably interpret, and categorize vast amounts of unstructured social media data into actionable medical insights across diverse linguistic and regional landscapes.


Asunto(s)
COVID-19 , Aprendizaje Profundo , Medios de Comunicación Sociales , Unified Medical Language System , Humanos , Salud Pública , Almacenamiento y Recuperación de la Información/métodos
14.
BMC Med Inform Decis Mak ; 24(Suppl 2): 114, 2024 Apr 30.
Artículo en Inglés | MEDLINE | ID: mdl-38689287

RESUMEN

BACKGROUND: Traditional literature based discovery is based on connecting knowledge pairs extracted from separate publications via a common mid point to derive previously unseen knowledge pairs. To avoid the over generation often associated with this approach, we explore an alternative method based on word evolution. Word evolution examines the changing contexts of a word to identify changes in its meaning or associations. We investigate the possibility of using changing word contexts to detect drugs suitable for repurposing. RESULTS: Word embeddings, which represent a word's context, are constructed from chronologically ordered publications in MEDLINE at bi-monthly intervals, yielding a time series of word embeddings for each word. Focusing on clinical drugs only, any drugs repurposed in the final time segment of the time series are annotated as positive examples. The decision regarding the drug's repurposing is based either on the Unified Medical Language System (UMLS), or semantic triples extracted using SemRep from MEDLINE. CONCLUSIONS: The annotated data allows deep learning classification, with a 5-fold cross validation, to be performed and multiple architectures to be explored. Performance of 65% using UMLS labels, and 81% using SemRep labels is attained, indicating the technique's suitability for the detection of candidate drugs for repurposing. The investigation also shows that different architectures are linked to the quantities of training data available and therefore that different models should be trained for every annotation approach.


Asunto(s)
Reposicionamiento de Medicamentos , Humanos , Unified Medical Language System , MEDLINE , Aprendizaje Profundo , Procesamiento de Lenguaje Natural , Semántica
15.
J Am Med Inform Assoc ; 31(9): 1844-1855, 2024 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-38412333

RESUMEN

OBJECTIVE: In this study, we investigate the potential of large language models (LLMs) to complement biomedical knowledge graphs in the training of semantic models for the biomedical and clinical domains. MATERIALS AND METHODS: Drawing on the wealth of the Unified Medical Language System knowledge graph and harnessing cutting-edge LLMs, we propose a new state-of-the-art approach for obtaining high-fidelity representations of biomedical concepts and sentences, consisting of 3 steps: an improved contrastive learning phase, a novel self-distillation phase, and a weight averaging phase. RESULTS: Through rigorous evaluations of diverse downstream tasks, we demonstrate consistent and substantial improvements over the previous state of the art for semantic textual similarity (STS), biomedical concept representation (BCR), and clinically named entity linking, across 15+ datasets. Besides our new state-of-the-art biomedical model for English, we also distill and release a multilingual model compatible with 50+ languages and finetuned on 7 European languages. DISCUSSION: Many clinical pipelines can benefit from our latest models. Our new multilingual model enables a range of languages to benefit from our advancements in biomedical semantic representation learning, opening a new avenue for bioinformatics researchers around the world. As a result, we hope to see BioLORD-2023 becoming a precious tool for future biomedical applications. CONCLUSION: In this article, we introduced BioLORD-2023, a state-of-the-art model for STS and BCR designed for the clinical domain.


Asunto(s)
Procesamiento de Lenguaje Natural , Semántica , Unified Medical Language System , Humanos
16.
BMC Med Inform Decis Mak ; 23(Suppl 4): 299, 2024 Feb 07.
Artículo en Inglés | MEDLINE | ID: mdl-38326827

RESUMEN

BACKGROUND: In this era of big data, data harmonization is an important step to ensure reproducible, scalable, and collaborative research. Thus, terminology mapping is a necessary step to harmonize heterogeneous data. Take the Medical Dictionary for Regulatory Activities (MedDRA) and International Classification of Diseases (ICD) for example, the mapping between them is essential for drug safety and pharmacovigilance research. Our main objective is to provide a quantitative and qualitative analysis of the mapping status between MedDRA and ICD. We focus on evaluating the current mapping status between MedDRA and ICD through the Unified Medical Language System (UMLS) and Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). We summarized the current mapping statistics and evaluated the quality of the current MedDRA-ICD mapping; for unmapped terms, we used our self-developed algorithm to rank the best possible mapping candidates for additional mapping coverage. RESULTS: The identified MedDRA-ICD mapped pairs cover 27.23% of the overall MedDRA preferred terms (PT). The systematic quality analysis demonstrated that, among the mapped pairs provided by UMLS, only 51.44% are considered an exact match. For the 2400 sampled unmapped terms, 56 of the 2400 MedDRA Preferred Terms (PT) could have exact match terms from ICD. CONCLUSION: Some of the mapped pairs between MedDRA and ICD are not exact matches due to differences in granularity and focus. For 72% of the unmapped PT terms, the identified exact match pairs illustrate the possibility of identifying additional mapped pairs. Referring to its own mapping standard, some of the unmapped terms should qualify for the expansion of MedDRA to ICD mapping in UMLS.


Asunto(s)
Sistemas de Registro de Reacción Adversa a Medicamentos , Clasificación Internacional de Enfermedades , Humanos , Unified Medical Language System , Farmacovigilancia , Algoritmos
17.
J Biomed Inform ; 149: 104580, 2024 01.
Artículo en Inglés | MEDLINE | ID: mdl-38163514

RESUMEN

The complex linguistic structures and specialized terminology of expert-authored content limit the accessibility of biomedical literature to the general public. Automated methods have the potential to render this literature more interpretable to readers with different educational backgrounds. Prior work has framed such lay language generation as a summarization or simplification task. However, adapting biomedical text for the lay public includes the additional and distinct task of background explanation: adding external content in the form of definitions, motivation, or examples to enhance comprehensibility. This task is especially challenging because the source document may not include the required background knowledge. Furthermore, background explanation capabilities have yet to be formally evaluated, and little is known about how best to enhance them. To address this problem, we introduce Retrieval-Augmented Lay Language (RALL) generation, which intuitively fits the need for external knowledge beyond that in expert-authored source documents. In addition, we introduce CELLS, the largest (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. To evaluate RALL, we augmented state-of-the-art text generation models with information retrieval of either term definitions from the UMLS and Wikipedia, or embeddings of explanations from Wikipedia documents. Of these, embedding-based RALL models improved summary quality and simplicity while maintaining factual correctness, suggesting that Wikipedia is a helpful source for background explanation in this context. We also evaluated the ability of both an open-source Large Language Model (Llama 2) and a closed-source Large Language Model (GPT-4) in background explanation, with and without retrieval augmentation. Results indicate that these LLMs can generate simplified content, but that the summary quality is not ideal. Taken together, this work presents the first comprehensive study of background explanation for lay language generation, paving the path for disseminating scientific knowledge to a broader audience. Our code and data are publicly available at: https://github.com/LinguisticAnomalies/pls_retrieval.


Asunto(s)
Lenguaje , Procesamiento de Lenguaje Natural , Almacenamiento y Recuperación de la Información , Lingüística , Unified Medical Language System
18.
AMIA Annu Symp Proc ; 2023: 599-607, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38222370

RESUMEN

Biomedical ontologies are a key component in many systems for the analysis of textual clinical data. They are employed to organize information about a certain domain relying on a hierarchy of different classes. Each class maps a concept to items in a terminology developed by domain experts. These mappings are then leveraged to organize the information extracted by Natural Language Processing (NLP) models to build knowledge graphs for inferences. The creation of these associations, however, requires extensive manual review. In this paper, we present an automated approach and repeatable framework to learn a mapping between ontology classes and terminology terms derived from vocabularies in the Unified Medical Language System (UMLS) metathesaurus. According to our evaluation, the proposed system achieves a performance close to humans and provides a substantial improvement over existing systems developed by the National Library of Medicine to assist researchers through this process.


Asunto(s)
Ontologías Biológicas , Unified Medical Language System , Estados Unidos , Humanos , National Library of Medicine (U.S.) , Procesamiento de Lenguaje Natural
19.
AMIA Annu Symp Proc ; 2023: 977-986, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38222357

RESUMEN

The Unified Medical Language System (UMLS), a large repository of biomedical vocabularies, has been used for supporting various biomedical applications. Ensuring the quality of the UMLS is critical to maintain both the accuracy of its content and the reliability of downstream applications. In this work, we present a Graph Convolutional Network (GCN)-based approach to identify misaligned synonymous terms organized under different UMLS concepts. We used synonymous terms grouped under the same concept as positive samples and top lexically similar terms as negative samples to train the GCN model. We applied the model to a test set and suggested those negative samples predicted to be synonymous as potentially misaligned synonymous terms. A total of 147,625 suggestions were made. A human expert evaluated 100 randomly selected suggestions and agreed with 60 of them. The results indicate that our GCN-based approach shows promise to help improve the synonymy grouping in the UMLS.


Asunto(s)
Unified Medical Language System , Humanos , Reproducibilidad de los Resultados
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA