RESUMO
This paper presents a transformer-based approach for symptom Named Entity Recognition (NER) in Spanish clinical texts and multilingual entity linking on the SympTEMIST dataset. For Spanish NER, we fine tune a RoBERTa-based token-level classifier with Bidirectional Long Short-Term Memory and conditional random field layers on an augmented train set, achieving an F1 score of 0.73. Entity linking is performed via a hybrid approach with dictionaries, generating candidates from a knowledge base containing Unified Medical Language System aliases using the cross-lingual SapBERT and reranking the top candidates using GPT-3.5. The entity linking approach shows consistent results for multiple languages of 0.73 accuracy on the SympTEMIST multilingual dataset and also achieves an accuracy of 0.6123 on the Spanish entity linking task surpassing the current top score for this subtask. Database URL: https://github.com/svassileva/symptemist-multilingual-linking.
Assuntos
Multilinguismo , Humanos , Processamento de Linguagem Natural , Unified Medical Language SystemRESUMO
INTRODUCTION: For an interoperable Intelligent Tutoring System (ITS), we used resources from Fast Healthcare Interoperability Resources (FHIR) and mapped learning content with Unified Medical Language System (UMLS) codes to enhance healthcare education. This study addresses the need to enhance the interoperability and effectiveness of ITS in healthcare education. STATE OF THE ART: The current state of the art in ITS involves advanced personalized learning and adaptability techniques, integrating technologies such as machine learning to personalize the learning experience and to create systems that dynamically respond to individual learner needs. However, existing ITS architectures face challenges related to interoperability and integration with healthcare systems. CONCEPT: Our system maps learning content with UMLS codes, each scored for similarity, ensuring consistency and extensibility. FHIR is used to standardize the exchange of medical information and learning content. IMPLEMENTATION: Implemented as a microservice architecture, the system uses a recommender to request FHIR resources, provide questions, and measure learner progress. LESSONS LEARNED: Using international standards, our ITS ensures reproducibility and extensibility, enhancing interoperability and integration with existing platforms.
Assuntos
Interoperabilidade da Informação em Saúde , Nível Sete de Saúde , Unified Medical Language System , Humanos , Aprendizado de Máquina , Instrução por Computador/métodosRESUMO
This article presents our experience in development an ontological model can be used in clinical decision support systems (CDSS) creating. We have used the largest international biomedical terminological metathesaurus the Unified Medical Language System (UMLS) as the basis of our model. This metathesaurus has been adapted into Russian using an automated hybrid translation system with expert control. The product we have created was named as the National Unified Terminological System (NUTS). We have added more than 33 million scientific and clinical relationships between NUTS terms, extracted from the texts of scientific articles and electronic health records. We have also computed weights for each relationship, standardized their values and created symptom checker in preliminary diagnostics based on this. We expect, that the NUTS allow solving task of named entity recognition (NER) and increasing terms interoperability in different CDSS.
Assuntos
Registros Eletrônicos de Saúde , Bases de Conhecimento , Unified Medical Language System , Sistemas de Apoio a Decisões Clínicas , Processamento de Linguagem Natural , Humanos , Federação Russa , Vocabulário ControladoRESUMO
This article presents experience in construction the National Unified Terminological System (NUTS) with an ontological structure based on international Unified Medical Language System (UMLS). UMLS has been adapted and enriched with formulations from national directories, relationships, extracted from the texts of scientific articles and electronic health records, and weight coefficients.
Assuntos
Registros Eletrônicos de Saúde , Unified Medical Language System , Processamento de Linguagem Natural , Terminologia como Assunto , Vocabulário ControladoRESUMO
Our novel Intelligent Tutoring System (ITS) architecture integrates HL7 Fast Healthcare Interoperability Resources (FHIR) for data exchange and Unified Medical Language System (UMLS) codes for content mapping.
Assuntos
Nível Sete de Saúde , Unified Medical Language System , Interoperabilidade da Informação em Saúde , Integração de Sistemas , HumanosRESUMO
Biomedical data analysis and visualization often demand data experts for each unique health event. There is a clear lack of automatic tools for semantic visualization of the spread of health risks through biomedical data. Illnesses such as coronavirus disease (COVID-19) and Monkeypox spread rampantly around the world before governments could make decisions based on the analysis of such data. We propose the design of a knowledge graph (KG) for spatio-temporal tracking of public health event propagation. To achieve this, we propose the specialization of the Core Propagation Phenomenon Ontology (PropaPhen) into a health-related propagation phenomenon domain ontology. Data from the UMLS and OpenStreetMaps are suggested for instantiating the proposed knowledge graph. Finally, the results of a use case on COVID-19 data from the World Health Organization are analyzed to evaluate the possibilities of our approach.
Assuntos
COVID-19 , Humanos , SARS-CoV-2 , Ontologias Biológicas , Unified Medical Language SystemRESUMO
Ontologies play a key role in representing and structuring domain knowledge. In the biomedical domain, the need for this type of representation is crucial for structuring, coding, and retrieving data. However, available ontologies do not encompass all the relevant concepts and relationships. In this paper, we propose the framework SiMHOMer (Siamese Models for Health Ontologies Merging) to semantically merge and integrate the most relevant ontologies in the healthcare domain, with a first focus on diseases, symptoms, drugs, and adverse events. We propose to rely on the siamese neural models we developed and trained on biomedical data, BioSTransformers, to identify new relevant relations between concepts and to create new semantic relations, the objective being to build a new merging ontology that could be used in applications. To validate the proposed approach and the new relations, we relied on the UMLS Metathesaurus and the Semantic Network. Our first results show promising improvements for future research.
Assuntos
Ontologias Biológicas , Semântica , Redes Neurais de Computação , Humanos , Unified Medical Language SystemRESUMO
BACKGROUND: Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects. CLINICALTRIALS: gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance. RESULTS: In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, the Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy. CONCLUSION: This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.
Assuntos
Ontologias Biológicas , Ensaios Clínicos como Assunto , Vacinas , Vacinas/imunologia , Humanos , Processamento de Linguagem Natural , Unified Medical Language SystemRESUMO
OBJECTIVE: Traditional knowledge-based and machine learning diagnostic decision support systems have benefited from integrating the medical domain knowledge encoded in the Unified Medical Language System (UMLS). The emergence of Large Language Models (LLMs) to supplant traditional systems poses questions of the quality and extent of the medical knowledge in the models' internal knowledge representations and the need for external knowledge sources. The objective of this study is three-fold: to probe the diagnosis-related medical knowledge of popular LLMs, to examine the benefit of providing the UMLS knowledge to LLMs (grounding the diagnosis predictions), and to evaluate the correlations between human judgments and the UMLS-based metrics for generations by LLMs. METHODS: We evaluated diagnoses generated by LLMs from consumer health questions and daily care notes in the electronic health records using the ConsumerQA and Problem Summarization datasets. Probing LLMs for the UMLS knowledge was performed by prompting the LLM to complete the diagnosis-related UMLS knowledge paths. Grounding the predictions was examined in an approach that integrated the UMLS graph paths and clinical notes in prompting the LLMs. The results were compared to prompting without the UMLS paths. The final experiments examined the alignment of different evaluation metrics, UMLS-based and non-UMLS, with human expert evaluation. RESULTS: In probing the UMLS knowledge, GPT-3.5 significantly outperformed Llama2 and a simple baseline yielding an F1 score of 10.9% in completing one-hop UMLS paths for a given concept. Grounding diagnosis predictions with the UMLS paths improved the results for both models on both tasks, with the highest improvement (4%) in SapBERT score. There was a weak correlation between the widely used evaluation metrics (ROUGE and SapBERT) and human judgments. CONCLUSION: We found that while popular LLMs contain some medical knowledge in their internal representations, augmentation with the UMLS knowledge provides performance gains around diagnosis generation. The UMLS needs to be tailored for the task to improve the LLMs predictions. Finding evaluation metrics that are aligned with human judgments better than the traditional ROUGE and BERT-based scores remains an open research question.
Assuntos
Registros Eletrônicos de Saúde , Unified Medical Language System , Humanos , Aprendizado de Máquina , Processamento de Linguagem Natural , Sistemas de Apoio a Decisões Clínicas , Diagnóstico por Computador/métodosRESUMO
To date, symptom documentation has mostly relied on clinical notes in electronic health records or patient-reported outcomes using disease-specific symptom inventories. To provide a common and precise language for symptom recording, assessment, and research, a comprehensive list of symptom codes is needed. The International Classification of Diseases, Ninth Revision or its clinical modification ( International Classification of Diseases, Ninth Revision, Clinical Modification ) has a range of codes designated for symptoms, but it does not contain codes for all possible symptoms, and not all codes in that range are symptom related. This study aimed to identify and categorize the first list of International Classification of Diseases, Ninth Revision, Clinical Modification symptom codes for a general population and demonstrate their use to characterize symptoms of patients with type 2 diabetes mellitus in the Cerner database. A list of potential symptom codes was automatically extracted from the Unified Medical Language System Metathesaurus. Two clinical experts in symptom science and diabetes manually reviewed this list to identify and categorize codes as symptoms. A total of 1888 International Classification of Diseases, Ninth Revision, Clinical Modification symptom codes were identified and categorized into 65 categories. The symptom characterization using the newly obtained symptom codes and categories was found to be more reasonable than that using the previous symptom codes and categories on the same Cerner diabetes cohort.
Assuntos
Registros Eletrônicos de Saúde , Classificação Internacional de Doenças , Avaliação de Sintomas , Registros Eletrônicos de Saúde/estatística & dados numéricos , Humanos , Avaliação de Sintomas/métodos , Diabetes Mellitus Tipo 2/diagnóstico , Codificação Clínica/métodos , Codificação Clínica/normas , Unified Medical Language System , Feminino , Masculino , Pessoa de Meia-IdadeRESUMO
MOTIVATION: Biomedical entity linking (BEL) is the task of grounding entity mentions to a given knowledge base (KB). Recently, neural name-based methods, system identifying the most appropriate name in the KB for a given mention using neural network (either via dense retrieval or autoregressive modeling), achieved remarkable results for the task, without requiring manual tuning or definition of domain/entity-specific rules. However, as name-based methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). RESULTS: We present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. BELHD builds upon the BioSyn model with two crucial extensions. First, it performs pre-processing of the KB, during which it expands homonyms with a specifically constructed disambiguating string, thus enforcing unique linking decisions. Second, it introduces candidate sharing, a novel strategy that strengthens the overall training signal by including similar mentions from the same document as positive or negative examples, according to their corresponding KB identifier. Experiments with 10 corpora and 5 entity types show that BELHD improves upon current neural state-of-the-art approaches, achieving the best results in 6 out of 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the prediction model and thus can also improve other neural methods, which we exemplify for GenBioEL, a generative name-based BEL approach. AVAILABILITY AND IMPLEMENTATION: The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belhd.
Assuntos
Processamento de Linguagem Natural , Redes Neurais de Computação , Bases de Conhecimento , Algoritmos , Unified Medical Language System , Humanos , Biologia Computacional/métodosRESUMO
Automated medical image analysis systems often require large amounts of training data with high quality labels, which are difficult and time consuming to generate. This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access subset. It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018. It further provides manually curated concepts for imaging modalities with additional anatomical and directional concepts for X-rays. The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023. The dataset is suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using Unified Medical Language System (UMLS) concepts provided with each image. In addition, it can serve for pre-training of medical domain models, and evaluation of deep learning models for multi-task learning.
Assuntos
Imagem Multimodal , Radiologia , Humanos , Processamento de Imagem Assistida por Computador , Unified Medical Language SystemRESUMO
OBJECTIVES: Biomedical Knowledge Graphs play a pivotal role in various biomedical research domains. Concurrently, term clustering emerges as a crucial step in constructing these knowledge graphs, aiming to identify synonymous terms. Due to a lack of knowledge, previous contrastive learning models trained with Unified Medical Language System (UMLS) synonyms struggle at clustering difficult terms and do not generalize well beyond UMLS terms. In this work, we leverage the world knowledge from large language models (LLMs) and propose Contrastive Learning for Representing Terms via Explanations (CoRTEx) to enhance term representation and significantly improves term clustering. MATERIALS AND METHODS: The model training involves generating explanations for a cleaned subset of UMLS terms using ChatGPT. We employ contrastive learning, considering term and explanation embeddings simultaneously, and progressively introduce hard negative samples. Additionally, a ChatGPT-assisted BIRCH algorithm is designed for efficient clustering of a new ontology. RESULTS: We established a clustering test set and a hard negative test set, where our model consistently achieves the highest F1 score. With CoRTEx embeddings and the modified BIRCH algorithm, we grouped 35 580 932 terms from the Biomedical Informatics Ontology System (BIOS) into 22 104 559 clusters with O(N) queries to ChatGPT. Case studies highlight the model's efficacy in handling challenging samples, aided by information from explanations. CONCLUSION: By aligning terms to their explanations, CoRTEx demonstrates superior accuracy over benchmark models and robustness beyond its training set, and it is suitable for clustering terms for large-scale biomedical ontologies.
Assuntos
Algoritmos , Aprendizado de Máquina , Processamento de Linguagem Natural , Unified Medical Language System , Análise por Conglomerados , Ontologias Biológicas , Pesquisa Biomédica , Humanos , Terminologia como AssuntoRESUMO
OBJECTIVE: Social media-based public health research is crucial for epidemic surveillance, but most studies identify relevant corpora with keyword-matching. This study develops a system to streamline the process of curating colloquial medical dictionaries. We demonstrate the pipeline by curating a Unified Medical Language System (UMLS)-colloquial symptom dictionary from COVID-19-related tweets as proof of concept. METHODS: COVID-19-related tweets from February 1, 2020, to April 30, 2022 were used. The pipeline includes three modules: a named entity recognition module to detect symptoms in tweets; an entity normalization module to aggregate detected entities; and a mapping module that iteratively maps entities to Unified Medical Language System concepts. A random 500 entity samples were drawn from the final dictionary for accuracy validation. Additionally, we conducted a symptom frequency distribution analysis to compare our dictionary to a pre-defined lexicon from previous research. RESULTS: We identified 498 480 unique symptom entity expressions from the tweets. Pre-processing reduces the number to 18 226. The final dictionary contains 38 175 unique expressions of symptoms that can be mapped to 966 UMLS concepts (accuracy = 95%). Symptom distribution analysis found that our dictionary detects more symptoms and is effective at identifying psychiatric disorders like anxiety and depression, often missed by pre-defined lexicons. CONCLUSIONS: This study advances public health research by implementing a novel, systematic pipeline for curating symptom lexicons from social media data. The final lexicon's high accuracy, validated by medical professionals, underscores the potential of this methodology to reliably interpret, and categorize vast amounts of unstructured social media data into actionable medical insights across diverse linguistic and regional landscapes.
Assuntos
COVID-19 , Aprendizado Profundo , Mídias Sociais , Unified Medical Language System , Humanos , Saúde Pública , Armazenamento e Recuperação da Informação/métodosRESUMO
OBJECTIVE: This study aims to promote interoperability in precision medicine and translational research by aligning the Observational Medical Outcomes Partnership (OMOP) and Phenopackets data models. Phenopackets is an expert knowledge-driven schema designed to facilitate the storage and exchange of multimodal patient data, and support downstream analysis. The first goal of this paper is to explore model alignment by characterizing the common data models using a newly developed data transformation process and evaluation method. Second, using OMOP normalized clinical data, we evaluate the mapping of real-world patient data to Phenopackets. We evaluate the suitability of Phenopackets as a patient data representation for real-world clinical cases. METHODS: We identified mappings between OMOP and Phenopackets and applied them to a real patient dataset to assess the transformation's success. We analyzed gaps between the models and identified key considerations for transforming data between them. Further, to improve ambiguous alignment, we incorporated Unified Medical Language System (UMLS) semantic type-based filtering to direct individual concepts to their most appropriate domain and conducted a domain-expert evaluation of the mapping's clinical utility. RESULTS: The OMOP to Phenopacket transformation pipeline was executed for 1,000 Alzheimer's disease patients and successfully mapped all required entities. However, due to missing values in OMOP for required Phenopacket attributes, 10.2 % of records were lost. The use of UMLS-semantic type filtering for ambiguous alignment of individual concepts resulted in 96 % agreement with clinical thinking, increased from 68 % when mapping exclusively by domain correspondence. CONCLUSION: This study presents a pipeline to transform data from OMOP to Phenopackets. We identified considerations for the transformation to ensure data quality, handling restrictions for successful Phenopacket validation and discrepant data formats. We identified unmappable Phenopacket attributes that focus on specialty use cases, such as genomics or oncology, which OMOP does not currently support. We introduce UMLS semantic type filtering to resolve ambiguous alignment to Phenopacket entities to be most appropriate for real-world interpretation. We provide a systematic approach to align OMOP and Phenopackets schemas. Our work facilitates future use of Phenopackets in clinical applications by addressing key barriers to interoperability when deriving a Phenopacket from real-world patient data.
Assuntos
Unified Medical Language System , Humanos , Semântica , Registros Eletrônicos de Saúde , Medicina de Precisão/métodos , Pesquisa Translacional Biomédica , Informática Médica/métodos , Processamento de Linguagem Natural , Doença de AlzheimerRESUMO
BACKGROUND: Since the beginning of the COVID-19 pandemic, >1 million studies have been collected within the COVID-19 Open Research Dataset, a corpus of manuscripts created to accelerate research against the disease. Their related abstracts hold a wealth of information that remains largely unexplored and difficult to search due to its unstructured nature. Keyword-based search is the standard approach, which allows users to retrieve the documents of a corpus that contain (all or some of) the words in a target list. This type of search, however, does not provide visual support to the task and is not suited to expressing complex queries or compensating for missing specifications. OBJECTIVE: This study aims to consider small graphs of concepts and exploit them for expressing graph searches over existing COVID-19-related literature, leveraging the increasing use of graphs to represent and query scientific knowledge and providing a user-friendly search and exploration experience. METHODS: We considered the COVID-19 Open Research Dataset corpus and summarized its content by annotating the publications' abstracts using terms selected from the Unified Medical Language System and the Ontology of Coronavirus Infectious Disease. Then, we built a co-occurrence network that includes all relevant concepts mentioned in the corpus, establishing connections when their mutual information is relevant. A sophisticated graph query engine was built to allow the identification of the best matches of graph queries on the network. It also supports partial matches and suggests potential query completions using shortest paths. RESULTS: We built a large co-occurrence network, consisting of 128,249 entities and 47,198,965 relationships; the GRAPH-SEARCH interface allows users to explore the network by formulating or adapting graph queries; it produces a bibliography of publications, which are globally ranked; and each publication is further associated with the specific parts of the query that it explains, thereby allowing the user to understand each aspect of the matching. CONCLUSIONS: Our approach supports the process of query formulation and evidence search upon a large text corpus; it can be reapplied to any scientific domain where documents corpora and curated ontologies are made available.
Assuntos
Algoritmos , COVID-19 , SARS-CoV-2 , COVID-19/epidemiologia , Humanos , Pandemias , Armazenamento e Recuperação da Informação/métodos , Pesquisa Biomédica/métodos , Unified Medical Language System , Ferramenta de BuscaRESUMO
BACKGROUND: A huge amount of research is carried out nowadays in Artificial Intelligence to propose automated ways to analyse medical data with the aim to support doctors in delivering medical diagnoses. However, a main issue of these approaches is the lack of transparency and interpretability of the achieved results, making it hard to employ such methods for educational purposes. It is therefore necessary to develop new frameworks to enhance explainability in these solutions. RESULTS: In this paper, we present a novel full pipeline to generate automatically natural language explanations for medical diagnoses. The proposed solution starts from a clinical case description associated with a list of correct and incorrect diagnoses and, through the extraction of the relevant symptoms and findings, enriches the information contained in the description with verified medical knowledge from an ontology. Finally, the system returns a pattern-based explanation in natural language which elucidates why the correct (incorrect) diagnosis is the correct (incorrect) one. The main contribution of the paper is twofold: first, we propose two novel linguistic resources for the medical domain (i.e, a dataset of 314 clinical cases annotated with the medical entities from UMLS, and a database of biological boundaries for common findings), and second, a full Information Extraction pipeline to extract symptoms and findings from the clinical cases and match them with the terms in a medical ontology and to the biological boundaries. An extensive evaluation of the proposed approach shows the our method outperforms comparable approaches. CONCLUSIONS: Our goal is to offer AI-assisted educational support framework to form clinical residents to formulate sound and exhaustive explanations for their diagnoses to patients.
Assuntos
Processamento de Linguagem Natural , Humanos , Diagnóstico , Inteligência Artificial , Erros de Diagnóstico , Unified Medical Language SystemRESUMO
BACKGROUND: Traditional literature based discovery is based on connecting knowledge pairs extracted from separate publications via a common mid point to derive previously unseen knowledge pairs. To avoid the over generation often associated with this approach, we explore an alternative method based on word evolution. Word evolution examines the changing contexts of a word to identify changes in its meaning or associations. We investigate the possibility of using changing word contexts to detect drugs suitable for repurposing. RESULTS: Word embeddings, which represent a word's context, are constructed from chronologically ordered publications in MEDLINE at bi-monthly intervals, yielding a time series of word embeddings for each word. Focusing on clinical drugs only, any drugs repurposed in the final time segment of the time series are annotated as positive examples. The decision regarding the drug's repurposing is based either on the Unified Medical Language System (UMLS), or semantic triples extracted using SemRep from MEDLINE. CONCLUSIONS: The annotated data allows deep learning classification, with a 5-fold cross validation, to be performed and multiple architectures to be explored. Performance of 65% using UMLS labels, and 81% using SemRep labels is attained, indicating the technique's suitability for the detection of candidate drugs for repurposing. The investigation also shows that different architectures are linked to the quantities of training data available and therefore that different models should be trained for every annotation approach.
Assuntos
Reposicionamento de Medicamentos , Humanos , Unified Medical Language System , MEDLINE , Aprendizado Profundo , Processamento de Linguagem Natural , SemânticaRESUMO
BACKGROUND: In this era of big data, data harmonization is an important step to ensure reproducible, scalable, and collaborative research. Thus, terminology mapping is a necessary step to harmonize heterogeneous data. Take the Medical Dictionary for Regulatory Activities (MedDRA) and International Classification of Diseases (ICD) for example, the mapping between them is essential for drug safety and pharmacovigilance research. Our main objective is to provide a quantitative and qualitative analysis of the mapping status between MedDRA and ICD. We focus on evaluating the current mapping status between MedDRA and ICD through the Unified Medical Language System (UMLS) and Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). We summarized the current mapping statistics and evaluated the quality of the current MedDRA-ICD mapping; for unmapped terms, we used our self-developed algorithm to rank the best possible mapping candidates for additional mapping coverage. RESULTS: The identified MedDRA-ICD mapped pairs cover 27.23% of the overall MedDRA preferred terms (PT). The systematic quality analysis demonstrated that, among the mapped pairs provided by UMLS, only 51.44% are considered an exact match. For the 2400 sampled unmapped terms, 56 of the 2400 MedDRA Preferred Terms (PT) could have exact match terms from ICD. CONCLUSION: Some of the mapped pairs between MedDRA and ICD are not exact matches due to differences in granularity and focus. For 72% of the unmapped PT terms, the identified exact match pairs illustrate the possibility of identifying additional mapped pairs. Referring to its own mapping standard, some of the unmapped terms should qualify for the expansion of MedDRA to ICD mapping in UMLS.
Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos , Classificação Internacional de Doenças , Humanos , Unified Medical Language System , Farmacovigilância , AlgoritmosRESUMO
OBJECTIVE: In this study, we investigate the potential of large language models (LLMs) to complement biomedical knowledge graphs in the training of semantic models for the biomedical and clinical domains. MATERIALS AND METHODS: Drawing on the wealth of the Unified Medical Language System knowledge graph and harnessing cutting-edge LLMs, we propose a new state-of-the-art approach for obtaining high-fidelity representations of biomedical concepts and sentences, consisting of 3 steps: an improved contrastive learning phase, a novel self-distillation phase, and a weight averaging phase. RESULTS: Through rigorous evaluations of diverse downstream tasks, we demonstrate consistent and substantial improvements over the previous state of the art for semantic textual similarity (STS), biomedical concept representation (BCR), and clinically named entity linking, across 15+ datasets. Besides our new state-of-the-art biomedical model for English, we also distill and release a multilingual model compatible with 50+ languages and finetuned on 7 European languages. DISCUSSION: Many clinical pipelines can benefit from our latest models. Our new multilingual model enables a range of languages to benefit from our advancements in biomedical semantic representation learning, opening a new avenue for bioinformatics researchers around the world. As a result, we hope to see BioLORD-2023 becoming a precious tool for future biomedical applications. CONCLUSION: In this article, we introduced BioLORD-2023, a state-of-the-art model for STS and BCR designed for the clinical domain.