Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 1.157
Filtrar
1.
Yearb Med Inform ; 30(1): 185-190, 2021 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-34479390

RESUMO

OBJECTIVE: To select, present and summarize some of the best papers in the field of Knowledge Representation and Management (KRM) published in 2020. METHODS: A comprehensive and standardized review of the medical informatics literature was performed to select the most interesting papers of KRM published in 2020, based on PubMed queries. This review was conducted according to the IMIA Yearbook guidelines. RESULTS: Four best papers were selected among 1,175 publications. In contrast with the papers selected last year, the four best papers of 2020 demonstrated a significant focus on methods and tools for ontology curation and design. The usual KRM application domains (bioinformatics, machine learning, and electronic health records) were also represented. CONCLUSION: In 2020, ontology curation emerges as a significant topic of research interest. Bioinformatics, machine learning, and electronics health records remain significant research areas in the KRM community with various applications. Knowledge representations are key to advance machine learning by providing context and to develop novel bioinformatics metrics. As in 2019, representations serve a great variety of applications across many medical domains, with actionable results and now with growing adhesion to the open science initiative.


Assuntos
Ontologias Biológicas , Interoperabilidade da Informação em Saúde , Gestão do Conhecimento , Genômica , Humanos , Informática Médica , Unified Medical Language System
2.
AMIA Annu Symp Proc ; 2021: 515-524, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34457167

RESUMO

Natural language is continually changing. Given the prevalence of unstructured, free-text clinical notes in the healthcare domain, understanding the aspects of this change is of critical importance to clinical Natural Language Processing (NLP) systems. In this study, we examine two previously described semantic change laws based on word frequency and polysemy, and analyze how they apply to the clinical domain. We also explore a new facet of change: whether domain-specific clinical terms exhibit different change patterns compared to general-purpose English. Using a corpus spanning eighteen years of clinical notes, we find that the previously described laws of semantic change hold for our data set. We also find that domain-specific biomedical terms change faster compared to general English words.


Assuntos
Processamento de Linguagem Natural , Semântica , Humanos , Idioma , Unified Medical Language System
3.
J Biomed Inform ; 120: 103867, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34284119

RESUMO

BACKGROUND: Recent natural language processing (NLP) research is dominated by neural network methods that employ word embeddings as basic building blocks. Pre-training with neural methods that capture local and global distributional properties (e.g., skip-gram, GLoVE) using free text corpora is often used to embed both words and concepts. Pre-trained embeddings are typically leveraged in downstream tasks using various neural architectures that are designed to optimize task-specific objectives that might further tune such embeddings. OBJECTIVE: Despite advances in contextualized language model based embeddings, static word embeddings still form an essential starting point in BioNLP research and applications. They are useful in low resource settings and in lexical semantics studies. Our main goal is to build improved biomedical word embeddings and make them publicly available for downstream applications. METHODS: We jointly learn word and concept embeddings by first using the skip-gram method and further fine-tuning them with correlational information manifesting in co-occurring Medical Subject Heading (MeSH) concepts in biomedical citations. This fine-tuning is accomplished with the transformer-based BERT architecture in the two-sentence input mode with a classification objective that captures MeSH pair co-occurrence. We conduct evaluations of these tuned static embeddings using multiple datasets for word relatedness developed by previous efforts. RESULTS: Both in qualitative and quantitative evaluations we demonstrate that our methods produce improved biomedical embeddings in comparison with other static embedding efforts. Without selectively culling concepts and terms (as was pursued by previous efforts), we believe we offer the most exhaustive evaluation of biomedical embeddings to date with clear performance improvements across the board. CONCLUSION: We repurposed a transformer architecture (typically used to generate dynamic embeddings) to improve static biomedical word embeddings using concept correlations. We provide our code and embeddings for public use for downstream applications and research endeavors: https://github.com/bionlproc/BERT-CRel-Embeddings.


Assuntos
Processamento de Linguagem Natural , Unified Medical Language System , Humanos , Idioma , Medical Subject Headings , Semântica
4.
Artif Intell Med ; 117: 102083, 2021 07.
Artigo em Inglês | MEDLINE | ID: mdl-34127232

RESUMO

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ∼8.8B words from ∼17M clinical records and further fine-tuning with ∼6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.


Assuntos
Processamento de Linguagem Natural , Systematized Nomenclature of Medicine , Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação , Unified Medical Language System
5.
Stud Health Technol Inform ; 281: 178-182, 2021 May 27.
Artigo em Inglês | MEDLINE | ID: mdl-34042729

RESUMO

We present an approach called MTP (multiple translation paths) aiming at assisting human translation in SNOMED CT localisation projects based on free, web-based machine translation tools. For a chosen target language, MTP generates a scored output of translation candidates (TCs) for each input concept. This paper describes the basic idea of MTP, the distribution of its output TCs and discusses typical examples with German as target language. The MTP approach capitalises on combinatorial growth by the combination of input languages, support languages, and translation engines. We applied MTP on the SNOMED CT Starter Set, using Google Translator, DeepL and Systran, together with the four source languages English, Spanish, Swedish and French, and Danish, Dutch, Norwegian, Italian, Portuguese, Polish and Russian as support languages. The descriptive assessment of TC variety, together with an analysis of typical results is the focus of this paper. MTP defines, for each input concept, TPs by the combination of input languages, support languages and translation engines, resulting in 91 translation results with various degrees of co-incidence (cardinality). The most configurations produce an average number of TCs indicating that the same TC is often derived via different translation paths. Combinations of translation engines result in distributions with a higher number of distinct TCs per concept. We present work in progress on using machine translation (MT) for terminology translation, by leveraging several free MT tools fed by different languages and language combinations. A first qualitative analysis was promising and supports our hypothesis that a majority voting applied to many translation candidates yields higher quality results than from one single engine and input language.


Assuntos
Idioma , Systematized Nomenclature of Medicine , Humanos , Federação Russa , Tradução , Unified Medical Language System
6.
BMC Med Inform Decis Mak ; 21(1): 145, 2021 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-33947365

RESUMO

BACKGROUND: Controlled vocabularies are fundamental resources for information extraction from clinical texts using natural language processing (NLP). Standard language resources available in the healthcare domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose, but with limitations such as lexical ambiguity of clinical terms. However, most of them are unambiguous within text limited to a given clinical specialty. This is one rationale besides others to classify clinical text by the clinical specialty to which they belong. RESULTS: This paper addresses this limitation by proposing and applying a method that automatically extracts Spanish medical terms classified and weighted per sub-domain, using Spanish MEDLINE titles and abstracts as input. The hypothesis is biomedical NLP tasks benefit from collections of domain terms that are specific to clinical subdomains. We use PubMed queries that generate sub-domain specific corpora from Spanish titles and abstracts, from which token n-grams are collected and metrics of relevance, discriminatory power, and broadness per sub-domain are computed. The generated term set, called Spanish core vocabulary about clinical specialties (SCOVACLIS), was made available to the scientific community and used in a text classification problem obtaining improvements of 6 percentage points in the F-measure compared to the baseline using Multilayer Perceptron, thus demonstrating the hypothesis that a specialized term set improves NLP tasks. CONCLUSION: The creation and validation of SCOVACLIS support the hypothesis that specific term sets reduce the level of ambiguity when compared to a specialty-independent and broad-scope vocabulary.


Assuntos
Processamento de Linguagem Natural , Unified Medical Language System , Humanos , Idioma , Systematized Nomenclature of Medicine , Vocabulário Controlado
7.
Comput Biol Med ; 134: 104433, 2021 07.
Artigo em Inglês | MEDLINE | ID: mdl-34004575

RESUMO

BACKGROUND: Word vectors or word embeddings are n-dimensional representations of words and form the backbone of Natural Language Processing of textual data. This research experiments with algorithms that augment word vectors with lexical constraints that are popular in NLP research and clinical domain constraints derived from the Unified Medical Language System (UMLS). It also compares the performance of the augmented vectors with Bio + Clinical BERT vectors which have been trained and fine-tuned on clinical datasets. METHODS: Word2vec vectors are generated for words in a publicly available de-identified Electronic Health Records (EHR) dataset and augmented by ontologies using three algorithms that have fundamentally different approaches to vector augmentation. The augmented vectors are then evaluated alongside publicly available Bio + Clinical BERT on their correlation with human-annotated lists using Spearman's correlation coefficient. They are also evaluated on the downstream task of Named Entity Recognition (NER). Quantitative and empirical evaluations are used to highlight the strengths and weaknesses of the different approaches. RESULTS: The counter-fitted word2vec vectors augmented with information from the UMLS ontology produced the best correlation overall with human-annotated evaluation lists (Spearman's correlation of 0.733 with mini mayo-doctors' annotation) while Bio + Clinical BERT produces the best results in the NER task (F1 of 0.87 and 0.811 on the i2b2 2010 and i2b2 2012 datasets respectively) in our experiments. CONCLUSION: Clinically adapted word2vec vectors successfully encapsulate concepts of lexical and clinical synonymy and antonymy and to a smaller extent, hyponymy and hypernymy. Bio + Clinical BERT vectors perform better at NER and avoid out-of-vocabulary words.


Assuntos
Processamento de Linguagem Natural , Unified Medical Language System , Algoritmos , Registros Eletrônicos de Saúde , Humanos
8.
BMC Med Inform Decis Mak ; 21(1): 128, 2021 04 15.
Artigo em Inglês | MEDLINE | ID: mdl-33858409

RESUMO

BACKGROUND: Semantic categorization analysis of clinical trials eligibility criteria based on natural language processing technology is crucial for the task of optimizing clinical trials design and building automated patient recruitment system. However, most of related researches focused on English eligibility criteria, and to the best of our knowledge, there are no researches studied the Chinese eligibility criteria. Thus in this study, we aimed to explore the semantic categories of Chinese eligibility criteria. METHODS: We downloaded the clinical trials registration files from the website of Chinese Clinical Trial Registry (ChiCTR) and extracted both the Chinese eligibility criteria and corresponding English eligibility criteria. We represented the criteria sentences based on the Unified Medical Language System semantic types and conducted the hierarchical clustering algorithm for the induction of semantic categories. Furthermore, in order to explore the classification performance of Chinese eligibility criteria with our developed semantic categories, we implemented multiple classification algorithms, include four baseline machine learning algorithms (LR, NB, kNN, SVM), three deep learning algorithms (CNN, RNN, FastText) and two pre-trained language models (BERT, ERNIE). RESULTS: We totally developed 44 types of semantic categories, summarized 8 topic groups, and investigated the average incidence and prevalence in 272 hepatocellular carcinoma related Chinese clinical trials. Compared with the previous proposed categories in English eligibility criteria, 13 novel categories are identified in Chinese eligibility criteria. The classification result shows that most of semantic categories performed quite well, the pre-trained language model ERNIE achieved best performance with macro-average F1 score of 0.7980 and micro-average F1 score of 0.8484. CONCLUSION: As a pilot study of Chinese eligibility criteria analysis, we developed the 44 semantic categories by hierarchical clustering algorithms for the first times, and validated the classification capacity with multiple classification algorithms.


Assuntos
Semântica , Unified Medical Language System , China , Humanos , Aprendizado de Máquina , Processamento de Linguagem Natural , Projetos Piloto
9.
Comput Biol Med ; 133: 104361, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-33872968

RESUMO

It is a well-known fact that there are often side effects to the long-term use of certain medications. These side effects can vary from mild dizziness to, at its most serious, death. The main factors that cause these side effects are the chemical composition, the mode of treatment, and the dose. The dynamics that govern the reaction of a drug heavily depend on its structural composition. The structural composition of a drug is defined by the structural arrangement of the corresponding basic chemical functional groups. Hence, it is essential to investigate the effect of chemical functional groups on the side effects to synthesize drugs with minimal side effects. To support this process, we developed a framework named MedFused (Medical Functional Group Side Effects Database), which is composed of drugs (International Union of Pure and Applied Chemistry: IUPAC nomenclature), functional groups, and the side effects along with other valuable information such as STITCH (search tool for interactions of chemicals) compound ID, and the Unified Medical Language System (UMLS) concept ID. We develop a web framework that functions on the MedFused system database on top of the Django web framework. Our web server supports functionalities such as exploring the database and descriptive graph tools, which provide additional exploration capabilities to the framework. These descriptive tools include histograms, pie charts, and association charts, which further explore the system. Above these basic tools, MedFused includes functionality to discover the drug's "chemical functional group" impact on "side effects". The method conducts an association rule analysis on the relationships by considering the MedFused database as a collection of transactions. A specific transaction has a list of the functional groups of a drug and one side effect. Hence, a drug that has more than one side effect forms multiple transactions. Next, we generate a binary feature matrix based on the transactions and introduce a pruning mechanism to consider only the potential functional groups and side effects based on their support (frequencies), subjected to a predefined threshold (which can be changed accordingly). As the current version of the MedFused database has a limited number of side effects (hence low support), we restricted the analysis to identify the functional groups which have the most potential of causing a particular side effect, based on a confidence value of 1. Our framework can be further extended with more functions and tools as it supports the model view controller (MVC) architecture, which is inherited from the Django Python web framework.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Preparações Farmacêuticas , Bases de Dados Factuais , Humanos , Unified Medical Language System
10.
Artif Intell Med ; 113: 102033, 2021 03.
Artigo em Inglês | MEDLINE | ID: mdl-33685589

RESUMO

Sentiments associated with assessments and observations recorded in a clinical narrative can often indicate a patient's health status. To perform sentiment analysis on clinical narratives, domain-specific knowledge concerning meanings of medical terms is required. In this study, semantic types in the Unified Medical Language System (UMLS) are exploited to improve lexicon-based sentiment classification methods. For sentiment classification using SentiWordNet, the overall accuracy is improved from 0.582 to 0.710 by using logistic regression to determine appropriate polarity scores for UMLS 'Disorders' semantic types. For sentiment classification using a trained lexicon, when disorder terms in a training set are replaced with their semantic types, classification accuracies are improved on some data segments containing specific semantic types. To select an appropriate classification method for a given data segment, classifier combination is proposed. Using classifier combination, classification accuracies are improved on most data segments, with the overall accuracy of 0.882 being obtained.


Assuntos
Semântica , Unified Medical Language System , Humanos
11.
PLoS One ; 16(2): e0246310, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33561139

RESUMO

Named entity recognition (NER) is a key component of many scientific literature mining tasks, such as information retrieval, information extraction, and question answering; however, many modern approaches require large amounts of labeled training data in order to be effective. This severely limits the effectiveness of NER models in applications where expert annotations are difficult and expensive to obtain. In this work, we explore the effectiveness of transfer learning and semi-supervised self-training to improve the performance of NER models in biomedical settings with very limited labeled data (250-2000 labeled samples). We first pre-train a BiLSTM-CRF and a BERT model on a very large general biomedical NER corpus such as MedMentions or Semantic Medline, and then we fine-tune the model on a more specific target NER task that has very limited training data; finally, we apply semi-supervised self-training using unlabeled data to further boost model performance. We show that in NER tasks that focus on common biomedical entity types such as those in the Unified Medical Language System (UMLS), combining transfer learning with self-training enables a NER model such as a BiLSTM-CRF or BERT to obtain similar performance with the same model trained on 3x-8x the amount of labeled data. We further show that our approach can also boost performance in a low-resource application where entities types are more rare and not specifically covered in UMLS.


Assuntos
Inteligência Artificial , Reconhecimento Psicológico , Aprendizado de Máquina não Supervisionado , Humanos , Modelos Teóricos , Aprendizado de Máquina Supervisionado , Terminologia como Assunto , Transferência de Experiência , Unified Medical Language System
12.
BMC Med Inform Decis Mak ; 21(1): 69, 2021 02 22.
Artigo em Inglês | MEDLINE | ID: mdl-33618727

RESUMO

BACKGROUND: The large volume of medical literature makes it difficult for healthcare professionals to keep abreast of the latest studies that support Evidence-Based Medicine. Natural language processing enhances the access to relevant information, and gold standard corpora are required to improve systems. To contribute with a new dataset for this domain, we collected the Clinical Trials for Evidence-Based Medicine in Spanish (CT-EBM-SP) corpus. METHODS: We annotated 1200 texts about clinical trials with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). We doubly annotated 10% of the corpus and measured inter-annotator agreement (IAA) using F-measure. As use case, we run medical entity recognition experiments with neural network models. RESULTS: This resource contains 500 abstracts of journal articles about clinical trials and 700 announcements of trial protocols (292 173 tokens). We annotated 46 699 entities (13.98% are nested entities). Regarding IAA agreement, we obtained an average F-measure of 85.65% (±4.79, strict match) and 93.94% (±3.31, relaxed match). In the use case experiments, we achieved recognition results ranging from 80.28% (±00.99) to 86.74% (±00.19) of average F-measure. CONCLUSIONS: Our results show that this resource is adequate for experiments with state-of-the-art approaches to biomedical named entity recognition. It is freely distributed at: http://www.lllf.uam.es/ESP/nlpmedterm_en.html . The methods are generalizable to other languages with similar available sources.


Assuntos
Processamento de Linguagem Natural , Unified Medical Language System , Medicina Baseada em Evidências , Humanos , Idioma , Semântica
13.
J Biomed Inform ; 114: 103684, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33450387

RESUMO

INTRODUCTION: Concept normalization is the task of linking terms from textual medical documents to their concept in terminologies such as the UMLS®. Traditional approaches to this problem depend heavily on the coverage of available resources, which poses a problem for languages other than English. OBJECTIVE: We present a system for concept normalization in French. We consider textual mentions already extracted and labeled by a named entity recognition system, and we classify these mentions with a UMLS concept unique identifier. We take advantage of the multilingual nature of available terminologies and embedding models to improve concept normalization in French without translation nor direct supervision. MATERIALS AND METHODS: We consider the task as a highly-multiclass classification problem. The terms are encoded with contextualized embeddings and classified via cosine similarity and softmax. A first step uses a subset of the terminology to finetune the embeddings and train the model. A second step adds the entire target terminology, and the model is trained further with hard negative selection and softmax sampling. RESULTS: On two corpora from the Quaero FrenchMed benchmark, we show that our approach can lead to good results even with no labeled data at all; and that it outperforms existing supervised methods with labeled data. DISCUSSION: Training the system with both French and English terms improves by a large margin the performance of the system on a French benchmark, regardless of the way the embeddings were pretrained (French, English, multilingual). Our distantly supervised method can be applied to any kind of documents or medical domain, as it does not require any concept-labeled documents. CONCLUSION: These experiments pave the way for simpler and more effective multilingual approaches to processing medical texts in languages other than English.


Assuntos
Multilinguismo , Unified Medical Language System , Idioma , Processamento de Linguagem Natural
14.
Injury ; 52(2): 205-212, 2021 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-33131794

RESUMO

INTRODUCTION: Trauma injury severity scores are currently calculated retrospectively from the electronic health record (EHR) using manual annotation by certified trauma coders. Natural language processing (NLP) of clinical documents in the EHR may enable automated injury scoring. We hypothesize that NLP with machine learning can discriminate between cases of severe and non-severe injury to the thorax after trauma. METHODS: Clinical documents from a trauma center were examined between 2014 and 2018. Severe chest injury was defined as a thorax abbreviated injury score (AIS) >2 and served as the reference standard for supervised learning. Free text unigrams and concept unique identifiers (CUIs) from the Unified Medical Language Systems (UMLS) were extracted from clinical documents collected at one hour, four hours, and eight hours after patient arrival to the emergency department. Logistic regression models with elastic net regularization were tuned to maximize area under the receiver operating characteristic curve (AUROC) using 10-fold cross-validation on the training dataset (80%) and tested on a hold-out 20% dataset. RESULTS: There were 6,891 traumas that met inclusion criteria. The complete data corpus consisted of 473,694 documents. Models trained using the first hour of data had a mean AUROC of 0.88 (95%CI [0.86, 0.89]); model discrimination and reclassification from the first hour significantly improved after eight hours with a mean AUROC of 0.94 (95%CI [0.93, 0.95]). Performance of models using CUIs were similar to unigrams (p>0.05). Models demonstrated excellent clinical face validity. CONCLUSIONS: Both CUIs and unigrams demonstrated excellent discrimination in predicting severity of chest injury using the first eight hours of clinical documents. Our model demonstrates that automated anatomical injury scoring is feasible and may be used for aggregation of data for trauma research and quality programs.


Assuntos
Processamento de Linguagem Natural , Traumatismos Torácicos , Registros Eletrônicos de Saúde , Humanos , Estudos Retrospectivos , Unified Medical Language System
15.
J Biomed Inform ; 113: 103665, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33333323

RESUMO

BACKGROUND: There has been increasing interest in machine learning based natural language processing (NLP) methods in radiology; however, models have often used word embeddings trained on general web corpora due to lack of a radiology-specific corpus. PURPOSE: We examined the potential of Radiopaedia to serve as a general radiology corpus to produce radiology specific word embeddings that could be used to enhance performance on a NLP task on radiological text. MATERIALS AND METHODS: Embeddings of dimension 50, 100, 200, and 300 were trained on articles collected from Radiopaedia using a GloVe algorithm and evaluated on analogy completion. A shallow neural network using input from either our trained embeddings or pre-trained Wikipedia 2014 + Gigaword 5 (WG) embeddings was used to label the Radiopaedia articles. Labeling performance was evaluated based on exact match accuracy and Hamming loss. The McNemar's test with continuity and the Benjamini-Hochberg correction and a 5×2 cross validation paired two-tailed t-test were used to assess statistical significance. RESULTS: For accuracy in the analogy task, 50-dimensional (50-D) Radiopaedia embeddings outperformed WG embeddings on tumor origin analogies (p < 0.05) and organ adjectives (p < 0.01) whereas WG embeddings tended to outperform on inflammation location and bone vs. muscle analogies (p < 0.01). The two embeddings had comparable performance on other subcategories. In the labeling task, the Radiopaedia-based model outperformed the WG based model at 50, 100, 200, and 300-D for exact match accuracy (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively) and Hamming loss (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively). CONCLUSION: We have developed a set of word embeddings from Radiopaedia and shown that they can preserve relevant medical semantics and augment performance on a radiology NLP task. Our results suggest that the cultivation of a radiology-specific corpus can benefit radiology NLP models in the future.


Assuntos
Processamento de Linguagem Natural , Radiologia , Aprendizado de Máquina , Semântica , Unified Medical Language System
16.
J Biomed Inform ; 114: 103669, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33359111

RESUMO

Over the last decades clinical research has been driven by informatics changes nourished by distinct research endeavors. Inherent to this evolution, several issues have been the focus of a variety of studies: multi-location patient data access, interoperability between terminological and classification systems and clinical practice and records harmonization. Having these problems in mind, the Data Safe Haven paradigm emerged to promote a newborn architecture, better reasoning and safe and easy access to distinct Clinical Data Repositories. This study aim is to present a novel solution for clinical search harmonization within a safe environment, making use of a hybrid coding taxonomy that enables researchers to collect information from multiple repositories based on a clinical domain query definition. Results show that is possible to query multiple repositories using a single query definition based on clinical domains and the capabilities of the Unified Medical Language System, although it leads to deterioration of the framework response times. Participants of a Focus Group and a System Usability Scale questionnaire rated the framework with a median value of 72.5, indicating the hybrid coding taxonomy could be enriched with additional metadata to further improve the refinement of the results and enable the possibility of using this system as data quality tagging mechanism.


Assuntos
Metadados , Unified Medical Language System , Humanos , Recém-Nascido
17.
J Am Med Inform Assoc ; 28(3): 516-532, 2021 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-33319905

RESUMO

OBJECTIVES: Normalizing mentions of medical concepts to standardized vocabularies is a fundamental component of clinical text analysis. Ambiguity-words or phrases that may refer to different concepts-has been extensively researched as part of information extraction from biomedical literature, but less is known about the types and frequency of ambiguity in clinical text. This study characterizes the distribution and distinct types of ambiguity exhibited by benchmark clinical concept normalization datasets, in order to identify directions for advancing medical concept normalization research. MATERIALS AND METHODS: We identified ambiguous strings in datasets derived from the 2 available clinical corpora for concept normalization and categorized the distinct types of ambiguity they exhibited. We then compared observed string ambiguity in the datasets with potential ambiguity in the Unified Medical Language System (UMLS) to assess how representative available datasets are of ambiguity in clinical language. RESULTS: We found that <15% of strings were ambiguous within the datasets, while over 50% were ambiguous in the UMLS, indicating only partial coverage of clinical ambiguity. The percentage of strings in common between any pair of datasets ranged from 2% to only 36%; of these, 40% were annotated with different sets of concepts, severely limiting generalization. Finally, we observed 12 distinct types of ambiguity, distributed unequally across the available datasets, reflecting diverse linguistic and medical phenomena. DISCUSSION: Existing datasets are not sufficient to cover the diversity of clinical concept ambiguity, limiting both training and evaluation of normalization methods for clinical text. Additionally, the UMLS offers important semantic information for building and evaluating normalization methods. CONCLUSIONS: Our findings identify 3 opportunities for concept normalization research, including a need for ambiguity-specific clinical datasets and leveraging the rich semantics of the UMLS in new methods and evaluation measures for normalization.


Assuntos
Conjuntos de Dados como Assunto , Registros Eletrônicos de Saúde , Terminologia como Assunto , Unified Medical Language System , Aprendizado Profundo , Processamento de Linguagem Natural , Semântica , Vocabulário Controlado
18.
J Am Med Inform Assoc ; 28(4): 812-823, 2021 03 18.
Artigo em Inglês | MEDLINE | ID: mdl-33367705

RESUMO

OBJECTIVE: The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity. MATERIALS AND METHODS: We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT. RESULTS: UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82). CONCLUSIONS: This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.


Assuntos
Pesquisa Biomédica , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Unified Medical Language System , Gerenciamento de Dados
19.
BMC Med Inform Decis Mak ; 20(Suppl 10): 301, 2020 12 15.
Artigo em Inglês | MEDLINE | ID: mdl-33319696

RESUMO

Biological and biomedical ontologies and terminologies are used to organize and store various domain-specific knowledge to provide standardization of terminology usage and to improve interoperability. The growing number of such ontologies and terminologies and their increasing adoption in clinical, research and healthcare settings call for effective and efficient quality assurance and semantic enrichment techniques of these ontologies and terminologies. In this editorial, we provide an introductory summary of nine articles included in this supplement issue for quality assurance and enrichment of biological and biomedical ontologies and terminologies. The articles cover a range of standards including SNOMED CT, National Cancer Institute Thesaurus, Unified Medical Language System, North American Association of Central Cancer Registries and OBO Foundry Ontologies.


Assuntos
Ontologias Biológicas , Humanos , Semântica , Systematized Nomenclature of Medicine , Unified Medical Language System , Vocabulário Controlado
20.
J Am Med Inform Assoc ; 27(10): 1538-1546, 2020 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-33029614

RESUMO

OBJECTIVE: The study sought to explore the use of deep learning techniques to measure the semantic relatedness between Unified Medical Language System (UMLS) concepts. MATERIALS AND METHODS: Concept sentence embeddings were generated for UMLS concepts by applying the word embedding models BioWordVec and various flavors of BERT to concept sentences formed by concatenating UMLS terms. Graph embeddings were generated by the graph convolutional networks and 4 knowledge graph embedding models, using graphs built from UMLS hierarchical relations. Semantic relatedness was measured by the cosine between the concepts' embedding vectors. Performance was compared with 2 traditional path-based (shortest path and Leacock-Chodorow) measurements and the publicly available concept embeddings, cui2vec, generated from large biomedical corpora. The concept sentence embeddings were also evaluated on a word sense disambiguation (WSD) task. Reference standards used included the semantic relatedness and semantic similarity datasets from the University of Minnesota, concept pairs generated from the Standardized MedDRA Queries and the MeSH (Medical Subject Headings) WSD corpus. RESULTS: Sentence embeddings generated by BioWordVec outperformed all other methods used individually in semantic relatedness measurements. Graph convolutional network graph embedding uniformly outperformed path-based measurements and was better than some word embeddings for the Standardized MedDRA Queries dataset. When used together, combined word and graph embedding achieved the best performance in all datasets. For WSD, the enhanced versions of BERT outperformed BioWordVec. CONCLUSIONS: Word and graph embedding techniques can be used to harness terms and relations in the UMLS to measure semantic relatedness between concepts. Concept sentence embedding outperforms path-based measurements and cui2vec, and can be further enhanced by combining with graph embedding.


Assuntos
Aprendizado Profundo , Semântica , Unified Medical Language System , Medical Subject Headings , Processamento de Linguagem Natural , Curva ROC , Terminologia como Assunto
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...