Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 39
Filtrar
1.
J Biomed Inform ; 137: 104252, 2023 01.
Artigo em Inglês | MEDLINE | ID: mdl-36464228

RESUMO

Biomedical Entity Linking (BEL) is the task of mapping of spans of text within biomedical documents to normalized, unique identifiers within an ontology. This is an important task in natural language processing for both translational information extraction applications and providing context for downstream tasks like relationship extraction. In this paper, we will survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, discuss the technical components that comprise BEL systems, and discuss possible directions for the future of the field.


Assuntos
Mineração de Dados , Envio de Mensagens de Texto , Processamento de Linguagem Natural
2.
J Biomed Inform ; 143: 104362, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37146741

RESUMO

Scientific literature presents a wealth of information yet to be explored. As the number of researchers increase with each passing year and publications are released, this contributes to an era where specialized fields of research are becoming more prevalent. As this trend continues, this further propagates the separation of interdisciplinary publications and makes keeping up to date with literature a laborious task. Literature-based discovery (LBD) aims to mitigate these concerns by promoting information sharing among non-interacting literature while extracting potentially meaningful information. Furthermore, recent advances in neural network architectures and data representation techniques have fueled their respective research communities in achieving state-of-the-art performance in many downstream tasks. However, studies of neural network-based methods for LBD remain to be explored. We introduce and explore a deep learning neural network-based approach for LBD. Additionally, we investigate various approaches to represent terms as concepts and analyze the affect of feature scaling representations into our model. We compare the evaluation performance of our method on five hallmarks of cancer datasets utilized for closed discovery. Our results show the chosen representation as input into our model affects evaluation performance. We found feature scaling our input representations increases evaluation performance and decreases the necessary number of epochs needed to achieve model generalization. We also explore two approaches to represent model output. We found reducing the model's output to capturing a subset of concepts improved evaluation performance at the cost of model generalizability. We also compare the efficacy of our method on the five hallmarks of cancer datasets to a set of randomly chosen relations between concepts. We found these experiments confirm our method's suitability for LBD.


Assuntos
Aprendizado Profundo , Neoplasias , Humanos , Redes Neurais de Computação , Descoberta do Conhecimento/métodos , Publicações
3.
J Biomed Inform ; 130: 104062, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35413440

RESUMO

MOTIVATION: Training domain-specific named entity recognition (NER) models requires high quality hand curated gold standard datasets which are time-consuming and expensive to create. Furthermore, the storage and memory required to deploy NLP models can be prohibitive when the number of tasks is large. In this work, we explore utilizing multi-task learning to reduce the amount of training data needed to train new domain-specific models. We evaluate our system across 22 distinct biomedical NER datasets and evaluate the extent to which transfer learning helps task performance using two forms of ablation. RESULTS: We found that multitasking models generally do not improve performance, but in many cases perform on par compared to single-task models. However, we show that in some cases, new unseen tasks can be trained as a single model using less data by starting with weights from a multitask model and improve performance. AVAILABILITY: The software underlying this article are available in: https://github.com/NLPatVCU/multitasking_bert-1.


Assuntos
Processamento de Linguagem Natural , Software
4.
J Biomed Inform ; 126: 103970, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-34920128

RESUMO

Systematic reviews are labor-intensive processes to combine all knowledge about a given topic into a coherent summary. Despite the high labor investment, they are necessary to create an exhaustive overview of current evidence relevant to a research question. In this work, we evaluate three state-of-the-art supervised multi-label sequence classification systems to automatically identify 24 different experimental design factors for the categories of Animal, Dose, Exposure, and Endpoint from journal articles describing the experiments related to toxicity and health effects of environmental agents. We then present an in depth analysis of the results evaluating the lexical diversity of the design parameters with respect to model performance, evaluating the impact of tokenization and non-contiguous mentions, and finally evaluating the dependencies between entities within the category entities. We demonstrate that in general, algorithms that use embedded representations of the sequences out-perform statistical algorithms, but that even these algorithms struggle with lexically diverse entities.


Assuntos
Algoritmos , Processamento de Linguagem Natural , Revisões Sistemáticas como Assunto
5.
Molecules ; 27(17)2022 Aug 31.
Artigo em Inglês | MEDLINE | ID: mdl-36080376

RESUMO

Reducing the use of solvents is an important aim of green chemistry. Using micelles self-assembled from amphiphilic molecules dispersed in water (considered a green solvent) has facilitated reactions of organic compounds. When performing reactions in micelles, the hydrophobic effect can considerably accelerate apparent reaction rates, as well as enhance selectivity. Here, we review micellar reaction media and their potential role in sustainable chemical production. The focus of this review is applications of engineered amphiphilic systems for reactions (surface-active ionic liquids, designer surfactants, and block copolymers) as reaction media. Micelles are a versatile platform for performing a large array of organic chemistries using water as the bulk solvent. Building on this foundation, synthetic sequences combining several reaction steps in one pot have been developed. Telescoping multiple reactions can reduce solvent waste by limiting the volume of solvents, as well as eliminating purification processes. Thus, in particular, we review recent advances in "one-pot" multistep reactions achieved using micellar reaction media with potential applications in medicinal chemistry and agrochemistry. Photocatalyzed reactions in micellar reaction media are also discussed. In addition to the use of micelles, we emphasize the process (steps to isolate the product and reuse the catalyst).


Assuntos
Micelas , Polímeros , Interações Hidrofóbicas e Hidrofílicas , Polímeros/química , Solventes , Água/química
6.
J Biomed Inform ; 118: 103784, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-33862232

RESUMO

Understanding a patient's medical history, such as how long symptoms last or when a procedure was performed, is vital to diagnosing problems and providing good care. Frequently, important information regarding a patient's medical timeline is buried in their Electronic Health Record (EHR) in the form of unstructured clinical notes. This results in care providers spending time reading notes in a patient's record in order to become familiar with their condition prior to developing a diagnosis or treatment plan. Valuable time could be saved if this information was readily accessible for searching and visualization for fast comprehension by the medical team. Clinical Natural Language Processing (NLP) is an area of research that aims to build computational methods to automatically extract medically relevant information from unstructured clinical texts. A key component of Clinical NLP is Temporal Reasoning, as understanding a patient's medical history relies heavily on the ability to identify, assimilate, and reason over temporal information. In this work, we review the current state of Temporal Reasoning in the clinical domain with respect to Clinical Timeline Extraction. While much progress has been made, the current state-of-the-art still has a ways to go before practical application in the clinical setting will be possible. Areas such as handling relative and implicit temporal expressions, both in normalization and in identifying temporal relationships, improving co-reference resolution, and building inter-operable timeline extraction tools that can integrate multiple types of data are in need of new and innovative solutions to improve performance on clinical data.


Assuntos
Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Humanos , Resolução de Problemas , Tempo
7.
J Biomed Inform ; 110: 103552, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-32890727

RESUMO

Adverse drug events (ADEs) are unintended incidents that involve the taking of a medication. ADEs pose significant health and financial problems worldwide. Information about ADEs can inform health care and improve patient safety. However, much of this information is buried in narrative texts and needs to be extracted with Natural Language Processing techniques, in order to be useful to computerized methods. ADEs can be found on drug labels, contained in the different sections such as descriptions of the drug's active components or more prominently in descriptions of studied side-effects. Extracting these automatically could be useful in triaging and processing drug reports. In this paper, we present three base methods consisting of a Conditional Random Field (CRF), a bi-directional Long Short Term Memory unit with a CRF layer (biLSTM+CRF), and a pre-trained Bi-directional Encoder Representations from Transformers (BERT) model. We also present several ensembles of the CRF and biLSTM+CRF methods for extracting ADEs and their Reason from FDA drug labels. We show that all three methods perform well on our task, and that combining the models through different ensemble methods can improve results, providing increases in recall for the majority class and improving precision for all other classes. We also show the potential of framing ADE extraction from drug labels as a multi-class classification task on the Reason, or type, of ADE.


Assuntos
Aprendizado Profundo , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Preparações Farmacêuticas , Rotulagem de Medicamentos , Humanos , Processamento de Linguagem Natural
8.
J Biomed Inform ; 112: 103589, 2020 12.
Artigo em Inglês | MEDLINE | ID: mdl-33035705

RESUMO

Patient-physician communication is an often overlooked yet a very important aspect of providing medical care. Positive patient-physician quality of communication within discourse has an influence on various aspects of a consultation such as a patient's treatment adherence to prescribed medical regimen and their medical care outcome. As few reference standards exist for exploring semantics within the patient-physician setting and its effects on personalized healthcare, this paper presents a study exploring three methods to capture, model and evaluate patient-physician communication among three distinct data-sources. We introduce, compare and contrast these methods for capturing and modeling patient-physician communication quality using relatedness between discourse content within a given consultation. Results are shown for all three data-sources and communication quality scores among physicians recorded. We found our models demonstrate the ability to capture positive communication quality between both participants within a consultation. We also evaluate these findings against self-reported questionnaires highlighting various aspects of the consultation and rank communication quality among seventeen physicians who consulted amid one-hundred and thirty-two patients.


Assuntos
Relações Médico-Paciente , Médicos , Comunicação , Humanos , Satisfação do Paciente , Semântica , Inquéritos e Questionários
9.
BMC Bioinformatics ; 20(1): 425, 2019 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-31416434

RESUMO

BACKGROUND: Literature Based Discovery (LBD) produces more potential hypotheses than can be manually reviewed, making automatically ranking these hypotheses critical. In this paper, we introduce the indirect association measures of Linking Term Association (LTA), Minimum Weight Association (MWA), and Shared B to C Set Association (SBC), and compare them to Linking Set Association (LSA), concept embeddings vector cosine, Linking Term Count (LTC), and direct co-occurrence vector cosine. Our proposed indirect association measures extend traditional association measures to quantify indirect rather than direct associations while preserving valuable statistical properties. RESULTS: We perform a comparison between several different hypothesis ranking methods for LBD, and compare them against our proposed indirect association measures. We intrinsically evaluate each method's performance using its ability to estimate semantic relatedness on standard evaluation datasets. We extrinsically evaluate each method's ability to rank hypotheses in LBD using a time-slicing dataset based on co-occurrence information, and another time-slicing dataset based on SemRep extracted-relationships. Precision and recall curves are generated by ranking term pairs and applying a threshold at each rank. CONCLUSIONS: Results differ depending on the evaluation methods and datasets, but it is unclear if this is a result of biases in the evaluation datasets or if one method is truly better than another. We conclude that LTC and SBC are the best suited methods for hypothesis ranking in LBD, but there is value in having a variety of methods to choose from.


Assuntos
Descoberta do Conhecimento , Modelos Teóricos , Área Sob a Curva , Bases de Dados como Assunto , Humanos , Curva ROC , Semântica , Estatísticas não Paramétricas
10.
J Biomed Inform ; 77: 111-119, 2018 01.
Artigo em Inglês | MEDLINE | ID: mdl-29247788

RESUMO

This paper presents a comparison between several multi-word term aggregation methods of distributional context vectors applied to the task of semantic similarity and relatedness in the biomedical domain. We compare the multi-word term aggregation methods of summation of component word vectors, mean of component word vectors, direct construction of compound term vectors using the compoundify tool, and direct construction of concept vectors using the MetaMap tool. Dimensionality reduction is critical when constructing high quality distributional context vectors, so these baseline co-occurrence vectors are compared against dimensionality reduced vectors created using singular value decomposition (SVD), and word2vec word embeddings using continuous bag of words (CBOW), and skip-gram models. We also find optimal vector dimensionalities for the vectors produced by these techniques. Our results show that none of the tested multi-word term aggregation methods is statistically significantly better than any other. This allows flexibility when choosing a multi-word term aggregation method, and means expensive corpora preprocessing may be avoided. Results are shown with several standard evaluation datasets, and state of the results are achieved.


Assuntos
Pesquisa Biomédica , Aprendizado de Máquina/normas , Processamento de Linguagem Natural , Semântica , Humanos , Reprodutibilidade dos Testes , Unified Medical Language System
11.
J Biomed Inform ; 74: 20-32, 2017 10.
Artigo em Inglês | MEDLINE | ID: mdl-28838802

RESUMO

OBJECTIVES: This paper provides an introduction and overview of literature based discovery (LBD) in the biomedical domain. It introduces the reader to modern and historical LBD models, key system components, evaluation methodologies, and current trends. After completion, the reader will be familiar with the challenges and methodologies of LBD. The reader will be capable of distinguishing between recent LBD systems and publications, and be capable of designing an LBD system for a specific application. TARGET AUDIENCE: From biomedical researchers curious about LBD, to someone looking to design an LBD system, to an LBD expert trying to catch up on trends in the field. The reader need not be familiar with LBD, but knowledge of biomedical text processing tools is helpful. SCOPE: This paper describes a unifying framework for LBD systems. Within this framework, different models and methods are presented to both distinguish and show overlap between systems. Topics include term and document representation, system components, and an overview of models including co-occurrence models, semantic models, and distributional models. Other topics include uninformative term filtering, term ranking, results display, system evaluation, an overview of the application areas of drug development, drug repurposing, and adverse drug event prediction, and challenges and future directions. A timeline showing contributions to LBD, and a table summarizing the works of several authors is provided. Topics are presented from a high level perspective. References are given if more detailed analysis is required.


Assuntos
Descoberta do Conhecimento/métodos , Modelos Teóricos , Algoritmos , Mineração de Dados
12.
J Biomed Inform ; 54: 329-36, 2015 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-25523466

RESUMO

INTRODUCTION: This article explores how measures of semantic similarity and relatedness are impacted by the semantic groups to which the concepts they are measuring belong. Our goal is to determine if there are distinctions between homogeneous comparisons (where both concepts belong to the same group) and heterogeneous ones (where the concepts are in different groups). Our hypothesis is that the similarity measures will be significantly affected since they rely on hierarchical is-a relations, whereas relatedness measures should be less impacted since they utilize a wider range of relations. In addition, we also evaluate the effect of combining different measures of similarity and relatedness. Our hypothesis is that these combined measures will more closely correlate with human judgment, since they better reflect the rich variety of information humans use when assessing similarity and relatedness. METHOD: We evaluate our method on four reference standards. Three of the reference standards were annotated by human judges for relatedness and one was annotated for similarity. RESULTS: We found significant differences in the correlation of semantic similarity and relatedness measures with human judgment, depending on which semantic groups were involved. We also found that combining a definition based relatedness measure with an information content similarity measure resulted in significant improvements in correlation over individual measures. AVAILABILITY: The semantic similarity and relatedness package is an open source program available from http://umls-similarity.sourceforge.net/. The reference standards are available at http://www.people.vcu.edu/∼{}btmcinnes/downloads.html.


Assuntos
Processamento de Linguagem Natural , Semântica , Unified Medical Language System/classificação , Humanos , Systematized Nomenclature of Medicine
13.
J Biomed Inform ; 47: 83-90, 2014 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-24076369

RESUMO

Automatic processing of biomedical documents is made difficult by the fact that many of the terms they contain are ambiguous. Word Sense Disambiguation (WSD) systems attempt to resolve these ambiguities and identify the correct meaning. However, the published literature on WSD systems for biomedical documents report considerable differences in performance for different terms. The development of WSD systems is often expensive with respect to acquiring the necessary training data. It would therefore be useful to be able to predict in advance which terms WSD systems are likely to perform well or badly on. This paper explores various methods for estimating the performance of WSD systems on a wide range of ambiguous biomedical terms (including ambiguous words/phrases and abbreviations). The methods include both supervised and unsupervised approaches. The supervised approaches make use of information from labeled training data while the unsupervised ones rely on the UMLS Metathesaurus. The approaches are evaluated by comparing their predictions about how difficult disambiguation will be for ambiguous terms against the output of two WSD systems. We find the supervised methods are the best predictors of WSD difficulty, but are limited by their dependence on labeled training data. The unsupervised methods all perform well in some situations and can be applied more widely.


Assuntos
Bases de Conhecimento , Semântica , Algoritmos , Inteligência Artificial , Humanos , Idioma , MEDLINE , Informática Médica , Modelos Estatísticos , Processamento de Linguagem Natural , Reprodutibilidade dos Testes , Unified Medical Language System , Vocabulário Controlado
14.
J Biomed Inform ; 46(6): 1116-24, 2013 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-24012881

RESUMO

INTRODUCTION: In this article, we evaluate a knowledge-based word sense disambiguation method that determines the intended concept associated with an ambiguous word in biomedical text using semantic similarity and relatedness measures. These measures quantify the degree of similarity or relatedness between concepts in the Unified Medical Language System (UMLS). The objective of this work is to develop a method that can disambiguate terms in biomedical text by exploiting similarity and relatedness information extracted from biomedical resources and to evaluate the efficacy of these measure on WSD. METHOD: We evaluate our method on a biomedical dataset (MSH-WSD) that contains 203 ambiguous terms and acronyms. RESULTS: We show that information content-based measures derived from either a corpus or taxonomy obtain a higher disambiguation accuracy than path-based measures or relatedness measures on the MSH-WSD dataset. AVAILABILITY: The WSD system is open source and freely available from http://search.cpan.org/dist/UMLS-SenseRelate/. The MSH-WSD dataset is available from the National Library of Medicine http://wsd.nlm.nih.gov.


Assuntos
Semântica , Estudos de Avaliação como Assunto , Unified Medical Language System
15.
Front Res Metr Anal ; 7: 1001266, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36352893

RESUMO

Temporal expression recognition and normalization (TERN) is the foundation for all higher-level temporal reasoning tasks in natural language processing, such as timeline extraction, so it must be performed well to limit error propagation. Achieving new heights in state-of-the-art performance for TERN in clinical texts requires knowledge of where current systems struggle. In this work, we summarize the results of a detailed error analysis for three top performing state-of-the-art TERN systems that participated in the 2012 i2b2 Clinical Temporal Relation Challenge, and compare our own home-grown system Chrono to identify specific areas in need of improvement. Performance metrics and an error analysis reveal that all systems have reduced performance in normalization of relative temporal expressions, specifically in disambiguating temporal types and in the identification of the correct anchor time. To address the issue of temporal disambiguation we developed and integrated a module into Chrono that utilizes temporally fine-tuned contextual word embeddings to disambiguate relative temporal expressions. Chrono now achieves state-of-the-art performance for temporal disambiguation of relative temporal expressions in clinical text, and is the only TERN system to output dual annotations into both TimeML and SCATE schemes.

16.
Proc Int World Wide Web Conf ; 2022: 823-832, 2022 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-37465200

RESUMO

Since the rise of the COVID-19 pandemic, peer-reviewed biomedical repositories have experienced a surge in chemical and disease related queries. These queries have a wide variety of naming conventions and nomenclatures from trademark and generic, to chemical composition mentions. Normalizing or disambiguating these mentions within texts provides researchers and data-curators with more relevant articles returned by their search query. Named entity normalization aims to automate this disambiguation process by linking entity mentions onto their appropriate candidate concepts within a biomedical knowledge base or ontology. We explore several term embedding aggregation techniques in addition to how the term's context affects evaluation performance. We also evaluate our embedding approaches for normalizing term instances containing one or many relations within unstructured texts.

17.
Database (Oxford) ; 20222022 08 11.
Artigo em Inglês | MEDLINE | ID: mdl-35951425

RESUMO

TopEx is a natural language processing application developed to facilitate the exploration of topics and key words in a set of texts through a user interface that requires no programming or natural language processing knowledge, thus enhancing the ability of nontechnical researchers to explore and analyze textual data. The underlying algorithm groups semantically similar sentences together followed by a topic analysis on each group to identify the key topics discussed in a collection of texts. Implementation is achieved via a Python library back end and a web application front end built with React and D3.js for visualizations. TopEx has been successfully used to identify themes, topics and key words in a variety of corpora, including Coronavirus disease 2019 (COVID-19) discharge summaries and tweets. Feedback from the BioCreative VII Challenge Track 4 concludes that TopEx is a useful tool for text exploration for a variety of users and tasks. DATABSE URL: http://topex.cctr.vcu.edu.


Assuntos
COVID-19 , Algoritmos , Mineração de Dados/métodos , Humanos , Processamento de Linguagem Natural , Software
18.
JMIR Form Res ; 6(9): e32460, 2022 Sep 06.
Artigo em Inglês | MEDLINE | ID: mdl-36066925

RESUMO

BACKGROUND: Community-engaged research (CEnR) is a research approach in which scholars partner with community organizations or individuals with whom they share an interest in the study topic, typically with the goal of supporting that community's well-being. CEnR is well-established in numerous disciplines including the clinical and social sciences. However, universities experience challenges reporting comprehensive CEnR metrics, limiting the development of appropriate CEnR infrastructure and the advancement of relationships with communities, funders, and stakeholders. OBJECTIVE: We propose a novel approach to identifying and categorizing community-engaged studies by applying attention-based deep learning models to human participants protocols that have been submitted to the university's institutional review board (IRB). METHODS: We manually classified a sample of 280 protocols submitted to the IRB using a 3- and 6-level CEnR heuristic. We then trained an attention-based bidirectional long short-term memory unit (Bi-LSTM) on the classified protocols and compared it to transformer models such as Bidirectional Encoder Representations From Transformers (BERT), Bio + Clinical BERT, and Cross-lingual Language Model-Robustly Optimized BERT Pre-training Approach (XLM-RoBERTa). We applied the best-performing models to the full sample of unlabeled IRB protocols submitted in the years 2013-2019 (n>6000). RESULTS: Although transfer learning is superior, receiving a 0.9952 evaluation F1 score for all transformer models implemented compared to the attention-based Bi-LSTM (between 48%-80%), there were key issues with overfitting. This finding is consistent across several methodological adjustments: an augmented data set with and without cross-validation, an unaugmented data set with and without cross-validation, a 6-class CEnR spectrum, and a 3-class one. CONCLUSIONS: Transfer learning is a more viable method than the attention-based bidirectional-LSTM for differentiating small data sets characterized by the idiosyncrasies and variability of CEnR descriptions used by principal investigators in research protocols. Despite these issues involving overfitting, BERT and the other transformer models remarkably showed an understanding of our data unlike the attention-based Bi-LSTM model, promising a more realistic path toward solving this real-world application.

19.
ACS Synth Biol ; 11(6): 2043-2054, 2022 06 17.
Artigo em Inglês | MEDLINE | ID: mdl-35671034

RESUMO

Scientific articles contain a wealth of information about experimental methods and results describing biological designs. Due to its unstructured nature and multiple sources of ambiguity and variability, extracting this information from text is a difficult task. In this paper, we describe the development of the synthetic biology knowledge system (SBKS) text processing pipeline. The pipeline uses natural language processing techniques to extract and correlate information from the literature for synthetic biology researchers. Specifically, we apply named entity recognition, relation extraction, concept grounding, and topic modeling to extract information from published literature to link articles to elements within our knowledge system. Our results show the efficacy of each of the components on synthetic biology literature and provide future directions for further advancement of the pipeline.


Assuntos
Mineração de Dados , Biologia Sintética , Mineração de Dados/métodos , Processamento de Linguagem Natural
20.
BMC Bioinformatics ; 12: 223, 2011 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-21635749

RESUMO

BACKGROUND: Evaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE. We demonstrate the use of this method by developing such a data set, called MSH WSD. METHODS: In our method, the Metathesaurus is first screened to identify ambiguous terms whose possible senses consist of two or more MeSH headings. We then use each ambiguous term and its corresponding MeSH heading to extract MEDLINE citations where the term and only one of the MeSH headings co-occur. The term found in the MEDLINE citation is automatically assigned the UMLS CUI linked to the MeSH heading. Each instance has been assigned a UMLS Concept Unique Identifier (CUI). We compare the characteristics of the MSH WSD data set to the previously existing NLM WSD data set. RESULTS: The resulting MSH WSD data set consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous entities. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE.We evaluated the reliability of the MSH WSD data set using existing knowledge-based methods and compared their performance to that of the results previously obtained by these algorithms on the pre-existing data set, NLM WSD. We show that the knowledge-based methods achieve different results but keep their relative performance except for the Journal Descriptor Indexing (JDI) method, whose performance is below the other methods. CONCLUSIONS: The MSH WSD data set allows the evaluation of WSD algorithms in the biomedical domain. Compared to previously existing data sets, MSH WSD contains a larger number of biomedical terms/abbreviations and covers the largest set of UMLS Semantic Types. Furthermore, the MSH WSD data set has been generated automatically reusing already existing annotations and, therefore, can be regenerated from subsequent UMLS versions.


Assuntos
Algoritmos , MEDLINE , Medical Subject Headings , Indexação e Redação de Resumos , Humanos , Bases de Conhecimento , Processamento de Linguagem Natural , Semântica , Unified Medical Language System , Estados Unidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA