Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
1.
Front Res Metr Anal ; 8: 1250930, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37841902

RESUMO

Biomedical experts are facing challenges in keeping up with the vast amount of biomedical knowledge published daily. With millions of citations added to databases like MEDLINE/PubMed each year, efficiently accessing relevant information becomes crucial. Traditional term-based searches may lead to irrelevant or missed documents due to homonyms, synonyms, abbreviations, or term mismatch. To address this, semantic search approaches employing predefined concepts with associated synonyms and relations have been used to expand query terms and improve information retrieval. The National Library of Medicine (NLM) plays a significant role in this area, indexing citations in the MEDLINE database with topic descriptors from the Medical Subject Headings (MeSH) thesaurus, enabling advanced semantic search strategies to retrieve relevant citations, despite synonymy, and polysemy of biomedical terms. Over time, advancements in semantic indexing have been made, with Machine Learning facilitating the transition from manual to automatic semantic indexing in the biomedical literature. The paper highlights the journey of this transition, starting with manual semantic indexing and the initial efforts toward automatic indexing. The BioASQ challenge has served as a catalyst in revolutionizing the domain of semantic indexing, further pushing the boundaries of efficient knowledge retrieval in the biomedical field.

2.
J Assoc Inf Sci Technol ; 74(2): 205-218, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-36819642

RESUMO

MEDLINE is the National Library of Medicine's (NLM) journal citation database. It contains over 28 million references to biomedical and life science journal articles, and a key feature of the database is that all articles are indexed with NLM Medical Subject Headings (MeSH). The library employs a team of MeSH indexers, and in recent years they have been asked to index close to 1 million articles per year in order to keep MEDLINE up to date. An important part of the MEDLINE indexing process is the assignment of articles to indexers. High quality and timely indexing is only possible when articles are assigned to indexers with suitable expertise. This paper introduces the NLM indexer assignment dataset: a large dataset of 4.2 million indexer article assignments for articles indexed between 2011 and 2019. The dataset is shown to be a valuable testbed for expert matching and assignment algorithms, and indexer article assignment is also found to be useful domain-adaptive pre-training for the closely related task of reviewer assignment.

3.
AMIA Annu Symp Proc ; 2021: 677-686, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35308957

RESUMO

Sentence boundary detection (SBD) is a fundamental building block in the Natural Language Processing (NLP) pipeline. Incorrect SBD may impact subsequent processing stages resulting in decreased performance. In well-behaved corpora, a few simple rules based on punctuation and capitalization are sufficient for successfully detecting sentence boundaries. However, a corpus like MEDLINE citations presents challenges for SBD due to several syntactic ambiguities, e.g., abbreviation-periods, capital letters in first words of sentences, etc. In this manuscript we present an algorithm to address these challenges based on majority voting among three SBD engines (Python NLTK, pySBD, and Syntok) followed by custom post-processing algorithms that rely on NLP spaCy part-of-speech, abbreviation and capital letter detection, and computing general sentence statistics. Experiments on several thousand MEDLINE citations show that our proposed approach for combining multiple SBD engines and post-processing rules performs better than each individual engine.


Assuntos
Idioma , Processamento de Linguagem Natural , Algoritmos , Humanos , MEDLINE
5.
AMIA Jt Summits Transl Sci Proc ; 2020: 561-568, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32477678

RESUMO

Chemical entity recognition is essential for indexing scientific literature in the MEDLINE database at the National Library of Medicine. However, the tool currently used to suggest terms for indexing, the Medical Text Indexer, was not originally conceived as a chemical recognition tool. It has instead been adapted to the task via its use of MetaMap and the addition of in-house patterns and rules. In order to develop a tool more suitable for chemical recognition, we have created a collection of 200 MEDLINE titles and abstracts annotated with genes, proteins, inorganic and organic chemicals, as well as other biological molecules. We use this collection to evaluate eleven chemical entity recognition systems, where we seek to identify a tool that effectively recognizes chemical entities for indexing and also performs well on chemical recognition beyond the indexing task. We observe the highest performance with a SciBERT ensemble.

6.
Artigo em Inglês | MEDLINE | ID: mdl-32472120

RESUMO

Natural language processing (NLP) plays a vital role in modern medical informatics. It converts narrative text or unstructured data into knowledge by analyzing and extracting concepts. A comprehensive lexical system is the foundation to the success of NLP applications and an essential component at the beginning of the NLP pipeline. The SPECIALIST Lexicon and Lexical Tools, distributed by the National Library of Medicine as one of the Unified Medical Language System Knowledge Sources, provides an underlying resource for many NLP applications. This article reports recent developments of 3 key components in the Lexicon. The core NLP operation of Unified Medical Language System concept mapping is used to illustrate the importance of these developments. Our objective is to provide generic, broad coverage and a robust lexical system for NLP applications. A novel multiword approach and other planned developments are proposed.

7.
AMIA Annu Symp Proc ; 2020: 1031-1040, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33936479

RESUMO

This year less than 200 National Library of Medicine indexers expect to index 1 million articles, and this would not be possible without the assistance of the Medical Text Indexer (MTI) system. MTI is an automated indexing system that provides MeSH main heading/subheading pair recommendations to assist indexers with their heavy workload. Over the years, a lot of research effort has focused on improving main heading prediction performance, but automated fine-grained indexing with main heading/subheading pairs has received much less attention. This work revisits the subheading attachment problem, and demonstrates very significant performance improvements using modern Convolutional Neural Network classifiers. The best performing method is shown to outperform the current MTI implementation with a 3.7% absolute improvement in precision, and a 27.6% absolute improvement in recall. We also conducted a manual review of false positive predictions, and 70% were found to be acceptable indexing.


Assuntos
Indexação e Redação de Resumos/métodos , Medical Subject Headings , Redes Neurais de Computação , Humanos , MEDLINE , Aprendizado de Máquina , National Library of Medicine (U.S.) , Processamento de Linguagem Natural , Unified Medical Language System , Estados Unidos
8.
AMIA Annu Symp Proc ; 2019: 727-734, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-32308868

RESUMO

MEDLINE is the National Library of Medicine's premier bibliographic database for biomedical literature. A highly valuable feature of the database is that each record is manually indexed with a controlled vocabulary called MeSH. Most MEDLINE journals are indexed cover-to-cover, but there are about 200 selectively indexed journals for which only articles related to biomedicine and life sciences are indexed. In recent years, the selection process has become an increasing burden for indexing staff, and this paper presents a machine learning based system that offers very significant time savings by semi-automating the task. At the core of the system is a high recall classifier for the identification of journal articles that are in-scope for MEDLINE. The system is shown to reduce the number of articles requiring manual review by 54%, equivalent to approximately 40,000 articles per year.


Assuntos
Indexação e Redação de Resumos , MEDLINE , Aprendizado de Máquina , Redes Neurais de Computação , Medical Subject Headings , National Library of Medicine (U.S.) , Estados Unidos
9.
AMIA Annu Symp Proc ; 2018: 368-376, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30815076

RESUMO

Medication doses, one of the determining factors in medication safety and effectiveness, are present in the literature, but only in free-text form. We set out to determine if the systems developed for extracting drug prescription information from clinical text would yield comparable results on scientific literature and if sequence-to-sequence learning with neural networks could improve over the current state-of-the-art. We developed a collection of 694 PubMed Central documents annotated with drug dose information using the i2b2 schema. We found that less than half of the drug doses are present in the MEDLINE/PubMed abstracts, and full-text is needed to identify the other half. We identified the differences in the scope and formatting of drug dose information in the literature and clinical text, which require developing new dose extraction approaches. Finally, we achieved 83.9% recall, 87.2% precision and 85.5% F1 score in extracting complete drug prescription information from the literature.


Assuntos
Aprendizado Profundo , Armazenamento e Recuperação da Informação/métodos , Redes Neurais de Computação , Preparações Farmacêuticas/administração & dosagem , PubMed , Vias de Administração de Medicamentos , Esquema de Medicação , Cálculos da Dosagem de Medicamento , Humanos
10.
PLoS One ; 10(7): e0132735, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26207759

RESUMO

OBJECTIVE: This study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are "invisible" or not deposited in a known repository. METHODS: We analyzed NIH-funded journal articles that were published in 2011, cited in PubMed and deposited in PubMed Central (PMC) to identify those that indicate data were submitted to a known repository. After excluding those articles, we analyzed a random sample of the remaining articles to estimate how many and what types of invisible datasets were used in each article. RESULTS: About 12% of the articles explicitly mention deposition of datasets in recognized repositories, leaving 88% that are invisible datasets. Among articles with invisible datasets, we found an average of 2.9 to 3.4 datasets, suggesting there were approximately 200,000 to 235,000 invisible datasets generated from NIH-funded research published in 2011. Approximately 87% of the invisible datasets consist of data newly collected for the research reported; 13% reflect reuse of existing data. More than 50% of the datasets were derived from live human or non-human animal subjects. CONCLUSION: In addition to providing a rough estimate of the total number of datasets produced per year by NIH-funded researchers, this study identifies additional issues that must be addressed to improve the discoverability of and access to biomedical research data: the definition of a "dataset," determination of which (if any) data are valuable for archiving and preservation, and better methods for estimating the number of datasets of interest. Lack of consensus amongst annotators about the number of datasets in a given article reinforces the need for a principled way of thinking about how to identify and characterize biomedical datasets.


Assuntos
Pesquisa Biomédica/economia , National Institutes of Health (U.S.)/organização & administração , Editoração/organização & administração , Acesso à Informação , Pesquisa Biomédica/organização & administração , Bases de Dados Bibliográficas , Humanos , National Institutes of Health (U.S.)/economia , Estados Unidos
11.
BMC Bioinformatics ; 16: 113, 2015 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-25887792

RESUMO

BACKGROUND: Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of biomedical texts as features for reproducing the MeSH annotations of some of the most frequent MeSH headings. In addition to unigrams and bigrams, these features include noun phrases, citation meta-data, citation structure, and semantic annotation of the citations. RESULTS: Traditional features like unigrams and bigrams exhibit strong performance compared to other feature sets. Little or no improvement is obtained when using meta-data or citation structure. Noun phrases are too sparse and thus have lower performance compared to more traditional features. Conceptual annotation of the texts by MetaMap shows similar performance compared to unigrams, but adding concepts from the UMLS taxonomy does not improve the performance of using only mapped concepts. The combination of all the features performs largely better than any individual feature set considered. In addition, this combination improves the performance of a state-of-the-art MeSH indexer. Concerning the machine learning algorithms, we find that those that are more resilient to class imbalance largely obtain better performance. CONCLUSIONS: We conclude that even though traditional features such as unigrams and bigrams have strong performance compared to other features, it is possible to combine them to effectively improve the performance of the bag-of-words representation. We have also found that the combination of the learning algorithm and feature sets has an influence in the overall performance of the system. Moreover, using learning algorithms resilient to class imbalance largely improves performance. However, when using a large set of features, consideration needs to be taken with algorithms due to the risk of over-fitting. Specific combinations of learning algorithms and features for individual MeSH headings could further increase the performance of an indexing system.


Assuntos
Indexação e Redação de Resumos/métodos , Algoritmos , Armazenamento e Recuperação da Informação , MEDLINE , Medical Subject Headings , Inteligência Artificial , Humanos , Semântica
12.
AMIA Annu Symp Proc ; 2015: 484-91, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26958181

RESUMO

Characteristics of the subjects of biomedical research are important in determining if a publication describing the research is relevant to a search. To facilitate finding relevant publications, MEDLINE citations provide Medical Subject Headings that describe the subjects' characteristics, such as their species, gender, and age. We seek to improve the recommendation of these headings by the Medical Text Indexer (MTI) that supports manual indexing of MEDLINE. To that end, we explore the potential of the full text of the publications. Using simple recall-oriented rule-based methods we determined that adding sentences extracted from the methods sections and captions to the abstracts prior to MTI processing significantly improved recall and F1 score with only a slight drop in precision. Improvements were also achieved in directly assigning several headings extracted from the full text. These results indicate the need for further development of automated methods capable of leveraging the full text for indexing.


Assuntos
Indexação e Redação de Resumos/métodos , Pesquisa Biomédica/métodos , Mineração de Dados , Demografia , Armazenamento e Recuperação da Informação/métodos , Sujeitos da Pesquisa , Algoritmos , Animais , Humanos , Armazenamento e Recuperação da Informação/normas , MEDLINE
13.
BMC Bioinformatics ; 14: 171, 2013 May 31.
Artigo em Inglês | MEDLINE | ID: mdl-23725347

RESUMO

BACKGROUND: A Gene Reference Into Function (GeneRIF) describes novel functionality of genes. GeneRIFs are available from the National Center for Biotechnology Information (NCBI) Gene database. GeneRIF indexing is performed manually, and the intention of our work is to provide methods to support creating the GeneRIF entries. The creation of GeneRIF entries involves the identification of the genes mentioned in MEDLINE®; citations and the sentences describing a novel function. RESULTS: We have compared several learning algorithms and several features extracted or derived from MEDLINE sentences to determine if a sentence should be selected for GeneRIF indexing. Features are derived from the sentences or using mechanisms to augment the information provided by them: assigning a discourse label using a previously trained model, for example. We show that machine learning approaches with specific feature combinations achieve results close to one of the annotators. We have evaluated different feature sets and learning algorithms. In particular, Naïve Bayes achieves better performance with a selection of features similar to one used in related work, which considers the location of the sentence, the discourse of the sentence and the functional terminology in it. CONCLUSIONS: The current performance is at a level similar to human annotation and it shows that machine learning can be used to automate the task of sentence selection for GeneRIF annotation. The current experiments are limited to the human species. We would like to see how the methodology can be extended to other species, specifically the normalization of gene mentions in other species.


Assuntos
Indexação e Redação de Resumos/métodos , Inteligência Artificial , Bases de Dados Genéticas , Algoritmos , Teorema de Bayes , Genes , Humanos , MEDLINE
14.
BMC Bioinformatics ; 14: 208, 2013 Jun 26.
Artigo em Inglês | MEDLINE | ID: mdl-23802936

RESUMO

BACKGROUND: MEDLINE citations are manually indexed at the U.S. National Library of Medicine (NLM) using as reference the Medical Subject Headings (MeSH) controlled vocabulary. For this task, the human indexers read the full text of the article. Due to the growth of MEDLINE, the NLM Indexing Initiative explores indexing methodologies that can support the task of the indexers. Medical Text Indexer (MTI) is a tool developed by the NLM Indexing Initiative to provide MeSH indexing recommendations to indexers. Currently, the input to MTI is MEDLINE citations, title and abstract only. Previous work has shown that using full text as input to MTI increases recall, but decreases precision sharply. We propose using summaries generated automatically from the full text for the input to MTI to use in the task of suggesting MeSH headings to indexers. Summaries distill the most salient information from the full text, which might increase the coverage of automatic indexing approaches based on MEDLINE. We hypothesize that if the results were good enough, manual indexers could possibly use automatic summaries instead of the full texts, along with the recommendations of MTI, to speed up the process while maintaining high quality of indexing results. RESULTS: We have generated summaries of different lengths using two different summarizers, and evaluated the MTI indexing on the summaries using different algorithms: MTI, individual MTI components, and machine learning. The results are compared to those of full text articles and MEDLINE citations. Our results show that automatically generated summaries achieve similar recall but higher precision compared to full text articles. Compared to MEDLINE citations, summaries achieve higher recall but lower precision. CONCLUSIONS: Our results show that automatic summaries produce better indexing than full text articles. Summaries produce similar recall to full text but much better precision, which seems to indicate that automatic summaries can efficiently capture the most important contents within the original articles. The combination of MEDLINE citations and automatically generated summaries could improve the recommendations suggested by MTI. On the other hand, indexing performance might be dependent on the MeSH heading being indexed. Summarization techniques could thus be considered as a feature selection algorithm that might have to be tuned individually for each MeSH heading.


Assuntos
Indexação e Redação de Resumos/métodos , Algoritmos , Medical Subject Headings , Inteligência Artificial , MEDLINE
15.
AMIA Annu Symp Proc ; 2013: 300-8, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24551339

RESUMO

This paper presents a two-step approach to generating comprehensive abstractive overviews for biomedical topics. It starts with a sensitivity-maximizing search of MEDLINE/PubMed and MeSH-based filtering of the results that are then processed using NLP methods to extract relations between entities of interest. We evaluate this approach in a case study based on the IOM report on the role of vitamin D in human health. The report defines disorders that serve as health indicators for the role of vitamin D. We evaluate the abstractive overviews generated using MeSH indexing and the extracted relations using the disorders listed in the IOM report as reference standard. We conclude that MeSH-based aggregation and filtering of the results is a useful and easy step in the generation of abstractive overviews. Although our relation extraction achieved 83.6% recall and 92.8% precision, only half of the disorders of interest participated in these relations.


Assuntos
Indexação e Redação de Resumos , Mineração de Dados , MEDLINE , Medical Subject Headings , Processamento de Linguagem Natural , Vitamina D , Cálcio/administração & dosagem , Suplementos Nutricionais/efeitos adversos , Humanos , Vitamina D/administração & dosagem , Vitamina D/efeitos adversos , Vitamina D/sangue , Deficiência de Vitamina D
16.
AMIA Annu Symp Proc ; 2013: 709-18, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24551371

RESUMO

MeSH indexing of MEDLINE is becoming a more difficult task for the group of highly qualified indexing staff at the US National Library of Medicine, due to the large yearly growth of MEDLINE and the increasing size of MeSH. Since 2002, this task has been assisted by the Medical Text Indexer or MTI program. We extend previous machine learning analysis by adding a more diverse set of MeSH headings targeting examples where MTI has been shown to perform poorly. Machine learning algorithms exceed MTI's performance on MeSH headings that are used very frequently and headings for which the indexing frequency is very low. We find that when we combine the MTI suggestions and the prediction of the learning algorithms, the performance improves compared to any single method for most of the evaluated MeSH headings.


Assuntos
Indexação e Redação de Resumos/métodos , Algoritmos , Inteligência Artificial , Medical Subject Headings , Processamento de Linguagem Natural , MEDLINE
17.
AMIA Annu Symp Proc ; 2011: 1583-92, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-22195224

RESUMO

MEDLINE indexing performed by the US National Library of Medicine staff describes the essence of a biomedical publication in about 14 Medical Subject Headings (MeSH). Since 2002, this task is assisted by the Medical Text Indexer (MTI) program. We present a bottom-up approach to MEDLINE indexing in which the abstract is searched for indicators for a specific MeSH recommendation in a two-step process. Supervised machine learning combined with triage rules improves sensitivity of recommendations while keeping the number of recommended terms relatively small. Improvement in recommendations observed in this work warrants further exploration of this approach to MTI recommendations on a larger set of MeSH headings.


Assuntos
Indexação e Redação de Resumos/métodos , Inteligência Artificial , MEDLINE , Medical Subject Headings , Processamento de Linguagem Natural , Algoritmos , Sequência de Carboidratos , Unified Medical Language System
19.
J Am Med Inform Assoc ; 17(5): 536-9, 2010.
Artigo em Inglês | MEDLINE | ID: mdl-20819859

RESUMO

OBJECTIVE: The authors used the i2b2 Medication Extraction Challenge to evaluate their entity extraction methods, contribute to the generation of a publicly available collection of annotated clinical notes, and start developing methods for ontology-based reasoning using structured information generated from the unstructured clinical narrative. DESIGN: Extraction of salient features of medication orders from the text of de-identified hospital discharge summaries was addressed with a knowledge-based approach using simple rules and lookup lists. The entity recognition tool, MetaMap, was combined with dose, frequency, and duration modules specifically developed for the Challenge as well as a prototype module for reason identification. MEASUREMENTS: Evaluation metrics and corresponding results were provided by the Challenge organizers. RESULTS: The results indicate that robust rule-based tools achieve satisfactory results in extraction of simple elements of medication orders, but more sophisticated methods are needed for identification of reasons for the orders and durations. LIMITATIONS: Owing to the time constraints and nature of the Challenge, some obvious follow-on analysis has not been completed yet. CONCLUSIONS: The authors plan to integrate the new modules with MetaMap to enhance its accuracy. This integration effort will provide guidance in retargeting existing tools for better processing of clinical text.


Assuntos
Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Preparações Farmacêuticas , Humanos , Alta do Paciente , Design de Software
20.
J Biomed Inform ; 43(4): 587-94, 2010 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-20152935

RESUMO

Identification of medical terms in free text is a first step in such Natural Language Processing (NLP) tasks as automatic indexing of biomedical literature and extraction of patients' problem lists from the text of clinical notes. Many tools developed to perform these tasks use biomedical knowledge encoded in the Unified Medical Language System (UMLS) Metathesaurus. We continue our exploration of automatic approaches to creation of subsets (UMLS content views) which can support NLP processing of either the biomedical literature or clinical text. We found that suppression of highly ambiguous terms in the conservative AutoFilter content view can partially replace manual filtering for literature applications, and suppression of two character mappings in the same content view achieves 89.5% precision at 78.6% recall for clinical applications.


Assuntos
Processamento de Linguagem Natural , Unified Medical Language System/normas , Armazenamento e Recuperação da Informação/métodos , Publicações
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...