Búsqueda | Portal de Búsqueda de la BVS España

1.

Feature engineering for MEDLINE citation categorization with MeSH.

Jimeno Yepes, Antonio Jose; Plaza, Laura; Carrillo-de-Albornoz, Jorge; Mork, James G; Aronson, Alan R.

BMC Bioinformatics ; 16: 113, 2015 Apr 08.

Artículo en Inglés | MEDLINE | ID: mdl-25887792

RESUMEN

BACKGROUND: Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of biomedical texts as features for reproducing the MeSH annotations of some of the most frequent MeSH headings. In addition to unigrams and bigrams, these features include noun phrases, citation meta-data, citation structure, and semantic annotation of the citations. RESULTS: Traditional features like unigrams and bigrams exhibit strong performance compared to other feature sets. Little or no improvement is obtained when using meta-data or citation structure. Noun phrases are too sparse and thus have lower performance compared to more traditional features. Conceptual annotation of the texts by MetaMap shows similar performance compared to unigrams, but adding concepts from the UMLS taxonomy does not improve the performance of using only mapped concepts. The combination of all the features performs largely better than any individual feature set considered. In addition, this combination improves the performance of a state-of-the-art MeSH indexer. Concerning the machine learning algorithms, we find that those that are more resilient to class imbalance largely obtain better performance. CONCLUSIONS: We conclude that even though traditional features such as unigrams and bigrams have strong performance compared to other features, it is possible to combine them to effectively improve the performance of the bag-of-words representation. We have also found that the combination of the learning algorithm and feature sets has an influence in the overall performance of the system. Moreover, using learning algorithms resilient to class imbalance largely improves performance. However, when using a large set of features, consideration needs to be taken with algorithms due to the risk of over-fitting. Specific combinations of learning algorithms and features for individual MeSH headings could further increase the performance of an indexing system.

Asunto(s)

Indización y Redacción de Resúmenes/métodos , Algoritmos , Almacenamiento y Recuperación de la Información , MEDLINE , Medical Subject Headings , Inteligencia Artificial , Humanos , Semántica

2.

MeSH indexing based on automatically generated summaries.

Jimeno-Yepes, Antonio J; Plaza, Laura; Mork, James G; Aronson, Alan R; Díaz, Alberto.

BMC Bioinformatics ; 14: 208, 2013 Jun 26.

Artículo en Inglés | MEDLINE | ID: mdl-23802936

RESUMEN

BACKGROUND: MEDLINE citations are manually indexed at the U.S. National Library of Medicine (NLM) using as reference the Medical Subject Headings (MeSH) controlled vocabulary. For this task, the human indexers read the full text of the article. Due to the growth of MEDLINE, the NLM Indexing Initiative explores indexing methodologies that can support the task of the indexers. Medical Text Indexer (MTI) is a tool developed by the NLM Indexing Initiative to provide MeSH indexing recommendations to indexers. Currently, the input to MTI is MEDLINE citations, title and abstract only. Previous work has shown that using full text as input to MTI increases recall, but decreases precision sharply. We propose using summaries generated automatically from the full text for the input to MTI to use in the task of suggesting MeSH headings to indexers. Summaries distill the most salient information from the full text, which might increase the coverage of automatic indexing approaches based on MEDLINE. We hypothesize that if the results were good enough, manual indexers could possibly use automatic summaries instead of the full texts, along with the recommendations of MTI, to speed up the process while maintaining high quality of indexing results. RESULTS: We have generated summaries of different lengths using two different summarizers, and evaluated the MTI indexing on the summaries using different algorithms: MTI, individual MTI components, and machine learning. The results are compared to those of full text articles and MEDLINE citations. Our results show that automatically generated summaries achieve similar recall but higher precision compared to full text articles. Compared to MEDLINE citations, summaries achieve higher recall but lower precision. CONCLUSIONS: Our results show that automatic summaries produce better indexing than full text articles. Summaries produce similar recall to full text but much better precision, which seems to indicate that automatic summaries can efficiently capture the most important contents within the original articles. The combination of MEDLINE citations and automatically generated summaries could improve the recommendations suggested by MTI. On the other hand, indexing performance might be dependent on the MeSH heading being indexed. Summarization techniques could thus be considered as a feature selection algorithm that might have to be tuned individually for each MeSH heading.

Asunto(s)

Indización y Redacción de Resúmenes/métodos , Algoritmos , Medical Subject Headings , Inteligencia Artificial , MEDLINE

3.

GeneRIF indexing: sentence selection based on machine learning.

Jimeno-Yepes, Antonio J; Sticco, J Caitlin; Mork, James G; Aronson, Alan R.

BMC Bioinformatics ; 14: 171, 2013 May 31.

Artículo en Inglés | MEDLINE | ID: mdl-23725347

RESUMEN

BACKGROUND: A Gene Reference Into Function (GeneRIF) describes novel functionality of genes. GeneRIFs are available from the National Center for Biotechnology Information (NCBI) Gene database. GeneRIF indexing is performed manually, and the intention of our work is to provide methods to support creating the GeneRIF entries. The creation of GeneRIF entries involves the identification of the genes mentioned in MEDLINE®; citations and the sentences describing a novel function. RESULTS: We have compared several learning algorithms and several features extracted or derived from MEDLINE sentences to determine if a sentence should be selected for GeneRIF indexing. Features are derived from the sentences or using mechanisms to augment the information provided by them: assigning a discourse label using a previously trained model, for example. We show that machine learning approaches with specific feature combinations achieve results close to one of the annotators. We have evaluated different feature sets and learning algorithms. In particular, Naïve Bayes achieves better performance with a selection of features similar to one used in related work, which considers the location of the sentence, the discourse of the sentence and the functional terminology in it. CONCLUSIONS: The current performance is at a level similar to human annotation and it shows that machine learning can be used to automate the task of sentence selection for GeneRIF annotation. The current experiments are limited to the human species. We would like to see how the methodology can be extended to other species, specifically the normalization of gene mentions in other species.

Asunto(s)

Indización y Redacción de Resúmenes/métodos , Inteligencia Artificial , Bases de Datos Genéticas , Algoritmos , Teorema de Bayes , Genes , Humanos , MEDLINE

4.

Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation.

Jimeno-Yepes, Antonio J; McInnes, Bridget T; Aronson, Alan R.

BMC Bioinformatics ; 12: 223, 2011 Jun 02.

Artículo en Inglés | MEDLINE | ID: mdl-21635749

RESUMEN

BACKGROUND: Evaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE. We demonstrate the use of this method by developing such a data set, called MSH WSD. METHODS: In our method, the Metathesaurus is first screened to identify ambiguous terms whose possible senses consist of two or more MeSH headings. We then use each ambiguous term and its corresponding MeSH heading to extract MEDLINE citations where the term and only one of the MeSH headings co-occur. The term found in the MEDLINE citation is automatically assigned the UMLS CUI linked to the MeSH heading. Each instance has been assigned a UMLS Concept Unique Identifier (CUI). We compare the characteristics of the MSH WSD data set to the previously existing NLM WSD data set. RESULTS: The resulting MSH WSD data set consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous entities. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE.We evaluated the reliability of the MSH WSD data set using existing knowledge-based methods and compared their performance to that of the results previously obtained by these algorithms on the pre-existing data set, NLM WSD. We show that the knowledge-based methods achieve different results but keep their relative performance except for the Journal Descriptor Indexing (JDI) method, whose performance is below the other methods. CONCLUSIONS: The MSH WSD data set allows the evaluation of WSD algorithms in the biomedical domain. Compared to previously existing data sets, MSH WSD contains a larger number of biomedical terms/abbreviations and covers the largest set of UMLS Semantic Types. Furthermore, the MSH WSD data set has been generated automatically reusing already existing annotations and, therefore, can be regenerated from subsequent UMLS versions.

Asunto(s)

Algoritmos , MEDLINE , Medical Subject Headings , Indización y Redacción de Resúmenes , Humanos , Bases del Conocimiento , Procesamiento de Lenguaje Natural , Semántica , Unified Medical Language System , Estados Unidos

5.

Collocation analysis for UMLS knowledge-based word sense disambiguation.

Jimeno-Yepes, Antonio; McInnes, Bridget T; Aronson, Alan R.

BMC Bioinformatics ; 12 Suppl 3: S4, 2011 Jun 09.

Artículo en Inglés | MEDLINE | ID: mdl-21658291

RESUMEN

BACKGROUND: The effectiveness of knowledge-based word sense disambiguation (WSD) approaches depends in part on the information available in the reference knowledge resource. Off the shelf, these resources are not optimized for WSD and might lack terms to model the context properly. In addition, they might include noisy terms which contribute to false positives in the disambiguation results. METHODS: We analyzed some collocation types which could improve the performance of knowledge-based disambiguation methods. Collocations are obtained by extracting candidate collocations from MEDLINE and then assigning them to one of the senses of an ambiguous word. We performed this assignment either using semantic group profiles or a knowledge-based disambiguation method. In addition to collocations, we used second-order features from a previously implemented approach.Specifically, we measured the effect of these collocations in two knowledge-based WSD methods. The first method, AEC, uses the knowledge from the UMLS to collect examples from MEDLINE which are used to train a Naïve Bayes approach. The second method, MRD, builds a profile for each candidate sense based on the UMLS and compares the profile to the context of the ambiguous word.We have used two WSD test sets which contain disambiguation cases which are mapped to UMLS concepts. The first one, the NLM WSD set, was developed manually by several domain experts and contains words with high frequency occurrence in MEDLINE. The second one, the MSH WSD set, was developed automatically using the MeSH indexing in MEDLINE. It contains a larger set of words and covers a larger number of UMLS semantic types. RESULTS: The results indicate an improvement after the use of collocations, although the approaches have different performance depending on the data set. In the NLM WSD set, the improvement is larger for the MRD disambiguation method using second-order features. Assignment of collocations to a candidate sense based on UMLS semantic group profiles is more effective in the AEC method.In the MSH WSD set, the increment in performance is modest for all the methods. Collocations combined with the MRD disambiguation method have the best performance. The MRD disambiguation method and second-order features provide an insignificant change in performance. The AEC disambiguation method gives a modest improvement in performance. Assignment of collocations to a candidate sense based on knowledge-based methods has better performance. CONCLUSIONS: Collocations improve the performance of knowledge-based disambiguation methods, although results vary depending on the test set and method used. Generally, the AEC method is sensitive to query drift. Using AEC, just a few selected terms provide a large improvement in disambiguation performance. The MRD method handles noisy terms better but requires a larger set of terms to improve performance.

Asunto(s)

MEDLINE , Informática Médica/métodos , Procesamiento de Lenguaje Natural , Terminología como Asunto , Unified Medical Language System , Algoritmos , Teorema de Bayes , Bases del Conocimiento , Medical Subject Headings , Semántica

6.

Studying the correlation between different word sense disambiguation methods and summarization effectiveness in biomedical texts.

Plaza, Laura; Jimeno-Yepes, Antonio J; Díaz, Alberto; Aronson, Alan R.

BMC Bioinformatics ; 12: 355, 2011 Aug 26.

Artículo en Inglés | MEDLINE | ID: mdl-21871110

RESUMEN

BACKGROUND: Word sense disambiguation (WSD) attempts to solve lexical ambiguities by identifying the correct meaning of a word based on its context. WSD has been demonstrated to be an important step in knowledge-based approaches to automatic summarization. However, the correlation between the accuracy of the WSD methods and the summarization performance has never been studied. RESULTS: We present three existing knowledge-based WSD approaches and a graph-based summarizer. Both the WSD approaches and the summarizer employ the Unified Medical Language System (UMLS) Metathesaurus as the knowledge source. We first evaluate WSD directly, by comparing the prediction of the WSD methods to two reference sets: the NLM WSD dataset and the MSH WSD collection. We next apply the different WSD methods as part of the summarizer, to map documents onto concepts in the UMLS Metathesaurus, and evaluate the summaries that are generated. The results obtained by the different methods in both evaluations are studied and compared. CONCLUSIONS: It has been found that the use of WSD techniques has a positive impact on the results of our graph-based summarizer, and that, when both the WSD and summarization tasks are assessed over large and homogeneous evaluation collections, there exists a correlation between the overall results of the WSD and summarization tasks. Furthermore, the best WSD algorithm in the first task tends to be also the best one in the second. However, we also found that the improvement achieved by the summarizer is not directly correlated with the WSD performance. The most likely reason is that the errors in disambiguation are not equally important but depend on the relative salience of the different concepts in the document to be summarized.

Asunto(s)

Algoritmos , Procesamiento de Lenguaje Natural , Unified Medical Language System , Minería de Datos , Humanos , Bases del Conocimiento

7.

Knowledge-based biomedical word sense disambiguation: comparison of approaches.

Jimeno-Yepes, Antonio J; Aronson, Alan R.

BMC Bioinformatics ; 11: 569, 2010 Nov 22.

Artículo en Inglés | MEDLINE | ID: mdl-21092226

RESUMEN

BACKGROUND: Word sense disambiguation (WSD) algorithms attempt to select the proper sense of ambiguous terms in text. Resources like the UMLS provide a reference thesaurus to be used to annotate the biomedical literature. Statistical learning approaches have produced good results, but the size of the UMLS makes the production of training data infeasible to cover all the domain. METHODS: We present research on existing WSD approaches based on knowledge bases, which complement the studies performed on statistical learning. We compare four approaches which rely on the UMLS Metathesaurus as the source of knowledge. The first approach compares the overlap of the context of the ambiguous word to the candidate senses based on a representation built out of the definitions, synonyms and related terms. The second approach collects training data for each of the candidate senses to perform WSD based on queries built using monosemous synonyms and related terms. These queries are used to retrieve MEDLINE citations. Then, a machine learning approach is trained on this corpus. The third approach is a graph-based method which exploits the structure of the Metathesaurus network of relations to perform unsupervised WSD. This approach ranks nodes in the graph according to their relative structural importance. The last approach uses the semantic types assigned to the concepts in the Metathesaurus to perform WSD. The context of the ambiguous word and semantic types of the candidate concepts are mapped to Journal Descriptors. These mappings are compared to decide among the candidate concepts. Results are provided estimating accuracy of the different methods on the WSD test collection available from the NLM. CONCLUSIONS: We have found that the last approach achieves better results compared to the other methods. The graph-based approach, using the structure of the Metathesaurus network to estimate the relevance of the Metathesaurus concepts, does not perform well compared to the first two methods. In addition, the combination of methods improves the performance over the individual approaches. On the other hand, the performance is still below statistical learning trained on manually produced data and below the maximum frequency sense baseline. Finally, we propose several directions to improve the existing methods and to improve the Metathesaurus to be more effective in WSD.

Asunto(s)

Algoritmos , Bases del Conocimiento , Semántica , Inteligencia Artificial , MEDLINE , Procesamiento de Lenguaje Natural , Estados Unidos , Vocabulario Controlado

8.

UMLS content views appropriate for NLP processing of the biomedical literature vs. clinical text.

Demner-Fushman, Dina; Mork, James G; Shooshan, Sonya E; Aronson, Alan R.

J Biomed Inform ; 43(4): 587-94, 2010 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-20152935

RESUMEN

Identification of medical terms in free text is a first step in such Natural Language Processing (NLP) tasks as automatic indexing of biomedical literature and extraction of patients' problem lists from the text of clinical notes. Many tools developed to perform these tasks use biomedical knowledge encoded in the Unified Medical Language System (UMLS) Metathesaurus. We continue our exploration of automatic approaches to creation of subsets (UMLS content views) which can support NLP processing of either the biomedical literature or clinical text. We found that suppression of highly ambiguous terms in the conservative AutoFilter content view can partially replace manual filtering for literature applications, and suppression of two character mappings in the same content view achieves 89.5% precision at 78.6% recall for clinical applications.

Asunto(s)

Procesamiento de Lenguaje Natural , Unified Medical Language System/normas , Almacenamiento y Recuperación de la Información/métodos , Publicaciones

9.

A recent advance in the automatic indexing of the biomedical literature.

Névéol, Aurélie; Shooshan, Sonya E; Humphrey, Susanne M; Mork, James G; Aronson, Alan R.

J Biomed Inform ; 42(5): 814-23, 2009 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-19166973

RESUMEN

The volume of biomedical literature has experienced explosive growth in recent years. This is reflected in the corresponding increase in the size of MEDLINE, the largest bibliographic database of biomedical citations. Indexers at the US National Library of Medicine (NLM) need efficient tools to help them accommodate the ensuing workload. After reviewing issues in the automatic assignment of Medical Subject Headings (MeSH terms) to biomedical text, we focus more specifically on the new subheading attachment feature for NLM's Medical Text Indexer (MTI). Natural Language Processing, statistical, and machine learning methods of producing automatic MeSH main heading/subheading pair recommendations were assessed independently and combined. The best combination achieves 48% precision and 30% recall. After validation by NLM indexers, a suitable combination of the methods presented in this paper was integrated into MTI as a subheading attachment feature producing MeSH indexing recommendations compliant with current state-of-the-art indexing practice.

Asunto(s)

Indización y Redacción de Resúmenes/métodos , Inteligencia Artificial , MEDLINE , Medical Subject Headings , Procesamiento de Lenguaje Natural , Diccionarios Médicos como Asunto , Estudios de Evaluación como Asunto , Humanos , Interfaz Usuario-Computador

10.

Spell checker for consumer language (CSpell).

Lu, Chris J; Aronson, Alan R; Shooshan, Sonya E; Demner-Fushman, Dina.

J Am Med Inform Assoc ; 26(3): 211-218, 2019 03 01.

Artículo en Inglés | MEDLINE | ID: mdl-30668712

RESUMEN

Objective: Automated understanding of consumer health inquiries might be hindered by misspellings. To detect and correct various types of spelling errors in consumer health questions, we developed a distributable spell-checking tool, CSpell, that handles nonword errors, real-word errors, word boundary infractions, punctuation errors, and combinations of the above. Methods: We developed a novel approach of using dual embedding within Word2vec for context-dependent corrections. This technique was used in combination with dictionary-based corrections in a 2-stage ranking system. We also developed various splitters and handlers to correct word boundary infractions. All correction approaches are integrated to handle errors in consumer health questions. Results: Our approach achieves an F1 score of 80.93% and 69.17% for spelling error detection and correction, respectively. Discussion: The dual-embedding model shows a significant improvement (9.13%) in F1 score compared with the general practice of using cosine similarity with word vectors in Word2vec for context ranking. Our 2-stage ranking system shows a 4.94% improvement in F1 score compared with the best 1-stage ranking system. Conclusion: CSpell improves over the state of the art and provides near real-time automatic misspelling detection and correction in consumer health questions. The software and the CSpell test set are available at https://umlslex.nlm.nih.gov/cSpell.

Asunto(s)

Algoritmos , Información de Salud al Consumidor , Conducta en la Búsqueda de Información , Lenguaje , Procesamiento de Lenguaje Natural , Informática Aplicada a la Salud de los Consumidores , Humanos

11.

Finding medication doses in the liteature.

Demner-Fushman, Dina; Mork, James G; Rogers, Willie J; Shooshan, Sonya E; Rodriguez, Laritza; Aronson, Alan R.

AMIA Annu Symp Proc ; 2018: 368-376, 2018.

Artículo en Inglés | MEDLINE | ID: mdl-30815076

RESUMEN

Medication doses, one of the determining factors in medication safety and effectiveness, are present in the literature, but only in free-text form. We set out to determine if the systems developed for extracting drug prescription information from clinical text would yield comparable results on scientific literature and if sequence-to-sequence learning with neural networks could improve over the current state-of-the-art. We developed a collection of 694 PubMed Central documents annotated with drug dose information using the i2b2 schema. We found that less than half of the drug doses are present in the MEDLINE/PubMed abstracts, and full-text is needed to identify the other half. We identified the differences in the scope and formatting of drug dose information in the literature and clinical text, which require developing new dose extraction approaches. Finally, we achieved 83.9% recall, 87.2% precision and 85.5% F1 score in extracting complete drug prescription information from the literature.

Asunto(s)

Aprendizaje Profundo , Almacenamiento y Recuperación de la Información/métodos , Redes Neurales de la Computación , Preparaciones Farmacéuticas/administración & dosificación , PubMed , Vías de Administración de Medicamentos , Esquema de Medicación , Cálculo de Dosificación de Drogas , Humanos

12.

A dataset of 200 structured product labels annotated for adverse drug reactions.

Demner-Fushman, Dina; Shooshan, Sonya E; Rodriguez, Laritza; Aronson, Alan R; Lang, Francois; Rogers, Willie; Roberts, Kirk; Tonning, Joseph.

Sci Data ; 5: 180001, 2018 01 30.

Artículo en Inglés | MEDLINE | ID: mdl-29381145

RESUMEN

Adverse drug reactions (ADRs), unintended and sometimes dangerous effects that a drug may have, are one of the leading causes of morbidity and mortality during medical care. To date, there is no structured machine-readable authoritative source of known ADRs. The United States Food and Drug Administration (FDA) partnered with the National Library of Medicine to create a pilot dataset containing standardised information about known adverse reactions for 200 FDA-approved drugs. The Structured Product Labels (SPLs), the documents FDA uses to exchange information about drugs and other products, were manually annotated for adverse reactions at the mention level to facilitate development and evaluation of text mining tools for extraction of ADRs from all SPLs. The ADRs were then normalised to the Unified Medical Language System (UMLS) and to the Medical Dictionary for Regulatory Activities (MedDRA). We present the curation process and the structure of the publicly available database SPL-ADR-200db containing 5,098 distinct ADRs. The database is available at https://bionlp.nlm.nih.gov/tac2017adversereactions/; the code for preparing and validating the data is available at https://github.com/lhncbc/fda-ars.

Asunto(s)

Etiquetado de Medicamentos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Bases de Datos Factuales , Estados Unidos , United States Food and Drug Administration

13.

Combining lexical and semantic methods of inter-terminology mapping using the UMLS.

Fung, Kin Wah; Bodenreider, Olivier; Aronson, Alan R; Hole, William T; Srinivasan, Suresh.

Stud Health Technol Inform ; 129(Pt 1): 605-9, 2007.

Artículo en Inglés | MEDLINE | ID: mdl-17911788

RESUMEN

The need for inter-terminology mapping is constantly increasing with the growth in the volume of electronically captured biomedical data and the demand to re-use the same data for secondary purposes. Using the UMLS as a knowledge base, semantically-based and lexically-based mappings were generated from SNOMED CT to ICD9CM terms and compared to a gold standard. Semantic mapping performed better than lexical mapping in terms of coverage, recall and precision. As the two mapping methods are orthogonal, the two sets of mappings can be used to validate and enhance each other. A method of combining the mappings based on the precision level of sub-categories in each method was derived. The combined method outperformed both methods, achieving coverage of 91%, recall of 43% and precision of 27%. It is also possible to customize the method of combination to optimize performance according to the task at hand.

Asunto(s)

Algoritmos , Clasificación Internacional de Enfermedades , Unified Medical Language System , Semántica , Vocabulario Controlado

14.

Using discourse analysis to improve text categorization in MEDLINE.

Ruch, Patrick; Geissbühler, Antoine; Gobeill, Julien; Lisacek, Frederic; Tbahriti, Imad; Veuthey, Anne-Lise; Aronson, Alan R.

Stud Health Technol Inform ; 129(Pt 1): 710-5, 2007.

Artículo en Inglés | MEDLINE | ID: mdl-17911809

RESUMEN

PROBLEM: Automatic keyword assignment has been largely studied in medical informatics in the context of the MEDLINE database, both for helping search in MEDLINE and in order to provide an indicative "gist" of the content of an article. Automatic assignment of Medical Subject Headings (MeSH), which is formally an automatic text categorization task, has been proposed using different methods or combination of methods, including machine learning (naïve Bayes, neural networks..), linguistically-motivated methods (syntactic parsing, semantic tagging, or information retrieval. METHODS: In the present study, we propose to evaluate the impact of the argumentative structures of scientific articles to improve the categorization effectiveness of a categorizer, which combines linguistically-motivated and information retrieval methods. Our argumentative categorizer, which uses representation levels inherited from the field of discourse analysis, is able to classify sentences of an abstract in four classes: PURPOSE; METHODS; RESULTS and CONCLUSION. For the evaluation, the OHSUMED collection, a sample of MEDLINE, is used as a benchmark. For each abstract in the collection, the result of the argumentative classifier, i.e. the labeling of each sentence with an argumentative class, is used to modify the original ranking of the MeSH categorizer. RESULTS: The most effective combination (+2%, p<0.003) strongly overweights the METHODS section and moderately the RESULTS and CONCLUSION section. CONCLUSION: Although modest, the improvement brought by argumentative features for text categorization confirms that discourse analysis methods could benefit text mining in scientific digital libraries.

Asunto(s)

Indización y Redacción de Resúmenes/métodos , MEDLINE , Procesamiento de Lenguaje Natural , Almacenamiento y Recuperación de la Información , Bibliotecas Digitales , Medical Subject Headings

15.

MetaMap Lite: an evaluation of a new Java implementation of MetaMap.

Demner-Fushman, Dina; Rogers, Willie J; Aronson, Alan R.

J Am Med Inform Assoc ; 24(4): 841-844, 2017 Jul 01.

Artículo en Inglés | MEDLINE | ID: mdl-28130331

RESUMEN

MetaMap is a widely used named entity recognition tool that identifies concepts from the Unified Medical Language System Metathesaurus in text. This study presents MetaMap Lite, an implementation of some of the basic MetaMap functions in Java. On several collections of biomedical literature and clinical text, MetaMap Lite demonstrated real-time speed and precision, recall, and F1 scores comparable to or exceeding those of MetaMap and other popular biomedical text processing tools, clinical Text Analysis and Knowledge Extraction System (cTAKES) and DNorm.

Asunto(s)

Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Programas Informáticos , Unified Medical Language System , Algoritmos

16.

Comment on 'MeSH-up: effective MeSH text classification for improved document retrieval'.

Névéol, Aurélie; Mork, James G; Aronson, Alan R.

Bioinformatics ; 25(20): 2770-1; author reply 2772, 2009 Oct 15.

Artículo en Inglés | MEDLINE | ID: mdl-19671694

Asunto(s)

Indización y Redacción de Resúmenes/métodos , Biología Computacional/métodos , Almacenamiento y Recuperación de la Información/métodos , Medical Subject Headings , Vocabulario Controlado

17.

Generating hypotheses by discovering implicit associations in the literature: a case report of a search for new potential therapeutic uses for thalidomide.

Weeber, Marc; Vos, Rein; Klein, Henny; De Jong-Van Den Berg, Lolkje T W; Aronson, Alan R; Molema, Grietje.

J Am Med Inform Assoc ; 10(3): 252-9, 2003.

Artículo en Inglés | MEDLINE | ID: mdl-12626374

RESUMEN

The availability of scientific bibliographies through online databases provides a rich source of information for scientists to support their research. However, the risk of this pervasive availability is that an individual researcher may fail to find relevant information that is outside the direct scope of interest. Following Swanson's ABC model of disjoint but complementary structures in the biomedical literature, we have developed a discovery support tool to systematically analyze the scientific literature in order to generate novel and plausible hypotheses. In this case report, we employ the system to find potentially new target diseases for the drug thalidomide. We find solid bibliographic evidence suggesting that thalidomide might be useful for treating acute pancreatitis, chronic hepatitis C, Helicobacter pylori-induced gastritis, and myasthenia gravis. However, experimental and clinical evaluation is needed to validate these hypotheses and to assess the trade-off between therapeutic benefits and toxicities.

Asunto(s)

Almacenamiento y Recuperación de la Información/métodos , Talidomida/uso terapéutico , Bases de Datos Bibliográficas , Humanos , Interleucina-12/antagonistas & inhibidores , Procesamiento de Lenguaje Natural , PubMed , Talidomida/inmunología , Unified Medical Language System

18.

The NLM Indexing Initiative's Medical Text Indexer.

Aronson, Alan R; Mork, James G; Gay, Clifford W; Humphrey, Susanne M; Rogers, Willie J.

Stud Health Technol Inform ; 107(Pt 1): 268-72, 2004.

Artículo en Inglés | MEDLINE | ID: mdl-15360816

RESUMEN

The Medical Text Indexer (MTI) is a program for producing MeSH indexing recommendations. It is the major product of NLM's Indexing Initiative and has been used in both semi-automated and fully automated indexing environments at the Library since mid 2002. We report here on an experiment conducted with MEDLINE indexers to evaluate MTI's performance and to generate ideas for its improvement as a tool for user-assisted indexing. We also discuss some filtering techniques developed to improve MTI's accuracy for use primarily in automatically producing the indexing for several abstracts collections.

Asunto(s)

Indización y Redacción de Resúmenes/métodos , Medical Subject Headings , Procesamiento de Lenguaje Natural , MEDLINE , National Library of Medicine (U.S.) , Unified Medical Language System , Estados Unidos

19.

Applying MetaMap to Medline for identifying novel associations in a large clinical dataset: a feasibility analysis.

Hanauer, David A; Saeed, Mohammed; Zheng, Kai; Mei, Qiaozhu; Shedden, Kerby; Aronson, Alan R; Ramakrishnan, Naren.

J Am Med Inform Assoc ; 21(5): 925-37, 2014.

Artículo en Inglés | MEDLINE | ID: mdl-24928177

RESUMEN

OBJECTIVE: We describe experiments designed to determine the feasibility of distinguishing known from novel associations based on a clinical dataset comprised of International Classification of Disease, V.9 (ICD-9) codes from 1.6 million patients by comparing them to associations of ICD-9 codes derived from 20.5 million Medline citations processed using MetaMap. Associations appearing only in the clinical dataset, but not in Medline citations, are potentially novel. METHODS: Pairwise associations of ICD-9 codes were independently identified in both the clinical and Medline datasets, which were then compared to quantify their degree of overlap. We also performed a manual review of a subset of the associations to validate how well MetaMap performed in identifying diagnoses mentioned in Medline citations that formed the basis of the Medline associations. RESULTS: The overlap of associations based on ICD-9 codes in the clinical and Medline datasets was low: only 6.6% of the 3.1 million associations found in the clinical dataset were also present in the Medline dataset. Further, a manual review of a subset of the associations that appeared in both datasets revealed that co-occurring diagnoses from Medline citations do not always represent clinically meaningful associations. DISCUSSION: Identifying novel associations derived from large clinical datasets remains challenging. Medline as a sole data source for existing knowledge may not be adequate to filter out widely known associations. CONCLUSIONS: In this study, novel associations were not readily identified. Further improvements in accuracy and relevance for tools such as MetaMap are needed to realize their expected utility.

Asunto(s)

Minería de Datos , Clasificación Internacional de Enfermedades , MEDLINE , Unified Medical Language System , Estudios de Factibilidad , Humanos , Procesamiento de Lenguaje Natural

20.

Mining MEDLINE for problems associated with vitamin D.

Demner-Fushman, Dina; Mork, James G; Aronson, Alan R.

AMIA Annu Symp Proc ; 2013: 300-8, 2013.

Artículo en Inglés | MEDLINE | ID: mdl-24551339

RESUMEN

This paper presents a two-step approach to generating comprehensive abstractive overviews for biomedical topics. It starts with a sensitivity-maximizing search of MEDLINE/PubMed and MeSH-based filtering of the results that are then processed using NLP methods to extract relations between entities of interest. We evaluate this approach in a case study based on the IOM report on the role of vitamin D in human health. The report defines disorders that serve as health indicators for the role of vitamin D. We evaluate the abstractive overviews generated using MeSH indexing and the extracted relations using the disorders listed in the IOM report as reference standard. We conclude that MeSH-based aggregation and filtering of the results is a useful and easy step in the generation of abstractive overviews. Although our relation extraction achieved 83.6% recall and 92.8% precision, only half of the disorders of interest participated in these relations.

Asunto(s)

Indización y Redacción de Resúmenes , Minería de Datos , MEDLINE , Medical Subject Headings , Procesamiento de Lenguaje Natural , Vitamina D , Calcio/administración & dosificación , Suplementos Dietéticos/efectos adversos , Humanos , Vitamina D/administración & dosificación , Vitamina D/efectos adversos , Vitamina D/sangre , Deficiencia de Vitamina D

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA