Búsqueda | BVS CLAP/SMR-OPS/OMS

1.

Open Agile text mining for bioinformatics: the PubAnnotation ecosystem.

Kim, Jin-Dong; Wang, Yue; Fujiwara, Toyofumi; Okuda, Shujiro; Callahan, Tiffany J; Cohen, K Bretonnel.

Bioinformatics ; 35(21): 4372-4380, 2019 11 01.

Artículo en Inglés | MEDLINE | ID: mdl-30937439

RESUMEN

MOTIVATION: Most currently available text mining tools share two characteristics that make them less than optimal for use by biomedical researchers: they require extensive specialist skills in natural language processing and they were built on the assumption that they should optimize global performance metrics on representative datasets. This is a problem because most end-users are not natural language processing specialists and because biomedical researchers often care less about global metrics like F-measure or representative datasets than they do about more granular metrics such as precision and recall on their own specialized datasets. Thus, there are fundamental mismatches between the assumptions of much text mining work and the preferences of potential end-users. RESULTS: This article introduces the concept of Agile text mining, and presents the PubAnnotation ecosystem as an example implementation. The system approaches the problems from two perspectives: it allows the reformulation of text mining by biomedical researchers from the task of assembling a complete system to the task of retrieving warehoused annotations, and it makes it possible to do very targeted customization of the pre-existing system to address specific end-user requirements. Two use cases are presented: assisted curation of the GlycoEpitope database, and assessing coverage in the literature of pre-eclampsia-associated genes. AVAILABILITY AND IMPLEMENTATION: The three tools that make up the ecosystem, PubAnnotation, PubDictionaries and TextAE are publicly available as web services, and also as open source projects. The dictionaries and the annotation datasets associated with the use cases are all publicly available through PubDictionaries and PubAnnotation, respectively.

Asunto(s)

Biología Computacional , Ecosistema , Minería de Datos , Femenino , Humanos , Procesamiento de Lenguaje Natural , Embarazo , PubMed

2.

A nonparametric Bayesian method of translating machine learning scores to probabilities in clinical decision support.

Connolly, Brian; Cohen, K Bretonnel; Santel, Daniel; Bayram, Ulya; Pestian, John.

BMC Bioinformatics ; 18(1): 361, 2017 Aug 07.

Artículo en Inglés | MEDLINE | ID: mdl-28784111

RESUMEN

BACKGROUND: Probabilistic assessments of clinical care are essential for quality care. Yet, machine learning, which supports this care process has been limited to categorical results. To maximize its usefulness, it is important to find novel approaches that calibrate the ML output with a likelihood scale. Current state-of-the-art calibration methods are generally accurate and applicable to many ML models, but improved granularity and accuracy of such methods would increase the information available for clinical decision making. This novel non-parametric Bayesian approach is demonstrated on a variety of data sets, including simulated classifier outputs, biomedical data sets from the University of California, Irvine (UCI) Machine Learning Repository, and a clinical data set built to determine suicide risk from the language of emergency department patients. RESULTS: The method is first demonstrated on support-vector machine (SVM) models, which generally produce well-behaved, well understood scores. The method produces calibrations that are comparable to the state-of-the-art Bayesian Binning in Quantiles (BBQ) method when the SVM models are able to effectively separate cases and controls. However, as the SVM models' ability to discriminate classes decreases, our approach yields more granular and dynamic calibrated probabilities comparing to the BBQ method. Improvements in granularity and range are even more dramatic when the discrimination between the classes is artificially degraded by replacing the SVM model with an ad hoc k-means classifier. CONCLUSIONS: The method allows both clinicians and patients to have a more nuanced view of the output of an ML model, allowing better decision making. The method is demonstrated on simulated data, various biomedical data sets and a clinical data set, to which diverse ML methods are applied. Trivially extending the method to (non-ML) clinical scores is also discussed.

Asunto(s)

Sistemas de Apoyo a Decisiones Clínicas , Aprendizaje Automático , Adolescente , Teorema de Bayes , Calibración , Sistemas de Apoyo a Decisiones Clínicas/normas , Humanos , Estadísticas no Paramétricas , Suicidio , Máquina de Vectores de Soporte

3.

Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles.

Cohen, K Bretonnel; Lanfranchi, Arrick; Choi, Miji Joo-Young; Bada, Michael; Baumgartner, William A; Panteleyeva, Natalya; Verspoor, Karin; Palmer, Martha; Hunter, Lawrence E.

BMC Bioinformatics ; 18(1): 372, 2017 Aug 17.

Artículo en Inglés | MEDLINE | ID: mdl-28818042

RESUMEN

BACKGROUND: Coreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations. RESULTS: The corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus. CONCLUSIONS: The project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large.

Asunto(s)

Minería de Datos/métodos , Publicaciones Periódicas como Asunto , Semántica

4.

Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters.

Funk, Christopher; Baumgartner, William; Garcia, Benjamin; Roeder, Christophe; Bada, Michael; Cohen, K Bretonnel; Hunter, Lawrence E; Verspoor, Karin.

BMC Bioinformatics ; 15: 59, 2014 Feb 26.

Artículo en Inglés | MEDLINE | ID: mdl-24571547

RESUMEN

BACKGROUND: Ontological concepts are useful for many different biomedical tasks. Concepts are difficult to recognize in text due to a disconnect between what is captured in an ontology and how the concepts are expressed in text. There are many recognizers for specific ontologies, but a general approach for concept recognition is an open problem. RESULTS: Three dictionary-based systems (MetaMap, NCBO Annotator, and ConceptMapper) are evaluated on eight biomedical ontologies in the Colorado Richly Annotated Full-Text (CRAFT) Corpus. Over 1,000 parameter combinations are examined, and best-performing parameters for each system-ontology pair are presented. CONCLUSIONS: Baselines for concept recognition by three systems on eight biomedical ontologies are established (F-measures range from 0.14-0.83). Out of the three systems we tested, ConceptMapper is generally the best-performing system; it produces the highest F-measure of seven out of eight ontologies. Default parameters are not ideal for most systems on most ontologies; by changing parameters F-measure can be increased by up to 0.4. Not only are best performing parameters presented, but suggestions for choosing the best parameters based on ontology characteristics are presented.

Asunto(s)

Ontologías Biológicas , Minería de Datos/métodos , Bases de Datos Factuales , Reproducibilidad de los Resultados

5.

Mining the pharmacogenomics literature--a survey of the state of the art.

Hahn, Udo; Cohen, K Bretonnel; Garten, Yael; Shah, Nigam H.

Brief Bioinform ; 13(4): 460-94, 2012 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-22833496

RESUMEN

This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.

Asunto(s)

Minería de Datos/métodos , Farmacogenética , Recolección de Datos , Almacenamiento y Recuperación de la Información , Procesamiento de Lenguaje Natural , Publicaciones , Semántica

6.

Chapter 16: text mining for translational bioinformatics.

Cohen, K Bretonnel; Hunter, Lawrence E.

PLoS Comput Biol ; 9(4): e1003044, 2013 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-23633944

RESUMEN

Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.

Asunto(s)

Biología Computacional/métodos , Minería de Datos/métodos , Algoritmos , Animales , Inteligencia Artificial , Simulación por Computador , Humanos , Fenotipo , Lenguajes de Programación , Programas Informáticos , Investigación Biomédica Traslacional

7.

Concept annotation in the CRAFT corpus.

Bada, Michael; Eckert, Miriam; Evans, Donald; Garcia, Kristin; Shipley, Krista; Sitnikov, Dmitry; Baumgartner, William A; Cohen, K Bretonnel; Verspoor, Karin; Blake, Judith A; Hunter, Lawrence E.

BMC Bioinformatics ; 13: 161, 2012 Jul 09.

Artículo en Inglés | MEDLINE | ID: mdl-22776079

RESUMEN

BACKGROUND: Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. RESULTS: This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. CONCLUSIONS: As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

Asunto(s)

Minería de Datos , Procesamiento de Lenguaje Natural , Vocabulario Controlado , Biología Computacional/métodos , Bases de Datos Factuales , Almacenamiento y Recuperación de la Información/métodos , Semántica

8.

U-Compare bio-event meta-service: compatible BioNLP event extraction services.

Kano, Yoshinobu; Björne, Jari; Ginter, Filip; Salakoski, Tapio; Buyko, Ekaterina; Hahn, Udo; Cohen, K Bretonnel; Verspoor, Karin; Roeder, Christophe; Hunter, Lawrence E; Kilicoglu, Halil; Bergler, Sabine; Van Landeghem, Sofie; Van Parys, Thomas; Van de Peer, Yves; Miwa, Makoto; Ananiadou, Sophia; Neves, Mariana; Pascual-Montano, Alberto; Özgür, Arzucan; Radev, Dragomir R; Riedel, Sebastian; Sætre, Rune; Chun, Hong-Woo; Kim, Jin-Dong; Pyysalo, Sampo; Ohta, Tomoko; Tsujii, Jun'ichi.

BMC Bioinformatics ; 12: 481, 2011 Dec 18.

Artículo en Inglés | MEDLINE | ID: mdl-22177292

RESUMEN

BACKGROUND: Bio-molecular event extraction from literature is recognized as an important task of bio text mining and, as such, many relevant systems have been developed and made available during the last decade. While such systems provide useful services individually, there is a need for a meta-service to enable comparison and ensemble of such services, offering optimal solutions for various purposes. RESULTS: We have integrated nine event extraction systems in the U-Compare framework, making them intercompatible and interoperable with other U-Compare components. The U-Compare event meta-service provides various meta-level features for comparison and ensemble of multiple event extraction systems. Experimental results show that the performance improvements achieved by the ensemble are significant. CONCLUSIONS: While individual event extraction systems themselves provide useful features for bio text mining, the U-Compare meta-service is expected to improve the accessibility to the individual systems, and to enable meta-level uses over multiple event extraction systems such as comparison and ensemble.

Asunto(s)

Minería de Datos , Sistemas de Computación , Publicaciones Periódicas como Asunto , Programas Informáticos

9.

The structural and content aspects of abstracts versus bodies of full text journal articles are different.

Cohen, K Bretonnel; Johnson, Helen L; Verspoor, Karin; Roeder, Christophe; Hunter, Lawrence E.

BMC Bioinformatics ; 11: 492, 2010 Sep 29.

Artículo en Inglés | MEDLINE | ID: mdl-20920264

RESUMEN

BACKGROUND: An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research. RESULTS: We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies. CONCLUSIONS: Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.

Asunto(s)

Indización y Redacción de Resúmenes/métodos , Publicaciones Periódicas como Asunto , Almacenamiento y Recuperación de la Información/métodos , MEDLINE , Procesamiento de Lenguaje Natural , Terminología como Asunto

10.

Ontology quality assurance through analysis of term transformations.

Verspoor, Karin; Dvorkin, Daniel; Cohen, K Bretonnel; Hunter, Lawrence.

Bioinformatics ; 25(12): i77-84, 2009 Jun 15.

Artículo en Inglés | MEDLINE | ID: mdl-19478020

RESUMEN

MOTIVATION: It is important for the quality of biological ontologies that similar concepts be expressed consistently, or univocally. Univocality is relevant for the usability of the ontology for humans, as well as for computational tools that rely on regularity in the structure of terms. However, in practice terms are not always expressed consistently, and we must develop methods for identifying terms that are not univocal so that they can be corrected. RESULTS: We developed an automated transformation-based clustering methodology for detecting terms that use different linguistic conventions for expressing similar semantics. These term sets represent occurrences of univocality violations. Our method was able to identify 67 examples of univocality violations in the Gene Ontology. AVAILABILITY: The identified univocality violations are available upon request. We are preparing a release of an open source version of the software to be available at http://bionlp.sourceforge.net.

Asunto(s)

Biología Computacional/métodos , Vocabulario Controlado , Análisis por Conglomerados , Bases de Datos Factuales , Almacenamiento y Recuperación de la Información/métodos , Control de Calidad , Programas Informáticos

11.

U-Compare: share and compare text mining tools with UIMA.

Kano, Yoshinobu; Baumgartner, William A; McCrohon, Luke; Ananiadou, Sophia; Cohen, K Bretonnel; Hunter, Lawrence; Tsujii, Jun'ichi.

Bioinformatics ; 25(15): 1997-8, 2009 Aug 01.

Artículo en Inglés | MEDLINE | ID: mdl-19414535

RESUMEN

SUMMARY: Due to the increasing number of text mining resources (tools and corpora) available to biologists, interoperability issues between these resources are becoming significant obstacles to using them effectively. UIMA, the Unstructured Information Management Architecture, is an open framework designed to aid in the construction of more interoperable tools. U-Compare is built on top of the UIMA framework, and provides both a concrete framework for out-of-the-box text mining and a sophisticated evaluation platform allowing users to run specific tools on any target text, generating both detailed statistics and instance-based visualizations of outputs. U-Compare is a joint project, providing the world's largest, and still growing, collection of UIMA-compatible resources. These resources, originally developed by different groups for a variety of domains, include many famous tools and corpora. U-Compare can be launched straight from the web, without needing to be manually installed. All U-Compare components are provided ready-to-use and can be combined easily via a drag-and-drop interface without any programming. External UIMA components can also simply be mixed with U-Compare components, without distinguishing between locally and remotely deployed resources. AVAILABILITY: http://u-compare.org/

Asunto(s)

Biología Computacional/métodos , Programas Informáticos , Bases de Datos Factuales , Almacenamiento y Recuperación de la Información , Interfaz Usuario-Computador

12.

The textual characteristics of traditional and Open Access scientific journals are similar.

Verspoor, Karin; Cohen, K Bretonnel; Hunter, Lawrence.

BMC Bioinformatics ; 10: 183, 2009 Jun 15.

Artículo en Inglés | MEDLINE | ID: mdl-19527520

RESUMEN

BACKGROUND: Recent years have seen an increased amount of natural language processing (NLP) work on full text biomedical journal publications. Much of this work is done with Open Access journal articles. Such work assumes that Open Access articles are representative of biomedical publications in general and that methods developed for analysis of Open Access full text publications will generalize to the biomedical literature as a whole. If this assumption is wrong, the cost to the community will be large, including not just wasted resources, but also flawed science. This paper examines that assumption. RESULTS: We collected two sets of documents, one consisting only of Open Access publications and the other consisting only of traditional journal publications. We examined them for differences in surface linguistic structures that have obvious consequences for the ease or difficulty of natural language processing and for differences in semantic content as reflected in lexical items. Regarding surface linguistic structures, we examined the incidence of conjunctions, negation, passives, and pronominal anaphora, and found that the two collections did not differ. We also examined the distribution of sentence lengths and found that both collections were characterized by the same mode. Regarding lexical items, we found that the Kullback-Leibler divergence between the two collections was low, and was lower than the divergence between either collection and a reference corpus. Where small differences did exist, log likelihood analysis showed that they were primarily in the area of formatting and in specific named entities. CONCLUSION: We did not find structural or semantic differences between the Open Access and traditional journal collections.

Asunto(s)

Acceso a la Información , Lingüística , Publicaciones Periódicas como Asunto , Investigación Biomédica , Bases de Datos Bibliográficas , Procesamiento de Lenguaje Natural

13.

Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD).

Wiegers, Thomas C; Davis, Allan Peter; Cohen, K Bretonnel; Hirschman, Lynette; Mattingly, Carolyn J.

BMC Bioinformatics ; 10: 326, 2009 Oct 08.

Artículo en Inglés | MEDLINE | ID: mdl-19814812

RESUMEN

BACKGROUND: The Comparative Toxicogenomics Database (CTD) is a publicly available resource that promotes understanding about the etiology of environmental diseases. It provides manually curated chemical-gene/protein interactions and chemical- and gene-disease relationships from the peer-reviewed, published literature. The goals of the research reported here were to establish a baseline analysis of current CTD curation, develop a text-mining prototype from readily available open source components, and evaluate its potential value in augmenting curation efficiency and increasing data coverage. RESULTS: Prototype text-mining applications were developed and evaluated using a CTD data set consisting of manually curated molecular interactions and relationships from 1,600 documents. Preliminary results indicated that the prototype found 80% of the gene, chemical, and disease terms appearing in curated interactions. These terms were used to re-rank documents for curation, resulting in increases in mean average precision (63% for the baseline vs. 73% for a rule-based re-ranking), and in the correlation coefficient of rank vs. number of curatable interactions per document (baseline 0.14 vs. 0.38 for the rule-based re-ranking). CONCLUSION: This text-mining project is unique in its integration of existing tools into a single workflow with direct application to CTD. We performed a baseline assessment of the inter-curator consistency and coverage in CTD, which allowed us to measure the potential of these integrated tools to improve prioritization of journal articles for manual curation. Our study presents a feasible and cost-effective approach for developing a text mining solution to enhance manual curation throughput and efficiency.

Asunto(s)

Biología Computacional/métodos , Bases de Datos Factuales , Redes Reguladoras de Genes , Almacenamiento y Recuperación de la Información/métodos , Toxicogenética

14.

P-Hacking Lexical Richness Through Definitions of "Type" and "Token".

Cohen, K Bretonnel; Hunter, Lawrence E; Pressman, Peter S.

Stud Health Technol Inform ; 264: 1433-1434, 2019 Aug 21.

Artículo en Inglés | MEDLINE | ID: mdl-31438167

RESUMEN

"P-hacking" is the repeated analysis of data until a statistically significant result is achieved. We show that p-hacking can also occur during data generation, sometimes unintentionally. We use the type-token ratio to demonstrate that differences in the definitions of "type" and "token" can produce significantly different results. Since these terms are rarely defined in the biomedical literature, the result is an inability to meaningfully interpret the body of literature that makes use of this measure.

Asunto(s)

Seguridad Computacional , Vocabulario

15.

OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression.

Hunter, Lawrence; Lu, Zhiyong; Firby, James; Baumgartner, William A; Johnson, Helen L; Ogren, Philip V; Cohen, K Bretonnel.

BMC Bioinformatics ; 9: 78, 2008 Jan 31.

Artículo en Inglés | MEDLINE | ID: mdl-18237434

RESUMEN

BACKGROUND: Information extraction (IE) efforts are widely acknowledged to be important in harnessing the rapid advance of biomedical knowledge, particularly in areas where important factual information is published in a diverse literature. Here we report on the design, implementation and several evaluations of OpenDMAP, an ontology-driven, integrated concept analysis system. It significantly advances the state of the art in information extraction by leveraging knowledge in ontological resources, integrating diverse text processing applications, and using an expanded pattern language that allows the mixing of syntactic and semantic elements and variable ordering. RESULTS: OpenDMAP information extraction systems were produced for extracting protein transport assertions (transport), protein-protein interaction assertions (interaction) and assertions that a gene is expressed in a cell type (expression). Evaluations were performed on each system, resulting in F-scores ranging from .26-.72 (precision .39-.85, recall .16-.85). Additionally, each of these systems was run over all abstracts in MEDLINE, producing a total of 72,460 transport instances, 265,795 interaction instances and 176,153 expression instances. CONCLUSION: OpenDMAP advances the performance standards for extracting protein-protein interaction predications from the full texts of biomedical research articles. Furthermore, this level of performance appears to generalize to other information extraction tasks, including extracting information about predicates of more than two arguments. The output of the information extraction system is always constructed from elements of an ontology, ensuring that the knowledge representation is grounded with respect to a carefully constructed model of reality. The results of these efforts can be used to increase the efficiency of manual curation efforts and to provide additional features in systems that integrate multiple sources for information extraction. The open source OpenDMAP code library is freely available at http://bionlp.sourceforge.net/

Asunto(s)

Expresión Génica/fisiología , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Publicaciones Periódicas como Asunto , Mapeo de Interacción de Proteínas/métodos , Transporte de Proteínas/fisiología , Programas Informáticos , Algoritmos , Fenómenos Fisiológicos Celulares , Conocimientos, Actitudes y Práctica en Salud , Modelos Biológicos

16.

Habitat-Lite: a GSC case study based on free text terms for environmental metadata.

Hirschman, Lynette; Clark, Cheryl; Cohen, K Bretonnel; Mardis, Scott; Luciano, Joanne; Kottmann, Renzo; Cole, James; Markowitz, Victor; Kyrpides, Nikos; Morrison, Norman; Schriml, Lynn M; Field, Dawn.

OMICS ; 12(2): 129-36, 2008 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-18416669

RESUMEN

There is an urgent need to capture metadata on the rapidly growing number of genomic, metagenomic and related sequences, such as 16S ribosomal genes. This need is a major focus within the Genomic Standards Consortium (GSC), and Habitat is a key metadata descriptor in the proposed "Minimum Information about a Genome Sequence" (MIGS) specification. The goal of the work described here is to provide a light-weight, easy-to-use (small) set of terms ("Habitat-Lite") that captures high-level information about habitat while preserving a mapping to the recently launched Environment Ontology (EnvO). Our motivation for building Habitat-Lite is to meet the needs of multiple users, such as annotators curating these data, database providers hosting the data, and biologists and bioinformaticians alike who need to search and employ such data in comparative analyses. Here, we report a case study based on semiautomated identification of terms from GenBank and GOLD. We estimate that the terms in the initial version of Habitat-Lite would provide useful labels for over 60% of the kinds of information found in the GenBank isolation_source field, and around 85% of the terms in the GOLD habitat field. We present a revised version of Habitat-Lite defined within the EnvO Environmental Ontology through a new category, EnvO-Lite-GSC. We invite the community's feedback on its further development to provide a minimum list of terms to capture high-level habitat information and to provide classification bins needed for future studies.

Asunto(s)

Genómica , Bases de Datos Genéticas , Estándares de Referencia

17.

Manual curation is not sufficient for annotation of genomic databases.

Baumgartner, William A; Cohen, K Bretonnel; Fox, Lynne M; Acquaah-Mensah, George; Hunter, Lawrence.

Bioinformatics ; 23(13): i41-8, 2007 Jul 01.

Artículo en Inglés | MEDLINE | ID: mdl-17646325

RESUMEN

MOTIVATION: Knowledge base construction has been an area of intense activity and great importance in the growth of computational biology. However, there is little or no history of work on the subject of evaluation of knowledge bases, either with respect to their contents or with respect to the processes by which they are constructed. This article proposes the application of a metric from software engineering known as the found/fixed graph to the problem of evaluating the processes by which genomic knowledge bases are built, as well as the completeness of their contents. RESULTS: Well-understood patterns of change in the found/fixed graph are found to occur in two large publicly available knowledge bases. These patterns suggest that the current manual curation processes will take far too long to complete the annotations of even just the most important model organisms, and that at their current rate of production, they will never be sufficient for completing the annotation of all currently available proteomes.

Asunto(s)

Mapeo Cromosómico/métodos , Bases de Datos de Proteínas , Documentación/métodos , Genómica/métodos , Proteínas/química , Proteínas/genética , Análisis de Secuencia de Proteína/métodos

18.

MutationFinder: a high-performance system for extracting point mutation mentions from text.

Caporaso, J Gregory; Baumgartner, William A; Randolph, David A; Cohen, K Bretonnel; Hunter, Lawrence.

Bioinformatics ; 23(14): 1862-5, 2007 Jul 15.

Artículo en Inglés | MEDLINE | ID: mdl-17495998

RESUMEN

Discussion of point mutations is ubiquitous in biomedical literature, and manually compiling databases or literature on mutations in specific genes or proteins is tedious. We present an open-source, rule-based system, MutationFinder, for extracting point mutation mentions from text. On blind test data, it achieves nearly perfect precision and a markedly improved recall over a baseline. AVAILABILITY: MutationFinder, along with a high-quality gold standard data set, and a scoring script for mutation extraction systems have been made publicly available. Implementations, source code and unit tests are available in Python, Perl and Java. MutationFinder can be used as a stand-alone script, or imported by other applications. PROJECT URL: http://bionlp.sourceforge.net.

Asunto(s)

Biología Computacional/métodos , Mutación , Mutación Puntual , Algoritmos , Análisis Mutacional de ADN , Bases de Datos Bibliográficas , Bases de Datos Genéticas , Bases de Datos de Proteínas , Técnicas Genéticas , Humanos , Reconocimiento de Normas Patrones Automatizadas , Publicaciones , Reproducibilidad de los Resultados , Programas Informáticos

19.

Improving precision in concept normalization.

Boguslav, Mayla; Cohen, K Bretonnel; Baumgartner, William A; Hunter, Lawrence E.

Pac Symp Biocomput ; 23: 566-577, 2018.

Artículo en Inglés | MEDLINE | ID: mdl-29218915

RESUMEN

Most natural language processing applications exhibit a trade-off between precision and recall. In some use cases for natural language processing, there are reasons to prefer to tilt that trade-off toward high precision. Relying on the Zipfian distribution of false positive results, we describe a strategy for increasing precision, using a variety of both pre-processing and post-processing methods. They draw on both knowledge-based and frequentist approaches to modeling language. Based on an existing high-performance biomedical concept recognition pipeline and a previously published manually annotated corpus, we apply this hybrid rationalist/empiricist strategy to concept normalization for eight different ontologies. Which approaches did and did not improve precision varied widely between the ontologies.

Asunto(s)

Procesamiento de Lenguaje Natural , Ontologías Biológicas/estadística & datos numéricos , Biología Computacional/métodos , Minería de Datos/métodos , Registros Electrónicos de Salud/estadística & datos numéricos , Reacciones Falso Positivas , Humanos , Medicina de Precisión/estadística & datos numéricos , PubMed/estadística & datos numéricos , Reproducibilidad de los Resultados

20.

A Modular Ontology for Modeling Service Provision in a Communication Network for Coordination of Care.

Cardoso, Sonia; Aimé, Xavier; Meininger, Vincent; Grabli, David; Melo Mora, Luis Felipe; Cohen, K Bretonnel; Charlet, Jean.

Stud Health Technol Inform ; 247: 890-894, 2018.

Artículo en Inglés | MEDLINE | ID: mdl-29678089

RESUMEN

This paper presents a modular ontology of health care in the context in Amyotrophic Lateral Sclerosis. 4 modules cover socio-environmental, medical, and care coordination aspects of the domain. They are organized by a core module. Its goal is to understand interruptions in health care provision in the context of a neurodegenerative disease.

Asunto(s)

Esclerosis Amiotrófica Lateral/terapia , Comunicación , Manejo de la Enfermedad , Humanos

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA