Text-mining of PubMed abstracts by natural language processing to create a public knowledge base on molecular mechanisms of bacterial enteropathogens.

Zaremba, Sam; Ramos-Santacruz, Mila; Hampton, Thomas; Shetty, Panna; Fedorko, Joel; Whitmore, Jon; Greene, John M; Perna, Nicole T; Glasner, Jeremy D; Plunkett, Guy; Shaker, Matthew; Pot, David

Zaremba, Sam; Ramos-Santacruz, Mila; Hampton, Thomas; Shetty, Panna; Fedorko, Joel; Whitmore, Jon; Greene, John M; Perna, Nicole T; Glasner, Jeremy D; Plunkett, Guy; Shaker, Matthew; Pot, David.

Afiliación

Zaremba S; ERIC-BRC, SRA International Inc, Global Health Sector, Rockville, MD 20852, USA. Sam_Zaremba@sra.com

BMC Bioinformatics ; 10: 177, 2009 Jun 10.

Article en En | MEDLINE | ID: mdl-19515247

ABSTRACT

ABSTRACT

BACKGROUND:

The Enteropathogen Resource Integration Center (ERIC; http//www.ericbrc.org) has a goal of providing bioinformatics support for the scientific community researching enteropathogenic bacteria such as Escherichia coli and Salmonella spp. Rapid and accurate identification of experimental conclusions from the scientific literature is critical to support research in this field. Natural Language Processing (NLP), and in particular Information Extraction (IE) technology, can be a significant aid to this process. DESCRIPTION We have trained a powerful, state-of-the-art IE technology on a corpus of abstracts from the microbial literature in PubMed to automatically identify and categorize biologically relevant entities and predicative relations. These relations include Genes/Gene Products and their Roles; Gene Mutations and the resulting Phenotypes; and Organisms and their associated Pathogenicity. Evaluations on blind datasets show an F-measure average of greater than 90% for entities (genes, operons, etc.) and over 70% for relations (gene/gene product to role, etc). This IE capability, combined with text indexing and relational database technologies, constitute the core of our recently deployed text mining application.

CONCLUSION:

Our Text Mining application is available online on the ERIC website (http//www.ericbrc.org/portal/eric/articles). The information retrieval interface displays a list of recently published enteropathogen literature abstracts, and also provides a search interface to execute custom queries by keyword, date range, etc. Upon selection, processed abstracts and the entities and relations extracted from them are retrieved from a relational database and marked up to highlight the entities and relations. The abstract also provides links from extracted genes and gene products to the ERIC Annotations database, thus providing access to comprehensive genomic annotations and adding value to both the text-mining and annotations systems.

Asunto(s)

Indización y Redacción de Resúmenes; Biología Computacional/métodos; Enterobacteriaceae; Almacenamiento y Recuperación de la Información; Procesamiento de Lenguaje Natural; PubMed; Fenómenos Fisiológicos Bacterianos; Sistemas de Administración de Bases de Datos; Bases de Datos Factuales; Enterobacteriaceae/genética; Enterobacteriaceae/patogenicidad; Enterobacteriaceae/fisiología; Escherichia coli/genética; Escherichia coli/patogenicidad; Escherichia coli/fisiología; Internet; Salmonella/genética; Salmonella/patogenicidad; Salmonella/fisiología; Interfaz Usuario-Computador

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Procesamiento de Lenguaje Natural / Almacenamiento y Recuperación de la Información / Biología Computacional / PubMed / Indización y Redacción de Resúmenes / Enterobacteriaceae Idioma: En Revista: BMC Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2009 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google