Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus.

Comeau, Donald C; Liu, Haibin; Islamaj Dogan, Rezarta; Wilbur, W John

Comeau, Donald C; Liu, Haibin; Islamaj Dogan, Rezarta; Wilbur, W John.

Afiliación

Comeau DC; National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA comeau@ncbi.nlm.nih.gov.
Liu H; National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Islamaj Dogan R; National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Wilbur WJ; National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Database (Oxford) ; 20142014.

Article en En | MEDLINE | ID: mdl-24935050

ABSTRACT

ABSTRACT

BioC is a new format and associated code libraries for sharing text and annotations. We have implemented BioC natural language preprocessing pipelines in two popular programming languages C++ and Java. The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets. The pipeline functionality includes sentence segmentation, tokenization, part-of-speech tagging, lemmatization and sentence parsing. These pipelines can be easily integrated along with other BioC programs into any BioC compliant text mining systems. As an application, we converted the NCBI disease corpus to BioC format, and the pipelines have successfully run on this corpus to demonstrate their functionality. Code and data can be downloaded from http//bioc.sourceforge.net. Database URL http//bioc.sourceforge.net.

Asunto(s)

Ontologías Biológicas; Biología Computacional; Minería de Datos/métodos; Bases de Datos Factuales; Enfermedad; Procesamiento de Lenguaje Natural; Humanos; Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Procesamiento de Lenguaje Natural / Enfermedad / Bases de Datos Factuales / Biología Computacional / Minería de Datos / Ontologías Biológicas Límite: Humans País/Región como asunto: America do norte Idioma: En Revista: Database (Oxford) Año: 2014 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google