Búsqueda | Portal Regional de la BVS

BioInfer: a corpus for information extraction in the biomedical domain.

Pyysalo, Sampo; Ginter, Filip; Heimonen, Juho; Björne, Jari; Boberg, Jorma; Järvinen, Jouni; Salakoski, Tapio.

BMC Bioinformatics ; 8: 50, 2007 Feb 09.

Artículo en Inglés | MEDLINE | ID: mdl-17291334

RESUMEN

BACKGROUND: Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora. RESULTS: We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation. CONCLUSION: We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at http://www.it.utu.fi/BioInfer.

Asunto(s)

Sistemas de Administración de Bases de Datos , Bases de Datos Factuales , Documentación/métodos , Genes , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Publicaciones Periódicas como Asunto , Proteínas/clasificación , ARN/clasificación , Terminología como Asunto

Evaluation of two dependency parsers on biomedical corpus targeted at protein-protein interactions.

Pyysalo, Sampo; Ginter, Filip; Pahikkala, Tapio; Boberg, Jorma; Järvinen, Jouni; Salakoski, Tapio.

Int J Med Inform ; 75(6): 430-42, 2006 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-16099201

RESUMEN

We present an evaluation of Link Grammar and Connexor Machinese Syntax, two major broad-coverage dependency parsers, on a custom hand-annotated corpus consisting of sentences regarding protein-protein interactions. In the evaluation, we apply the notion of an interaction subgraph, which is the subgraph of a dependency graph expressing a protein-protein interaction. We measure the performance of the parsers for recovery of individual dependencies, fully correct parses, and interaction subgraphs. For Link Grammar, an open system that can be inspected in detail, we further perform a comprehensive failure analysis, report specific causes of error, and suggest potential modifications to the grammar. We find that both parsers perform worse on biomedical English than previously reported on general English. While Connexor Machinese Syntax significantly outperforms Link Grammar, the failure analysis suggests specific ways in which the latter could be modified for better performance in the domain.

Asunto(s)

Indización y Redacción de Resúmenes/métodos , Bases de Datos Bibliográficas , Procesamiento de Lenguaje Natural , Publicaciones Periódicas como Asunto , Mapeo de Interacción de Proteínas , Terminología como Asunto , Vocabulario Controlado , Inteligencia Artificial , Sistemas de Administración de Bases de Datos , Almacenamiento y Recuperación de la Información/métodos

Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation.

Pahikkala, Tapio; Ginter, Filip; Boberg, Jorma; Järvinen, Jouni; Salakoski, Tapio.

BMC Bioinformatics ; 6: 157, 2005 Jun 22.

Artículo en Inglés | MEDLINE | ID: mdl-15972097

RESUMEN

BACKGROUND: The ability to distinguish between genes and proteins is essential for understanding biological text. Support Vector Machines (SVMs) have been proven to be very efficient in general data mining tasks. We explore their capability for the gene versus protein name disambiguation task. RESULTS: We incorporated into the conventional SVM a weighting scheme based on distances of context words from the word to be disambiguated. This weighting scheme increased the performance of SVMs by five percentage points giving performance better than 85% as measured by the area under ROC curve and outperformed the Weighted Additive Classifier, which also incorporates the weighting, and the Naive Bayes classifier. CONCLUSION: We show that the performance of SVMs can be improved by the proposed weighting scheme. Furthermore, our results suggest that in this study the increase of the classification performance due to the weighting is greater than that obtained by selecting the underlying classifier or the kernel part of the SVM.

Asunto(s)

Biología Computacional/métodos , Bases de Datos Bibliográficas , Proteínas/química , Algoritmos , Área Bajo la Curva , Inteligencia Artificial , Teorema de Bayes , Análisis por Conglomerados , Metodologías Computacionales , Interpretación Estadística de Datos , Genoma Humano , Humanos , Modelos Estadísticos , Nombres , Redes Neurales de la Computación , Distribución Normal , Reconocimiento de Normas Patrones Automatizadas , Curva ROC , Análisis de Secuencia de Proteína , Programas Informáticos

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA