Búsqueda | Portal de Búsqueda de la BVS España

The Genia Event and Protein Coreference tasks of the BioNLP Shared Task 2011.

Kim, Jin-Dong; Nguyen, Ngan; Wang, Yue; Tsujii, Jun'ichi; Takagi, Toshihisa; Yonezawa, Akinori.

BMC Bioinformatics ; 13 Suppl 11: S1, 2012 Jun 26.

Artículo en Inglés | MEDLINE | ID: mdl-22759455

RESUMEN

BACKGROUND: The Genia task, when it was introduced in 2009, was the first community-wide effort to address a fine-grained, structural information extraction from biomedical literature. Arranged for the second time as one of the main tasks of BioNLP Shared Task 2011, it aimed to measure the progress of the community since 2009, and to evaluate generalization of the technology to full text papers. The Protein Coreference task was arranged as one of the supporting tasks, motivated from one of the lessons of the 2009 task that the abundance of coreference structures in natural language text hinders further improvement with the Genia task. RESULTS: The Genia task received final submissions from 15 teams. The results show that the community has made a significant progress, marking 74% of the best F-score in extracting bio-molecular events of simple structure, e.g., gene expressions, and 45% ~ 48% in extracting those of complex structure, e.g., regulations. The Protein Coreference task received 6 final submissions. The results show that the coreference resolution performance in biomedical domain is lagging behind that in newswire domain, cf. 50% vs. 66% in MUC score. Particularly, in terms of protein coreference resolution the best system achieved 34% in F-score. CONCLUSIONS: Detailed analysis performed on the results improves our insight into the problem and suggests the directions for further improvements.

Asunto(s)

Sistemas de Información , Procesamiento de Lenguaje Natural , Proteínas/química , Congresos como Asunto , Expresión Génica , Proteínas/genética , Proteínas/metabolismo

Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering.

Yamaguchi, Atsuko; Yamamoto, Yasunori; Kim, Jin-Dong; Takagi, Toshihisa; Yonezawa, Akinori.

BMC Genomics ; 13 Suppl 3: S8, 2012 Jun 11.

Artículo en Inglés | MEDLINE | ID: mdl-22759617

RESUMEN

BACKGROUND: Term clustering, by measuring the string similarities between terms, is known within the natural language processing community to be an effective method for improving the quality of texts and dictionaries. However, we have observed that chemical names are difficult to cluster using string similarity measures. In order to clearly demonstrate this difficulty, we compared the string similarities determined using the edit distance, the Monge-Elkan score, SoftTFIDF, and the bigram Dice coefficient for chemical names with those for non-chemical names. RESULTS: Our experimental results revealed the following: (1) The edit distance had the best performance in the matching of full forms, whereas Cohen et al. reported that SoftTFIDF with the Jaro-Winkler distance would yield the best measure for matching pairs of terms for their experiments. (2) For each of the string similarity measures above, the best threshold for term matching differs for chemical names and for non-chemical names; the difference is especially large for the edit distance. (3) Although the matching results obtained for chemical names using the edit distance, Monge-Elkan scores, or the bigram Dice coefficients are better than the result obtained for non-chemical names, the results were contrary when using SoftTFIDF. (4) A suitable weight for chemical names varies substantially from one for non-chemical names. In particular, a weight vector that has been optimized for non-chemical names is not suitable for chemical names. (5) The matching results using the edit distances improve further by dividing a set of full forms into two subsets, according to whether a full form is a chemical name or not. These results show that our hypothesis is acceptable, and that we can significantly improve the performance of abbreviation-full form clustering by computing chemical names and non-chemical names separately. CONCLUSIONS: In conclusion, the discriminative application of string similarity methods to chemical and non-chemical names may be a simple yet effective way to improve the performance of term clustering.

Asunto(s)

Algoritmos , Análisis por Conglomerados , Almacenamiento y Recuperación de la Información/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Química , Biología Computacional/métodos , Reproducibilidad de los Resultados , Terminología como Asunto

Building Linked Open Data towards integration of biomedical scientific literature with DBpedia.

Yamamoto, Yasunori; Yamaguchi, Atsuko; Yonezawa, Akinori.

J Biomed Semantics ; 4(1): 8, 2013 Mar 13.

Artículo en Inglés | MEDLINE | ID: mdl-23497538

RESUMEN

BACKGROUND: There is a growing need for efficient and integrated access to databases provided by diverse institutions. Using a linked data design pattern allows the diverse data on the Internet to be linked effectively and accessed efficiently by computers. Previously, we developed the Allie database, which stores pairs of abbreviations and long forms (LFs, or expanded forms) used in the life sciences. LFs define the semantics of abbreviations, and Allie provides a Web-based search service for researchers to look up the LF of an unfamiliar abbreviation. This service encounters two problems. First, it does not display each LF's definition, which could help the user to disambiguate and learn the abbreviations more easily. Furthermore, there are too many LFs for us to prepare a full dictionary from scratch. On the other hand, DBpedia has made the contents of Wikipedia available in the Resource Description Framework (RDF), which is expected to contain a significant number of entries corresponding to LFs. Therefore, linking the Allie LFs to DBpedia entries may present a solution to the Allie's problems. This requires a method that is capable of matching large numbers of string pairs within a reasonable period of time because Allie and DBpedia are frequently updated. RESULTS: We built a Linked Open Data set that links LFs to DBpedia titles by applying key collision methods (i.e., fingerprint and n-gram fingerprint) to their literals, which are simple approximate string-matching methods. In addition, we used UMLS resources to normalise the life science terms. As a result, combining the key collision methods with the domain-specific resources performed best, and 44,027 LFs have links to DBpedia titles. We manually evaluated the accuracy of the string matching by randomly sampling 1200 LFs, and our approach achieved an F-measure of 0.98. In addition, our experiments revealed the following. (1) Performances were similar independently from the frequency of the LFs in MEDLINE. (2) There is a relationship (r2 = 0.96, P < 0.01) between the occurrence frequencies of LFs in MEDLINE and their presence probabilities in DBpedia titles. CONCLUSIONS: The obtained results help Allie users locate the correct LFs. Because the methods are computationally simple and yield a high performance and because the most frequently used LFs in MEDLINE appear more often in DBpedia titles, we can continually and reasonably update the linked dataset to reflect the latest publications and additions to DBpedia. Joining LFs between scientific literature and DBpedia enables cross-resource exploration for mutual benefits.

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA