Your browser doesn't support javascript.
loading
Similarity corpus on microbial transcriptional regulation.
Lithgow-Serrano, Oscar; Gama-Castro, Socorro; Ishida-Gutiérrez, Cecilia; Mejía-Almonte, Citlalli; Tierrafría, Víctor H; Martínez-Luna, Sara; Santos-Zavaleta, Alberto; Velázquez-Ramírez, David; Collado-Vides, Julio.
Afiliação
  • Lithgow-Serrano O; Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México. olithgow@ccg.unam.mx.
  • Gama-Castro S; Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas (IIMAS), Universidad Nacional Autónoma de México (UNAM), Mexico City, México. olithgow@ccg.unam.mx.
  • Ishida-Gutiérrez C; Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México.
  • Mejía-Almonte C; Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México.
  • Tierrafría VH; Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México.
  • Martínez-Luna S; Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México.
  • Santos-Zavaleta A; Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México.
  • Velázquez-Ramírez D; Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México.
  • Collado-Vides J; Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México.
J Biomed Semantics ; 10(1): 8, 2019 05 22.
Article em En | MEDLINE | ID: mdl-31118102
ABSTRACT

BACKGROUND:

The ability to express the same meaning in different ways is a well-known property of natural language. This amazing property is the source of major difficulties in natural language processing. Given the constant increase in published literature, its curation and information extraction would strongly benefit from efficient automatic processes, for which corpora of sentences evaluated by experts are a valuable resource.

RESULTS:

Given our interest in applying such approaches to the benefit of curation of the biomedical literature, specifically that about gene regulation in microbial organisms, we decided to build a corpus with graded textual similarity evaluated by curators and that was designed specifically oriented to our purposes. Based on the predefined statistical power of future analyses, we defined features of the design, including sampling, selection criteria, balance, and size, among others. A non-fully crossed study design was applied. Each pair of sentences was evaluated by 3 annotators from a total of 7; the scale used in the semantic similarity assessment task within the Semantic Evaluation workshop (SEMEVAL) was adapted to our goals in four successive iterative sessions with clear improvements in the agreed guidelines and interrater reliability results. Alternatives for such a corpus evaluation have been widely discussed.

CONCLUSIONS:

To the best of our knowledge, this is the first similarity corpus-a dataset of pairs of sentences for which human experts rate the semantic similarity of each pair-in this domain of knowledge. We have initiated its incorporation in our research towards high-throughput curation strategies based on natural language processing.
Assuntos
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Transcrição Gênica / Processamento de Linguagem Natural / Regulação da Expressão Gênica / Microbiologia Idioma: En Revista: J Biomed Semantics Ano de publicação: 2019 Tipo de documento: Article

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Transcrição Gênica / Processamento de Linguagem Natural / Regulação da Expressão Gênica / Microbiologia Idioma: En Revista: J Biomed Semantics Ano de publicação: 2019 Tipo de documento: Article