RESUMO
Corpora are one of the most valuable resources at present for building machine learning systems. However, building new corpora is an expensive task, which makes the automatic extension of corpora a highly attractive task to develop. Hence, finding new strategies that reduce the cost and effort involved in this task, while at the same time guaranteeing quality, remains an open and important challenge for the research community. In this paper, we present a set of ensembling strategies oriented toward entity and relation extraction tasks. The main goal is to combine several automatically annotated versions of corpora to produce a single version with improved quality. An ensembler is built by exploring a configuration space in search of the combination that maximizes the fitness of the ensembled collection according to a reference collection. The eHealth-KD 2019 challenge was chosen for the case study. The submitted systems' outputs were ensembled, resulting in the construction of an automatically annotated collection of 8000 sentences. We show that using this collection as additional training input for a baseline algorithm has a positive impact on its performance. Additionally, the ensembling pipeline was used as a participant system in the 2020 edition of the challenge. The ensembled run achieved a slightly better performance than the individual runs.
Assuntos
Descoberta do Conhecimento , Telemedicina , Algoritmos , Humanos , Idioma , Aprendizado de Máquina , Processamento de Linguagem NaturalRESUMO
The massive amount of biomedical information published online requires the development of automatic knowledge discovery technologies to effectively make use of this available content. To foster and support this, the research community creates linguistic resources, such as annotated corpora, and designs shared evaluation campaigns and academic competitive challenges. This work describes an ecosystem that facilitates research and development in knowledge discovery in the biomedical domain, specifically in Spanish language. To this end, several resources are developed and shared with the research community, including a novel semantic annotation model, an annotated corpus of 1045 sentences, and computational resources to build and evaluate automatic knowledge discovery techniques. Furthermore, a research task is defined with objective evaluation criteria, and an online evaluation environment is setup and maintained, enabling researchers interested in this task to obtain immediate feedback and compare their results with the state-of-the-art. As a case study, we analyze the results of a competitive challenge based on these resources and provide guidelines for future research. The constructed ecosystem provides an effective learning and evaluation environment to encourage research in knowledge discovery in Spanish biomedical documents.
Assuntos
Descoberta do Conhecimento , Telemedicina , Ecossistema , Idioma , Processamento de Linguagem Natural , SemânticaRESUMO
This paper presents and describes eHealth-KD corpus. The corpus is a collection of 1173 Spanish health-related sentences manually annotated with a general semantic structure that captures most of the content, without resorting to domain-specific labels. The semantic representation is first defined and illustrated with example sentences from the corpus. Next, the paper summarizes the process of annotation and provides key metrics of the corpus. Finally, three baseline implementations, which are supported by machine learning models, were designed to consider the complexity of learning the corpus semantics. The resulting corpus was used as an evaluation scenario in TASS 2018 (Martínez-Cámara et al., 2018) and the findings obtained by participants are discussed. The eHealth-KD corpus provides the first step in the design of a general-purpose semantic framework that can be used to extract knowledge from a variety of domains.