Búsqueda | Portal Regional de la BVS

Linguistic scope-based and biological event-based speculation and negation annotations in the BioScope and Genia Event corpora.

Vincze, Veronika; Szarvas, György; Móra, György; Ohta, Tomoko; Farkas, Richárd.

J Biomed Semantics ; 2 Suppl 5: S8, 2011 Oct 06.

Artículo en Inglés | MEDLINE | ID: mdl-22166355

RESUMEN

BACKGROUND: The treatment of negation and hedging in natural language processing has received much interest recently, especially in the biomedical domain. However, open access corpora annotated for negation and/or speculation are hardly available for training and testing applications, and even if they are, they sometimes follow different design principles. In this paper, the annotation principles of the two largest corpora containing annotation for negation and speculation - BioScope and Genia Event - are compared. BioScope marks linguistic cues and their scopes for negation and hedging while in Genia biological events are marked for uncertainty and/or negation. RESULTS: Differences among the annotations of the two corpora are thematically categorized and the frequency of each category is estimated. We found that the largest amount of differences is due to the issue that scopes - which cover text spans - deal with the key events and each argument (including events within events) of these events is under the scope as well. In contrast, Genia deals with the modality of events within events independently. CONCLUSIONS: The analysis of multiple layers of annotation (linguistic scopes and biological events) showed that the detection of negation/hedge keywords and their scopes can contribute to determining the modality of key events (denoted by the main predicate). On the other hand, for the detection of the negation and speculation status of events within events, additional syntax-based rules investigating the dependency path between the modality cue and the event cue have to be employed.

Assessment of NER solutions against the first and second CALBC Silver Standard Corpus.

Rebholz-Schuhmann, Dietrich; Jimeno Yepes, Antonio; Li, Chen; Kafkas, Senay; Lewin, Ian; Kang, Ning; Corbett, Peter; Milward, David; Buyko, Ekaterina; Beisswanger, Elena; Hornbostel, Kerstin; Kouznetsov, Alexandre; Witte, René; Laurila, Jonas B; Baker, Christopher Jo; Kuo, Cheng-Ju; Clematide, Simone; Rinaldi, Fabio; Farkas, Richárd; Móra, György; Hara, Kazuo; Furlong, Laura I; Rautschka, Michael; Neves, Mariana Lara; Pascual-Montano, Alberto; Wei, Qi; Collier, Nigel; Chowdhury, Md Faisal Mahbub; Lavelli, Alberto; Berlanga, Rafael; Morante, Roser; Van Asch, Vincent; Daelemans, Walter; Marina, José Luís; van Mulligen, Erik; Kors, Jan; Hahn, Udo.

J Biomed Semantics ; 2 Suppl 5: S11, 2011 Oct 06.

Artículo en Inglés | MEDLINE | ID: mdl-22166494

RESUMEN

BACKGROUND: Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions. RESULTS: All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I.The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants' solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE. CONCLUSIONS: The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs' annotation solutions in comparison to the SSC-I.

The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes.

Vincze, Veronika; Szarvas, György; Farkas, Richárd; Móra, György; Csirik, János.

BMC Bioinformatics ; 9 Suppl 11: S9, 2008 Nov 19.

Artículo en Inglés | MEDLINE | ID: mdl-19025695

RESUMEN

BACKGROUND: Detecting uncertain and negative assertions is essential in most BioMedical Text Mining tasks where, in general, the aim is to derive factual knowledge from textual data. This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus). RESULTS: The corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief linguist--also responsible for setting up the annotation guidelines --who resolved cases where the annotators disagreed. The resulting corpus consists of more than 20.000 sentences that were considered for annotation and over 10% of them actually contain one (or more) linguistic annotation suggesting negation or uncertainty. CONCLUSION: Statistics are reported on corpus size, ambiguity levels and the consistency of annotations. The corpus is accessible for academic purposes and is free of charge. Apart from the intended goal of serving as a common resource for the training, testing and comparing of biomedical Natural Language Processing systems, the corpus is also a good resource for the linguistic analysis of scientific and clinical texts.

Asunto(s)

Indización y Redacción de Resúmenes/métodos , Bases de Datos Bibliográficas , Almacenamiento y Recuperación de la Información/métodos , Inteligencia Artificial , Sistemas de Administración de Bases de Datos , Procesamiento de Lenguaje Natural , Vocabulario Controlado

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA