Your browser doesn't support javascript.
loading
Annotation and initial evaluation of a large annotated German oncological corpus.
Kittner, Madeleine; Lamping, Mario; Rieke, Damian T; Götze, Julian; Bajwa, Bariya; Jelas, Ivan; Rüter, Gina; Hautow, Hanjo; Sänger, Mario; Habibi, Maryam; Zettwitz, Marit; de Bortoli, Till; Ostermann, Leonie; Seva, Jurica; Starlinger, Johannes; Kohlbacher, Oliver; Malek, Nisar P; Keilholz, Ulrich; Leser, Ulf.
Affiliation
  • Kittner M; Knowledge Management for Bioinformatics, Humboldt Universität zu Berlin, Berlin, Germany.
  • Lamping M; Department of Hematology, Oncology and Cancer Immunology, Campus Benjamin Franklin, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
  • Rieke DT; Charité Comprehensive Cancer Center, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
  • Götze J; Department of Hematology, Oncology and Cancer Immunology, Campus Benjamin Franklin, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
  • Bajwa B; Charité Comprehensive Cancer Center, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
  • Jelas I; Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
  • Rüter G; Innere Medizin I, Universitätsklinikum Tübingen, Tübingen, Germany.
  • Hautow H; Innere Medizin I, Universitätsklinikum Tübingen, Tübingen, Germany.
  • Sänger M; Charité Comprehensive Cancer Center, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
  • Habibi M; Charité Comprehensive Cancer Center, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
  • Zettwitz M; Knowledge Management for Bioinformatics, Humboldt Universität zu Berlin, Berlin, Germany.
  • de Bortoli T; Knowledge Management for Bioinformatics, Humboldt Universität zu Berlin, Berlin, Germany.
  • Ostermann L; Knowledge Management for Bioinformatics, Humboldt Universität zu Berlin, Berlin, Germany.
  • Seva J; Charité Comprehensive Cancer Center, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
  • Starlinger J; Charité Comprehensive Cancer Center, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
  • Kohlbacher O; Innere Medizin I, Universitätsklinikum Tübingen, Tübingen, Germany.
  • Malek NP; Knowledge Management for Bioinformatics, Humboldt Universität zu Berlin, Berlin, Germany.
  • Keilholz U; Knowledge Management for Bioinformatics, Humboldt Universität zu Berlin, Berlin, Germany.
  • Leser U; Institut für Translationale Bioinformatik, Universitätsklinikum Tübingen, Tübingen, Germany.
JAMIA Open ; 4(2): ooab025, 2021 Apr.
Article de En | MEDLINE | ID: mdl-33898938
ABSTRACT

OBJECTIVE:

We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available research on Information Extraction from German medical texts. MATERIALS AND

METHODS:

BRONCO consists of 200 manually deidentified discharge summaries of cancer patients. Annotation followed a structured and quality-controlled process involving 2 groups of medical experts to ensure consistency, comprehensiveness, and high quality of annotations. We present results of several state-of-the-art techniques for different IE tasks as baselines for subsequent research.

RESULTS:

The annotated corpus consists of 11 434 sentences and 89 942 tokens, annotated with 11 124 annotations for medical entities and 3118 annotations of related attributes. We publish 75% of the corpus as a set of shuffled sentences, and keep 25% as held-out data set for unbiased evaluation of future IE tools. On this held-out dataset, our baselines reach depending on the specific entity types F1-scores of 0.72-0.90 for named entity recognition, 0.10-0.68 for entity normalization, 0.55 for negation detection, and 0.33 for speculation detection.

DISCUSSION:

Medical corpus annotation is a complex and time-consuming task. This makes sharing of such resources even more important.

CONCLUSION:

To our knowledge, BRONCO is the first sizable and freely available German medical corpus. Our baseline results show that more research efforts are necessary to lift the quality of information extraction in German medical texts to the level already possible for English.
Mots clés

Texte intégral: 1 Collection: 01-internacional Base de données: MEDLINE Type d'étude: Qualitative_research Langue: En Journal: JAMIA Open Année: 2021 Type de document: Article Pays d'affiliation: Allemagne

Texte intégral: 1 Collection: 01-internacional Base de données: MEDLINE Type d'étude: Qualitative_research Langue: En Journal: JAMIA Open Année: 2021 Type de document: Article Pays d'affiliation: Allemagne