Accurate pattern-based extraction of complex Gleason score expressions from pathology reports.

Miettinen, Joonas; Tanskanen, Tomas; Degerlund, Henna; Nevala, Aapeli; Malila, Nea; Pitkäniemi, Janne

Miettinen, Joonas; Tanskanen, Tomas; Degerlund, Henna; Nevala, Aapeli; Malila, Nea; Pitkäniemi, Janne.

Afiliación

Miettinen J; Finnish Cancer Registry, Institute for Statistical and Epidemiological Cancer Research, Helsinki, Finland. Electronic address: joonas.miettinen@cancer.fi.
Tanskanen T; Finnish Cancer Registry, Institute for Statistical and Epidemiological Cancer Research, Helsinki, Finland.
Degerlund H; Finnish Cancer Registry, Institute for Statistical and Epidemiological Cancer Research, Helsinki, Finland.
Nevala A; Finnish Cancer Registry, Institute for Statistical and Epidemiological Cancer Research, Helsinki, Finland.
Malila N; Finnish Cancer Registry, Institute for Statistical and Epidemiological Cancer Research, Helsinki, Finland.
Pitkäniemi J; Finnish Cancer Registry, Institute for Statistical and Epidemiological Cancer Research, Helsinki, Finland; Department of Public Health, University of Helsinki, Finland; School of Health Sciences, University of Tampere, Finland.

J Biomed Inform ; 120: 103850, 2021 08.

Article en En | MEDLINE | ID: mdl-34182148

ABSTRACT

ABSTRACT

PURPOSE:

The Gleason score is an important grading factor of prostate cancer. Gleason scores can be extracted from pathology report texts using regular expressions, but previously developed programmes have targeted only relatively simple Gleason score expressions. We developed a programme capable of extracting also complex expressions. The programme is relatively easy to adapt to other languages and datasets.

METHODS:

We developed and evaluated our regular expression-based programme using manually processed pathology reports of prostate cancer cases diagnosed in Finland in 2016-2017. Both simple and complex Gleason score expressions were targeted. We measured the performance of our programme using recall, precision, and the F1. The proportion of complex Gleason score expressions was estimated as the complement of the recall when only addition expressions (e.g. "Gleason 3 + 4") were targeted.

RESULTS:

The detection of values (scores and score components) is based on mandatory keywords before or after the value. The programme favours precision over recall by primarily allowing for lists of optional expressions between keyword-value pairs and only secondarily allowing for arbitrary expressions. The programme is straightforward to adapt to new datasets by modifying the lists of mandatory and optional expressions. The full and addition-only programmes had 92% (95% CI [90%, 95%]) and 65% ([61%, 70%]) recall and high precision (98% [97%, 99%] and 100% [99%, 100%]), respectively. The estimated proportion of complex Gleason score expressions was 100-65 = 35%.

CONCLUSIONS:

Even complex Gleason score expressions can be extracted with high recall and precision using regular expressions. We recommend implementing automated Gleason score extraction where possible by adapting our validated programme.

Asunto(s)

Neoplasias de la Próstata; Finlandia; Humanos; Masculino; Clasificación del Tumor; Informe de Investigación

Palabras clave

Free-form text; Gleason score; Information extraction; Natural language processing; Pathology report; Regular expression

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Neoplasias de la Próstata Límite: Humans / Male País/Región como asunto: Europa Idioma: En Revista: J Biomed Inform Asunto de la revista: INFORMATICA MEDICA Año: 2021 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google