Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports.

Pironet, Antoine; Poirel, Hélène A; Tambuyzer, Tim; De Schutter, Harlinde; van Walle, Lien; Mattheijssens, Joris; Henau, Kris; Van Eycken, Liesbet; Van Damme, Nancy

Pironet, Antoine; Poirel, Hélène A; Tambuyzer, Tim; De Schutter, Harlinde; van Walle, Lien; Mattheijssens, Joris; Henau, Kris; Van Eycken, Liesbet; Van Damme, Nancy.

Afiliação

Pironet A; Belgian Cancer Registry, Brussels, Belgium.
Poirel HA; Belgian Cancer Registry, Brussels, Belgium.
Tambuyzer T; Belgian Cancer Registry, Brussels, Belgium.
De Schutter H; Belgian Cancer Registry, Brussels, Belgium.
van Walle L; Belgian Cancer Registry, Brussels, Belgium.
Mattheijssens J; Belgian Cancer Registry, Brussels, Belgium.
Henau K; Belgian Cancer Registry, Brussels, Belgium.
Van Eycken L; Belgian Cancer Registry, Brussels, Belgium.
Van Damme N; Belgian Cancer Registry, Brussels, Belgium.

Front Digit Health ; 3: 692077, 2021.

Article em En | MEDLINE | ID: mdl-34713168

RESUMO

As part of its core business of gathering population-based information on new cancer diagnoses, the Belgian Cancer Registry receives free-text pathology reports, describing results of (pre-)malignant specimens. These reports are provided by 82 laboratories and written in 2 national languages, Dutch or French. For breast cancer, the reports characterize the status of estrogen receptor, progesterone receptor, and Erb-b2 receptor tyrosine kinase 2. These biomarkers are related with tumor growth and prognosis and are essential to define therapeutic management. The availability of population-scale information about their status in breast cancer patients can therefore be considered crucial to enrich real-world scientific studies and to guide public health policies regarding personalized medicine. The main objective of this study is to expand the data available at the Belgian Cancer Registry by automatically extracting the status of these biomarkers from the pathology reports. Various types of numeric features are computed from over 1,300 manually annotated reports linked to breast tumors diagnosed in 2014. A range of popular machine learning classifiers, such as support vector machines, random forests and logistic regressions, are trained on this data and compared using their F 1 scores on a separate validation set. On a held-out test set, the best performing classifiers achieve F 1 scores ranging from 0.89 to 0.92 for the four classification tasks. The extraction is thus reliable and allows to significantly increase the availability of this valuable information on breast cancer receptor status at a population level.

Palavras-chave

breast cancer; machine learning; natural language processing; pathology; receptor status

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Tipo de estudo: Prognostic_studies Idioma: En Ano de publicação: 2021 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Tipo de estudo: Prognostic_studies Idioma: En Ano de publicação: 2021 Tipo de documento: Article