Your browser doesn't support javascript.
loading
exKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies.
Yang, Tiancheng; Sucholutsky, Ilia; Jen, Kuang-Yu; Schonlau, Matthias.
Afiliação
  • Yang T; Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada.
  • Sucholutsky I; Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada.
  • Jen KY; Department of Pathology and Laboratory Medicine, University of California, Davis, Sacramento, CA, United States of America.
  • Schonlau M; Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada.
PeerJ Comput Sci ; 10: e1888, 2024.
Article em En | MEDLINE | ID: mdl-38435545
ABSTRACT

Background:

Pathology reports contain key information about the patient's diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility.

Objective:

To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplant-pathology reports to extract relevant information for two predefined questions (1) "What kind of rejection does the patient show?"; (2) "What is the grade of interstitial fibrosis and tubular atrophy (IFTA)?"

Methods:

Kidney BERT was developed by pre-training Clinical BERT on 3.4K renal transplant pathology reports and 1.5M words. Then, exKidneyBERT was developed by extending Clinical BERT's tokenizer with six technical keywords and repeating the pre-training procedure. This extended the model's vocabulary. All three models were fine-tuned with information retrieval heads.

Results:

The model with extended vocabulary, exKidneyBERT, outperformed Clinical BERT and Kidney BERT in both questions. For rejection, exKidneyBERT achieved an 83.3% overlap ratio for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR). For IFTA, exKidneyBERT had a 95.8% exact match rate.

Conclusion:

ExKidneyBERT is a high-performing model for extracting information from renal pathology reports. Additional pre-training of BERT language models on specialized small domains does not necessarily improve performance. Extending the BERT tokenizer's vocabulary library is essential for specialized domains to improve performance, especially when pre-training on small corpora.
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: PeerJ Comput Sci Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: PeerJ Comput Sci Ano de publicação: 2024 Tipo de documento: Article