Toward structuring real-world data: Deep learning for extracting oncology information from clinical text with patient-level supervision.

Preston, Sam; Wei, Mu; Rao, Rajesh; Tinn, Robert; Usuyama, Naoto; Lucas, Michael; Gu, Yu; Weerasinghe, Roshanthi; Lee, Soohee; Piening, Brian; Tittel, Paul; Valluri, Naveen; Naumann, Tristan; Bifulco, Carlo; Poon, Hoifung

Preston, Sam; Wei, Mu; Rao, Rajesh; Tinn, Robert; Usuyama, Naoto; Lucas, Michael; Gu, Yu; Weerasinghe, Roshanthi; Lee, Soohee; Piening, Brian; Tittel, Paul; Valluri, Naveen; Naumann, Tristan; Bifulco, Carlo; Poon, Hoifung.

Afiliação

Preston S; Microsoft Research, Redmond, WA, USA.
Wei M; Microsoft Research, Redmond, WA, USA.
Rao R; Microsoft Research, Redmond, WA, USA.
Tinn R; Microsoft Research, Redmond, WA, USA.
Usuyama N; Microsoft Research, Redmond, WA, USA.
Lucas M; Microsoft Research, Redmond, WA, USA.
Gu Y; Microsoft Research, Redmond, WA, USA.
Weerasinghe R; Providence St Joseph's Health, Portland, OR, USA.
Lee S; Providence St Joseph's Health, Portland, OR, USA.
Piening B; Providence Genomics & Earle A. Chiles Research Institute, Portland, OR, USA.
Tittel P; Providence Genomics & Earle A. Chiles Research Institute, Portland, OR, USA.
Valluri N; Microsoft Research, Redmond, WA, USA.
Naumann T; Microsoft Research, Redmond, WA, USA.
Bifulco C; Providence Genomics & Earle A. Chiles Research Institute, Portland, OR, USA.
Poon H; Microsoft Research, Redmond, WA, USA.

Patterns (N Y) ; 4(4): 100726, 2023 Apr 14.

Article em En | MEDLINE | ID: mdl-37123439

ABSTRACT

ABSTRACT

Most detailed patient information in real-world data (RWD) is only consistently available in free-text clinical documents. Manual curation is expensive and time consuming. Developing natural language processing (NLP) methods for structuring RWD is thus essential for scaling real-world evidence generation. We propose leveraging patient-level supervision from medical registries, which are often readily available and capture key patient information, for general RWD applications. We conduct an extensive study on 135,107 patients from the cancer registry of a large integrated delivery network (IDN) comprising healthcare systems in five western US states. Our deep-learning methods attain test area under the receiver operating characteristic curve (AUROC) values of 94%-99% for key tumor attributes and comparable performance on held-out data from separate health systems and states. Ablation results demonstrate the superiority of these advanced deep-learning methods. Error analysis shows that our NLP system sometimes even corrects errors in registrar labels.

Palavras-chave

E01.789.625; H02.403.429.515; L01.224.050.375.580; L01.313.500.750.280.199; data mining; medical oncology; natural language processing; neoplasm staging

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Temas: Geral Base de dados: MEDLINE Tipo de estudo: Guideline Idioma: En Revista: Patterns (N Y) Ano de publicação: 2023 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google