Machine learning to parse breast pathology reports in Chinese.

Tang, Rong; Ouyang, Lizhi; Li, Clara; He, Yue; Griffin, Molly; Taghian, Alphonse; Smith, Barbara; Yala, Adam; Barzilay, Regina; Hughes, Kevin

Tang, Rong; Ouyang, Lizhi; Li, Clara; He, Yue; Griffin, Molly; Taghian, Alphonse; Smith, Barbara; Yala, Adam; Barzilay, Regina; Hughes, Kevin.

Afiliação

Tang R; Division of Surgical Oncology, MGH, Boston, USA.
Ouyang L; Department of Breast Surgery, Hunan Cancer Hospital, Changsha, Hunan, China.
Li C; Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, USA.
He Y; Department of Breast Surgery, Hunan Cancer Hospital, Changsha, Hunan, China.
Griffin M; Division of Surgical Oncology, MGH, Boston, USA.
Taghian A; Department of Radiation Oncology, MGH, Boston, USA.
Smith B; Division of Surgical Oncology, MGH, Boston, USA.
Yala A; Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, USA.
Barzilay R; Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, USA.
Hughes K; Division of Surgical Oncology, MGH, Boston, USA.

Breast Cancer Res Treat ; 169(2): 243-250, 2018 Jun.

Article em En | MEDLINE | ID: mdl-29380208

ABSTRACT

ABSTRACT

INTRODUCTION:

Large structured databases of pathology findings are valuable in deriving new clinical insights. However, they are labor intensive to create and generally require manual annotation. There has been some work in the bioinformatics community to support automating this work via machine learning in English. Our contribution is to provide an automated approach to construct such structured databases in Chinese, and to set the stage for extraction from other languages.

METHODS:

We collected 2104 de-identified Chinese benign and malignant breast pathology reports from Hunan Cancer Hospital. Physicians with native Chinese proficiency reviewed the reports and annotated a variety of binary and numerical pathologic entities. After excluding 78 cases with a bilateral lesion in the same report, 1216 cases were used as a training set for the algorithm, which was then refined by 405 development cases. The Natural language processing algorithm was tested by using the remaining 405 cases to evaluate the machine learning outcome. The model was used to extract 13 binary entities and 8 numerical entities.

RESULTS:

When compared to physicians with native Chinese proficiency, the model showed a per-entity accuracy from 91 to 100% for all common diagnoses on the test set. The overall accuracy of binary entities was 98% and of numerical entities was 95%. In a per-report evaluation for binary entities with more than 100 training cases, 85% of all the testing reports were completely correct and 11% had an error in 1 out of 22 entities.

CONCLUSION:

We have demonstrated that Chinese breast pathology reports can be automatically parsed into structured data using standard machine learning approaches. The results of our study demonstrate that techniques effective in parsing English reports can be scaled to other languages.

Assuntos

Neoplasias da Mama/epidemiologia; Registros Eletrônicos de Saúde; Aprendizado de Máquina; Processamento de Linguagem Natural; Algoritmos; Mama/patologia; Neoplasias da Mama/patologia; Mineração de Dados; Bases de Dados Factuais; Feminino; Humanos

Palavras-chave

Chinese; Electronic health record (EHR); Machine learning; Natural language processing (NLP); Pathology reports

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Neoplasias da Mama / Registros Eletrônicos de Saúde / Aprendizado de Máquina Idioma: En Ano de publicação: 2018 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google