Improving rare disease classification using imperfect knowledge graph.

Li, Xuedong; Wang, Yue; Wang, Dongwu; Yuan, Walter; Peng, Dezhong; Mei, Qiaozhu

Li, Xuedong; Wang, Yue; Wang, Dongwu; Yuan, Walter; Peng, Dezhong; Mei, Qiaozhu.

Afiliação

Li X; College of Computer Science, Sichuan University, Chengdu, China.
Wang Y; School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States.
Wang D; MobLab Inc., Pasadena, CA, United States.
Yuan W; MobLab Inc., Pasadena, CA, United States.
Peng D; College of Computer Science, Sichuan University, Chengdu, China.
Mei Q; School of Information, University of Michigan, Ann Arbor, MI, United States. qmei@umich.edu.

BMC Med Inform Decis Mak ; 19(Suppl 5): 238, 2019 12 05.

Article em En | MEDLINE | ID: mdl-31801534

RESUMO

BACKGROUND: Accurately recognizing rare diseases based on symptom description is an important task in patient triage, early risk stratification, and target therapies. However, due to the very nature of rare diseases, the lack of historical data poses a great challenge to machine learning-based approaches. On the other hand, medical knowledge in automatically constructed knowledge graphs (KGs) has the potential to compensate the lack of labeled training examples. This work aims to develop a rare disease classification algorithm that makes effective use of a knowledge graph, even when the graph is imperfect. METHOD: We develop a text classification algorithm that represents a document as a combination of a "bag of words" and a "bag of knowledge terms," where a "knowledge term" is a term shared between the document and the subgraph of KG relevant to the disease classification task. We use two Chinese disease diagnosis corpora to evaluate the algorithm. The first one, HaoDaiFu, contains 51,374 chief complaints categorized into 805 diseases. The second data set, ChinaRe, contains 86,663 patient descriptions categorized into 44 disease categories. RESULTS: On the two evaluation data sets, the proposed algorithm delivers robust performance and outperforms a wide range of baselines, including resampling, deep learning, and feature selection approaches. Both classification-based metric (macro-averaged F1 score) and ranking-based metric (mean reciprocal rank) are used in evaluation. CONCLUSION: Medical knowledge in large-scale knowledge graphs can be effectively leveraged to improve rare diseases classification models, even when the knowledge graph is incomplete.

Assuntos

Aprendizado de Máquina; Doenças Raras/classificação; Algoritmos; Humanos; Reconhecimento Automatizado de Padrão; Triagem

Palavras-chave

Extremely imbalanced data; Knowledge graph; Machine learning; Rare disease diagnosis; Text classification

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Doenças Raras / Aprendizado de Máquina Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Ano de publicação: 2019 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google