Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit.
Artif Intell Med
; 117: 102083, 2021 07.
Article
en En
| MEDLINE
| ID: mdl-34127232
ABSTRACT
Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F10.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over â¼8.8B words from â¼17M clinical records and further fine-tuning with â¼6K clinician annotated examples. We show strong transferability (F1â¯>â¯0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.
Palabras clave
Texto completo:
1
Banco de datos:
MEDLINE
Asunto principal:
Procesamiento de Lenguaje Natural
/
Systematized Nomenclature of Medicine
Tipo de estudio:
Prognostic_studies
Idioma:
En
Revista:
Artif Intell Med
Asunto de la revista:
INFORMATICA MEDICA
Año:
2021
Tipo del documento:
Article
País de afiliación:
Reino Unido