RESUMO
Psychiatric electronic health records (EHRs) present a distinctive challenge in the domain of ML owing to their unstructured nature, with a high degree of complexity and variability. This study aimed to identify a cohort of patients with diagnoses of a psychotic disorder and posttraumatic stress disorder (PTSD), develop clinically-informed guidelines for annotating these health records for instances of traumatic events to create a gold standard publicly available dataset, and demonstrate that the data gathered using this annotation scheme is suitable for training a machine learning (ML) model to identify these indicators of trauma in unseen health records. We created a representative corpus of 101 EHRs (222,033 tokens) from a centralized database and a detailed annotation scheme for annotating information relevant to traumatic events in the clinical narratives. A team of clinical experts annotated the dataset and updated the annotation guidelines in collaboration with computational linguistic specialists. Inter-annotator agreement was high (0.688 for span tags, 0.589 for relations, and 0.874 for tag attributes). We characterize the major points relating to the annotation process of psychiatric EHRs. Additionally, high-performing baseline span labeling and relation extraction ML models were developed to demonstrate practical viability of the gold standard corpus for ML applications.
RESUMO
BACKGROUND: Readmission after discharge from a hospital is disruptive and costly, regardless of the reason. However, it can be particularly problematic for psychiatric patients, so predicting which patients may be readmitted is critically important but also very difficult. Clinical narratives in psychiatric electronic health records (EHRs) span a wide range of topics and vocabulary; therefore, a psychiatric readmission prediction model must begin with a robust and interpretable topic extraction component. RESULTS: We designed and evaluated multiple multilayer perceptron and radial basis function neural networks to predict the sentences in a patient's EHR that are associated with one or more of seven readmission risk factor domains that we identified. In contrast to our baseline cosine similarity model that is based on the methodologies of prior works, our deep learning approaches achieved considerably better F1 scores (0.83 vs 0.66) while also being more scalable and computationally efficient with large volumes of data. Additionally, we found that integrating clinically relevant multiword expressions during preprocessing improves the accuracy of our models and allows for identifying a wider scope of training data in a semi-supervised setting. CONCLUSION: We created a data pipeline for using document vector similarity metrics to perform topic extraction on psychiatric EHR data in service of our long-term goal of creating a readmission risk classifier. We show results for our topic extraction model and identify additional features we will be incorporating in the future.