RESUMEN
BACKGROUND: Analyzing the unstructured textual data contained in electronic health records (EHRs) has always been a challenging task. Word embedding methods have become an essential foundation for neural network-based approaches in natural language processing (NLP), to learn dense and low-dimensional word representations from large unlabeled corpora that capture the implicit semantics of words. Models like Word2Vec, GloVe or FastText have been broadly applied and reviewed in the bioinformatics and healthcare fields, most often to embed clinical notes or activity and diagnostic codes. Visualization of the learned embeddings has been used in a subset of these works, whether for exploratory or evaluation purposes. However, visualization practices tend to be heterogeneous, and lack overall guidelines. OBJECTIVE: This scoping review aims to describe the methods and strategies used to visualize medical concepts represented using word embedding methods. We aim to understand the objectives of the visualizations and their limits. METHODS: This scoping review summarizes different methods used to visualize word embeddings in healthcare. We followed the methodology proposed by Arksey and O'Malley (Int J Soc Res Methodol 8:19-32, 2005) and by Levac et al. (Implement Sci 5:69, 2010) to better analyze the data and provide a synthesis of the literature on the matter. RESULTS: We first obtained 471 unique articles from a search conducted in PubMed, MedRxiv and arXiv databases. 30 of these were effectively reviewed, based on our inclusion and exclusion criteria. 23 articles were excluded in the full review stage, resulting in the analysis of 7 papers that fully correspond to our inclusion criteria. Included papers pursued a variety of objectives and used distinct methods to evaluate their embeddings and to visualize them. Visualization also served heterogeneous purposes, being alternatively used as a way to explore the embeddings, to evaluate them or to merely illustrate properties otherwise formally assessed. CONCLUSIONS: Visualization helps to explore embedding results (further dimensionality reduction, synthetic representation). However, it does not exhaust the information conveyed by the embeddings nor constitute a self-sustaining evaluation method of their pertinence.
Asunto(s)
Procesamiento de Lenguaje Natural , Semántica , Bases de Datos Factuales , Registros Electrónicos de Salud , Humanos , PubMedRESUMEN
BACKGROUND: Often missing from or uncertain in a biomedical data warehouse (BDW), vital status after discharge is central to the value of a BDW in medical research. The French National Mortality Database (FNMD) offers open-source nominative records of every death. Matching large-scale BDWs records with the FNMD combines multiple challenges: absence of unique common identifiers between the 2 databases, names changing over life, clerical errors, and the exponential growth of the number of comparisons to compute. OBJECTIVE: We aimed to develop a new algorithm for matching BDW records to the FNMD and evaluated its performance. METHODS: We developed a deterministic algorithm based on advanced data cleaning and knowledge of the naming system and the Damerau-Levenshtein distance (DLD). The algorithm's performance was independently assessed using BDW data of 3 university hospitals: Lille, Nantes, and Rennes. Specificity was evaluated with living patients on January 1, 2016 (ie, patients with at least 1 hospital encounter before and after this date). Sensitivity was evaluated with patients recorded as deceased between January 1, 2001, and December 31, 2020. The DLD-based algorithm was compared to a direct matching algorithm with minimal data cleaning as a reference. RESULTS: All centers combined, sensitivity was 11% higher for the DLD-based algorithm (93.3%, 95% CI 92.8-93.9) than for the direct algorithm (82.7%, 95% CI 81.8-83.6; P<.001). Sensitivity was superior for men at 2 centers (Nantes: 87%, 95% CI 85.1-89 vs 83.6%, 95% CI 81.4-85.8; P=.006; Rennes: 98.6%, 95% CI 98.1-99.2 vs 96%, 95% CI 94.9-97.1; P<.001) and for patients born in France at all centers (Nantes: 85.8%, 95% CI 84.3-87.3 vs 74.9%, 95% CI 72.8-77.0; P<.001). The DLD-based algorithm revealed significant differences in sensitivity among centers (Nantes, 85.3% vs Lille and Rennes, 97.3%, P<.001). Specificity was >98% in all subgroups. Our algorithm matched tens of millions of death records from BDWs, with parallel computing capabilities and low RAM requirements. We used the Inseehop open-source R script for this measurement. CONCLUSIONS: Overall, sensitivity/recall was 11% higher using the DLD-based algorithm than that using the direct algorithm. This shows the importance of advanced data cleaning and knowledge of a naming system through DLD use. Statistically significant differences in sensitivity between groups could be found and must be considered when performing an analysis to avoid differential biases. Our algorithm, originally conceived for linking a BDW with the FNMD, can be used to match any large-scale databases. While matching operations using names are considered sensitive computational operations, the Inseehop package released here is easy to run on premises, thereby facilitating compliance with cybersecurity local framework. The use of an advanced deterministic matching algorithm such as the DLD-based algorithm is an insightful example of combining open-source external data to improve the usage value of BDWs.
RESUMEN
OBJECTIVE: Current data regarding the impact of diabetes mellitus (DM) on cardiovascular mortality in patients with aortic stenosis (AS) are restricted to severe AS or aortic valve replacement (AVR) trials. We aimed to investigate cardiovascular mortality according to DM across the entire spectrum of outpatients with AS. METHODS: Between May 2016 and December 2017, patients with mild (peak aortic velocity=2.5-2.9 m/s), moderate (3-3.9 m/s) and severe (≥4 m/s) AS graded by echocardiography were included during outpatient cardiology visits in the Nord-Pas-de-Calais region in France and followed-up for modes of death between May 2018 and August 2020. RESULTS: Among 2703 patients, 820 (30.3%) had DM, mean age was 76±10.8 years with 46.6% of women and a relatively high prevalence of underlying cardiovascular diseases. There were 200 cardiovascular deaths prior to AVR during the 2.1 years (IQR 1.4-2.7) follow-up period. In adjusted analyses, DM was significantly associated with cardiovascular mortality (HR=1.40, 95% CI 1.04 to 1.89; p=0.029). In mild or moderate AS, the cardiovascular mortality of patients with diabetes was similar to that of patients without diabetes. In severe AS, DM was associated with higher cardiovascular mortality (HR=2.65, 95% CI 1.50 to 4.68; p=0.001). This was almost exclusively related to a higher risk of death from heart failure (HR=2.61, 95% CI 1.15 to 5.92; p=0.022) and sudden death (HR=3.33, 95% CI 1.28 to 8.67; p=0.014). CONCLUSION: The effect of DM on cardiovascular mortality varied across AS severity. Despite no association between DM and outcomes in patients with mild/moderate AS, DM was strongly associated with death from heart failure and sudden death in patients with severe AS.
Asunto(s)
Estenosis de la Válvula Aórtica , Diabetes Mellitus , Insuficiencia Cardíaca , Implantación de Prótesis de Válvulas Cardíacas , Humanos , Femenino , Anciano , Anciano de 80 o más Años , Estenosis de la Válvula Aórtica/complicaciones , Estenosis de la Válvula Aórtica/diagnóstico por imagen , Estenosis de la Válvula Aórtica/cirugía , Válvula Aórtica/cirugía , Diabetes Mellitus/epidemiología , Insuficiencia Cardíaca/cirugía , Muerte Súbita , Índice de Severidad de la Enfermedad , Resultado del TratamientoRESUMEN
Autoantibodies (Aabs) are frequent in systemic sclerosis (SSc). Although recognized as potent biomarkers, their pathogenic role is debated. This study explored the effect of purified immunoglobulin G (IgG) from SSc patients on protein and mRNA expression of dermal fibroblasts (FBs) using an innovative multi-omics approach. Dermal FBs were cultured in the presence of sera or purified IgG from patients with diffuse cutaneous SSc (dcSSc), limited cutaneous SSc or healthy controls (HCs). The FB proteome and transcriptome were explored using liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) and microarray assays, respectively. Proteomic analysis identified 3,310 proteins. SSc sera and purified IgG induced singular protein profile patterns. These FB proteome changes depended on the Aab serotype, with a singular effect observed with purified IgG from anti-topoisomerase-I autoantibody (ATA) positive patients compared to HC or other SSc serotypes. IgG from ATA positive SSc patients induced enrichment in proteins involved in focal adhesion, cadherin binding, cytosolic part, or lytic vacuole. Multi-omics analysis was performed in two ways: first by restricting the analysis of the transcriptomic data to differentially expressed proteins; and secondly, by performing a global statistical analysis integrating proteomics and transcriptomics. Transcriptomic analysis distinguished 764 differentially expressed genes and revealed that IgG from dcSSc can induce extracellular matrix (ECM) remodeling changes in gene expression profiles in FB. Global statistical analysis integrating proteomics and transcriptomics confirmed that IgG from SSc can induce ECM remodeling and activate FB profiles. This effect depended on the serotype of the patient, suggesting that SSc Aab might play a pathogenic role in some SSc subsets.
Asunto(s)
Inmunoglobulina G , Esclerodermia Sistémica , Autoanticuerpos , Cromatografía Liquida , Fibroblastos/metabolismo , Humanos , Proteoma/metabolismo , Proteómica , Espectrometría de Masas en TándemRESUMEN
The use of international laboratory terminologies inside hospital information systems is required to conduct data reuse analyses through inter-hospital databases. While most terminology matching techniques performing semantic interoperability are language-based, another strategy is to use distribution matching that performs terms matching based on the statistical similarity. In this work, our objective is to design and assess a structured framework to perform distribution matching on concepts described by continuous variables. We propose a framework that combines distribution matching and machine learning techniques. Using a training sample consisting of correct and incorrect correspondences between different terminologies, a match probability score is built. For each term, best candidates are returned and sorted in decreasing order using the probability given by the model. Searching 101 terms from Lille University Hospital among the same list of concepts in MIMIC-III, the model returned the correct match in the top 5 candidates for 96 of them (95%). Using this open-source framework with a top-k suggestions system could make the expert validation of terminologies alignment easier.
Asunto(s)
Sistemas de Información en Hospital , Laboratorios , Bases de Datos Factuales , Humanos , Aprendizaje Automático , SemánticaRESUMEN
We collected user needs to define a process for setting up Federated Learning in a network of hospitals. We identified seven steps: consortium definition, architecture implementation, clinical study definition, data collection, initialization, model training and results sharing. This process adapts certain steps from the classical centralized multicenter framework and brings new opportunities for interaction thanks to the architecture of the Federated Learning algorithms. It is open for completion to cover a variety of scenarios.
Asunto(s)
Algoritmos , HospitalesRESUMEN
Biostatistics and machine learning have been the cornerstone of a variety of recent developments in medicine. In order to gather large enough datasets, it is often necessary to set up multi-centric studies; yet, centralization of measurements can be difficult, either for practical, legal or ethical reasons. As an alternative, federated learning enables leveraging multiple centers' data without actually collating them. While existing works generally require a center to act as a leader and coordinate computations, we propose a fully decentralized framework where each center plays the same role. In this paper, we apply this framework to logistic regression, including confidence intervals computation. We test our algorithm on two distinct clinical datasets split among different centers, and show that it matches results from the centralized framework. In addition, we discuss possible privacy leaks and potential protection mechanisms, paving the way towards further research.