Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
1.
medRxiv ; 2023 May 21.
Artículo en Inglés | MEDLINE | ID: mdl-37293026

RESUMEN

Objective: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes, covering hundreds of thousands of clinical concepts available for research and clinical care. The complex, massive, heterogeneous, and noisy nature of EHR data imposes significant challenges for feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features. Methods: The ARCH algorithm first derives embedding vectors from a co-occurrence matrix of all EHR concepts and then generates cosine similarities along with associated p-values to measure the strength of relatedness between clinical features with statistical certainty quantification. In the final step, ARCH performs a sparse embedding regression to remove indirect linkage between entity pairs. We validated the clinical utility of the ARCH knowledge graph, generated from 12.5 million patients in the Veterans Affairs (VA) healthcare system, through downstream tasks including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients. Results: ARCH produces high-quality clinical embeddings and KG for over 60,000 EHR concepts, as visualized in the R-shiny powered web-API (https://celehs.hms.harvard.edu/ARCH/). The ARCH embeddings attained an average area under the ROC curve (AUC) of 0.926 and 0.861 for detecting pairs of similar EHR concepts when the concepts are mapped to codified data and to NLP data; and 0.810 (codified) and 0.843 (NLP) for detecting related pairs. Based on the p-values computed by ARCH, the sensitivity of detecting similar and related entity pairs are 0.906 and 0.888 under false discovery rate (FDR) control of 5%. For detecting drug side effects, the cosine similarity based on the ARCH semantic representations achieved an AUC of 0.723 while the AUC improved to 0.826 after few-shot training via minimizing the loss function on the training data set. Incorporating NLP data substantially improved the ability to detect side effects in the EHR. For example, based on unsupervised ARCH embeddings, the power of detecting drug-side effects pairs when using codified data only was 0.15, much lower than the power of 0.51 when using both codified and NLP concepts. Compared to existing large-scale representation learning methods including PubmedBERT, BioBERT and SAPBERT, ARCH attains the most robust performance and substantially higher accuracy in detecting these relationships. Incorporating ARCH selected features in weakly supervised phenotyping algorithms can improve the robustness of algorithm performance, especially for diseases that benefit from NLP features as supporting evidence. For example, the phenotyping algorithm for depression attained an AUC of 0.927 when using ARCH selected features but only 0.857 when using codified features selected via the KESER network[1]. In addition, embeddings and knowledge graphs generated from the ARCH network were able to cluster AD patients into two subgroups, where the fast progression subgroup had a much higher mortality rate. Conclusions: The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.

2.
J Phys Condens Matter ; 34(13)2022 Jan 26.
Artículo en Inglés | MEDLINE | ID: mdl-35008070

RESUMEN

This work describes two methods to fit the inelastic neutron-scattering spectrumS(q,ω) with wavevectorqand frequencyω. The common and well-established method extracts the experimental spin-wave branchesωn(q) from the measured spectraS(q,ω) and then minimizes the difference between the observed and predicted frequencies. Whennbranches of frequencies are predicted but the measured frequencies overlap to produce onlym < nbranches, the weighted average of the predicted frequencies must be compared to the observed frequencies. A penalty is then exacted when the width of the predicted frequencies exceeds the width of the observed frequencies. The second method directly compares the measured and predicted intensitiesS(q,ω) over a grid {qi,ωj} in wavevector and frequency space. After subtracting background noise from the observed intensities, the theoretical intensities are scaled by a simple wavevector-dependent function that reflects the instrumental resolution. The advantages and disadvantages of each approach are demonstrated by studying the open honeycomb material Tb2Ir3Ga9.

3.
NPJ Digit Med ; 4(1): 151, 2021 Oct 27.
Artículo en Inglés | MEDLINE | ID: mdl-34707226

RESUMEN

The increasing availability of electronic health record (EHR) systems has created enormous potential for translational research. However, it is difficult to know all the relevant codes related to a phenotype due to the large number of codes available. Traditional data mining approaches often require the use of patient-level data, which hinders the ability to share data across institutions. In this project, we demonstrate that multi-center large-scale code embeddings can be used to efficiently identify relevant features related to a disease of interest. We constructed large-scale code embeddings for a wide range of codified concepts from EHRs from two large medical centers. We developed knowledge extraction via sparse embedding regression (KESER) for feature selection and integrative network analysis. We evaluated the quality of the code embeddings and assessed the performance of KESER in feature selection for eight diseases. Besides, we developed an integrated clinical knowledge map combining embedding data from both institutions. The features selected by KESER were comprehensive compared to lists of codified data generated by domain experts. Features identified via KESER resulted in comparable performance to those built upon features selected manually or with patient-level data. The knowledge map created using an integrative analysis identified disease-disease and disease-drug pairs more accurately compared to those identified using single institution data. Analysis of code embeddings via KESER can effectively reveal clinical knowledge and infer relatedness among codified concepts. KESER bypasses the need for patient-level data in individual analyses providing a significant advance in enabling multi-center studies using EHR data.

4.
AMIA Jt Summits Transl Sci Proc ; 2020: 533-541, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32477675

RESUMEN

The Department of Veteran's Affairs (VA) archives one of the largest corpora of clinical notes in their corporate data warehouse as unstructured text data. Unstructured text easily supports keyword searches and regular expressions. Often these simple searches do not adequately support the complex searches that need to be performed on notes. For example, a researcher may want all notes with a Duke Treadmill Score of less than five or people that smoke more than one pack per day. Range queries like this and more can be supported by modelling text as semi-structured documents. In this paper, we implement a scalable machine learning pipeline that models plain medical text as useful semi-structured documents. We improve on existing models and achieve an F1-score of 0.912 and scale our methods to the entire VA corpus.

5.
Phys Rev E Stat Nonlin Soft Matter Phys ; 76(2 Pt 2): 026209, 2007 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-17930123

RESUMEN

Commonly used dependence measures, such as linear correlation, cross-correlogram, or Kendall's tau , cannot capture the complete dependence structure in data unless the structure is restricted to linear, periodic, or monotonic. Mutual information (MI) has been frequently utilized for capturing the complete dependence structure including nonlinear dependence. Recently, several methods have been proposed for the MI estimation, such as kernel density estimators (KDEs), k -nearest neighbors (KNNs), Edgeworth approximation of differential entropy, and adaptive partitioning of the XY plane. However, outstanding gaps in the current literature have precluded the ability to effectively automate these methods, which, in turn, have caused limited adoptions by the application communities. This study attempts to address a key gap in the literature-specifically, the evaluation of the above methods to choose the best method, particularly in terms of their robustness for short and noisy data, based on comparisons with the theoretical MI estimates, which can be computed analytically, as well with linear correlation and Kendall's tau . Here we consider smaller data sizes, such as 50, 100, and 1000, and within this study we characterize 50 and 100 data points as very short and 1000 as short. We consider a broader class of functions, specifically linear, quadratic, periodic, and chaotic, contaminated with artificial noise with varying noise-to-signal ratios. Our results indicate KDEs as the best choice for very short data at relatively high noise-to-signal levels whereas the performance of KNNs is the best for very short data at relatively low noise levels as well as for short data consistently across noise levels. In addition, the optimal smoothing parameter of a Gaussian kernel appears to be the best choice for KDEs while three nearest neighbors appear optimal for KNNs. Thus, in situations where the approximate data sizes are known in advance and exploratory data analysis and/or domain knowledge can be used to provide a priori insights into the noise-to-signal ratios, the results in the paper point to a way forward for automating the process of MI estimation.

6.
IEEE Trans Pattern Anal Mach Intell ; 27(8): 1340-3, 2005 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-16119272

RESUMEN

FastMap is a dimension reduction technique that operates on distances between objects. Although only distances are used, implicitly the technique assumes that the objects are points in a p-dimensional Euclidean space. It selects a sequence of k < or = p orthogonal axes defined by distant pairs of points (called pivots) and computes the projection of the points onto the orthogonal axes. We show that FastMap uses only the outer envelope of a data set. Pivots are taken from the faces, usually vertices, of the convex hull of the data points in the original implicit Euclidean space. This provides a bridge to results in robust statistics, where the convex hull is used as a tool in multivariate outlier detection and in robust estimation methods. The connection sheds new light on the properties of FastMap, particularly its sensitivity to outliers, and provides an opportunity for a new class of dimension reduction algorithms, RobustMaps, that retain the speed of FastMap and exploit ideas in robust statistics.


Asunto(s)
Algoritmos , Inteligencia Artificial , Aumento de la Imagen/métodos , Interpretación de Imagen Asistida por Computador/métodos , Almacenamiento y Recuperación de la Información/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Análisis por Conglomerados , Simulación por Computador , Modelos Estadísticos , Análisis Multivariante , Procesamiento de Señales Asistido por Computador , Técnica de Sustracción
7.
Artículo en Inglés | MEDLINE | ID: mdl-16452798

RESUMEN

This paper presents a novel algorithm for identification and functional characterization of "key" genome features responsible for a particular biochemical process of interest. The central idea is that individual genome features are identified as "key" features if the discrimination accuracy between two classes of genomes with respect to a given biochemical process is sufficiently affected by the inclusion or exclusion of these features. In this paper, genome features are defined by high-resolution gene functions. The discrimination procedure utilizes the Support Vector Machine classification technique. The application to the oxygenic photosynthetic process resulted in 126 highly confident candidate genome features. While many of these features are well-known components in the oxygenic photosynthetic process, others are completely unknown, even including some hypothetical proteins. It is obvious that our algorithm is capable of discovering features related to a targeted biochemical process.


Asunto(s)
Algoritmos , Inteligencia Artificial , Mapeo Cromosómico/métodos , Genoma/genética , Reconocimiento de Normas Patrones Automatizadas/métodos , Fotosíntesis/genética , Proteoma/genética , Bases de Datos Genéticas
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...