RESUMO
MOTIVATION: Large language models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains such as biomedicine. Solutions such as pretraining and domain-specific fine-tuning add substantial computational overhead, requiring further domain-expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo, and GPT-4, to generate meaningful biomedical text rooted in established knowledge. RESULTS: Compared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain-specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion. AVAILABILITY AND IMPLEMENTATION: SPOKE KG can be accessed at https://spoke.rbvi.ucsf.edu/neighborhood.html. It can also be accessed using REST-API (https://spoke.rbvi.ucsf.edu/swagger/). KG-RAG code is made available at https://github.com/BaranziniLab/KG_RAG. Biomedical benchmark datasets used in this study are made available to the research community in the same GitHub repository.
Assuntos
Processamento de Linguagem Natural , Biologia Computacional/métodos , Algoritmos , HumanosRESUMO
Identification of Alzheimer's disease (AD) onset risk can facilitate interventions before irreversible disease progression. We demonstrate that electronic health records from the University of California, San Francisco, followed by knowledge networks (for example, SPOKE) allow for (1) prediction of AD onset and (2) prioritization of biological hypotheses, and (3) contextualization of sex dimorphism. We trained random forest models and predicted AD onset on a cohort of 749 individuals with AD and 250,545 controls with a mean area under the receiver operating characteristic of 0.72 (7 years prior) to 0.81 (1 day prior). We further harnessed matched cohort models to identify conditions with predictive power before AD onset. Knowledge networks highlight shared genes between multiple top predictors and AD (for example, APOE, ACTB, IL6 and INS). Genetic colocalization analysis supports AD association with hyperlipidemia at the APOE locus, as well as a stronger female AD association with osteoporosis at a locus near MS4A6A. We therefore show how clinical data can be utilized for early AD prediction and identification of personalized biological hypotheses.
Assuntos
Doença de Alzheimer , Masculino , Humanos , Feminino , Doença de Alzheimer/diagnóstico , Registros Eletrônicos de Saúde , Apolipoproteínas E/genética , São FranciscoRESUMO
Knowledge graphs have become a common approach for knowledge representation. Yet, the application of graph methodology is elusive due to the sheer number and complexity of knowledge sources. In addition, semantic incompatibilities hinder efforts to harmonize and integrate across these diverse sources. As part of The Biomedical Translator Consortium, we have developed a knowledge graph-based question-answering system designed to augment human reasoning and accelerate translational scientific discovery: the Translator system. We have applied the Translator system to answer biomedical questions in the context of a broad array of diseases and syndromes, including Fanconi anemia, primary ciliary dyskinesia, multiple sclerosis, and others. A variety of collaborative approaches have been used to research and develop the Translator system. One recent approach involved the establishment of a monthly "Question-of-the-Month (QotM) Challenge" series. Herein, we describe the structure of the QotM Challenge; the six challenges that have been conducted to date on drug-induced liver injury, cannabidiol toxicity, coronavirus infection, diabetes, psoriatic arthritis, and ATP1A3-related phenotypes; the scientific insights that have been gleaned during the challenges; and the technical issues that were identified over the course of the challenges and that can now be addressed to foster further development of the prototype Translator system. We close with a discussion on Large Language Models such as ChatGPT and highlight differences between those models and the Translator system.
RESUMO
Introduction: Early diagnosis of Parkinson's disease (PD) is important to identify treatments to slow neurodegeneration. People who develop PD often have symptoms before the disease manifests and may be coded as diagnoses in the electronic health record (EHR). Methods: To predict PD diagnosis, we embedded EHR data of patients onto a biomedical knowledge graph called Scalable Precision medicine Open Knowledge Engine (SPOKE) and created patient embedding vectors. We trained and validated a classifier using these vectors from 3,004 PD patients, restricting records to 1, 3, and 5 years before diagnosis, and 457,197 non-PD group. Results: The classifier predicted PD diagnosis with moderate accuracy (AUC = 0.77 ± 0.06, 0.74 ± 0.05, 0.72 ± 0.05 at 1, 3, and 5 years) and performed better than other benchmark methods. Nodes in the SPOKE graph, among cases, revealed novel associations, while SPOKE patient vectors revealed the basis for individual risk classification. Discussion: The proposed method was able to explain the clinical predictions using the knowledge graph, thereby making the predictions clinically interpretable. Through enriching EHR data with biomedical associations, SPOKE may be a cost-efficient and personalized way to predict PD diagnosis years before its occurrence.
RESUMO
MOTIVATION: Knowledge graphs (KGs) are being adopted in industry, commerce and academia. Biomedical KG presents a challenge due to the complexity, size and heterogeneity of the underlying information. RESULTS: In this work, we present the Scalable Precision Medicine Open Knowledge Engine (SPOKE), a biomedical KG connecting millions of concepts via semantically meaningful relationships. SPOKE contains 27 million nodes of 21 different types and 53 million edges of 55 types downloaded from 41 databases. The graph is built on the framework of 11 ontologies that maintain its structure, enable mappings and facilitate navigation. SPOKE is built weekly by python scripts which download each resource, check for integrity and completeness, and then create a 'parent table' of nodes and edges. Graph queries are translated by a REST API and users can submit searches directly via an API or a graphical user interface. Conclusions/Significance: SPOKE enables the integration of seemingly disparate information to support precision medicine efforts. AVAILABILITY AND IMPLEMENTATION: The SPOKE neighborhood explorer is available at https://spoke.rbvi.ucsf.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Reconhecimento Automatizado de Padrão , Medicina de Precisão , Bases de Dados FactuaisRESUMO
Meaningful representations of clinical data using embedding vectors is a pivotal step to invoke any machine learning (ML) algorithm for data inference. In this article, we propose a time-aware embedding approach of electronic health records onto a biomedical knowledge graph for creating machine readable patient representations. This approach not only captures the temporal dynamics of patient clinical trajectories, but also enriches it with additional biological information from the knowledge graph. To gauge the predictivity of this approach, we propose an ML pipeline called TANDEM (Temporal and Non-temporal Dynamics Embedded Model) and apply it on the early detection of Parkinson's disease. TANDEM results in a classification AUC score of 0.85 on unseen test dataset. These predictions are further explained by providing a biological insight using the knowledge graph. Taken together, we show that temporal embeddings of clinical data could be a meaningful predictive representation for downstream ML pipelines in clinical decision-making.
Assuntos
Biologia Computacional , Reconhecimento Automatizado de Padrão , Humanos , Biologia Computacional/métodos , Algoritmos , Aprendizado de Máquina , Registros Eletrônicos de SaúdeRESUMO
Knowledge representation and reasoning (KR&R) has been successfully implemented in many fields to enable computers to solve complex problems with AI methods. However, its application to biomedicine has been lagging in part due to the daunting complexity of molecular and cellular pathways that govern human physiology and pathology. In this article we describe concrete uses of SPOKE, an open knowledge network that connects curated information from 37 specialized and human-curated databases into a single property graph, with 3 million nodes and 15 million edges to date. Applications discussed in this article include drug discovery, COVID-19 research and chronic disease diagnosis and management.
RESUMO
Within clinical, biomedical, and translational science, an increasing number of projects are adopting graphs for knowledge representation. Graph-based data models elucidate the interconnectedness among core biomedical concepts, enable data structures to be easily updated, and support intuitive queries, visualizations, and inference algorithms. However, knowledge discovery across these "knowledge graphs" (KGs) has remained difficult. Data set heterogeneity and complexity; the proliferation of ad hoc data formats; poor compliance with guidelines on findability, accessibility, interoperability, and reusability; and, in particular, the lack of a universally accepted, open-access model for standardization across biomedical KGs has left the task of reconciling data sources to downstream consumers. Biolink Model is an open-source data model that can be used to formalize the relationships between data structures in translational science. It incorporates object-oriented classification and graph-oriented features. The core of the model is a set of hierarchical, interconnected classes (or categories) and relationships between them (or predicates) representing biomedical entities such as gene, disease, chemical, anatomic structure, and phenotype. The model provides class and edge attributes and associations that guide how entities should relate to one another. Here, we highlight the need for a standardized data model for KGs, describe Biolink Model, and compare it with other models. We demonstrate the utility of Biolink Model in various initiatives, including the Biomedical Data Translator Consortium and the Monarch Initiative, and show how it has supported easier integration and interoperability of biomedical KGs, bringing together knowledge from multiple sources and helping to realize the goals of translational science.
Assuntos
Reconhecimento Automatizado de Padrão , Ciência Translacional Biomédica , ConhecimentoRESUMO
Neurons in the dorsal pathway of the visual cortex are thought to be involved in motion processing. The first site of motion processing is the primary visual cortex (V1), encoding the direction of motion in local receptive fields, with higher order motion processing happening in the middle temporal area (MT). Complex motion properties like optic flow are processed in higher cortical areas of the Medial Superior Temporal area (MST). In this study, a hierarchical neural field network model of motion processing is presented. The model architecture has an input layer followed by either one or cascade of two neural fields (NF): the first of these, NF1, represents V1, while the second, NF2, represents MT. A special feature of the model is that lateral connections used in the neural fields are trained by asymmetric Hebbian learning, imparting to the neural field the ability to process sequential information in motion stimuli. The model was trained using various traditional moving patterns such as bars, squares, gratings, plaids, and random dot stimulus. In the case of bar stimuli, the model had only a single NF, the neurons of which developed a direction map of the moving bar stimuli. Training a network with two NFs on moving square and moving plaids stimuli, we show that, while the neurons in NF1 respond to the direction of the component (such as gratings and edges) motion, the neurons in NF2 (analogous to MT) responding to the direction of the pattern (plaids, square object) motion. In the third study, a network with 2 NFs was simulated using random dot stimuli (RDS) with translational motion, and show that the NF2 neurons can encode the direction of the concurrent dot motion (also called translational flow motion), independent of the dot configuration. This translational RDS flow motion is decoded by a simple perceptron network (a layer above NF2) with an accuracy of 100% on train set and 90% on the test set, thereby demonstrating that the proposed network can generalize to new dot configurations. Also, the response properties of the model on different input stimuli closely resembled many of the known features of the neurons found in electrophysiological studies.
RESUMO
Three-dimensional (3D) spatial cells in the mammalian hippocampal formation are believed to support the existence of 3D cognitive maps. Modeling studies are crucial to comprehend the neural principles governing the formation of these maps, yet to date very few have addressed this topic in 3D space. Here we present a hierarchical network model for the formation of 3D spatial cells using anti-Hebbian network. Built on empirical data, the model accounts for the natural emergence of 3D place, border, and grid cells, as well as a new type of previously undescribed spatial cell type which we call plane cells. It further explains the plausible reason behind the place and grid-cell anisotropic coding that has been observed in rodents and the potential discrepancy with the predicted periodic coding during 3D volumetric navigation. Lastly, it provides evidence for the importance of unsupervised learning rules in guiding the formation of higher-dimensional cognitive maps.
RESUMO
Oscillatory phenomena are ubiquitous in the brain. Although there are oscillator-based models of brain dynamics, their universal computational properties have not been explored much unlike in the case of rate-coded and spiking neuron network models. Use of oscillator-based models is often limited to special phenomena like locomotor rhythms and oscillatory attractor-based memories. If neuronal ensembles are taken to be the basic functional units of brain dynamics, it is desirable to develop oscillator-based models that can explain a wide variety of neural phenomena. Autoencoders are a special type of feed forward networks that have been used for construction of large-scale deep networks. Although autoencoders based on rate-coded and spiking neuron networks have been proposed, there are no autoencoders based on oscillators. We propose here an oscillatory neural network model that performs the function of an autoencoder. The model is a hybrid of rate-coded neurons and neural oscillators. Input signals modulate the frequency of the neural encoder oscillators. These signals are then multiplexed using a network of rate-code neurons that has afferent Hebbian and lateral anti-Hebbian connectivity, termed as Lateral Anti Hebbian Network (LAHN). Finally the LAHN output is de-multiplexed using an output neural layer which is a combination of adaptive Hopf and Kuramoto oscillators for the signal reconstruction. The Kuramoto-Hopf combination performing demodulation is a novel way of describing a neural phase-locked loop. The proposed model is tested using both synthetic signals and real world EEG signals. The proposed model arises out of the general motivation to construct biologically inspired, oscillatory versions of some of the standard neural network models, and presents itself as an autoencoder network based on oscillatory neurons applicable to time series signals. As a demonstration, the model is applied to compression of EEG signals.
RESUMO
Spatial cells in the hippocampal complex play a pivotal role in the navigation of an animal. Exact neural principles behind these spatial cell responses have not been completely unraveled yet. Here we present two models for spatial cells, namely the Velocity Driven Oscillatory Network (VDON) and Locomotor Driven Oscillatory Network. Both models have basically three stages in common such as direction encoding stage, path integration (PI) stage, and a stage of unsupervised learning of PI values. In the first model, the following three stages are implemented: head direction layer, frequency modulation by a layer of oscillatory neurons, and an unsupervised stage that extracts the principal components from the oscillator outputs. In the second model, a refined version of the first model, the stages are extraction of velocity representation from the locomotor input, frequency modulation by a layer of oscillators, and two cascaded unsupervised stages consisting of the lateral anti-hebbian network. The principal component stage of VDON exhibits grid cell-like spatially periodic responses including hexagonal firing fields. Locomotor Driven Oscillatory Network shows the emergence of spatially periodic grid cells and periodically active border-like cells in its lower layer; place cell responses are found in its higher layer. This model shows the inheritance of phase precession from grid cell to place cell in both one- and two-dimensional spaces. It also shows a novel result on the influence of locomotion rhythms on the grid cell activity. The study thus presents a comprehensive, unifying hierarchical model for hippocampal spatial cells.
Assuntos
Células de Grade/fisiologia , Hipocampo/citologia , Hipocampo/fisiologia , Locomoção/fisiologia , Modelos Neurológicos , Rede Nervosa/fisiologia , Redes Neurais de Computação , Células de Lugar/fisiologia , Navegação Espacial/fisiologia , Aprendizado de Máquina não Supervisionado , AnimaisRESUMO
Grid cells and place cells are believed to be cellular substrates for the spatial navigation functions of hippocampus as experimental animals physically navigated in 2D and 3D spaces. However, a recent saccade study on head fixated monkey has also reported grid-like representations on saccadic trajectory while the animal scanned the images on a computer screen. We present two computational models that explain the formation of grid patterns on saccadic trajectory formed on the novel Images. The first model named Saccade Velocity Driven Oscillatory Network -Direct PCA (SVDON-DPCA) explains how grid patterns can be generated on saccadic space using Principal Component Analysis (PCA) like learning rule. The model adopts a hierarchical architecture. We extend this to a network model viz. Saccade Velocity Driven Oscillatory Network-Network PCA (SVDON-NPCA) where the direct PCA stage is replaced by a neural network that can implement PCA using a neurally plausible algorithm. This gives the leverage to study the formation of grid cells at a network level. Saccade trajectory for both models is generated based on an attention model which attends to the salient location by computing the saliency maps of the images. Both models capture the spatial characteristics of grid cells such as grid scale variation on the dorso-ventral axis of Medial Entorhinal cortex. Adding one more layer of LAHN over the SVDON-NPCA model predicts the Place cells in saccadic space, which are yet to be discovered experimentally. To the best of our knowledge, this is the first attempt to model grid cells and place cells from saccade trajectory.
RESUMO
Grid cells are a special class of spatial cells found in the medial entorhinal cortex (MEC) characterized by their strikingly regular hexagonal firing fields. This spatially periodic firing pattern is originally considered to be independent of the geometric properties of the environment. However, this notion was contested by examining the grid cell periodicity in environments with different polarity (Krupic et al., 2015) and in connected environments (Carpenter et al., 2015). Aforementioned experimental results demonstrated the dependence of grid cell activity on environmental geometry. Analysis of grid cell periodicity on practically infinite variations of environmental geometry imposes a limitation on the experimental study. Hence we analyze the dependence of grid cell periodicity on the environmental geometry purely from a computational point of view. We use a hierarchical oscillatory network model where velocity inputs are presented to a layer of Head Direction cells, outputs of which are projected to a Path Integration layer. The Lateral Anti-Hebbian Network (LAHN) is used to perform feature extraction from the Path Integration neurons thereby producing a spectrum of spatial cell responses. We simulated the model in five types of environmental geometries such as: (1) connected environments, (2) convex shapes, (3) concave shapes, (4) regular polygons with varying number of sides, and (5) transforming environment. Simulation results point to a greater function for grid cells than what was believed hitherto. Grid cells in the model encode not just the local position but also more global information like the shape of the environment. Furthermore, the model is able to capture the invariant attributes of the physical space ingrained in its LAHN layer, thereby revealing its ability to classify an environment using this information. The proposed model is interesting not only because it is able to capture the experimental results but, more importantly, it is able to make many important predictions on the effect of the environmental geometry on the grid cell periodicity and suggesting the possibility of grid cells encoding the invariant properties of an environment.