RESUMEN
MOTIVATION: Large language models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains such as biomedicine. Solutions such as pretraining and domain-specific fine-tuning add substantial computational overhead, requiring further domain-expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo, and GPT-4, to generate meaningful biomedical text rooted in established knowledge. RESULTS: Compared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain-specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion. AVAILABILITY AND IMPLEMENTATION: SPOKE KG can be accessed at https://spoke.rbvi.ucsf.edu/neighborhood.html. It can also be accessed using REST-API (https://spoke.rbvi.ucsf.edu/swagger/). KG-RAG code is made available at https://github.com/BaranziniLab/KG_RAG. Biomedical benchmark datasets used in this study are made available to the research community in the same GitHub repository.
Asunto(s)
Procesamiento de Lenguaje Natural , Biología Computacional/métodos , Algoritmos , HumanosRESUMEN
MOTIVATION: Knowledge graphs (KGs) are being adopted in industry, commerce and academia. Biomedical KG presents a challenge due to the complexity, size and heterogeneity of the underlying information. RESULTS: In this work, we present the Scalable Precision Medicine Open Knowledge Engine (SPOKE), a biomedical KG connecting millions of concepts via semantically meaningful relationships. SPOKE contains 27 million nodes of 21 different types and 53 million edges of 55 types downloaded from 41 databases. The graph is built on the framework of 11 ontologies that maintain its structure, enable mappings and facilitate navigation. SPOKE is built weekly by python scripts which download each resource, check for integrity and completeness, and then create a 'parent table' of nodes and edges. Graph queries are translated by a REST API and users can submit searches directly via an API or a graphical user interface. Conclusions/Significance: SPOKE enables the integration of seemingly disparate information to support precision medicine efforts. AVAILABILITY AND IMPLEMENTATION: The SPOKE neighborhood explorer is available at https://spoke.rbvi.ucsf.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Reconocimiento de Normas Patrones Automatizadas , Medicina de Precisión , Bases de Datos FactualesRESUMEN
Pfizer's Crystal Structure Database (CSDB) is a key enabling technology that allows scientists on structure-based projects rapid access to Pfizer's vast library of in-house crystal structures, as well as a significant number of structures imported from the Protein Data Bank. In addition to capturing basic information such as the asymmetric unit coordinates, reflection data, and the like, CSDB employs a variety of automated methods to first ensure a standard level of annotations and error checking, and then to add significant value for design teams by processing the structures through a sequence of algorithms that prepares the structures for use in modeling. The structures are made available, both as the original asymmetric unit as submitted, as well as the final prepared structures, through REST-based web services that are consumed by several client desktop applications. The structures can be searched by keyword, sequence, submission date, ligand substructure and similarity search, and other common queries.
Asunto(s)
Algoritmos , Bases de Datos de Proteínas , Humanos , LigandosRESUMEN
Knowledge representation and reasoning (KR&R) has been successfully implemented in many fields to enable computers to solve complex problems with AI methods. However, its application to biomedicine has been lagging in part due to the daunting complexity of molecular and cellular pathways that govern human physiology and pathology. In this article we describe concrete uses of SPOKE, an open knowledge network that connects curated information from 37 specialized and human-curated databases into a single property graph, with 3 million nodes and 15 million edges to date. Applications discussed in this article include drug discovery, COVID-19 research and chronic disease diagnosis and management.
RESUMEN
Many proteins fold into highly regular and repetitive three dimensional structures. The analysis of structural patterns and repeated elements is fundamental to understand protein function and evolution. We present recent improvements to the CE-Symm tool for systematically detecting and analyzing the internal symmetry and structural repeats in proteins. In addition to the accurate detection of internal symmetry, the tool is now capable of i) reporting the type of symmetry, ii) identifying the smallest repeating unit, iii) describing the arrangement of repeats with transformation operations and symmetry axes, and iv) comparing the similarity of all the internal repeats at the residue level. CE-Symm 2.0 helps the user investigate proteins with a robust and intuitive sequence-to-structure analysis, with many applications in protein classification, functional annotation and evolutionary studies. We describe the algorithmic extensions of the method and demonstrate its applications to the study of interesting cases of protein evolution.
Asunto(s)
Algoritmos , Biología Computacional/métodos , Proteínas/química , Programas Informáticos , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Modelos Moleculares , Análisis de Secuencia de ProteínaRESUMEN
BioJava is an open-source project that provides a Java library for processing biological data. The project aims to simplify bioinformatic analyses by implementing parsers, data structures, and algorithms for common tasks in genomics, structural biology, ontologies, phylogenetics, and more. Since 2012, we have released two major versions of the library (4 and 5) that include many new features to tackle challenges with increasingly complex macromolecular structure data. BioJava requires Java 8 or higher and is freely available under the LGPL 2.1 license. The project is hosted on GitHub at https://github.com/biojava/biojava. More information and documentation can be found online on the BioJava website (http://www.biojava.org) and tutorial (https://github.com/biojava/biojava-tutorial). All inquiries should be directed to the GitHub page or the BioJava mailing list (http://lists.open-bio.org/mailman/listinfo/biojava-l).
Asunto(s)
Biología Computacional/métodos , Acceso a la Información , Algoritmos , Biblioteca de Genes , Genoma/genética , Genómica , Almacenamiento y Recuperación de la Información , Internet , Programas InformáticosRESUMEN
Motivation: The interactive visualization of very large macromolecular complexes on the web is becoming a challenging problem as experimental techniques advance at an unprecedented rate and deliver structures of increasing size. Results: We have tackled this problem by developing highly memory-efficient and scalable extensions for the NGL WebGL-based molecular viewer and by using Macromolecular Transmission Format (MMTF), a binary and compressed MMTF. These enable NGL to download and render molecular complexes with millions of atoms interactively on desktop computers and smartphones alike, making it a tool of choice for web-based molecular visualization in research and education. Availability and implementation: The source code is freely available under the MIT license at github.com/arose/ngl and distributed on NPM (npmjs.com/package/ngl). MMTF-JavaScript encoders and decoders are available at github.com/rcsb/mmtf-javascript.
Asunto(s)
Gráficos por Computador , Internet , Sustancias Macromoleculares , Programas InformáticosRESUMEN
The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB, http://rcsb.org), the US data center for the global PDB archive, makes PDB data freely available to all users, from structural biologists to computational biologists and beyond. New tools and resources have been added to the RCSB PDB web portal in support of a 'Structural View of Biology.' Recent developments have improved the User experience, including the high-speed NGL Viewer that provides 3D molecular visualization in any web browser, improved support for data file download and enhanced organization of website pages for query, reporting and individual structure exploration. Structure validation information is now visible for all archival entries. PDB data have been integrated with external biological resources, including chromosomal position within the human genome; protein modifications; and metabolic pathways. PDB-101 educational materials have been reorganized into a searchable website and expanded to include new features such as the Geis Digital Archive.
Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Proteínas/química , Proteínas/genética , Conjuntos de Datos como Asunto , Redes y Vías Metabólicas , Modelos Moleculares , Conformación Proteica , Proteínas/metabolismo , Programas Informáticos , Relación Estructura-Actividad , Interfaz Usuario-Computador , Navegador WebRESUMEN
SUMMARY: We developed a new software tool, BioJava-ModFinder, for identifying protein modifications observed in 3D structures archived in the Protein Data Bank (PDB). Information on more than 400 types of protein modifications were collected and curated from annotations in PDB, RESID, and PSI-MOD. We divided these modifications into three categories: modified residues, attachment modifications, and cross-links. We have developed a systematic method to identify these modifications in 3D protein structures. We have integrated this package with the RCSB PDB web application and added protein modification annotations to the sequence diagram and structure display. By scanning all 3D structures in the PDB using BioJava-ModFinder, we identified more than 30 000 structures with protein modifications, which can be searched, browsed, and visualized on the RCSB PDB website. AVAILABILITY AND IMPLEMENTATION: BioJava-ModFinder is available as open source (LGPL license) at ( https://github.com/biojava/biojava/tree/master/biojava-modfinder ). The RCSB PDB can be accessed at http://www.rcsb.org . CONTACT: pwrose@ucsd.edu.
Asunto(s)
Biología Computacional/métodos , Bases de Datos de Proteínas , Conformación Proteica , Programas Informáticos , InternetRESUMEN
Recent advances in experimental techniques have led to a rapid growth in complexity, size, and number of macromolecular structures that are made available through the Protein Data Bank. This creates a challenge for macromolecular visualization and analysis. Macromolecular structure files, such as PDB or PDBx/mmCIF files can be slow to transfer, parse, and hard to incorporate into third-party software tools. Here, we present a new binary and compressed data representation, the MacroMolecular Transmission Format, MMTF, as well as software implementations in several languages that have been developed around it, which address these issues. We describe the new format and its APIs and demonstrate that it is several times faster to parse, and about a quarter of the file size of the current standard format, PDBx/mmCIF. As a consequence of the new data representation, it is now possible to visualize structures with millions of atoms in a web browser, keep the whole PDB archive in memory or parse it within few minutes on average computers, which opens up a new way of thinking how to design and implement efficient algorithms in structural bioinformatics. The PDB archive is available in MMTF file format through web services and data that are updated on a weekly basis.
Asunto(s)
Biología Computacional/métodos , Bases de Datos de Compuestos Químicos , Sustancias Macromoleculares , Programas Informáticos , Internet , Sustancias Macromoleculares/análisis , Sustancias Macromoleculares/química , Sustancias Macromoleculares/clasificación , Estructura MolecularRESUMEN
The Protein Data Bank (PDB) now contains more than 120,000 three-dimensional (3D) structures of biological macromolecules. To allow an interpretation of how PDB data relates to other publicly available annotations, we developed a novel data integration platform that maps 3D structural information across various datasets. This integration bridges from the human genome across protein sequence to 3D structure space. We developed novel software solutions for data management and visualization, while incorporating new libraries for web-based visualization using SVG graphics. AVAILABILITY AND IMPLEMENTATION: The new views are available from http://www.rcsb.org and software is available from https://github.com/rcsb/. CONTACT: andreas.prlic@rcsb.orgSupplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Biología Computacional/métodos , Bases de Datos de Proteínas , Conformación Proteica , Programas Informáticos , Secuencia de Aminoácidos , Gráficos por Computador , Genómica , Humanos , Interfaz Usuario-ComputadorAsunto(s)
Conducta Cooperativa , Ciencia de los Datos , Comunicación Interdisciplinaria , Relaciones Interprofesionales , Biología Computacional , Ciencia de los Datos/ética , Ciencia de los Datos/organización & administración , Ciencia de los Datos/tendencias , Humanos , Colaboración IntersectorialRESUMEN
Scientific software engineering is a distinct discipline from both computational chemistry project support and research informatics. A scientific software engineer not only has a deep understanding of the science of drug discovery but also the desire, skills and time to apply good software engineering practices. A good team of scientific software engineers can create a software foundation that is maintainable, validated and robust. If done correctly, this foundation enable the organization to investigate new and novel computational ideas with a very high level of efficiency.
Asunto(s)
Diseño Asistido por Computadora , Descubrimiento de Drogas/métodos , Industria Farmacéutica/métodos , Programas Informáticos , Química Farmacéutica , Biología Computacional , Diseño de Fármacos , Modelos MolecularesRESUMEN
The RCSB Protein Data Bank (RCSB PDB, http://www.rcsb.org) provides access to 3D structures of biological macromolecules and is one of the leading resources in biology and biomedicine worldwide. Our efforts over the past 2 years focused on enabling a deeper understanding of structural biology and providing new structural views of biology that support both basic and applied research and education. Herein, we describe recently introduced data annotations including integration with external biological resources, such as gene and drug databases, new visualization tools and improved support for the mobile web. We also describe access to data files, web services and open access software components to enable software developers to more effectively mine the PDB archive and related annotations. Our efforts are aimed at expanding the role of 3D structure in understanding biology and medicine.
Asunto(s)
Bases de Datos de Proteínas , Conformación Proteica , Sitios de Unión , Internet , Proteínas de la Membrana/química , Biología Molecular/educación , Anotación de Secuencia Molecular , Complejos Multiproteicos/química , Péptidos/química , Preparaciones Farmacéuticas/química , Investigación , Programas InformáticosRESUMEN
SUMMARY: The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) resource provides tools for query, analysis and visualization of the 3D structures in the PDB archive. As the mobile Web is starting to surpass desktop and laptop usage, scientists and educators are beginning to integrate mobile devices into their research and teaching. In response, we have developed the RCSB PDB Mobile app for the iOS and Android mobile platforms to enable fast and convenient access to RCSB PDB data and services. Using the app, users from the general public to expert researchers can quickly search and visualize biomolecules, and add personal annotations via the RCSB PDB's integrated MyPDB service. AVAILABILITY AND IMPLEMENTATION: RCSB PDB Mobile is freely available from the Apple App Store and Google Play (http://www.rcsb.org).
Asunto(s)
Biología Computacional/métodos , Gráficos por Computador , Bases de Datos de Proteínas , Aplicaciones Móviles , Programas Informáticos , Investigación Biomédica , Humanos , Interfaz Usuario-Computador , Flujo de TrabajoAsunto(s)
Biología Computacional , Guías como Asunto , Escritura , Autoria , Documentación , HumanosRESUMEN
The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) develops tools and resources that provide a structural view of biology for research and education. The RCSB PDB web site (http://www.rcsb.org) uses the curated 3D macromolecular data contained in the PDB archive to offer unique methods to access, report and visualize data. Recent activities have focused on improving methods for simple and complex searches of PDB data, creating specialized access to chemical component data and providing domain-based structural alignments. New educational resources are offered at the PDB-101 educational view of the main web site such as Author Profiles that display a researcher's PDB entries in a timeline. To promote different kinds of access to the RCSB PDB, Web Services have been expanded, and an RCSB PDB Mobile application for the iPhone/iPad has been released. These improvements enable new opportunities for analyzing and understanding structure data.
Asunto(s)
Bases de Datos de Proteínas , Conformación Proteica , Bioquímica/educación , Gráficos por Computador , Internet , Ligandos , Estructura Terciaria de Proteína , Investigación , Homología Estructural de ProteínaRESUMEN
A new class of stabilized pentacene derivatives with externally fused five-membered rings are prepared by means of a key palladium-catalyzed cyclopentannulation step. The target compounds are synthesized by chemical manipulation of a partially saturated 6,13-dibromopentacene precursor that can be fully aromatized in a final step through a DDQ-mediated dehydrogenation reaction (DDQ=2,3-dichloro-5,6-dicyano-1,4-benzoquinone). The new 1,2,8,9-tetraaryldicyclopenta[fg,qr]pentacene derivatives have narrow energy gaps of circa 1.2â eV and behave as strong electron acceptors with lowest unoccupied molecular orbital energies between -3.81 and -3.90â eV. Photodegradation studies reveal the new compounds are more photostable than 6,13-bis(triisopropylsilylethynyl)pentacene (TIPS-pentacene).
RESUMEN
BACKGROUND: Survival rates following a diagnosis of cancer vary between countries. The International Cancer Benchmarking Partnership (ICBP), a collaboration between six countries with primary care led health services, was set up in 2009 to investigate the causes of these differences. Module 3 of this collaboration hypothesised that an association exists between the readiness of primary care physicians (PCP) to investigate for cancer - the 'threshold' risk level at which they investigate or refer to a specialist for consideration of possible cancer - and survival for that cancer (lung, colorectal and ovarian). We describe the development of an international survey instrument to test this hypothesis. METHODS: The work was led by an academic steering group in England. They agreed that an online survey was the most pragmatic way of identifying differences between the jurisdictions. Research questions were identified through clinical experience and expert knowledge of the relevant literature.A survey comprising a set of direct questions and five clinical scenarios was developed to investigate the hypothesis. The survey content was discussed and refined concurrently and repeatedly with international partners. The survey was validated using an iterative process in England. Following validation the survey was adapted to be relevant to the health systems operating in other jurisdictions and translated into Danish, Norwegian and Swedish, and into Canadian and Australian English. RESULTS: This work has produced a survey with face, content and cross cultural validity that will be circulated in all six countries. It could also form a benchmark for similar surveys in countries with similar health care systems. CONCLUSIONS: The vignettes could also be used as educational resources. This study is likely to impact on healthcare policy and practice in participating countries.
Asunto(s)
Neoplasias/diagnóstico , Pautas de la Práctica en Medicina/estadística & datos numéricos , Atención Primaria de Salud/normas , Encuestas y Cuestionarios , Australia , Canadá , Dinamarca , Inglaterra , Humanos , Noruega , Suecia , TraducciónRESUMEN
UNLABELLED: BioJava is an open-source project for processing of biological data in the Java programming language. We have recently released a new version (3.0.5), which is a major update to the code base that greatly extends its functionality. RESULTS: BioJava now consists of several independent modules that provide state-of-the-art tools for protein structure comparison, pairwise and multiple sequence alignments, working with DNA and protein sequences, analysis of amino acid properties, detection of protein modifications and prediction of disordered regions in proteins as well as parsers for common file formats using a biologically meaningful data model. AVAILABILITY: BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website (http://www.biojava.org). BioJava requires Java 1.6 or higher. All inquiries should be directed to the BioJava mailing lists. Details are available at http://biojava.org/wiki/BioJava:MailingLists.