ABSTRACT
Inflammatory pain results from the heightened sensitivity and reduced threshold of nociceptor sensory neurons due to exposure to inflammatory mediators. However, the cellular and transcriptional diversity of immune cell and sensory neuron types makes it challenging to decipher the immune mechanisms underlying pain. Here we used single-cell transcriptomics to determine the immune gene signatures associated with pain development in three skin inflammatory pain models in mice: zymosan injection, skin incision and ultraviolet burn. We found that macrophage and neutrophil recruitment closely mirrored the kinetics of pain development and identified cell-type-specific transcriptional programs associated with pain and its resolution. Using a comprehensive list of potential interactions mediated by receptors, ligands, ion channels and metabolites to generate injury-specific neuroimmune interactomes, we also uncovered that thrombospondin-1 upregulated by immune cells upon injury inhibited nociceptor sensitization. This study lays the groundwork for identifying the neuroimmune axes that modulate pain in diverse disease contexts.
Subject(s)
Nociceptors , Pain , Animals , Mice , Pain/immunology , Pain/metabolism , Nociceptors/metabolism , Transcriptome , Mice, Inbred C57BL , Inflammation/immunology , Male , Macrophages/immunology , Macrophages/metabolism , Disease Models, Animal , Thrombospondin 1/metabolism , Thrombospondin 1/genetics , Skin/immunology , Skin/metabolism , Skin/pathology , Zymosan , Single-Cell Analysis , Neuroimmunomodulation , Gene Expression Profiling , Neutrophils/immunology , Neutrophils/metabolismABSTRACT
SUMMARY: We introduce Eliater, a Python package for estimating the effect of perturbation of an upstream molecule on a downstream molecule in a biomolecular network. The estimation takes as input a biomolecular network, observational biomolecular data, and a perturbation of interest, and outputs an estimated quantitative effect of the perturbation. We showcase the functionalities of Eliater in a case study of Escherichia coli transcriptional regulatory network. AVAILABILITY AND IMPLEMENTATION: The code, the documentation, and several case studies are available open source at https://github.com/y0-causal-inference/eliater.
Subject(s)
Escherichia coli , Gene Regulatory Networks , Software , Escherichia coli/genetics , Escherichia coli/metabolism , Computational Biology/methodsABSTRACT
MOTIVATION: Biomedical identifier resources (such as ontologies, taxonomies, and controlled vocabularies) commonly overlap in scope and contain equivalent entries under different identifiers. Maintaining mappings between these entries is crucial for interoperability and the integration of data and knowledge. However, there are substantial gaps in available mappings motivating their semi-automated curation. RESULTS: Biomappings implements a curation workflow for missing mappings which combines automated prediction with human-in-the-loop curation. It supports multiple prediction approaches and provides a web-based user interface for reviewing predicted mappings for correctness, combined with automated consistency checking. Predicted and curated mappings are made available in public, version-controlled resource files on GitHub. Biomappings currently makes available 9274 curated mappings and 40 691 predicted ones, providing previously missing mappings between widely used identifier resources covering small molecules, cell lines, diseases, and other concepts. We demonstrate the value of Biomappings on case studies involving predicting and curating missing mappings among cancer cell lines as well as small molecules tested in clinical trials. We also present how previously missing mappings curated using Biomappings were contributed back to multiple widely used community ontologies. AVAILABILITY AND IMPLEMENTATION: The data and code are available under the CC0 and MIT licenses at https://github.com/biopragmatics/biomappings.
Subject(s)
Data Curation , Vocabulary, Controlled , Humans , Data Curation/methods , Software , User-Computer InterfaceABSTRACT
MOTIVATION: The investigation of sets of genes using biological pathways is a common task for researchers and is supported by a wide variety of software tools. This type of analysis generates hypotheses about the biological processes that are active or modulated in a specific experimental context. RESULTS: The Network Data Exchange Integrated Query (NDEx IQuery) is a new tool for network and pathway-based gene set interpretation that complements or extends existing resources. It combines novel sources of pathways, integration with Cytoscape, and the ability to store and share analysis results. The NDEx IQuery web application performs multiple gene set analyses based on diverse pathways and networks stored in NDEx. These include curated pathways from WikiPathways and SIGNOR, published pathway figures from the last 27 years, machine-assembled networks using the INDRA system, and the new NCI-PID v2.0, an updated version of the popular NCI Pathway Interaction Database. NDEx IQuery's integration with MSigDB and cBioPortal now provides pathway analysis in the context of these two resources. AVAILABILITY AND IMPLEMENTATION: NDEx IQuery is available at https://www.ndexbio.org/iquery and is implemented in Javascript and Java.
Subject(s)
Computational Biology , Software , Computational Biology/methods , Protein Interaction Maps , Publications , Databases, Factual , InternetABSTRACT
The analysis of omic data depends on machine-readable information about protein interactions, modifications, and activities as found in protein interaction networks, databases of post-translational modifications, and curated models of gene and protein function. These resources typically depend heavily on human curation. Natural language processing systems that read the primary literature have the potential to substantially extend knowledge resources while reducing the burden on human curators. However, machine-reading systems are limited by high error rates and commonly generate fragmentary and redundant information. Here, we describe an approach to precisely assemble molecular mechanisms at scale using multiple natural language processing systems and the Integrated Network and Dynamical Reasoning Assembler (INDRA). INDRA identifies full and partial overlaps in information extracted from published papers and pathway databases, uses predictive models to improve the reliability of machine reading, and thereby assembles individual pieces of information into non-redundant and broadly usable mechanistic knowledge. Using INDRA to create high-quality corpora of causal knowledge we show it is possible to extend protein-protein interaction databases and explain co-dependencies in the Cancer Dependency Map.
Subject(s)
Data Mining , Natural Language Processing , Humans , Reproducibility of Results , Databases, FactualABSTRACT
MOTIVATION: The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models. However, representations based on a single modality are inherently limited. RESULTS: To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs (KGs). This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations in a shared embedding space. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against three baseline models trained on either one of the modalities (i.e. text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.084 (i.e. from 0.881 to 0.965). Finally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. AVAILABILITY AND IMPLEMENTATION: We make the source code and the Python package of STonKGs available at GitHub (https://github.com/stonkgs/stonkgs) and PyPI (https://pypi.org/project/stonkgs/). The pre-trained STonKGs models and the task-specific classification models are respectively available at https://huggingface.co/stonkgs/stonkgs-150k and https://zenodo.org/communities/stonkgs. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Pattern Recognition, Automated , Software , Machine Learning , Natural Language Processing , PublicationsABSTRACT
Biological systems are acknowledged to be robust to perturbations but a rigorous understanding of this has been elusive. In a mathematical model, perturbations often exert their effect through parameters, so sizes and shapes of parametric regions offer an integrated global estimate of robustness. Here, we explore this "parameter geography" for bistability in post-translational modification (PTM) systems. We use the previously developed "linear framework" for timescale separation to describe the steady-states of a two-site PTM system as the solutions of two polynomial equations in two variables, with eight non-dimensional parameters. Importantly, this approach allows us to accommodate enzyme mechanisms of arbitrary complexity beyond the conventional Michaelis-Menten scheme, which unrealistically forbids product rebinding. We further use the numerical algebraic geometry tools Bertini, Paramotopy, and alphaCertified to statistically assess the solutions to these equations at â¼109 parameter points in total. Subject to sampling limitations, we find no bistability when substrate amount is below a threshold relative to enzyme amounts. As substrate increases, the bistable region acquires 8-dimensional volume which increases in an apparently monotonic and sigmoidal manner towards saturation. The region remains connected but not convex, albeit with a high visibility ratio. Surprisingly, the saturating bistable region occupies a much smaller proportion of the sampling domain under mechanistic assumptions more realistic than the Michaelis-Menten scheme. We find that bistability is compromised by product rebinding and that unrealistic assumptions on enzyme mechanisms have obscured its parametric rarity. The apparent monotonic increase in volume of the bistable region remains perplexing because the region itself does not grow monotonically: parameter points can move back and forth between monostability and bistability. We suggest mathematical conjectures and questions arising from these findings. Advances in theory and software now permit insights into parameter geography to be uncovered by high-dimensional, data-centric analysis.
Subject(s)
Computational Biology/methods , Protein Processing, Post-Translational/physiology , Algorithms , Gene Expression/genetics , Gene Expression/physiology , Gene Regulatory Networks/genetics , Gene Regulatory Networks/physiology , Models, Biological , Models, Theoretical , Protein Processing, Post-Translational/geneticsABSTRACT
SUMMARY: INDRA-IPM (Interactive Pathway Map) is a web-based pathway map modeling tool that combines natural language processing with automated model assembly and visualization. INDRA-IPM contextualizes models with expression data and exports them to standard formats. AVAILABILITY AND IMPLEMENTATION: INDRA-IPM is available at: http://pathwaymap.indra.bio. Source code is available at http://github.com/sorgerlab/indra_pathway_map. The underlying web service API is available at http://api.indra.bio:8000. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Computers , Software , Natural Language ProcessingABSTRACT
BACKGROUND: For automated reading of scientific publications to extract useful information about molecular mechanisms it is critical that genes, proteins and other entities be correctly associated with uniform identifiers, a process known as named entity linking or "grounding." Correct grounding is essential for resolving relationships among mined information, curated interaction databases, and biological datasets. The accuracy of this process is largely dependent on the availability of machine-readable resources associating synonyms and abbreviations commonly found in biomedical literature with uniform identifiers. RESULTS: In a task involving automated reading of â¼215,000 articles using the REACH event extraction software we found that grounding was disproportionately inaccurate for multi-protein families (e.g., "AKT") and complexes with multiple subunits (e.g."NF- κB"). To address this problem we constructed FamPlex, a manually curated resource defining protein families and complexes as they are commonly encountered in biomedical text. In FamPlex the gene-level constituents of families and complexes are defined in a flexible format allowing for multi-level, hierarchical membership. To create FamPlex, text strings corresponding to entities were identified empirically from literature and linked manually to uniform identifiers; these identifiers were also mapped to equivalent entries in multiple related databases. FamPlex also includes curated prefix and suffix patterns that improve named entity recognition and event extraction. Evaluation of REACH extractions on a test corpus of â¼54,000 articles showed that FamPlex significantly increased grounding accuracy for families and complexes (from 15 to 71%). The hierarchical organization of entities in FamPlex also made it possible to integrate otherwise unconnected mechanistic information across families, subfamilies, and individual proteins. Applications of FamPlex to the TRIPS/DRUM reading system and the Biocreative VI Bioentity Normalization Task dataset demonstrated the utility of FamPlex in other settings. CONCLUSION: FamPlex is an effective resource for improving named entity recognition, grounding, and relationship resolution in automated reading of biomedical text. The content in FamPlex is available in both tabular and Open Biomedical Ontology formats at https://github.com/sorgerlab/famplex under the Creative Commons CC0 license and has been integrated into the TRIPS/DRUM and REACH reading systems.
Subject(s)
Data Mining/methods , Proteins/metabolism , HumansABSTRACT
Word models (natural language descriptions of molecular mechanisms) are a common currency in spoken and written communication in biomedicine but are of limited use in predicting the behavior of complex biological networks. We present an approach to building computational models directly from natural language using automated assembly. Molecular mechanisms described in simple English are read by natural language processing algorithms, converted into an intermediate representation, and assembled into executable or network models. We have implemented this approach in the Integrated Network and Dynamical Reasoning Assembler (INDRA), which draws on existing natural language processing systems as well as pathway information in Pathway Commons and other online resources. We demonstrate the use of INDRA and natural language to model three biological processes of increasing scope: (i) p53 dynamics in response to DNA damage, (ii) adaptive drug resistance in BRAF-V600E-mutant melanomas, and (iii) the RAS signaling pathway. The use of natural language makes the task of developing a model more efficient and it increases model transparency, thereby promoting collaboration with the broader biology community.
Subject(s)
Gene Expression Regulation, Neoplastic , Melanoma/genetics , Models, Genetic , Natural Language Processing , Neural Networks, Computer , Skin Neoplasms/genetics , Antineoplastic Agents/therapeutic use , Cell Line, Tumor , Computer Simulation , DNA Damage , Drug Resistance, Neoplasm/genetics , Enzyme Inhibitors/therapeutic use , Humans , Indoles/therapeutic use , Language , Melanoma/drug therapy , Melanoma/metabolism , Melanoma/pathology , Proto-Oncogene Proteins B-raf/genetics , Proto-Oncogene Proteins B-raf/metabolism , Proto-Oncogene Proteins p21(ras)/genetics , Proto-Oncogene Proteins p21(ras)/metabolism , Signal Transduction , Skin Neoplasms/drug therapy , Skin Neoplasms/metabolism , Skin Neoplasms/pathology , Sulfonamides/therapeutic use , Tumor Suppressor Protein p53/genetics , Tumor Suppressor Protein p53/metabolism , VemurafenibABSTRACT
Inflammatory pain associated with tissue injury and infections, results from the heightened sensitivity of the peripheral terminals of nociceptor sensory neurons in response to exposure to inflammatory mediators. Targeting immune-derived inflammatory ligands, like prostaglandin E2, has been effective in alleviating inflammatory pain. However, the diversity of immune cells and the vast array of ligands they produce make it challenging to systematically map all neuroimmune pathways that contribute to inflammatory pain. Here, we constructed a comprehensive and updatable database of receptor-ligand pairs and complemented it with single-cell transcriptomics of immune cells and sensory neurons in three distinct inflammatory pain conditions, to generate injury-specific neuroimmune interactomes. We identified cell-type-specific neuroimmune axes that are common, as well as unique, to different injury types. This approach successfully predicts neuroimmune pathways with established roles in inflammatory pain as well as ones not previously described. We found that thrombospondin-1 produced by myeloid cells in all three conditions, is a negative regulator of nociceptor sensitization, revealing a non-canonical role of immune ligands as an endogenous reducer of peripheral sensitization. This computational platform lays the groundwork to identify novel mechanisms of immune-mediated peripheral sensitization and the specific disease contexts in which they act.
ABSTRACT
Introduction: The COVID-19 Disease Map project is a large-scale community effort uniting 277 scientists from 130 Institutions around the globe. We use high-quality, mechanistic content describing SARS-CoV-2-host interactions and develop interoperable bioinformatic pipelines for novel target identification and drug repurposing. Methods: Extensive community work allowed an impressive step forward in building interfaces between Systems Biology tools and platforms. Our framework can link biomolecules from omics data analysis and computational modelling to dysregulated pathways in a cell-, tissue- or patient-specific manner. Drug repurposing using text mining and AI-assisted analysis identified potential drugs, chemicals and microRNAs that could target the identified key factors. Results: Results revealed drugs already tested for anti-COVID-19 efficacy, providing a mechanistic context for their mode of action, and drugs already in clinical trials for treating other diseases, never tested against COVID-19. Discussion: The key advance is that the proposed framework is versatile and expandable, offering a significant upgrade in the arsenal for virus-host interactions and other complex pathologies.
Subject(s)
COVID-19 , Humans , SARS-CoV-2 , Drug Repositioning , Systems Biology , Computer SimulationABSTRACT
Summary: Gilda is a software tool and web service that implements a scored string matching algorithm for names and synonyms across entries in biomedical ontologies covering genes, proteins (and their families and complexes), small molecules, biological processes and diseases. Gilda integrates machine-learned disambiguation models to choose between ambiguous strings given relevant surrounding text as context, and supports species-prioritization in case of ambiguity. Availability and implementation: The Gilda web service is available at http://grounding.indra.bio with source code, documentation and tutorials available via https://github.com/indralab/gilda. Supplementary information: Supplementary data are available at Bioinformatics Advances online.
ABSTRACT
BACKGROUND: Mechanistic data is increasingly used in hazard identification of chemicals. However, the volume of data is large, challenging the efficient identification and clustering of relevant data. OBJECTIVES: We investigated whether evidence identification for hazard assessment can become more efficient and informed through an automated approach that combines machine reading of publications with network visualization tools. METHODS: We chose 13 chemicals that were evaluated by the International Agency for Research on Cancer (IARC) Monographs program incorporating the key characteristics of carcinogens (KCCs) approach. Using established literature search terms for KCCs, we retrieved and analyzed literature using Integrated Network and Dynamical Reasoning Assembler (INDRA). INDRA combines large-scale literature processing with pathway databases and extracts relationships between biomolecules, bioprocesses, and chemicals into statements (e.g., "benzene activates DNA damage"). These statements were subsequently assembled into networks and compared with the KCC evaluation by the IARC, to evaluate the informativeness of our approach. RESULTS: We found, in general, larger networks for those chemicals which the IARC has evaluated the evidence to be strong for KCC induction. Larger networks were not directly linked to publication count, given that we retrieved small networks for several chemicals with little support for KCC activation according to the IARC, despite the significant volume of literature for these specific chemicals. In addition, interpreting networks for genotoxicity and DNA repair showed concordance with the IARC KCC evaluation. DISCUSSION: Our method is an automated approach to condense mechanistic literature into searchable and interpretable networks based on an a priori ontology. The approach is no replacement of expert evaluation but, instead, provides an informed structure for experts to quickly identify which statements are made in which papers and how these could connect. We focused on the KCCs because these are supported by well-described search terms. The method needs to be tested in other frameworks as well to demonstrate its generalizability. https://doi.org/10.1289/EHP9112.
Subject(s)
Carcinogens , Neoplasms , Benzene , Carcinogens/toxicity , Databases, Factual , Humans , Neoplasms/chemically induced , Neoplasms/epidemiology , Risk AssessmentABSTRACT
The standardized identification of biomedical entities is a cornerstone of interoperability, reuse, and data integration in the life sciences. Several registries have been developed to catalog resources maintaining identifiers for biomedical entities such as small molecules, proteins, cell lines, and clinical trials. However, existing registries have struggled to provide sufficient coverage and metadata standards that meet the evolving needs of modern life sciences researchers. Here, we introduce the Bioregistry, an integrative, open, community-driven metaregistry that synthesizes and substantially expands upon 23 existing registries. The Bioregistry addresses the need for a sustainable registry by leveraging public infrastructure and automation, and employing a progressive governance model centered around open code and open data to foster community contribution. The Bioregistry can be used to support the standardized annotation of data, models, ontologies, and scientific literature, thereby promoting their interoperability and reuse. The Bioregistry can be accessed through https://bioregistry.io and its source code and data are available under the MIT and CC0 Licenses at https://github.com/biopragmatics/bioregistry .
ABSTRACT
Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.
Subject(s)
Genomics , Proteins , Base Sequence , Computational Biology , Genome , Molecular Sequence AnnotationABSTRACT
Despite progress in the development of standards for describing and exchanging scientific information, the lack of easy-to-use standards for mapping between different representations of the same or similar objects in different databases poses a major impediment to data integration and interoperability. Mappings often lack the metadata needed to be correctly interpreted and applied. For example, are two terms equivalent or merely related? Are they narrow or broad matches? Or are they associated in some other way? Such relationships between the mapped terms are often not documented, which leads to incorrect assumptions and makes them hard to use in scenarios that require a high degree of precision (such as diagnostics or risk prediction). Furthermore, the lack of descriptions of how mappings were done makes it hard to combine and reconcile mappings, particularly curated and automated ones. We have developed the Simple Standard for Sharing Ontological Mappings (SSSOM) which addresses these problems by: (i) Introducing a machine-readable and extensible vocabulary to describe metadata that makes imprecision, inaccuracy and incompleteness in mappings explicit. (ii) Defining an easy-to-use simple table-based format that can be integrated into existing data science pipelines without the need to parse or query ontologies, and that integrates seamlessly with Linked Data principles. (iii) Implementing open and community-driven collaborative workflows that are designed to evolve the standard continuously to address changing requirements and mapping practices. (iv) Providing reference tools and software libraries for working with the standard. In this paper, we present the SSSOM standard, describe several use cases in detail and survey some of the existing work on standardizing the exchange of mappings, with the goal of making mappings Findable, Accessible, Interoperable and Reusable (FAIR). The SSSOM specification can be found at http://w3id.org/sssom/spec. Database URL: http://w3id.org/sssom/spec.
Subject(s)
Metadata , Semantic Web , Data Management , Databases, Factual , WorkflowABSTRACT
A bottleneck in high-throughput functional genomics experiments is identifying the most important genes and their relevant functions from a list of gene hits. Gene Ontology (GO) enrichment methods provide insight at the gene set level. Here, we introduce GeneWalk ( github.com/churchmanlab/genewalk ) that identifies individual genes and their relevant functions critical for the experimental setting under examination. After the automatic assembly of an experiment-specific gene regulatory network, GeneWalk uses representation learning to quantify the similarity between vector representations of each gene and its GO annotations, yielding annotation significance scores that reflect the experimental context. By performing gene- and condition-specific functional analysis, GeneWalk converts a list of genes into data-driven hypotheses.
Subject(s)
Databases, Genetic , Gene Regulatory Networks , Animals , Biflavonoids , Brain , Gene Ontology , High-Throughput Nucleotide Sequencing , Humans , Mice , RNA-Seq , TranscriptomeABSTRACT
Making the knowledge contained in scientific papers machine-readable and formally computable would allow researchers to take full advantage of this information by enabling integration with other knowledge sources to support data analysis and interpretation. Here we describe Biofactoid, a web-based platform that allows scientists to specify networks of interactions between genes, their products, and chemical compounds, and then translates this information into a representation suitable for computational analysis, search and discovery. We also report the results of a pilot study to encourage the wide adoption of Biofactoid by the scientific community.