Search | VHL Search Portal

1.

Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning.

Caufield, J Harry; Hegde, Harshad; Emonet, Vincent; Harris, Nomi L; Joachimiak, Marcin P; Matentzoglu, Nicolas; Kim, HyeongSik; Moxon, Sierra; Reese, Justin T; Haendel, Melissa A; Robinson, Peter N; Mungall, Christopher J.

Bioinformatics ; 40(3)2024 Mar 04.

Article in English | MEDLINE | ID: mdl-38383067

ABSTRACT

MOTIVATION: Creating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas. RESULTS: Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM. AVAILABILITY AND IMPLEMENTATION: SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.

Subject(s)

Knowledge Bases , Semantics , Databases, Factual

2.

KG-Hub-building and exchanging biological knowledge graphs.

Caufield, J Harry; Putman, Tim; Schaper, Kevin; Unni, Deepak R; Hegde, Harshad; Callahan, Tiffany J; Cappelletti, Luca; Moxon, Sierra A T; Ravanmehr, Vida; Carbon, Seth; Chan, Lauren E; Cortes, Katherina; Shefchek, Kent A; Elsarboukh, Glass; Balhoff, Jim; Fontana, Tommaso; Matentzoglu, Nicolas; Bruskiewich, Richard M; Thessen, Anne E; Harris, Nomi L; Munoz-Torres, Monica C; Haendel, Melissa A; Robinson, Peter N; Joachimiak, Marcin P; Mungall, Christopher J; Reese, Justin T.

Bioinformatics ; 39(7)2023 07 01.

Article in English | MEDLINE | ID: mdl-37389415

ABSTRACT

MOTIVATION: Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of KGs is lacking. RESULTS: Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of KGs. Features include a simple, modular extract-transform-load pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate KGs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph ML, including node embeddings and training of models for link prediction and node classification. AVAILABILITY AND IMPLEMENTATION: https://kghub.org.

Subject(s)

Biological Ontologies , COVID-19 , Humans , Pattern Recognition, Automated , Rare Diseases , Machine Learning

3.

Developing a Knowledge Graph for Pharmacokinetic Natural Product-Drug Interactions.

Taneja, Sanya B; Callahan, Tiffany J; Paine, Mary F; Kane-Gill, Sandra L; Kilicoglu, Halil; Joachimiak, Marcin P; Boyce, Richard D.

J Biomed Inform ; 140: 104341, 2023 04.

Article in English | MEDLINE | ID: mdl-36933632

ABSTRACT

BACKGROUND: Pharmacokinetic natural product-drug interactions (NPDIs) occur when botanical or other natural products are co-consumed with pharmaceutical drugs. With the growing use of natural products, the risk for potential NPDIs and consequent adverse events has increased. Understanding mechanisms of NPDIs is key to preventing or minimizing adverse events. Although biomedical knowledge graphs (KGs) have been widely used for drug-drug interaction applications, computational investigation of NPDIs is novel. We constructed NP-KG as a first step toward computational discovery of plausible mechanistic explanations for pharmacokinetic NPDIs that can be used to guide scientific research. METHODS: We developed a large-scale, heterogeneous KG with biomedical ontologies, linked data, and full texts of the scientific literature. To construct the KG, biomedical ontologies and drug databases were integrated with the Phenotype Knowledge Translator framework. The semantic relation extraction systems, SemRep and Integrated Network and Dynamic Reasoning Assembler, were used to extract semantic predications (subject-relation-object triples) from full texts of the scientific literature related to the exemplar natural products green tea and kratom. A literature-based graph constructed from the predications was integrated into the ontology-grounded KG to create NP-KG. NP-KG was evaluated with case studies of pharmacokinetic green tea- and kratom-drug interactions through KG path searches and meta-path discovery to determine congruent and contradictory information in NP-KG compared to ground truth data. We also conducted an error analysis to identify knowledge gaps and incorrect predications in the KG. RESULTS: The fully integrated NP-KG consisted of 745,512 nodes and 7,249,576 edges. Evaluation of NP-KG resulted in congruent (38.98% for green tea, 50% for kratom), contradictory (15.25% for green tea, 21.43% for kratom), and both congruent and contradictory (15.25% for green tea, 21.43% for kratom) information compared to ground truth data. Potential pharmacokinetic mechanisms for several purported NPDIs, including the green tea-raloxifene, green tea-nadolol, kratom-midazolam, kratom-quetiapine, and kratom-venlafaxine interactions were congruent with the published literature. CONCLUSION: NP-KG is the first KG to integrate biomedical ontologies with full texts of the scientific literature focused on natural products. We demonstrate the application of NP-KG to identify known pharmacokinetic interactions between natural products and pharmaceutical drugs mediated by drug metabolizing enzymes and transporters. Future work will incorporate context, contradiction analysis, and embedding-based methods to enrich NP-KG. NP-KG is publicly available at https://doi.org/10.5281/zenodo.6814507. The code for relation extraction, KG construction, and hypothesis generation is available at https://github.com/sanyabt/np-kg.

Subject(s)

Biological Ontologies , Biological Products , Pattern Recognition, Automated , Drug Interactions , Semantics , Pharmaceutical Preparations

4.

The Monarch Initiative in 2019: an integrative data and analytic platform connecting phenotypes to genotypes across species.

Shefchek, Kent A; Harris, Nomi L; Gargano, Michael; Matentzoglu, Nicolas; Unni, Deepak; Brush, Matthew; Keith, Daniel; Conlin, Tom; Vasilevsky, Nicole; Zhang, Xingmin Aaron; Balhoff, James P; Babb, Larry; Bello, Susan M; Blau, Hannah; Bradford, Yvonne; Carbon, Seth; Carmody, Leigh; Chan, Lauren E; Cipriani, Valentina; Cuzick, Alayne; Della Rocca, Maria; Dunn, Nathan; Essaid, Shahim; Fey, Petra; Grove, Chris; Gourdine, Jean-Phillipe; Hamosh, Ada; Harris, Midori; Helbig, Ingo; Hoatlin, Maureen; Joachimiak, Marcin; Jupp, Simon; Lett, Kenneth B; Lewis, Suzanna E; McNamara, Craig; Pendlington, Zoë M; Pilgrim, Clare; Putman, Tim; Ravanmehr, Vida; Reese, Justin; Riggs, Erin; Robb, Sofia; Roncaglia, Paola; Seager, James; Segerdell, Erik; Similuk, Morgan; Storm, Andrea L; Thaxon, Courtney; Thessen, Anne; Jacobsen, Julius O B.

Nucleic Acids Res ; 48(D1): D704-D715, 2020 01 08.

Article in English | MEDLINE | ID: mdl-31701156

ABSTRACT

In biology and biomedicine, relating phenotypic outcomes with genetic variation and environmental factors remains a challenge: patient phenotypes may not match known diseases, candidate variants may be in genes that haven't been characterized, research organisms may not recapitulate human or veterinary diseases, environmental factors affecting disease outcomes are unknown or undocumented, and many resources must be queried to find potentially significant phenotypic associations. The Monarch Initiative (https://monarchinitiative.org) integrates information on genes, variants, genotypes, phenotypes and diseases in a variety of species, and allows powerful ontology-based search. We develop many widely adopted ontologies that together enable sophisticated computational analysis, mechanistic discovery and diagnostics of Mendelian diseases. Our algorithms and tools are widely used to identify animal models of human disease through phenotypic similarity, for differential diagnostics and to facilitate translational research. Launched in 2015, Monarch has grown with regards to data (new organisms, more sources, better modeling); new API and standards; ontologies (new Mondo unified disease ontology, improvements to ontologies such as HPO and uPheno); user interface (a redesigned website); and community development. Monarch data, algorithms and tools are being used and extended by resources such as GA4GH and NCATS Translator, among others, to aid mechanistic discovery and diagnostics.

Subject(s)

Computational Biology/methods , Genotype , Phenotype , Algorithms , Animals , Biological Ontologies , Databases, Genetic , Exome , Genetic Association Studies , Genetic Variation , Genomics , Humans , Internet , Software , Translational Research, Biomedical , User-Computer Interface

5.

A bacterial sensor taxonomy across earth ecosystems for machine learning applications.

Park, Helen; Joachimiak, Marcin P; Jungbluth, Sean P; Yang, Ziming; Riehl, William J; Canon, R Shane; Arkin, Adam P; Dehal, Paramvir S.

mSystems ; 9(1): e0002623, 2024 Jan 23.

Article in English | MEDLINE | ID: mdl-38078749

ABSTRACT

Microbial communities have evolved to colonize all ecosystems of the planet, from the deep sea to the human gut. Microbes survive by sensing, responding, and adapting to immediate environmental cues. This process is driven by signal transduction proteins such as histidine kinases, which use their sensing domains to bind or otherwise detect environmental cues and "transduce" signals to adjust internal processes. We hypothesized that an ecosystem's unique stimuli leave a sensor "fingerprint," able to identify and shed insight on ecosystem conditions. To test this, we collected 20,712 publicly available metagenomes from Host-associated, Environmental, and Engineered ecosystems across the globe. We extracted and clustered the collection's nearly 18M unique sensory domains into 113,712 similar groupings with MMseqs2. We built gradient-boosted decision tree machine learning models and found we could classify the ecosystem type (accuracy: 87%) and predict the levels of different physical parameters (R2 score: 83%) using the sensor cluster abundance as features. Feature importance enables identification of the most predictive sensors to differentiate between ecosystems which can lead to mechanistic interpretations if the sensor domains are well annotated. To demonstrate this, a machine learning model was trained to predict patient's disease state and used to identify domains related to oxygen sensing present in a healthy gut but missing in patients with abnormal conditions. Moreover, since 98.7% of identified sensor domains are uncharacterized, importance ranking can be used to prioritize sensors to determine what ecosystem function they may be sensing. Furthermore, these new predictive sensors can function as targets for novel sensor engineering with applications in biotechnology, ecosystem maintenance, and medicine.IMPORTANCEMicrobes infect, colonize, and proliferate due to their ability to sense and respond quickly to their surroundings. In this research, we extract the sensory proteins from a diverse range of environmental, engineered, and host-associated metagenomes. We trained machine learning classifiers using sensors as features such that it is possible to predict the ecosystem for a metagenome from its sensor profile. We use the optimized model's feature importance to identify the most impactful and predictive sensors in different environments. We next use the sensor profile from human gut metagenomes to classify their disease states and explore which sensors can explain differences between diseases. The sensors most predictive of environmental labels here, most of which correspond to uncharacterized proteins, are a useful starting point for the discovery of important environment signals and the development of possible diagnostic interventions.

Subject(s)

Metagenomics , Microbiota , Humans , Metagenome , Machine Learning , Earth, Planet

6.

Integrating biological knowledge for mechanistic inference in the host-associated microbiome.

Santangelo, Brook E; Apgar, Madison; Colorado, Angela Sofia Burkhart; Martin, Casey G; Sterrett, John; Wall, Elena; Joachimiak, Marcin P; Hunter, Lawrence E; Lozupone, Catherine A.

Front Microbiol ; 15: 1351678, 2024.

Article in English | MEDLINE | ID: mdl-38638909

ABSTRACT

Advances in high-throughput technologies have enhanced our ability to describe microbial communities as they relate to human health and disease. Alongside the growth in sequencing data has come an influx of resources that synthesize knowledge surrounding microbial traits, functions, and metabolic potential with knowledge of how they may impact host pathways to influence disease phenotypes. These knowledge bases can enable the development of mechanistic explanations that may underlie correlations detected between microbial communities and disease. In this review, we survey existing resources and methodologies for the computational integration of broad classes of microbial and host knowledge. We evaluate these knowledge bases in their access methods, content, and source characteristics. We discuss challenges of the creation and utilization of knowledge bases including inconsistency of nomenclature assignment of taxa and metabolites across sources, whether the biological entities represented are rooted in ontologies or taxonomies, and how the structure and accessibility limit the diversity of applications and user types. We make this information available in a code and data repository at: https://github.com/lozuponelab/knowledge-source-mappings. Addressing these challenges will allow for the development of more effective tools for drawing from abundant knowledge to find new insights into microbial mechanisms in disease by fostering a systematic and unbiased exploration of existing information.

7.

Estimating geographic variation of infection fatality ratios during epidemics.

Ladau, Joshua; Brodie, Eoin L; Falco, Nicola; Bansal, Ishan; Hoffman, Elijah B; Joachimiak, Marcin P; Mora, Ana M; Walker, Angelica M; Wainwright, Haruko M; Wu, Yulun; Pavicic, Mirko; Jacobson, Daniel; Hess, Matthias; Brown, James B; Abuabara, Katrina.

Infect Dis Model ; 9(2): 634-643, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38572058

ABSTRACT

Objectives: We aim to estimate geographic variability in total numbers of infections and infection fatality ratios (IFR; the number of deaths caused by an infection per 1,000 infected people) when the availability and quality of data on disease burden are limited during an epidemic. Methods: We develop a noncentral hypergeometric framework that accounts for differential probabilities of positive tests and reflects the fact that symptomatic people are more likely to seek testing. We demonstrate the robustness, accuracy, and precision of this framework, and apply it to the United States (U.S.) COVID-19 pandemic to estimate county-level SARS-CoV-2 IFRs. Results: The estimators for the numbers of infections and IFRs showed high accuracy and precision; for instance, when applied to simulated validation data sets, across counties, Pearson correlation coefficients between estimator means and true values were 0.996 and 0.928, respectively, and they showed strong robustness to model misspecification. Applying the county-level estimators to the real, unsimulated COVID-19 data spanning April 1, 2020 to September 30, 2020 from across the U.S., we found that IFRs varied from 0 to 44.69, with a standard deviation of 3.55 and a median of 2.14. Conclusions: The proposed estimation framework can be used to identify geographic variation in IFRs across settings.

8.

An open source knowledge graph ecosystem for the life sciences.

Callahan, Tiffany J; Tripodi, Ignacio J; Stefanski, Adrianne L; Cappelletti, Luca; Taneja, Sanya B; Wyrwa, Jordan M; Casiraghi, Elena; Matentzoglu, Nicolas A; Reese, Justin; Silverstein, Jonathan C; Hoyt, Charles Tapley; Boyce, Richard D; Malec, Scott A; Unni, Deepak R; Joachimiak, Marcin P; Robinson, Peter N; Mungall, Christopher J; Cavalleri, Emanuele; Fontana, Tommaso; Valentini, Giorgio; Mesiti, Marco; Gillenwater, Lucas A; Santangelo, Brook; Vasilevsky, Nicole A; Hoehndorf, Robert; Bennett, Tellen D; Ryan, Patrick B; Hripcsak, George; Kahn, Michael G; Bada, Michael; Baumgartner, William A; Hunter, Lawrence E.

Sci Data ; 11(1): 363, 2024 Apr 11.

Article in English | MEDLINE | ID: mdl-38605048

ABSTRACT

Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoint resources and abstraction algorithms), and benchmarks (e.g., prebuilt KGs). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 different large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.

Subject(s)

Biological Science Disciplines , Knowledge Bases , Pattern Recognition, Automated , Algorithms , Translational Research, Biomedical

9.

Mutual information analysis reveals coevolving residues in Tat that compensate for two distinct functions in HIV-1 gene expression.

Dey, Siddharth S; Xue, Yuhua; Joachimiak, Marcin P; Friedland, Gregory D; Burnett, John C; Zhou, Qiang; Arkin, Adam P; Schaffer, David V.

J Biol Chem ; 287(11): 7945-55, 2012 Mar 09.

Article in English | MEDLINE | ID: mdl-22253435

ABSTRACT

Viral genomes are continually subjected to mutations, and functionally deleterious ones can be rescued by reversion or additional mutations that restore fitness. The error prone nature of HIV-1 replication has resulted in highly diverse viral sequences, and it is not clear how viral proteins such as Tat, which plays a critical role in viral gene expression and replication, retain their complex functions. Although several important amino acid positions in Tat are conserved, we hypothesized that it may also harbor functionally important residues that may not be individually conserved yet appear as correlated pairs, whose analysis could yield new mechanistic insights into Tat function and evolution. To identify such sites, we combined mutual information analysis and experimentation to identify coevolving positions and found that residues 35 and 39 are strongly correlated. Mutation of either residue of this pair into amino acids that appear in numerous viral isolates yields a defective virus; however, simultaneous introduction of both mutations into the heterologous Tat sequence restores gene expression close to wild-type Tat. Furthermore, in contrast to most coevolving protein residues that contribute to the same function, structural modeling and biochemical studies showed that these two residues contribute to two mechanistically distinct steps in gene expression: binding P-TEFb and promoting P-TEFb phosphorylation of the C-terminal domain in RNAPII. Moreover, Tat variants that mimic HIV-1 subtypes B or C at sites 35 and 39 have evolved orthogonal strengths of P-TEFb binding versus RNAPII phosphorylation, suggesting that subtypes have evolved alternate transcriptional strategies to achieve similar gene expression levels.

Subject(s)

Evolution, Molecular , Gene Expression Regulation, Viral/physiology , HIV-1/physiology , Mutation/physiology , Virus Replication/physiology , tat Gene Products, Human Immunodeficiency Virus/metabolism , Genome, Viral/physiology , HEK293 Cells , HeLa Cells , Humans , Phosphorylation/physiology , Positive Transcriptional Elongation Factor B/genetics , Positive Transcriptional Elongation Factor B/metabolism , Protein Structure, Tertiary , RNA Polymerase II/genetics , RNA Polymerase II/metabolism , Species Specificity , tat Gene Products, Human Immunodeficiency Virus/genetics

10.

An SF1 affinity model to identify branch point sequences in human introns.

Pastuszak, Alexander W; Joachimiak, Marcin P; Blanchette, Marco; Rio, Donald C; Brenner, Steven E; Frankel, Alan D.

Nucleic Acids Res ; 39(6): 2344-56, 2011 Mar.

Article in English | MEDLINE | ID: mdl-21071404

ABSTRACT

Splicing factor 1 (SF1) binds to the branch point sequence (BPS) of mammalian introns and is believed to be important for the splicing of some, but not all, introns. To help identify BPSs, particularly those that depend on SF1, we generated a BPS profile model in which SF1 binding affinity data, validated by branch point mapping, were iteratively incorporated into computational models. We searched a data set of 117,499 human introns for best matches to the SF1 Affinity Model above a threshold, and counted the number of matches at each intronic position. After subtracting a background value, we found that 87.9% of remaining high-scoring matches identified were located in a region upstream of 3'-splice sites where BPSs are typically found. Since U2AF65 recognizes the polypyrimidine tract (PPT) and forms a cooperative RNA complex with SF1, we combined the SF1 model with a PPT model computed from high affinity binding sequences for U2AF65. The combined model, together with binding site location constraints, accurately identified introns bound by SF1 that are candidates for SF1-dependent splicing.

Subject(s)

DNA-Binding Proteins/metabolism , Introns , Models, Genetic , Transcription Factors/metabolism , Base Sequence , Binding Sites , Humans , RNA Splicing Factors , RNA, Messenger/chemistry , Sequence Analysis, RNA

11.

Gene Set Summarization using Large Language Models.

Joachimiak, Marcin P; Caufield, J Harry; Harris, Nomi L; Kim, Hyeongsik; Mungall, Christopher J.

ArXiv ; 2023 May 25.

Article in English | MEDLINE | ID: mdl-37292480

ABSTRACT

Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling the use of Large Language Models (LLMs), potentially utilizing scientific texts directly and avoiding reliance on a KB. We developed SPINDOCTOR (Structured Prompt Interpolation of Natural Language Descriptions of Controlled Terms for Ontology Reporting), a method that uses GPT models to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct model retrieval. We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for gene sets. However, GPT-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, these methods were rarely able to recapitulate the most precise and informative term from standard enrichment, likely due to an inability to generalize and reason using an ontology. Results are highly nondeterministic, with minor variations in prompt resulting in radically different term lists. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis and that manual curation of ontological assertions remains necessary.

12.

GRAPE for fast and scalable graph processing and random-walk-based embedding.

Cappelletti, Luca; Fontana, Tommaso; Casiraghi, Elena; Ravanmehr, Vida; Callahan, Tiffany J; Cano, Carlos; Joachimiak, Marcin P; Mungall, Christopher J; Robinson, Peter N; Reese, Justin; Valentini, Giorgio.

Nat Comput Sci ; 3(6): 552-568, 2023 Jun.

Article in English | MEDLINE | ID: mdl-38177435

ABSTRACT

Graph representation learning methods opened new avenues for addressing complex, real-world problems represented by graphs. However, many graphs used in these applications comprise millions of nodes and billions of edges and are beyond the capabilities of current methods and software implementations. We present GRAPE (Graph Representation Learning, Prediction and Evaluation), a software resource for graph processing and embedding that is able to scale with big graphs by using specialized and smart data structures, algorithms, and a fast parallel implementation of random-walk-based methods. Compared with state-of-the-art software resources, GRAPE shows an improvement of orders of magnitude in empirical space and time complexity, as well as competitive edge- and node-label prediction performance. GRAPE comprises approximately 1.7 million well-documented lines of Python and Rust code and provides 69 node-embedding methods, 25 inference models, a collection of efficient graph-processing utilities, and over 80,000 graphs from the literature and other sources. Standardized interfaces allow a seamless integration of third-party libraries, while ready-to-use and modular pipelines permit an easy-to-use evaluation of graph-representation-learning methods, therefore also positioning GRAPE as a software resource that performs a fair comparison between methods and libraries for graph processing and embedding.

Subject(s)

Libraries , Vitis , Algorithms , Software , Learning

13.

Deletion of the Desulfovibrio vulgaris carbon monoxide sensor invokes global changes in transcription.

Rajeev, Lara; Hillesland, Kristina L; Zane, Grant M; Zhou, Aifen; Joachimiak, Marcin P; He, Zhili; Zhou, Jizhong; Arkin, Adam P; Wall, Judy D; Stahl, David A.

J Bacteriol ; 194(21): 5783-93, 2012 Nov.

Article in English | MEDLINE | ID: mdl-22904289

ABSTRACT

The carbon monoxide-sensing transcriptional factor CooA has been studied only in hydrogenogenic organisms that can grow using CO as the sole source of energy. Homologs for the canonical CO oxidation system, including CooA, CO dehydrogenase (CODH), and a CO-dependent Coo hydrogenase, are present in the sulfate-reducing bacterium Desulfovibrio vulgaris, although it grows only poorly on CO. We show that D. vulgaris Hildenborough has an active CO dehydrogenase capable of consuming exogenous CO and that the expression of the CO dehydrogenase, but not that of a gene annotated as encoding a Coo hydrogenase, is dependent on both CO and CooA. Carbon monoxide did not act as a general metabolic inhibitor, since growth of a strain deleted for cooA was inhibited by CO on lactate-sulfate but not pyruvate-sulfate. While the deletion strain did not accumulate CO in excess, as would have been expected if CooA were important in the cycling of CO as a metabolic intermediate, global transcriptional analyses suggested that CooA and CODH are used during normal metabolism.

Subject(s)

Bacterial Proteins/genetics , Carbon Monoxide/metabolism , Desulfovibrio vulgaris/genetics , Gene Deletion , Gene Expression Profiling , Gene Expression Regulation, Bacterial , Transcription Factors/genetics , Aldehyde Oxidoreductases/metabolism , Desulfovibrio vulgaris/growth & development , Desulfovibrio vulgaris/metabolism , Lactates/metabolism , Multienzyme Complexes/metabolism , Pyruvic Acid/metabolism , Sulfates/metabolism

14.

Transcriptomic and proteomic analyses of Desulfovibrio vulgaris biofilms: carbon and energy flow contribute to the distinct biofilm growth state.

Clark, Melinda E; He, Zhili; Redding, Alyssa M; Joachimiak, Marcin P; Keasling, Jay D; Zhou, Jizhong Z; Arkin, Adam P; Mukhopadhyay, Aindrila; Fields, Matthew W.

BMC Genomics ; 13: 138, 2012 Apr 16.

Article in English | MEDLINE | ID: mdl-22507456

ABSTRACT

BACKGROUND: Desulfovibrio vulgaris Hildenborough is a sulfate-reducing bacterium (SRB) that is intensively studied in the context of metal corrosion and heavy-metal bioremediation, and SRB populations are commonly observed in pipe and subsurface environments as surface-associated populations. In order to elucidate physiological changes associated with biofilm growth at both the transcript and protein level, transcriptomic and proteomic analyses were done on mature biofilm cells and compared to both batch and reactor planktonic populations. The biofilms were cultivated with lactate and sulfate in a continuously fed biofilm reactor, and compared to both batch and reactor planktonic populations. RESULTS: The functional genomic analysis demonstrated that biofilm cells were different compared to planktonic cells, and the majority of altered abundances for genes and proteins were annotated as hypothetical (unknown function), energy conservation, amino acid metabolism, and signal transduction. Genes and proteins that showed similar trends in detected levels were particularly involved in energy conservation such as increases in an annotated ech hydrogenase, formate dehydrogenase, pyruvate:ferredoxin oxidoreductase, and rnf oxidoreductase, and the biofilm cells had elevated formate dehydrogenase activity. Several other hydrogenases and formate dehydrogenases also showed an increased protein level, while decreased transcript and protein levels were observed for putative coo hydrogenase as well as a lactate permease and hyp hydrogenases for biofilm cells. Genes annotated for amino acid synthesis and nitrogen utilization were also predominant changers within the biofilm state. Ribosomal transcripts and proteins were notably decreased within the biofilm cells compared to exponential-phase cells but were not as low as levels observed in planktonic, stationary-phase cells. Several putative, extracellular proteins (DVU1012, 1545) were also detected in the extracellular fraction from biofilm cells. CONCLUSIONS: Even though both the planktonic and biofilm cells were oxidizing lactate and reducing sulfate, the biofilm cells were physiologically distinct compared to planktonic growth states due to altered abundances of genes/proteins involved in carbon/energy flow and extracellular structures. In addition, average expression values for multiple rRNA transcripts and respiratory activity measurements indicated that biofilm cells were metabolically more similar to exponential-phase cells although biofilm cells are structured differently. The characterization of physiological advantages and constraints of the biofilm growth state for sulfate-reducing bacteria will provide insight into bioremediation applications as well as microbially-induced metal corrosion.

Subject(s)

Biofilms/growth & development , Carbon/metabolism , Desulfovibrio vulgaris/growth & development , Desulfovibrio vulgaris/genetics , Energy Metabolism/genetics , Gene Expression Profiling/methods , Proteomics/methods , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Biofilms/drug effects , Bioreactors/microbiology , Carbohydrate Metabolism/drug effects , Carbohydrate Metabolism/genetics , Cluster Analysis , Desulfovibrio vulgaris/drug effects , Desulfovibrio vulgaris/physiology , Energy Metabolism/drug effects , Gene Expression Regulation, Bacterial/drug effects , Lactic Acid/pharmacology , Microscopy, Confocal , Models, Biological , Plankton/cytology , Plankton/drug effects , Plankton/microbiology , Principal Component Analysis , RNA, Messenger/drug effects , RNA, Messenger/genetics , RNA, Messenger/metabolism , Ribosomal Proteins/genetics , Ribosomal Proteins/metabolism , Sulfates/pharmacology

15.

Functional characterization of Crp/Fnr-type global transcriptional regulators in Desulfovibrio vulgaris Hildenborough.

Zhou, Aifen; Chen, Yunyu I; Zane, Grant M; He, Zhili; Hemme, Christopher L; Joachimiak, Marcin P; Baumohl, Jason K; He, Qiang; Fields, Matthew W; Arkin, Adam P; Wall, Judy D; Hazen, Terry C; Zhou, Jizhong.

Appl Environ Microbiol ; 78(4): 1168-77, 2012 Feb.

Article in English | MEDLINE | ID: mdl-22156435

ABSTRACT

Crp/Fnr-type global transcriptional regulators regulate various metabolic pathways in bacteria and typically function in response to environmental changes. However, little is known about the function of four annotated Crp/Fnr homologs (DVU0379, DVU2097, DVU2547, and DVU3111) in Desulfovibrio vulgaris Hildenborough. A systematic study using bioinformatic, transcriptomic, genetic, and physiological approaches was conducted to characterize their roles in stress responses. Similar growth phenotypes were observed for the crp/fnr deletion mutants under multiple stress conditions. Nevertheless, the idea of distinct functions of Crp/Fnr-type regulators in stress responses was supported by phylogeny, gene transcription changes, fitness changes, and physiological differences. The four D. vulgaris Crp/Fnr homologs are localized in three subfamilies (HcpR, CooA, and cc). The crp/fnr knockout mutants were well separated by transcriptional profiling using detrended correspondence analysis (DCA), and more genes significantly changed in expression in a ΔDVU3111 mutant (JW9013) than in the other three paralogs. In fitness studies, strain JW9013 showed the lowest fitness under standard growth conditions (i.e., sulfate reduction) and the highest fitness under NaCl or chromate stress conditions; better fitness was observed for a ΔDVU2547 mutant (JW9011) under nitrite stress conditions and a ΔDVU2097 mutant (JW9009) under air stress conditions. A higher Cr(VI) reduction rate was observed for strain JW9013 in experiments with washed cells. These results suggested that the four Crp/Fnr-type global regulators play distinct roles in stress responses of D. vulgaris. DVU3111 is implicated in responses to NaCl and chromate stresses, DVU2547 in nitrite stress responses, and DVU2097 in air stress responses.

Subject(s)

Cyclic AMP Receptor Protein/metabolism , Desulfovibrio vulgaris/physiology , Gene Expression Regulation, Bacterial , Stress, Physiological , Transcription Factors/metabolism , Transcription, Genetic , Air , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Chromates/metabolism , Chromates/toxicity , Computational Biology , Cyclic AMP Receptor Protein/genetics , DNA, Bacterial/chemistry , DNA, Bacterial/genetics , Desulfovibrio vulgaris/genetics , Desulfovibrio vulgaris/growth & development , Desulfovibrio vulgaris/metabolism , Gene Deletion , Molecular Sequence Data , Nitrites/metabolism , Nitrites/toxicity , Sequence Analysis, DNA , Sodium Chloride/metabolism , Sodium Chloride/toxicity , Transcription Factors/genetics , Transcriptome

16.

MicrobesOnline: an integrated portal for comparative and functional genomics.

Dehal, Paramvir S; Joachimiak, Marcin P; Price, Morgan N; Bates, John T; Baumohl, Jason K; Chivian, Dylan; Friedland, Greg D; Huang, Katherine H; Keller, Keith; Novichkov, Pavel S; Dubchak, Inna L; Alm, Eric J; Arkin, Adam P.

Nucleic Acids Res ; 38(Database issue): D396-400, 2010 Jan.

Article in English | MEDLINE | ID: mdl-19906701

ABSTRACT

Since 2003, MicrobesOnline (http://www.microbesonline.org) has been providing a community resource for comparative and functional genome analysis. The portal includes over 1000 complete genomes of bacteria, archaea and fungi and thousands of expression microarrays from diverse organisms ranging from model organisms such as Escherichia coli and Saccharomyces cerevisiae to environmental microbes such as Desulfovibrio vulgaris and Shewanella oneidensis. To assist in annotating genes and in reconstructing their evolutionary history, MicrobesOnline includes a comparative genome browser based on phylogenetic trees for every gene family as well as a species tree. To identify co-regulated genes, MicrobesOnline can search for genes based on their expression profile, and provides tools for identifying regulatory motifs and seeing if they are conserved. MicrobesOnline also includes fast phylogenetic profile searches, comparative views of metabolic pathways, operon predictions, a workbench for sequence analysis and integration with RegTransBase and other microbial genome resources. The next update of MicrobesOnline will contain significant new functionality, including comparative analysis of metagenomic sequence data. Programmatic access to the database, along with source code and documentation, is available at http://microbesonline.org/programmers.html.

Subject(s)

Bacteria/genetics , Computational Biology/methods , Databases, Genetic , Databases, Nucleic Acid , Algorithms , Computational Biology/trends , Databases, Protein , Gene Expression Profiling , Genome, Bacterial , Information Storage and Retrieval/methods , Internet , Oligonucleotide Array Sequence Analysis , Protein Structure, Tertiary , Software

17.

Why was this cited? Explainable machine learning applied to COVID-19 research literature.

Beranová, Lucie; Joachimiak, Marcin P; Kliegr, Tomás; Rabby, Gollam; Sklenák, Vilém.

Scientometrics ; 127(5): 2313-2349, 2022.

Article in English | MEDLINE | ID: mdl-35431364

ABSTRACT

Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles-mostly from biology and medicine-applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by "black-box" machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a "black-box" method-neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target.

18.

Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation.

Elias, Dwayne A; Mukhopadhyay, Aindrila; Joachimiak, Marcin P; Drury, Elliott C; Redding, Alyssa M; Yen, Huei-Che B; Fields, Matthew W; Hazen, Terry C; Arkin, Adam P; Keasling, Jay D; Wall, Judy D.

Nucleic Acids Res ; 37(9): 2926-39, 2009 May.

Article in English | MEDLINE | ID: mdl-19293273

ABSTRACT

Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 HyP and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC-MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. One thousand two hundred and twelve of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations.

Subject(s)

Desulfovibrio vulgaris/genetics , Gene Expression Profiling , Genes, Bacterial , Bacterial Proteins/metabolism , Desulfovibrio vulgaris/metabolism , Gene Expression Regulation, Bacterial , Repressor Proteins/metabolism , Sequence Deletion , Stress, Physiological

19.

Zinc against COVID-19? Symptom surveillance and deficiency risk groups.

Joachimiak, Marcin P.

PLoS Negl Trop Dis ; 15(1): e0008895, 2021 01.

Article in English | MEDLINE | ID: mdl-33395417

ABSTRACT

A wide variety of symptoms is associated with Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infection, and these symptoms can overlap with other conditions and diseases. Knowing the distribution of symptoms across diseases and individuals can support clinical actions on timelines shorter than those for drug and vaccine development. Here, we focus on zinc deficiency symptoms, symptom overlap with other conditions, as well as zinc effects on immune health and mechanistic zinc deficiency risk groups. There are well-studied beneficial effects of zinc on the immune system including a decreased susceptibility to and improved clinical outcomes for infectious pathogens including multiple viruses. Zinc is also an anti-inflammatory and anti-oxidative stress agent, relevant to some severe Coronavirus Disease 2019 (COVID-19) symptoms. Unfortunately, zinc deficiency is common worldwide and not exclusive to the developing world. Lifestyle choices and preexisting conditions alone can result in zinc deficiency, and we compile zinc risk groups based on a review of the literature. It is also important to distinguish chronic zinc deficiency from deficiency acquired upon viral infection and immune response and their different supplementation strategies. Zinc is being considered as prophylactic or adjunct therapy for COVID-19, with 12 clinical trials underway, highlighting the relevance of this trace element for global pandemics. Using the example of zinc, we show that there is a critical need for a deeper understanding of essential trace elements in human health, and the resulting deficiency symptoms and their overlap with other conditions. This knowledge will directly support human immune health for decreasing susceptibility, shortening illness duration, and preventing progression to severe cases in the current and future pandemics.

Subject(s)

COVID-19 Drug Treatment , COVID-19/prevention & control , Zinc/administration & dosage , Zinc/deficiency , Anti-Inflammatory Agents/pharmacology , COVID-19/immunology , COVID-19/virology , Humans , Immune System/drug effects , Oxidative Stress/drug effects , Oxidative Stress/immunology , Pandemics , Risk Factors , SARS-CoV-2/isolation & purification

20.

Supervised learning with word embeddings derived from PubMed captures latent knowledge about protein kinases and cancer.

Ravanmehr, Vida; Blau, Hannah; Cappelletti, Luca; Fontana, Tommaso; Carmody, Leigh; Coleman, Ben; George, Joshy; Reese, Justin; Joachimiak, Marcin; Bocci, Giovanni; Hansen, Peter; Bult, Carol; Rueter, Jens; Casiraghi, Elena; Valentini, Giorgio; Mungall, Christopher; Oprea, Tudor I; Robinson, Peter N.

NAR Genom Bioinform ; 3(4): lqab113, 2021 Dec.

Article in English | MEDLINE | ID: mdl-34888523

ABSTRACT

Inhibiting protein kinases (PKs) that cause cancers has been an important topic in cancer therapy for years. So far, almost 8% of >530 PKs have been targeted by FDA-approved medications, and around 150 protein kinase inhibitors (PKIs) have been tested in clinical trials. We present an approach based on natural language processing and machine learning to investigate the relations between PKs and cancers, predicting PKs whose inhibition would be efficacious to treat a certain cancer. Our approach represents PKs and cancers as semantically meaningful 100-dimensional vectors based on word and concept neighborhoods in PubMed abstracts. We use information about phase I-IV trials in ClinicalTrials.gov to construct a training set for random forest classification. Our results with historical data show that associations between PKs and specific cancers can be predicted years in advance with good accuracy. Our tool can be used to predict the relevance of inhibiting PKs for specific cancers and to support the design of well-focused clinical trials to discover novel PKIs for cancer therapy.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL