Search | VHL Regional Portal

1.

An open source knowledge graph ecosystem for the life sciences.

Callahan, Tiffany J; Tripodi, Ignacio J; Stefanski, Adrianne L; Cappelletti, Luca; Taneja, Sanya B; Wyrwa, Jordan M; Casiraghi, Elena; Matentzoglu, Nicolas A; Reese, Justin; Silverstein, Jonathan C; Hoyt, Charles Tapley; Boyce, Richard D; Malec, Scott A; Unni, Deepak R; Joachimiak, Marcin P; Robinson, Peter N; Mungall, Christopher J; Cavalleri, Emanuele; Fontana, Tommaso; Valentini, Giorgio; Mesiti, Marco; Gillenwater, Lucas A; Santangelo, Brook; Vasilevsky, Nicole A; Hoehndorf, Robert; Bennett, Tellen D; Ryan, Patrick B; Hripcsak, George; Kahn, Michael G; Bada, Michael; Baumgartner, William A; Hunter, Lawrence E.

Sci Data ; 11(1): 363, 2024 Apr 11.

Article in English | MEDLINE | ID: mdl-38605048

ABSTRACT

Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoint resources and abstraction algorithms), and benchmarks (e.g., prebuilt KGs). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 different large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.

Subject(s)

Biological Science Disciplines , Knowledge Bases , Pattern Recognition, Automated , Algorithms , Translational Research, Biomedical

2.

Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning.

Cappelletti, Luca; Rekerle, Lauren; Fontana, Tommaso; Hansen, Peter; Casiraghi, Elena; Ravanmehr, Vida; Mungall, Christopher J; Yang, Jeremy J; Spranger, Leonard; Karlebach, Guy; Caufield, J Harry; Carmody, Leigh; Coleman, Ben; Oprea, Tudor I; Reese, Justin; Valentini, Giorgio; Robinson, Peter N.

Bioinform Adv ; 4(1): vbae036, 2024.

Article in English | MEDLINE | ID: mdl-38577542

ABSTRACT

Motivation: Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes. Results: We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement. Availability and implementation: Our code and data are publicly available at https://github.com/monarch-initiative/negativeExampleSelection.

3.

An integrated metagenomic, metabolomic and transcriptomic survey of Populus across genotypes and environments.

Schadt, Christopher; Martin, Stanton; Carrell, Alyssa; Fortner, Allison; Hopp, Dan; Jacobson, Dan; Klingeman, Dawn; Kristy, Brandon; Phillips, Jana; Piatkowski, Bryan; Miller, Mark A; Smith, Montana; Patil, Sujay; Flynn, Mark; Canon, Shane; Clum, Alicia; Mungall, Christopher J; Pennacchio, Christa; Bowen, Benjamin; Louie, Katherine; Northen, Trent; Eloe-Fadrosh, Emiley A; Mayes, Melanie A; Muchero, Wellington; Weston, David J; Mitchell, Julie; Doktycz, Mitchel.

Sci Data ; 11(1): 339, 2024 Apr 05.

Article in English | MEDLINE | ID: mdl-38580669

ABSTRACT

Bridging molecular information to ecosystem-level processes would provide the capacity to understand system vulnerability and, potentially, a means for assessing ecosystem health. Here, we present an integrated dataset containing environmental and metagenomic information from plant-associated microbial communities, plant transcriptomics, plant and soil metabolomics, and soil chemistry and activity characterization measurements derived from the model tree species Populus trichocarpa. Soil, rhizosphere, root endosphere, and leaf samples were collected from 27 different P. trichocarpa genotypes grown in two different environments leading to an integrated dataset of 318 metagenomes, 98 plant transcriptomes, and 314 metabolomic profiles that are supported by diverse soil measurements. This expansive dataset will provide insights into causal linkages that relate genomic features and molecular level events to system-level properties and their environmental influences.

Subject(s)

Metagenome , Microbiota , Populus , Transcriptome , Fungi/genetics , Gene Expression Profiling , Genotype , Populus/genetics , Soil

4.

Predicting nutrition and environmental factors associated with female reproductive disorders using a knowledge graph and random forests.

Chan, Lauren E; Casiraghi, Elena; Reese, Justin; Harmon, Quaker E; Schaper, Kevin; Hegde, Harshad; Valentini, Giorgio; Schmitt, Charles; Motsinger-Reif, Alison; Hall, Janet E; Mungall, Christopher J; Robinson, Peter N; Haendel, Melissa A.

Int J Med Inform ; 187: 105461, 2024 Apr 17.

Article in English | MEDLINE | ID: mdl-38643701

ABSTRACT

OBJECTIVE: Female reproductive disorders (FRDs) are common health conditions that may present with significant symptoms. Diet and environment are potential areas for FRD interventions. We utilized a knowledge graph (KG) method to predict factors associated with common FRDs (for example, endometriosis, ovarian cyst, and uterine fibroids). MATERIALS AND METHODS: We harmonized survey data from the Personalized Environment and Genes Study (PEGS) on internal and external environmental exposures and health conditions with biomedical ontology content. We merged the harmonized data and ontologies with supplemental nutrient and agricultural chemical data to create a KG. We analyzed the KG by embedding edges and applying a random forest for edge prediction to identify variables potentially associated with FRDs. We also conducted logistic regression analysis for comparison. RESULTS: Across 9765 PEGS respondents, the KG analysis resulted in 8535 significant or suggestive predicted links between FRDs and chemicals, phenotypes, and diseases. Amongst these links, 32 were exact matches when compared with the logistic regression results, including comorbidities, medications, foods, and occupational exposures. DISCUSSION: Mechanistic underpinnings of predicted links documented in the literature may support some of our findings. Our KG methods are useful for predicting possible associations in large, survey-based datasets with added information on directionality and magnitude of effect from logistic regression. These results should not be construed as causal but can support hypothesis generation. CONCLUSION: This investigation enabled the generation of hypotheses on a variety of potential links between FRDs and exposures. Future investigations should prospectively evaluate the variables hypothesized to impact FRDs.

5.

Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning.

Caufield, J Harry; Hegde, Harshad; Emonet, Vincent; Harris, Nomi L; Joachimiak, Marcin P; Matentzoglu, Nicolas; Kim, HyeongSik; Moxon, Sierra; Reese, Justin T; Haendel, Melissa A; Robinson, Peter N; Mungall, Christopher J.

Bioinformatics ; 40(3)2024 Mar 04.

Article in English | MEDLINE | ID: mdl-38383067

ABSTRACT

MOTIVATION: Creating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas. RESULTS: Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM. AVAILABILITY AND IMPLEMENTATION: SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.

Subject(s)

Knowledge Bases , Semantics , Databases, Factual

6.

An evaluation of GPT models for phenotype concept recognition.

Groza, Tudor; Caufield, Harry; Gration, Dylan; Baynam, Gareth; Haendel, Melissa A; Robinson, Peter N; Mungall, Christopher J; Reese, Justin T.

BMC Med Inform Decis Mak ; 24(1): 30, 2024 Jan 31.

Article in English | MEDLINE | ID: mdl-38297371

ABSTRACT

OBJECTIVE: Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. MATERIALS AND METHODS: The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. RESULTS: The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. CONCLUSION: Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.

Subject(s)

Knowledge , Language , Humans , Machine Learning , Phenotype , Rare Diseases

7.

On the limitations of large language models in clinical diagnosis.

Reese, Justin T; Danis, Daniel; Caufield, J Harry; Groza, Tudor; Casiraghi, Elena; Valentini, Giorgio; Mungall, Christopher J; Robinson, Peter N.

medRxiv ; 2024 Feb 26.

Article in English | MEDLINE | ID: mdl-37503093

ABSTRACT

Objective: Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information. Materials and Methods: We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically. Results: Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task. Discussion: The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings. Conclusion: Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.

8.

The Monarch Initiative in 2024: an analytic platform integrating phenotypes, genes and diseases across species.

Putman, Tim E; Schaper, Kevin; Matentzoglu, Nicolas; Rubinetti, Vincent P; Alquaddoomi, Faisal S; Cox, Corey; Caufield, J Harry; Elsarboukh, Glass; Gehrke, Sarah; Hegde, Harshad; Reese, Justin T; Braun, Ian; Bruskiewich, Richard M; Cappelletti, Luca; Carbon, Seth; Caron, Anita R; Chan, Lauren E; Chute, Christopher G; Cortes, Katherina G; De Souza, Vinícius; Fontana, Tommaso; Harris, Nomi L; Hartley, Emily L; Hurwitz, Eric; Jacobsen, Julius O B; Krishnamurthy, Madan; Laraway, Bryan J; McLaughlin, James A; McMurry, Julie A; Moxon, Sierra A T; Mullen, Kathleen R; O'Neil, Shawn T; Shefchek, Kent A; Stefancsik, Ray; Toro, Sabrina; Vasilevsky, Nicole A; Walls, Ramona L; Whetzel, Patricia L; Osumi-Sutherland, David; Smedley, Damian; Robinson, Peter N; Mungall, Christopher J; Haendel, Melissa A; Munoz-Torres, Monica C.

Nucleic Acids Res ; 52(D1): D938-D949, 2024 Jan 05.

Article in English | MEDLINE | ID: mdl-38000386

ABSTRACT

Bridging the gap between genetic variations, environmental determinants, and phenotypic outcomes is critical for supporting clinical diagnosis and understanding mechanisms of diseases. It requires integrating open data at a global scale. The Monarch Initiative advances these goals by developing open ontologies, semantic data models, and knowledge graphs for translational research. The Monarch App is an integrated platform combining data about genes, phenotypes, and diseases across species. Monarch's APIs enable access to carefully curated datasets and advanced analysis tools that support the understanding and diagnosis of disease for diverse applications such as variant prioritization, deep phenotyping, and patient profile-matching. We have migrated our system into a scalable, cloud-based infrastructure; simplified Monarch's data ingestion and knowledge graph integration systems; enhanced data mapping and integration standards; and developed a new user interface with novel search and graph navigation features. Furthermore, we advanced Monarch's analytic tools by developing a customized plugin for OpenAI's ChatGPT to increase the reliability of its responses about phenotypic data, allowing us to interrogate the knowledge in the Monarch graph using state-of-the-art Large Language Models. The resources of the Monarch Initiative can be found at monarchinitiative.org and its corresponding code repository at github.com/monarch-initiative/monarch-app.

Subject(s)

Databases, Factual , Disease , Genes , Phenotype , Humans , Internet , Databases, Factual/standards , Software , Genes/genetics , Disease/genetics

9.

The Medical Action Ontology: A tool for annotating and analyzing treatments and clinical management of human disease.

Carmody, Leigh C; Gargano, Michael A; Toro, Sabrina; Vasilevsky, Nicole A; Adam, Margaret P; Blau, Hannah; Chan, Lauren E; Gomez-Andres, David; Horvath, Rita; Kraus, Megan L; Ladewig, Markus S; Lewis-Smith, David; Lochmüller, Hanns; Matentzoglu, Nicolas A; Munoz-Torres, Monica C; Schuetz, Catharina; Seitz, Berthold; Similuk, Morgan N; Sparks, Teresa N; Strauss, Timmy; Swietlik, Emilia M; Thompson, Rachel; Zhang, Xingmin Aaron; Mungall, Christopher J; Haendel, Melissa A; Robinson, Peter N.

Med ; 4(12): 913-927.e3, 2023 Dec 08.

Article in English | MEDLINE | ID: mdl-37963467

ABSTRACT

BACKGROUND: Navigating the clinical literature to determine the optimal clinical management for rare diseases presents significant challenges. We introduce the Medical Action Ontology (MAxO), an ontology specifically designed to organize medical procedures, therapies, and interventions. METHODS: MAxO incorporates logical structures that link MAxO terms to numerous other ontologies within the OBO Foundry. Term development involves a blend of manual and semi-automated processes. Additionally, we have generated annotations detailing diagnostic modalities for specific phenotypic abnormalities defined by the Human Phenotype Ontology (HPO). We introduce a web application, POET, that facilitates MAxO annotations for specific medical actions for diseases using the Mondo Disease Ontology. FINDINGS: MAxO encompasses 1,757 terms spanning a wide range of biomedical domains, from human anatomy and investigations to the chemical and protein entities involved in biological processes. These terms annotate phenotypic features associated with specific disease (using HPO and Mondo). Presently, there are over 16,000 MAxO diagnostic annotations that target HPO terms. Through POET, we have created 413 MAxO annotations specifying treatments for 189 rare diseases. CONCLUSIONS: MAxO offers a computational representation of treatments and other actions taken for the clinical management of patients. Its development is closely coupled to Mondo and HPO, broadening the scope of our computational modeling of diseases and phenotypic features. We invite the community to contribute disease annotations using POET (https://poet.jax.org/). MAxO is available under the open-source CC-BY 4.0 license (https://github.com/monarch-initiative/MAxO). FUNDING: NHGRI 1U24HG011449-01A1 and NHGRI 5RM1HG010860-04.

Subject(s)

Biological Ontologies , Humans , Rare Diseases , Software , Computer Simulation

10.

An approach for collaborative development of a federated biomedical knowledge graph-based question-answering system: Question-of-the-Month challenges.

Fecho, Karamarie; Bizon, Chris; Issabekova, Tursynay; Moxon, Sierra; Thessen, Anne E; Abdollahi, Shervin; Baranzini, Sergio E; Belhu, Basazin; Byrd, William E; Chung, Lawrence; Crouse, Andrew; Duby, Marc P; Ferguson, Stephen; Foksinska, Aleksandra; Forero, Laura; Friedman, Jennifer; Gardner, Vicki; Glusman, Gwênlyn; Hadlock, Jennifer; Hanspers, Kristina; Hinderer, Eugene; Hobbs, Charlotte; Hyde, Gregory; Huang, Sui; Koslicki, David; Mease, Philip; Muller, Sandrine; Mungall, Christopher J; Ramsey, Stephen A; Roach, Jared; Rubin, Irit; Schurman, Shepherd H; Shalev, Anath; Smith, Brett; Soman, Karthik; Stemann, Sarah; Su, Andrew I; Ta, Casey; Watkins, Paul B; Williams, Mark D; Wu, Chunlei; Xu, Colleen H.

J Clin Transl Sci ; 7(1): e214, 2023.

Article in English | MEDLINE | ID: mdl-37900350

ABSTRACT

Knowledge graphs have become a common approach for knowledge representation. Yet, the application of graph methodology is elusive due to the sheer number and complexity of knowledge sources. In addition, semantic incompatibilities hinder efforts to harmonize and integrate across these diverse sources. As part of The Biomedical Translator Consortium, we have developed a knowledge graph-based question-answering system designed to augment human reasoning and accelerate translational scientific discovery: the Translator system. We have applied the Translator system to answer biomedical questions in the context of a broad array of diseases and syndromes, including Fanconi anemia, primary ciliary dyskinesia, multiple sclerosis, and others. A variety of collaborative approaches have been used to research and develop the Translator system. One recent approach involved the establishment of a monthly "Question-of-the-Month (QotM) Challenge" series. Herein, we describe the structure of the QotM Challenge; the six challenges that have been conducted to date on drug-induced liver injury, cannabidiol toxicity, coronavirus infection, diabetes, psoriatic arthritis, and ATP1A3-related phenotypes; the scientific insights that have been gleaned during the challenges; and the technical issues that were identified over the course of the challenges and that can now be addressed to foster further development of the prototype Translator system. We close with a discussion on Large Language Models such as ChatGPT and highlight differences between those models and the Translator system.

11.

Predicting nutrition and environmental factors associated with female reproductive disorders using a knowledge graph and random forests.

Chan, Lauren E; Casiraghi, Elena; Putman, Tim; Reese, Justin; Harmon, Quaker E; Schaper, Kevin; Hedge, Harshad; Valentini, Giorgio; Schmitt, Charles; Motsinger-Reif, Alison; Hall, Janet E; Mungall, Christopher J; Robinson, Peter N; Haendel, Melissa A.

medRxiv ; 2023 Jul 16.

Article in English | MEDLINE | ID: mdl-37502882

ABSTRACT

Objective: Female reproductive disorders (FRDs) are common health conditions that may present with significant symptoms. Diet and environment are potential areas for FRD interventions. We utilized a knowledge graph (KG) method to predict factors associated with common FRDs (e.g., endometriosis, ovarian cyst, and uterine fibroids). Materials and Methods: We harmonized survey data from the Personalized Environment and Genes Study on internal and external environmental exposures and health conditions with biomedical ontology content. We merged the harmonized data and ontologies with supplemental nutrient and agricultural chemical data to create a KG. We analyzed the KG by embedding edges and applying a random forest for edge prediction to identify variables potentially associated with FRDs. We also conducted logistic regression analysis for comparison. Results: Across 9765 PEGS respondents, the KG analysis resulted in 8535 significant predicted links between FRDs and chemicals, phenotypes, and diseases. Amongst these links, 32 were exact matches when compared with the logistic regression results, including comorbidities, medications, foods, and occupational exposures. Discussion: Mechanistic underpinnings of predicted links documented in the literature may support some of our findings. Our KG methods are useful for predicting possible associations in large, survey-based datasets with added information on directionality and magnitude of effect from logistic regression. These results should not be construed as causal, but can support hypothesis generation. Conclusion: This investigation enabled the generation of hypotheses on a variety of potential links between FRDs and exposures. Future investigations should prospectively evaluate the variables hypothesized to impact FRDs.

12.

The Medical Action Ontology: A Tool for Annotating and Analyzing Treatments and Clinical Management of Human Disease.

Carmody, Leigh C; Gargano, Michael A; Toro, Sabrina; Vasilevsky, Nicole A; Adam, Margaret P; Blau, Hannah; Chan, Lauren E; Gomez-Andres, David; Horvath, Rita; Kraus, Megan L; Ladewig, Markus S; Lewis-Smith, David; Lochmüller, Hanns; Matentzoglu, Nicolas A; Munoz-Torres, Monica C; Schuetz, Catharina; Seitz, Berthold; Similuk, Morgan N; Sparks, Teresa N; Strauss, Timmy; Swietlik, Emilia M; Thompson, Rachel; Zhang, Xingmin Aaron; Mungall, Christopher J; Haendel, Melissa A; Robinson, Peter N.

medRxiv ; 2023 Jul 13.

Article in English | MEDLINE | ID: mdl-37503136

ABSTRACT

Navigating the vast landscape of clinical literature to find optimal treatments and management strategies can be a challenging task, especially for rare diseases. To address this task, we introduce the Medical Action Ontology (MAxO), the first ontology specifically designed to organize medical procedures, therapies, and interventions in a structured way. Currently, MAxO contains 1757 medical action terms added through a combination of manual and semi-automated processes. MAxO was developed with logical structures that make it compatible with several other ontologies within the Open Biological and Biomedical Ontologies (OBO) Foundry. These cover a wide range of biomedical domains, from human anatomy and investigations to the chemical and protein entities involved in biological processes. We have created a database of over 16000 annotations that describe diagnostic modalities for specific phenotypic abnormalities as defined by the Human Phenotype Ontology (HPO). Additionally, 413 annotations are provided for medical actions for 189 rare diseases. We have developed a web application called POET (https://poet.jax.org/) for the community to use to contribute MAxO annotations. MAxO provides a computational representation of treatments and other actions taken for the clinical management of patients. The development of MAxO is closely coupled to the Mondo Disease Ontology (Mondo) and the Human Phenotype Ontology (HPO) and expands the scope of our computational modeling of diseases and phenotypic features to include diagnostics and therapeutic actions. MAxO is available under the open-source CC-BY 4.0 license (https://github.com/monarch-initiative/MAxO).

13.

Gene Set Summarization using Large Language Models.

Joachimiak, Marcin P; Caufield, J Harry; Harris, Nomi L; Kim, Hyeongsik; Mungall, Christopher J.

ArXiv ; 2023 May 25.

Article in English | MEDLINE | ID: mdl-37292480

ABSTRACT

Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling the use of Large Language Models (LLMs), potentially utilizing scientific texts directly and avoiding reliance on a KB. We developed SPINDOCTOR (Structured Prompt Interpolation of Natural Language Descriptions of Controlled Terms for Ontology Reporting), a method that uses GPT models to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct model retrieval. We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for gene sets. However, GPT-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, these methods were rarely able to recapitulate the most precise and informative term from standard enrichment, likely due to an inability to generalize and reason using an ontology. Results are highly nondeterministic, with minor variations in prompt resulting in radically different term lists. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis and that manual curation of ontological assertions remains necessary.

14.

KG-Hub-building and exchanging biological knowledge graphs.

Caufield, J Harry; Putman, Tim; Schaper, Kevin; Unni, Deepak R; Hegde, Harshad; Callahan, Tiffany J; Cappelletti, Luca; Moxon, Sierra A T; Ravanmehr, Vida; Carbon, Seth; Chan, Lauren E; Cortes, Katherina; Shefchek, Kent A; Elsarboukh, Glass; Balhoff, Jim; Fontana, Tommaso; Matentzoglu, Nicolas; Bruskiewich, Richard M; Thessen, Anne E; Harris, Nomi L; Munoz-Torres, Monica C; Haendel, Melissa A; Robinson, Peter N; Joachimiak, Marcin P; Mungall, Christopher J; Reese, Justin T.

Bioinformatics ; 39(7)2023 07 01.

Article in English | MEDLINE | ID: mdl-37389415

ABSTRACT

MOTIVATION: Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of KGs is lacking. RESULTS: Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of KGs. Features include a simple, modular extract-transform-load pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate KGs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph ML, including node embeddings and training of models for link prediction and node classification. AVAILABILITY AND IMPLEMENTATION: https://kghub.org.

Subject(s)

Biological Ontologies , COVID-19 , Humans , Pattern Recognition, Automated , Rare Diseases , Machine Learning

15.

Phenopacket-tools: Building and validating GA4GH Phenopackets.

Danis, Daniel; Jacobsen, Julius O B; Wagner, Alex H; Groza, Tudor; Beckwith, Martha A; Rekerle, Lauren; Carmody, Leigh C; Reese, Justin; Hegde, Harshad; Ladewig, Markus S; Seitz, Berthold; Munoz-Torres, Monica; Harris, Nomi L; Rambla, Jordi; Baudis, Michael; Mungall, Christopher J; Haendel, Melissa A; Robinson, Peter N.

PLoS One ; 18(5): e0285433, 2023.

Article in English | MEDLINE | ID: mdl-37196000

ABSTRACT

The Global Alliance for Genomics and Health (GA4GH) is a standards-setting organization that is developing a suite of coordinated standards for genomics. The GA4GH Phenopacket Schema is a standard for sharing disease and phenotype information that characterizes an individual person or biosample. The Phenopacket Schema is flexible and can represent clinical data for any kind of human disease including rare disease, complex disease, and cancer. It also allows consortia or databases to apply additional constraints to ensure uniform data collection for specific goals. We present phenopacket-tools, an open-source Java library and command-line application for construction, conversion, and validation of phenopackets. Phenopacket-tools simplifies construction of phenopackets by providing concise builders, programmatic shortcuts, and predefined building blocks (ontology classes) for concepts such as anatomical organs, age of onset, biospecimen type, and clinical modifiers. Phenopacket-tools can be used to validate the syntax and semantics of phenopackets as well as to assess adherence to additional user-defined requirements. The documentation includes examples showing how to use the Java library and the command-line tool to create and validate phenopackets. We demonstrate how to create, convert, and validate phenopackets using the library or the command-line application. Source code, API documentation, comprehensive user guide and a tutorial can be found at https://github.com/phenopackets/phenopacket-tools. The library can be installed from the public Maven Central artifact repository and the application is available as a standalone archive. The phenopacket-tools library helps developers implement and standardize the collection and exchange of phenotypic and other clinical data for use in phenotype-driven genomic diagnostics, translational research, and precision medicine applications.

Subject(s)

Neoplasms , Software , Humans , Genomics , Databases, Factual , Gene Library

16.

The Ontology of Biological Attributes (OBA)-computational traits for the life sciences.

Stefancsik, Ray; Balhoff, James P; Balk, Meghan A; Ball, Robyn L; Bello, Susan M; Caron, Anita R; Chesler, Elissa J; de Souza, Vinicius; Gehrke, Sarah; Haendel, Melissa; Harris, Laura W; Harris, Nomi L; Ibrahim, Arwa; Koehler, Sebastian; Matentzoglu, Nicolas; McMurry, Julie A; Mungall, Christopher J; Munoz-Torres, Monica C; Putman, Tim; Robinson, Peter; Smedley, Damian; Sollis, Elliot; Thessen, Anne E; Vasilevsky, Nicole; Walton, David O; Osumi-Sutherland, David.

Mamm Genome ; 34(3): 364-378, 2023 09.

Article in English | MEDLINE | ID: mdl-37076585

ABSTRACT

Existing phenotype ontologies were originally developed to represent phenotypes that manifest as a character state in relation to a wild-type or other reference. However, these do not include the phenotypic trait or attribute categories required for the annotation of genome-wide association studies (GWAS), Quantitative Trait Loci (QTL) mappings or any population-focussed measurable trait data. The integration of trait and biological attribute information with an ever increasing body of chemical, environmental and biological data greatly facilitates computational analyses and it is also highly relevant to biomedical and clinical applications. The Ontology of Biological Attributes (OBA) is a formalised, species-independent collection of interoperable phenotypic trait categories that is intended to fulfil a data integration role. OBA is a standardised representational framework for observable attributes that are characteristics of biological entities, organisms, or parts of organisms. OBA has a modular design which provides several benefits for users and data integrators, including an automated and meaningful classification of trait terms computed on the basis of logical inferences drawn from domain-specific ontologies for cells, anatomical and other relevant entities. The logical axioms in OBA also provide a previously missing bridge that can computationally link Mendelian phenotypes with GWAS and quantitative traits. The term components in OBA provide semantic links and enable knowledge and data integration across specialised research community boundaries, thereby breaking silos.

Subject(s)

Biological Ontologies , Biological Science Disciplines , Genome-Wide Association Study , Phenotype

17.

Author Correction: Brain Data Standards - A method for building data-driven cell-type ontologies.

Tan, Shawn Zheng Kai; Kir, Huseyin; Aevermann, Brian D; Gillespie, Tom; Harris, Nomi; Hawrylycz, Michael J; Jorstad, Nikolas L; Lein, Ed S; Matentzoglu, Nicolas; Miller, Jeremy A; Mollenkopf, Tyler S; Mungall, Christopher J; Ray, Patrick L; Sanchez, Raymond E A; Staats, Brian; Vermillion, Jim; Yadav, Ambika; Zhang, Yun; Scheuermann, Richard H; Osumi-Sutherland, David.

Sci Data ; 10(1): 246, 2023 Apr 28.

Article in English | MEDLINE | ID: mdl-37117232

18.

The Gene Ontology knowledgebase in 2023.

Aleksander, Suzi A; Balhoff, James; Carbon, Seth; Cherry, J Michael; Drabkin, Harold J; Ebert, Dustin; Feuermann, Marc; Gaudet, Pascale; Harris, Nomi L; Hill, David P; Lee, Raymond; Mi, Huaiyu; Moxon, Sierra; Mungall, Christopher J; Muruganugan, Anushya; Mushayahama, Tremayne; Sternberg, Paul W; Thomas, Paul D; Van Auken, Kimberly; Ramsey, Jolene; Siegele, Deborah A; Chisholm, Rex L; Fey, Petra; Aspromonte, Maria Cristina; Nugnes, Maria Victoria; Quaglia, Federica; Tosatto, Silvio; Giglio, Michelle; Nadendla, Suvarna; Antonazzo, Giulia; Attrill, Helen; Dos Santos, Gil; Marygold, Steven; Strelets, Victor; Tabone, Christopher J; Thurmond, Jim; Zhou, Pinglei; Ahmed, Saadullah H; Asanitthong, Praoparn; Luna Buitrago, Diana; Erdol, Meltem N; Gage, Matthew C; Ali Kadhum, Mohamed; Li, Kan Yan Chloe; Long, Miao; Michalak, Aleksandra; Pesala, Angeline; Pritazahra, Armalya; Saverimuttu, Shirin C C; Su, Renzhi.

Genetics ; 224(1)2023 05 04.

Article in English | MEDLINE | ID: mdl-36866529

ABSTRACT

The Gene Ontology (GO) knowledgebase (http://geneontology.org) is a comprehensive resource concerning the functions of genes and gene products (proteins and noncoding RNAs). GO annotations cover genes from organisms across the tree of life as well as viruses, though most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms. Here, we provide an updated overview of the GO knowledgebase, as well as the efforts of the broad, international consortium of scientists that develops, maintains, and updates the GO knowledgebase. The GO knowledgebase consists of three components: (1) the GO-a computational knowledge structure describing the functional characteristics of genes; (2) GO annotations-evidence-supported statements asserting that a specific gene product has a particular functional characteristic; and (3) GO Causal Activity Models (GO-CAMs)-mechanistic models of molecular "pathways" (GO biological processes) created by linking multiple GO annotations using defined relations. Each of these components is continually expanded, revised, and updated in response to newly published discoveries and receives extensive QA checks, reviews, and user feedback. For each of these components, we provide a description of the current contents, recent developments to keep the knowledgebase up to date with new discoveries, and guidance on how users can best make use of the data that we provide. We conclude with future directions for the project.

Subject(s)

Databases, Genetic , Proteins , Gene Ontology , Proteins/genetics , Molecular Sequence Annotation , Computational Biology

19.

GA4GH Phenopackets: A Practical Introduction.

Ladewig, Markus S; Jacobsen, Julius O B; Wagner, Alex H; Danis, Daniel; El Kassaby, Baha; Gargano, Michael; Groza, Tudor; Baudis, Michael; Steinhaus, Robin; Seelow, Dominik; Bechrakis, Nikolaos E; Mungall, Christopher J; Schofield, Paul N; Elemento, Olivier; Smith, Lindsay; McMurry, Julie A; Munoz-Torres, Monica; Haendel, Melissa A; Robinson, Peter N.

Adv Genet (Hoboken) ; 4(1): 2200016, 2023 Mar.

Article in English | MEDLINE | ID: mdl-36910590

ABSTRACT

The Global Alliance for Genomics and Health (GA4GH) is developing a suite of coordinated standards for genomics for healthcare. The Phenopacket is a new GA4GH standard for sharing disease and phenotype information that characterizes an individual person, linking that individual to detailed phenotypic descriptions, genetic information, diagnoses, and treatments. A detailed example is presented that illustrates how to use the schema to represent the clinical course of a patient with retinoblastoma, including demographic information, the clinical diagnosis, phenotypic features and clinical measurements, an examination of the extirpated tumor, therapies, and the results of genomic analysis. The Phenopacket Schema, together with other GA4GH data and technical standards, will enable data exchange and provide a foundation for the computational analysis of disease and phenotype information to improve our ability to diagnose and conduct research on all types of disorders, including cancer and rare diseases.

20.

An expectation-maximization framework for comprehensive prediction of isoform-specific functions.

Karlebach, Guy; Carmody, Leigh; Sundaramurthi, Jagadish Chandrabose; Casiraghi, Elena; Hansen, Peter; Reese, Justin; Mungall, Christopher J; Valentini, Giorgio; Robinson, Peter N.

Bioinformatics ; 39(4)2023 04 03.

Article in English | MEDLINE | ID: mdl-36929917

ABSTRACT

MOTIVATION: Advances in RNA sequencing technologies have achieved an unprecedented accuracy in the quantification of mRNA isoforms, but our knowledge of isoform-specific functions has lagged behind. There is a need to understand the functional consequences of differential splicing, which could be supported by the generation of accurate and comprehensive isoform-specific gene ontology annotations. RESULTS: We present isoform interpretation, a method that uses expectation-maximization to infer isoform-specific functions based on the relationship between sequence and functional isoform similarity. We predicted isoform-specific functional annotations for 85 617 isoforms of 17 900 protein-coding human genes spanning a range of 17 430 distinct gene ontology terms. Comparison with a gold-standard corpus of manually annotated human isoform functions showed that isoform interpretation significantly outperforms state-of-the-art competing methods. We provide experimental evidence that functionally related isoforms predicted by isoform interpretation show a higher degree of domain sharing and expression correlation than functionally related genes. We also show that isoform sequence similarity correlates better with inferred isoform function than with gene-level function. AVAILABILITY AND IMPLEMENTATION: Source code, documentation, and resource files are freely available under a GNU3 license at https://github.com/TheJacksonLaboratory/isopretEM and https://zenodo.org/record/7594321.

Subject(s)

Motivation , Software , Humans , Protein Isoforms/genetics , Alternative Splicing , Sequence Analysis, RNA

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL