Search | VHL Regional Portal

1.

A de novo variant in PAK2 detected in an individual with Knobloch type 2 syndrome.

Werren, Elizabeth A; Kalsner, Louisa; Ewald, Jessica; Peracchio, Michael; King, Cameron; Vats, Purva; Audano, Peter A; Robinson, Peter N; Adams, Mark D; Kelly, Melissa A; Matson, Adam P.

bioRxiv ; 2024 Apr 22.

Article in English | MEDLINE | ID: mdl-38712026

ABSTRACT

P21-activated kinase 2 (PAK2) is a serine/threonine kinase essential for a variety of cellular processes including signal transduction, cellular survival, proliferation, and migration. A recent report proposed monoallelic PAK2 variants cause Knobloch syndrome type 2 (KNO2)-a developmental disorder primarily characterized by ocular anomalies. Here, we identified a novel de novo heterozygous missense variant in PAK2, NM_002577.4:c.1273G>A, p.(D425N), by whole genome sequencing in an individual with features consistent with KNO2. Notable clinical phenotypes include global developmental delay, congenital retinal detachment, mild cerebral ventriculomegaly, hypotonia, FTT, pyloric stenosis, feeding intolerance, patent ductus arteriosus, and mild facial dysmorphism. The p.(D425N) variant lies within the protein kinase domain and is predicted to be functionally damaging by in silico analysis. Previous clinical genetic testing did not report this variant due to unknown relevance of PAK2 variants at the time of testing, highlighting the importance of reanalysis. Our findings also substantiate the candidacy of PAK2 variants in KNO2 and expand the KNO2 clinical spectrum.

2.

An open source knowledge graph ecosystem for the life sciences.

Callahan, Tiffany J; Tripodi, Ignacio J; Stefanski, Adrianne L; Cappelletti, Luca; Taneja, Sanya B; Wyrwa, Jordan M; Casiraghi, Elena; Matentzoglu, Nicolas A; Reese, Justin; Silverstein, Jonathan C; Hoyt, Charles Tapley; Boyce, Richard D; Malec, Scott A; Unni, Deepak R; Joachimiak, Marcin P; Robinson, Peter N; Mungall, Christopher J; Cavalleri, Emanuele; Fontana, Tommaso; Valentini, Giorgio; Mesiti, Marco; Gillenwater, Lucas A; Santangelo, Brook; Vasilevsky, Nicole A; Hoehndorf, Robert; Bennett, Tellen D; Ryan, Patrick B; Hripcsak, George; Kahn, Michael G; Bada, Michael; Baumgartner, William A; Hunter, Lawrence E.

Sci Data ; 11(1): 363, 2024 Apr 11.

Article in English | MEDLINE | ID: mdl-38605048

ABSTRACT

Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoint resources and abstraction algorithms), and benchmarks (e.g., prebuilt KGs). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 different large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.

Subject(s)

Biological Science Disciplines , Knowledge Bases , Pattern Recognition, Automated , Algorithms , Translational Research, Biomedical

3.

Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning.

Cappelletti, Luca; Rekerle, Lauren; Fontana, Tommaso; Hansen, Peter; Casiraghi, Elena; Ravanmehr, Vida; Mungall, Christopher J; Yang, Jeremy J; Spranger, Leonard; Karlebach, Guy; Caufield, J Harry; Carmody, Leigh; Coleman, Ben; Oprea, Tudor I; Reese, Justin; Valentini, Giorgio; Robinson, Peter N.

Bioinform Adv ; 4(1): vbae036, 2024.

Article in English | MEDLINE | ID: mdl-38577542

ABSTRACT

Motivation: Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes. Results: We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement. Availability and implementation: Our code and data are publicly available at https://github.com/monarch-initiative/negativeExampleSelection.

4.

Lethal phenotypes in Mendelian disorders.

Cacheiro, Pilar; Lawson, Samantha; Van den Veyver, Ignatia B; Marengo, Gabriel; Zocche, David; Murray, Stephen A; Duyzend, Michael; Robinson, Peter N; Smedley, Damian.

Genet Med ; : 101141, 2024 Apr 13.

Article in English | MEDLINE | ID: mdl-38629401

ABSTRACT

PURPOSE: Existing resources that characterise the essentiality status of genes are based on either proliferation assessment in human cell lines, viability evaluation in mouse knockouts, or constraint metrics derived from human population sequencing studies. Several repositories document phenotypic annotations for rare disorders, however there is a lack of comprehensive reporting on lethal phenotypes. METHODS: We queried Online Mendelian Inheritance in Man for terms related to lethality and classified all Mendelian genes according to the earliest age of death recorded for the associated disorders, from prenatal death to no reports of premature death. We characterised the genes across these lethality categories, examined the evidence on viability from mouse models and explored how this information could be used for novel gene discovery. RESULTS: We developed the Lethal Phenotypes Portal to showcase this curated catalogue of human essential genes. Differences in the mode of inheritance, physiological systems affected and disease class were found for genes in different lethality categories as well as discrepancies between the lethal phenotypes observed in mouse and human. CONCLUSION: We anticipate that this resource will aid clinicians in the diagnosis of early lethal conditions and assist researchers in investigating the properties that make these genes essential for human development.

5.

Predicting nutrition and environmental factors associated with female reproductive disorders using a knowledge graph and random forests.

Chan, Lauren E; Casiraghi, Elena; Reese, Justin; Harmon, Quaker E; Schaper, Kevin; Hegde, Harshad; Valentini, Giorgio; Schmitt, Charles; Motsinger-Reif, Alison; Hall, Janet E; Mungall, Christopher J; Robinson, Peter N; Haendel, Melissa A.

Int J Med Inform ; 187: 105461, 2024 Apr 17.

Article in English | MEDLINE | ID: mdl-38643701

ABSTRACT

OBJECTIVE: Female reproductive disorders (FRDs) are common health conditions that may present with significant symptoms. Diet and environment are potential areas for FRD interventions. We utilized a knowledge graph (KG) method to predict factors associated with common FRDs (for example, endometriosis, ovarian cyst, and uterine fibroids). MATERIALS AND METHODS: We harmonized survey data from the Personalized Environment and Genes Study (PEGS) on internal and external environmental exposures and health conditions with biomedical ontology content. We merged the harmonized data and ontologies with supplemental nutrient and agricultural chemical data to create a KG. We analyzed the KG by embedding edges and applying a random forest for edge prediction to identify variables potentially associated with FRDs. We also conducted logistic regression analysis for comparison. RESULTS: Across 9765 PEGS respondents, the KG analysis resulted in 8535 significant or suggestive predicted links between FRDs and chemicals, phenotypes, and diseases. Amongst these links, 32 were exact matches when compared with the logistic regression results, including comorbidities, medications, foods, and occupational exposures. DISCUSSION: Mechanistic underpinnings of predicted links documented in the literature may support some of our findings. Our KG methods are useful for predicting possible associations in large, survey-based datasets with added information on directionality and magnitude of effect from logistic regression. These results should not be construed as causal but can support hypothesis generation. CONCLUSION: This investigation enabled the generation of hypotheses on a variety of potential links between FRDs and exposures. Future investigations should prospectively evaluate the variables hypothesized to impact FRDs.

6.

Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning.

Caufield, J Harry; Hegde, Harshad; Emonet, Vincent; Harris, Nomi L; Joachimiak, Marcin P; Matentzoglu, Nicolas; Kim, HyeongSik; Moxon, Sierra; Reese, Justin T; Haendel, Melissa A; Robinson, Peter N; Mungall, Christopher J.

Bioinformatics ; 40(3)2024 Mar 04.

Article in English | MEDLINE | ID: mdl-38383067

ABSTRACT

MOTIVATION: Creating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas. RESULTS: Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM. AVAILABILITY AND IMPLEMENTATION: SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.

Subject(s)

Knowledge Bases , Semantics , Databases, Factual

7.

Lethal phenotypes in Mendelian disorders.

Cacheiro, Pilar; Lawson, Samantha; Van den Veyver, Ignatia B; Marengo, Gabriel; Zocche, David; Murray, Stephen A; Duyzend, Michael; Robinson, Peter N; Smedley, Damian.

medRxiv ; 2024 Jan 13.

Article in English | MEDLINE | ID: mdl-38260283

ABSTRACT

Essential genes are those whose function is required for cell proliferation and/or organism survival. A gene's intolerance to loss-of-function can be allocated within a spectrum, as opposed to being considered a binary feature, since this function might be essential at different stages of development, genetic backgrounds or other contexts. Existing resources that collect and characterise the essentiality status of genes are based on either proliferation assessment in human cell lines, embryonic and postnatal viability evaluation in different model organisms, and gene metrics such as intolerance to variation scores derived from human population sequencing studies. There are also several repositories available that document phenotypic annotations for rare disorders in humans such as the Online Mendelian Inheritance in Man (OMIM) and the Human Phenotype Ontology (HPO) knowledgebases. This raises the prospect of being able to use clinical data, including lethality as the most severe phenotypic manifestation, to further our characterisation of gene essentiality. Here we queried OMIM for terms related to lethality and classified all Mendelian genes into categories, according to the earliest age of death recorded for the associated disorders, from prenatal death to no reports of premature death. To showcase this curated catalogue of human essential genes, we developed the Lethal Phenotypes Portal (https://lethalphenotypes.research.its.qmul.ac.uk), where we also explore the relationships between these lethality categories, constraint metrics and viability in cell lines and mouse. Further analysis of the genes in these categories reveals differences in the mode of inheritance of the associated disorders, physiological systems affected and disease class. We highlight how the phenotypic similarity between genes in the same lethality category combined with gene family/group information can be used for novel disease gene discovery. Finally, we explore the overlaps and discrepancies between the lethal phenotypes observed in mouse and human and discuss potential explanations that include differences in transcriptional regulation, functional compensation and molecular disease mechanisms. We anticipate that this resource will aid clinicians in the diagnosis of early lethal conditions and assist researchers in investigating the properties that make these genes essential for human development.

8.

Improving prenatal diagnosis through standards and aggregation.

Duyzend, Michael H; Cacheiro, Pilar; Jacobsen, Julius O B; Giordano, Jessica; Brand, Harrison; Wapner, Ronald J; Talkowski, Michael E; Robinson, Peter N; Smedley, Damian.

Prenat Diagn ; 44(4): 454-464, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38242839

ABSTRACT

Advances in sequencing and imaging technologies enable enhanced assessment in the prenatal space, with a goal to diagnose and predict the natural history of disease, to direct targeted therapies, and to implement clinical management, including transfer of care, election of supportive care, and selection of surgical interventions. The current lack of standardization and aggregation stymies variant interpretation and gene discovery, which hinders the provision of prenatal precision medicine, leaving clinicians and patients without an accurate diagnosis. With large amounts of data generated, it is imperative to establish standards for data collection, processing, and aggregation. Aggregated and homogeneously processed genetic and phenotypic data permits dissection of the genomic architecture of prenatal presentations of disease and provides a dataset on which data analysis algorithms can be tuned to the prenatal space. Here we discuss the importance of generating aggregate data sets and how the prenatal space is driving the development of interoperable standards and phenotype-driven tools.

Subject(s)

Precision Medicine , Prenatal Diagnosis , Pregnancy , Female , Humans , Phenotype , Genomics , Algorithms

9.

An evaluation of GPT models for phenotype concept recognition.

Groza, Tudor; Caufield, Harry; Gration, Dylan; Baynam, Gareth; Haendel, Melissa A; Robinson, Peter N; Mungall, Christopher J; Reese, Justin T.

BMC Med Inform Decis Mak ; 24(1): 30, 2024 Jan 31.

Article in English | MEDLINE | ID: mdl-38297371

ABSTRACT

OBJECTIVE: Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. MATERIALS AND METHODS: The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. RESULTS: The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. CONCLUSION: Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.

Subject(s)

Knowledge , Language , Humans , Machine Learning , Phenotype , Rare Diseases

10.

On the limitations of large language models in clinical diagnosis.

Reese, Justin T; Danis, Daniel; Caufield, J Harry; Groza, Tudor; Casiraghi, Elena; Valentini, Giorgio; Mungall, Christopher J; Robinson, Peter N.

medRxiv ; 2024 Feb 26.

Article in English | MEDLINE | ID: mdl-37503093

ABSTRACT

Objective: Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information. Materials and Methods: We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically. Results: Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task. Discussion: The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings. Conclusion: Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.

11.

Computing Minimal Boolean Models of Gene Regulatory Networks.

Karlebach, Guy; Robinson, Peter N.

J Comput Biol ; 31(2): 117-127, 2024 Feb.

Article in English | MEDLINE | ID: mdl-37889991

ABSTRACT

Models of gene regulatory networks (GRNs) capture the dynamics of the regulatory processes that occur within the cell as a means to understanding the variability observed in gene expression between different conditions. Arguably the simplest mathematical construct used for modeling is the Boolean network, which dictates a set of logical rules for transition between states described as Boolean vectors. Due to the complexity of gene regulation and the limitations of experimental technologies, in most cases knowledge about regulatory interactions and Boolean states is partial. In addition, the logical rules themselves are not known a priori. Our goal in this work is to create an algorithm that finds the network that fits the data optimally, and identify the network states that correspond to the noise-free data. We present a novel methodology for integrating experimental data and performing a search for the optimal consistent structure via optimization of a linear objective function under a set of linear constraints. In addition, we extend our methodology into a heuristic that alleviates the computational complexity of the problem for datasets that are generated by single-cell RNA-Sequencing (scRNA-Seq). We demonstrate the effectiveness of these tools using simulated data, and in addition a publicly available scRNA-Seq dataset and the GRN that is associated with it. Our methodology will enable researchers to obtain a better understanding of the dynamics of GRNs and their biological role.

Subject(s)

Algorithms , Gene Regulatory Networks , Gene Expression Regulation

12.

The Monarch Initiative in 2024: an analytic platform integrating phenotypes, genes and diseases across species.

Putman, Tim E; Schaper, Kevin; Matentzoglu, Nicolas; Rubinetti, Vincent P; Alquaddoomi, Faisal S; Cox, Corey; Caufield, J Harry; Elsarboukh, Glass; Gehrke, Sarah; Hegde, Harshad; Reese, Justin T; Braun, Ian; Bruskiewich, Richard M; Cappelletti, Luca; Carbon, Seth; Caron, Anita R; Chan, Lauren E; Chute, Christopher G; Cortes, Katherina G; De Souza, Vinícius; Fontana, Tommaso; Harris, Nomi L; Hartley, Emily L; Hurwitz, Eric; Jacobsen, Julius O B; Krishnamurthy, Madan; Laraway, Bryan J; McLaughlin, James A; McMurry, Julie A; Moxon, Sierra A T; Mullen, Kathleen R; O'Neil, Shawn T; Shefchek, Kent A; Stefancsik, Ray; Toro, Sabrina; Vasilevsky, Nicole A; Walls, Ramona L; Whetzel, Patricia L; Osumi-Sutherland, David; Smedley, Damian; Robinson, Peter N; Mungall, Christopher J; Haendel, Melissa A; Munoz-Torres, Monica C.

Nucleic Acids Res ; 52(D1): D938-D949, 2024 Jan 05.

Article in English | MEDLINE | ID: mdl-38000386

ABSTRACT

Bridging the gap between genetic variations, environmental determinants, and phenotypic outcomes is critical for supporting clinical diagnosis and understanding mechanisms of diseases. It requires integrating open data at a global scale. The Monarch Initiative advances these goals by developing open ontologies, semantic data models, and knowledge graphs for translational research. The Monarch App is an integrated platform combining data about genes, phenotypes, and diseases across species. Monarch's APIs enable access to carefully curated datasets and advanced analysis tools that support the understanding and diagnosis of disease for diverse applications such as variant prioritization, deep phenotyping, and patient profile-matching. We have migrated our system into a scalable, cloud-based infrastructure; simplified Monarch's data ingestion and knowledge graph integration systems; enhanced data mapping and integration standards; and developed a new user interface with novel search and graph navigation features. Furthermore, we advanced Monarch's analytic tools by developing a customized plugin for OpenAI's ChatGPT to increase the reliability of its responses about phenotypic data, allowing us to interrogate the knowledge in the Monarch graph using state-of-the-art Large Language Models. The resources of the Monarch Initiative can be found at monarchinitiative.org and its corresponding code repository at github.com/monarch-initiative/monarch-app.

Subject(s)

Databases, Factual , Disease , Genes , Phenotype , Humans , Internet , Databases, Factual/standards , Software , Genes/genetics , Disease/genetics

13.

Toward robust clinical genome interpretation: Developing a consistent terminology to characterize Mendelian disease-gene relationships-allelic requirement, inheritance modes, and disease mechanisms.

Roberts, Angharad M; DiStefano, Marina T; Riggs, Erin Rooney; Josephs, Katherine S; Alkuraya, Fowzan S; Amberger, Joanna; Amin, Mutaz; Berg, Jonathan S; Cunningham, Fiona; Eilbeck, Karen; Firth, Helen V; Foreman, Julia; Hamosh, Ada; Hay, Eleanor; Leigh, Sarah; Martin, Christa L; McDonagh, Ellen M; Perrett, Daniel; Ramos, Erin M; Robinson, Peter N; Rath, Ana; Sant, David W; Stark, Zornitza; Whiffin, Nicola; Rehm, Heidi L; Ware, James S.

Genet Med ; 26(2): 101029, 2024 Feb.

Article in English | MEDLINE | ID: mdl-37982373

ABSTRACT

PURPOSE: The terminology used for gene-disease curation and variant annotation to describe inheritance, allelic requirement, and both sequence and functional consequences of a variant is currently not standardized. There is considerable discrepancy in the literature and across clinical variant reporting in the derivation and application of terms. Here, we standardize the terminology for the characterization of disease-gene relationships to facilitate harmonized global curation and to support variant classification within the ACMG/AMP framework. METHODS: Terminology for inheritance, allelic requirement, and both structural and functional consequences of a variant used by Gene Curation Coalition members and partner organizations was collated and reviewed. Harmonized terminology with definitions and use examples was created, reviewed, and validated. RESULTS: We present a standardized terminology to describe gene-disease relationships, and to support variant annotation. We demonstrate application of the terminology for classification of variation in the ACMG SF 2.0 genes recommended for reporting of secondary findings. Consensus terms were agreed and formalized in both Sequence Ontology (SO) and Human Phenotype Ontology (HPO) ontologies. Gene Curation Coalition member groups intend to use or map to these terms in their respective resources. CONCLUSION: The terminology standardization presented here will improve harmonization, facilitate the pooling of curation datasets across international curation efforts and, in turn, improve consistency in variant classification and genetic test interpretation.

Subject(s)

Genetic Testing , Genetic Variation , Humans , Alleles , Databases, Genetic

14.

The promises of large language models for protein design and modeling.

Valentini, Giorgio; Malchiodi, Dario; Gliozzo, Jessica; Mesiti, Marco; Soto-Gomez, Mauricio; Cabri, Alberto; Reese, Justin; Casiraghi, Elena; Robinson, Peter N.

Front Bioinform ; 3: 1304099, 2023.

Article in English | MEDLINE | ID: mdl-38076030

ABSTRACT

The recent breakthroughs of Large Language Models (LLMs) in the context of natural language processing have opened the way to significant advances in protein research. Indeed, the relationships between human natural language and the "language of proteins" invite the application and adaptation of LLMs to protein modelling and design. Considering the impressive results of GPT-4 and other recently developed LLMs in processing, generating and translating human languages, we anticipate analogous results with the language of proteins. Indeed, protein language models have been already trained to accurately predict protein properties, generate novel functionally characterized proteins, achieving state-of-the-art results. In this paper we discuss the promises and the open challenges raised by this novel and exciting research area, and we propose our perspective on how LLMs will affect protein modeling and design.

15.

Term-BLAST-like alignment tool for concept recognition in noisy clinical texts.

Groza, Tudor; Wu, Honghan; Dinger, Marcel E; Danis, Daniel; Hilton, Coleman; Bagley, Anita; Davids, Jon R; Luo, Ling; Lu, Zhiyong; Robinson, Peter N.

Bioinformatics ; 39(12)2023 12 01.

Article in English | MEDLINE | ID: mdl-38001031

ABSTRACT

MOTIVATION: Methods for concept recognition (CR) in clinical texts have largely been tested on abstracts or articles from the medical literature. However, texts from electronic health records (EHRs) frequently contain spelling errors, abbreviations, and other nonstandard ways of representing clinical concepts. RESULTS: Here, we present a method inspired by the BLAST algorithm for biosequence alignment that screens texts for potential matches on the basis of matching k-mer counts and scores candidates based on conformance to typical patterns of spelling errors derived from 2.9 million clinical notes. Our method, the Term-BLAST-like alignment tool (TBLAT) leverages a gold standard corpus for typographical errors to implement a sequence alignment-inspired method for efficient entity linkage. We present a comprehensive experimental comparison of TBLAT with five widely used tools. Experimental results show an increase of 10% in recall on scientific publications and 20% increase in recall on EHR records (when compared against the next best method), hence supporting a significant enhancement of the entity linking task. The method can be used stand-alone or as a complement to existing approaches. AVAILABILITY AND IMPLEMENTATION: Fenominal is a Java library that implements TBLAT for named CR of Human Phenotype Ontology terms and is available at https://github.com/monarch-initiative/fenominal under the GNU General Public License v3.0.

Subject(s)

Algorithms , Language , Humans , Sequence Alignment , Electronic Health Records , Publications

16.

The Medical Action Ontology: A tool for annotating and analyzing treatments and clinical management of human disease.

Carmody, Leigh C; Gargano, Michael A; Toro, Sabrina; Vasilevsky, Nicole A; Adam, Margaret P; Blau, Hannah; Chan, Lauren E; Gomez-Andres, David; Horvath, Rita; Kraus, Megan L; Ladewig, Markus S; Lewis-Smith, David; Lochmüller, Hanns; Matentzoglu, Nicolas A; Munoz-Torres, Monica C; Schuetz, Catharina; Seitz, Berthold; Similuk, Morgan N; Sparks, Teresa N; Strauss, Timmy; Swietlik, Emilia M; Thompson, Rachel; Zhang, Xingmin Aaron; Mungall, Christopher J; Haendel, Melissa A; Robinson, Peter N.

Med ; 4(12): 913-927.e3, 2023 Dec 08.

Article in English | MEDLINE | ID: mdl-37963467

ABSTRACT

BACKGROUND: Navigating the clinical literature to determine the optimal clinical management for rare diseases presents significant challenges. We introduce the Medical Action Ontology (MAxO), an ontology specifically designed to organize medical procedures, therapies, and interventions. METHODS: MAxO incorporates logical structures that link MAxO terms to numerous other ontologies within the OBO Foundry. Term development involves a blend of manual and semi-automated processes. Additionally, we have generated annotations detailing diagnostic modalities for specific phenotypic abnormalities defined by the Human Phenotype Ontology (HPO). We introduce a web application, POET, that facilitates MAxO annotations for specific medical actions for diseases using the Mondo Disease Ontology. FINDINGS: MAxO encompasses 1,757 terms spanning a wide range of biomedical domains, from human anatomy and investigations to the chemical and protein entities involved in biological processes. These terms annotate phenotypic features associated with specific disease (using HPO and Mondo). Presently, there are over 16,000 MAxO diagnostic annotations that target HPO terms. Through POET, we have created 413 MAxO annotations specifying treatments for 189 rare diseases. CONCLUSIONS: MAxO offers a computational representation of treatments and other actions taken for the clinical management of patients. Its development is closely coupled to Mondo and HPO, broadening the scope of our computational modeling of diseases and phenotypic features. We invite the community to contribute disease annotations using POET (https://poet.jax.org/). MAxO is available under the open-source CC-BY 4.0 license (https://github.com/monarch-initiative/MAxO). FUNDING: NHGRI 1U24HG011449-01A1 and NHGRI 5RM1HG010860-04.

Subject(s)

Biological Ontologies , Humans , Rare Diseases , Software , Computer Simulation

17.

Predictive models of long COVID.

Antony, Blessy; Blau, Hannah; Casiraghi, Elena; Loomba, Johanna J; Callahan, Tiffany J; Laraway, Bryan J; Wilkins, Kenneth J; Antonescu, Corneliu C; Valentini, Giorgio; Williams, Andrew E; Robinson, Peter N; Reese, Justin T; Murali, T M.

EBioMedicine ; 96: 104777, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37672869

ABSTRACT

BACKGROUND: The cause and symptoms of long COVID are poorly understood. It is challenging to predict whether a given COVID-19 patient will develop long COVID in the future. METHODS: We used electronic health record (EHR) data from the National COVID Cohort Collaborative to predict the incidence of long COVID. We trained two machine learning (ML) models - logistic regression (LR) and random forest (RF). Features used to train predictors included symptoms and drugs ordered during acute infection, measures of COVID-19 treatment, pre-COVID comorbidities, and demographic information. We assigned the 'long COVID' label to patients diagnosed with the U09.9 ICD10-CM code. The cohorts included patients with (a) EHRs reported from data partners using U09.9 ICD10-CM code and (b) at least one EHR in each feature category. We analysed three cohorts: all patients (n = 2,190,579; diagnosed with long COVID = 17,036), inpatients (149,319; 3,295), and outpatients (2,041,260; 13,741). FINDINGS: LR and RF models yielded median AUROC of 0.76 and 0.75, respectively. Ablation study revealed that drugs had the highest influence on the prediction task. The SHAP method identified age, gender, cough, fatigue, albuterol, obesity, diabetes, and chronic lung disease as explanatory features. Models trained on data from one N3C partner and tested on data from the other partners had average AUROC of 0.75. INTERPRETATION: ML-based classification using EHR information from the acute infection period is effective in predicting long COVID. SHAP methods identified important features for prediction. Cross-site analysis demonstrated the generalizability of the proposed methodology. FUNDING: NCATS U24 TR002306, NCATS UL1 TR003015, Axle Informatics Subcontract: NCATS-P00438-B, NIH/NIDDK/OD, PSR2015-1720GVALE_01, G43C22001320007, and Director, Office of Science, Office of Basic Energy Sciences of the U.S. Department of Energy Contract No. DE-AC02-05CH11231.

Subject(s)

COVID-19 , Post-Acute COVID-19 Syndrome , Humans , COVID-19 Drug Treatment , Machine Learning , Obesity

18.

Integration of EpiSign, facial phenotyping, and likelihood ratio interpretation of clinical abnormalities in the re-classification of an ARID1B missense variant.

Forwood, Caitlin; Ashton, Katie; Zhu, Ying; Zhang, Futao; Dias, Kerith-Rae; Standen, Krystle; Evans, Carey-Anne; Carey, Louise; Cardamone, Michael; Shalhoub, Carolyn; Katf, Hala; Riveros, Carlos; Hsieh, Tzung-Chien; Krawitz, Peter; Robinson, Peter N; Dudding-Byth, Tracy; Sadikovic, Bekim; Pinner, Jason; Buckley, Michael F; Roscioli, Tony.

Am J Med Genet C Semin Med Genet ; 193(3): e32056, 2023 09.

Article in English | MEDLINE | ID: mdl-37654076

ABSTRACT

Heterozygous ARID1B variants result in Coffin-Siris syndrome. Features may include hypoplastic nails, slow growth, characteristic facial features, hypotonia, hypertrichosis, and sparse scalp hair. Most reported cases are due to ARID1B loss of function variants. We report a boy with developmental delay, feeding difficulties, aspiration, recurrent respiratory infections, slow growth, and hypotonia without a clinical diagnosis, where a previously unreported ARID1B missense variant was classified as a variant of uncertain significance. The pathogenicity of this variant was refined through combined methodologies including genome-wide methylation signature analysis (EpiSign), Machine Learning (ML) facial phenotyping, and LIRICAL. Trio exome sequencing and EpiSign were performed. ML facial phenotyping compared facial images using FaceMatch and GestaltMatcher to syndrome-specific libraries to prioritize the trio exome bioinformatic pipeline gene list output. Phenotype-driven variant prioritization was performed with LIRICAL. A de novo heterozygous missense variant, ARID1B p.(Tyr1268His), was reported as a variant of uncertain significance. The ACMG classification was refined to likely pathogenic by a supportive methylation signature, ML facial phenotyping, and prioritization through LIRICAL. The ARID1B genotype-phenotype has been expanded through an extended analysis of missense variation through genome-wide methylation signatures, ML facial phenotyping, and likelihood-ratio gene prioritization.

Subject(s)

Abnormalities, Multiple , Hand Deformities, Congenital , Intellectual Disability , Micrognathism , Male , Humans , DNA-Binding Proteins/genetics , Muscle Hypotonia/pathology , Transcription Factors/genetics , Face/pathology , Abnormalities, Multiple/diagnosis , Micrognathism/genetics , Intellectual Disability/pathology , Hand Deformities, Congenital/genetics , Neck/pathology

19.

De novo TRPM3 missense variant associated with neurodevelopmental delay and manifestations of cerebral palsy.

Sundaramurthi, Jagadish Chandrabose; Bagley, Anita M; Blau, Hannah; Carmody, Leigh; Crandall, Amy; Danis, Daniel; Gargano, Michael A; Gustafson, Anxhela Gjyshi; Raney, Ellen M; Shingle, Mallory; Davids, Jon R; Robinson, Peter N.

Cold Spring Harb Mol Case Stud ; 9(4)2023 Dec.

Article in English | MEDLINE | ID: mdl-37684057

ABSTRACT

We identified a de novo heterozygous transient receptor potential cation channel subfamily M (melastatin) member 3 (TRPM3) missense variant, p.(Asn1126Asp), in a patient with developmental delay and manifestations of cerebral palsy (CP) using phenotype-driven prioritization analysis of whole-genome sequencing data with Exomiser. The variant is localized in the functionally important ion transport domain of the TRPM3 protein and predicted to impact the protein structure. Our report adds TRPM3 to the list of Mendelian disease-associated genes that can be associated with CP and provides further evidence for the pathogenicity of the variant p.(Asn1126Asp).

Subject(s)

Cerebral Palsy , Intellectual Disability , Nervous System Malformations , TRPM Cation Channels , Humans , Cerebral Palsy/genetics , Intellectual Disability/genetics , Mutation, Missense/genetics , Phenotype , TRPM Cation Channels/genetics

20.

The effects of pathogenic and likely pathogenic variants for inherited hemostasis disorders in 140 214 UK Biobank participants.

Stefanucci, Luca; Collins, Janine; Sims, Matthew C; Barrio-Hernandez, Inigo; Sun, Luanluan; Burren, Oliver S; Perfetto, Livia; Bender, Isobel; Callahan, Tiffany J; Fleming, Kathryn; Guerrero, Jose A; Hermjakob, Henning; Martin, Maria J; Stephenson, James; Paneerselvam, Kalpana; Petrovski, Slavé; Porras, Pablo; Robinson, Peter N; Wang, Quanli; Watkins, Xavier; Frontini, Mattia; Laskowski, Roman A; Beltrao, Pedro; Di Angelantonio, Emanuele; Gomez, Keith; Laffan, Mike; Ouwehand, Willem H; Mumford, Andrew D; Freson, Kathleen; Carss, Keren; Downes, Kate; Gleadall, Nick; Megy, Karyn; Bruford, Elspeth; Vuckovic, Dragana.

Blood ; 142(24): 2055-2068, 2023 12 14.

Article in English | MEDLINE | ID: mdl-37647632

ABSTRACT

Rare genetic diseases affect millions, and identifying causal DNA variants is essential for patient care. Therefore, it is imperative to estimate the effect of each independent variant and improve their pathogenicity classification. Our study of 140 214 unrelated UK Biobank (UKB) participants found that each of them carries a median of 7 variants previously reported as pathogenic or likely pathogenic. We focused on 967 diagnostic-grade gene (DGG) variants for rare bleeding, thrombotic, and platelet disorders (BTPDs) observed in 12 367 UKB participants. By association analysis, for a subset of these variants, we estimated effect sizes for platelet count and volume, and odds ratios for bleeding and thrombosis. Variants causal of some autosomal recessive platelet disorders revealed phenotypic consequences in carriers. Loss-of-function variants in MPL, which cause chronic amegakaryocytic thrombocytopenia if biallelic, were unexpectedly associated with increased platelet counts in carriers. We also demonstrated that common variants identified by genome-wide association studies (GWAS) for platelet count or thrombosis risk may influence the penetrance of rare variants in BTPD DGGs on their associated hemostasis disorders. Network-propagation analysis applied to an interactome of 18 410 nodes and 571 917 edges showed that GWAS variants with large effect sizes are enriched in DGGs and their first-order interactors. Finally, we illustrate the modifying effect of polygenic scores for platelet count and thrombosis risk on disease severity in participants carrying rare variants in TUBB1 or PROC and PROS1, respectively. Our findings demonstrate the power of association analyses using large population datasets in improving pathogenicity classifications of rare variants.

Subject(s)

Genome-Wide Association Study , Thrombosis , Humans , Biological Specimen Banks , Hemostasis , Hemorrhage/genetics , Rare Diseases

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL