Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 27
Filtrar
1.
Nucleic Acids Res ; 52(D1): D938-D949, 2024 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-38000386

RESUMO

Bridging the gap between genetic variations, environmental determinants, and phenotypic outcomes is critical for supporting clinical diagnosis and understanding mechanisms of diseases. It requires integrating open data at a global scale. The Monarch Initiative advances these goals by developing open ontologies, semantic data models, and knowledge graphs for translational research. The Monarch App is an integrated platform combining data about genes, phenotypes, and diseases across species. Monarch's APIs enable access to carefully curated datasets and advanced analysis tools that support the understanding and diagnosis of disease for diverse applications such as variant prioritization, deep phenotyping, and patient profile-matching. We have migrated our system into a scalable, cloud-based infrastructure; simplified Monarch's data ingestion and knowledge graph integration systems; enhanced data mapping and integration standards; and developed a new user interface with novel search and graph navigation features. Furthermore, we advanced Monarch's analytic tools by developing a customized plugin for OpenAI's ChatGPT to increase the reliability of its responses about phenotypic data, allowing us to interrogate the knowledge in the Monarch graph using state-of-the-art Large Language Models. The resources of the Monarch Initiative can be found at monarchinitiative.org and its corresponding code repository at github.com/monarch-initiative/monarch-app.


Assuntos
Bases de Dados Factuais , Doença , Genes , Fenótipo , Humanos , Internet , Bases de Dados Factuais/normas , Software , Genes/genética , Doença/genética
2.
Nucleic Acids Res ; 52(D1): D1333-D1346, 2024 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-37953324

RESUMO

The Human Phenotype Ontology (HPO) is a widely used resource that comprehensively organizes and defines the phenotypic features of human disease, enabling computational inference and supporting genomic and phenotypic analyses through semantic similarity and machine learning algorithms. The HPO has widespread applications in clinical diagnostics and translational research, including genomic diagnostics, gene-disease discovery, and cohort analytics. In recent years, groups around the world have developed translations of the HPO from English to other languages, and the HPO browser has been internationalized, allowing users to view HPO term labels and in many cases synonyms and definitions in ten languages in addition to English. Since our last report, a total of 2239 new HPO terms and 49235 new HPO annotations were developed, many in collaboration with external groups in the fields of psychiatry, arthrogryposis, immunology and cardiology. The Medical Action Ontology (MAxO) is a new effort to model treatments and other measures taken for clinical management. Finally, the HPO consortium is contributing to efforts to integrate the HPO and the GA4GH Phenopacket Schema into electronic health records (EHRs) with the goal of more standardized and computable integration of rare disease data in EHRs.


Assuntos
Ontologias Biológicas , Humanos , Fenótipo , Genômica , Algoritmos , Doenças Raras
3.
Bioinformatics ; 40(3)2024 03 04.
Artigo em Inglês | MEDLINE | ID: mdl-38383067

RESUMO

MOTIVATION: Creating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas. RESULTS: Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM. AVAILABILITY AND IMPLEMENTATION: SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.


Assuntos
Bases de Conhecimento , Semântica , Bases de Dados Factuais
4.
Bioinformatics ; 39(7)2023 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-37389415

RESUMO

MOTIVATION: Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of KGs is lacking. RESULTS: Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of KGs. Features include a simple, modular extract-transform-load pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate KGs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph ML, including node embeddings and training of models for link prediction and node classification. AVAILABILITY AND IMPLEMENTATION: https://kghub.org.


Assuntos
Ontologias Biológicas , COVID-19 , Humanos , Reconhecimento Automatizado de Padrão , Doenças Raras , Aprendizado de Máquina
5.
J Biol Chem ; 296: 100700, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33895137

RESUMO

YhcB, a poorly understood protein conserved across gamma-proteobacteria, contains a domain of unknown function (DUF1043) and an N-terminal transmembrane domain. Here, we used an integrated approach including X-ray crystallography, genetics, and molecular biology to investigate the function and structure of YhcB. The Escherichia coli yhcB KO strain does not grow at 45 °C and is hypersensitive to cell wall-acting antibiotics, even in the stationary phase. The deletion of yhcB leads to filamentation, abnormal FtsZ ring formation, and aberrant septum development. The Z-ring is essential for the positioning of the septa and the initiation of cell division. We found that YhcB interacts with proteins of the divisome (e.g., FtsI, FtsQ) and elongasome (e.g., RodZ, RodA). Seven of these interactions are also conserved in Yersinia pestis and/or Vibrio cholerae. Furthermore, we mapped the amino acid residues likely involved in the interactions of YhcB with FtsI and RodZ. The 2.8 Å crystal structure of the cytosolic domain of Haemophilus ducreyi YhcB shows a unique tetrameric α-helical coiled-coil structure likely to be involved in linking the Z-ring to the septal peptidoglycan-synthesizing complexes. In summary, YhcB is a conserved and conditionally essential protein that plays a role in cell division and consequently affects envelope biogenesis. Based on these findings, we propose to rename YhcB to ZapG (Z-ring-associated protein G). This study will serve as a starting point for future studies on this protein family and on how cells transit from exponential to stationary survival.


Assuntos
Proteínas de Bactérias/metabolismo , Peptidoglicano/biossíntese , Proteobactérias/citologia , Proteobactérias/metabolismo , Proteínas de Bactérias/química , Divisão Celular , Cristalografia por Raios X , Modelos Moleculares , Conformação Proteica
6.
J Proteome Res ; 20(5): 2182-2186, 2021 05 07.
Artigo em Inglês | MEDLINE | ID: mdl-33719446

RESUMO

Proteomics is, by definition, comprehensive and large-scale, seeking to unravel ome-level protein features with phenotypic information on an entire system, an organ, cells, or organisms. This scope consistently involves and extends beyond single experiments. Multitudinous resources now exist to assist in making the results of proteomics experiments more findable, accessible, interoperable, and reusable (FAIR), yet many tools are awaiting to be adopted by our community. Here we highlight strategies for expanding the impact of proteomics data beyond single studies. We show how linking specific terminologies, identifiers, and text (words) can unify individual data points across a wide spectrum of studies and, more importantly, how this approach may potentially reveal novel relationships. In this effort, we explain how data sets and methods can be rendered more linkable and how this maximizes their value. We also include a discussion on how data linking strategies benefit stakeholders across the proteomics community and beyond.


Assuntos
Proteômica
7.
Mol Cell Proteomics ; 17(5): 961-973, 2018 05.
Artigo em Inglês | MEDLINE | ID: mdl-29414760

RESUMO

Helicobacter pylori is a common pathogen that is estimated to infect half of the human population, causing several diseases such as duodenal ulcer. Despite one of the first pathogens to be sequenced, its proteome remains poorly characterized as about one-third of its proteins have no functional annotation. Here, we integrate and analyze known protein interactions with proteomic and genomic data from different sources. We find that proteins with similar abundances tend to interact. Such an observation is accompanied by a trend of interactions to appear between proteins of similar functions, although some show marked cross-talk to others. Protein function prediction with protein interactions is significantly improved when interactions from other bacteria are included in our network, allowing us to obtain putative functions of more than 300 poorly or previously uncharacterized proteins. Proteins that are critical for the topological controllability of the underlying network are significantly enriched with genes that are up-regulated in the spiral compared with the coccoid form of H. pylori Determining their evolutionary conservation, we present evidence that 80 protein complexes are identical in composition with their counterparts in Escherichia coli, while 85 are partially conserved and 120 complexes are completely absent. Furthermore, we determine network clusters that coincide with related functions, gene essentiality, genetic context, cellular localization, and gene expression in different cellular states.


Assuntos
Proteínas de Bactérias/metabolismo , Helicobacter pylori/metabolismo , Mapas de Interação de Proteínas , Proteoma/metabolismo , Proteômica/métodos , Regulação da Expressão Gênica , Genoma Bacteriano , Helicobacter pylori/genética , Modelos Moleculares , Complexos Multiproteicos/metabolismo , Óperon/genética , Fenótipo
8.
BMC Bioinformatics ; 18(1): 171, 2017 Mar 16.
Artigo em Inglês | MEDLINE | ID: mdl-28298180

RESUMO

BACKGROUND: Protein-protein interactions (PPIs) can offer compelling evidence for protein function, especially when viewed in the context of proteome-wide interactomes. Bacteria have been popular subjects of interactome studies: more than six different bacterial species have been the subjects of comprehensive interactome studies while several more have had substantial segments of their proteomes screened for interactions. The protein interactomes of several bacterial species have been completed, including several from prominent human pathogens. The availability of interactome data has brought challenges, as these large data sets are difficult to compare across species, limiting their usefulness for broad studies of microbial genetics and evolution. RESULTS: In this study, we use more than 52,000 unique protein-protein interactions (PPIs) across 349 different bacterial species and strains to determine their conservation across data sets and taxonomic groups. When proteins are collapsed into orthologous groups (OGs) the resulting meta-interactome still includes more than 43,000 interactions, about 14,000 of which involve proteins of unknown function. While conserved interactions provide support for protein function in their respective species data, we found only 429 PPIs (~1% of the available data) conserved in two or more species, rendering any cross-species interactome comparison immediately useful. The meta-interactome serves as a model for predicting interactions, protein functions, and even full interactome sizes for species with limited to no experimentally observed PPI, including Bacillus subtilis and Salmonella enterica which are predicted to have up to 18,000 and 31,000 PPIs, respectively. CONCLUSIONS: In the course of this work, we have assembled cross-species interactome comparisons that will allow interactomics researchers to anticipate the structures of yet-unexplored microbial interactomes and to focus on well-conserved yet uncharacterized interactors for further study. Such conserved interactions should provide evidence for important but yet-uncharacterized aspects of bacterial physiology and may provide targets for anti-microbial therapies.


Assuntos
Bactérias/metabolismo , Proteínas de Bactérias/metabolismo , Mapeamento de Interação de Proteínas/métodos , Bacillus subtilis/metabolismo , Proteínas de Bactérias/química , Evolução Molecular , Humanos , Proteoma/metabolismo , Salmonella enterica/metabolismo
9.
Mol Microbiol ; 95(2): 258-69, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25388641

RESUMO

Ribosomal protein L27 is a component of the eubacterial large ribosomal subunit that has been shown to play a critical role in substrate stabilization during protein synthesis. This function is mediated by the L27 N-terminus, which protrudes into the peptidyl transferase center. In this report, we demonstrate that L27 in Staphylococcus aureus and other Firmicutes is encoded with an N-terminal extension that is not present in most Gram-negative organisms and is absent from mature ribosomes. We have identified a cysteine protease, conserved among bacteria containing the L27 N-terminal extension, which performs post-translational cleavage of L27. Ribosomal biology in eubacteria has largely been studied in the Gram-negative bacterium Escherichia coli; our findings indicate that there are aspects of the basic biology of the ribosome in S. aureus and other related bacteria that differ substantially from that of the E. coli ribosome. This research lays the foundation for the development of new therapeutic approaches that target this novel pathway.


Assuntos
Cisteína Proteases/metabolismo , Processamento de Proteína Pós-Traducional , Proteínas Ribossômicas/química , Proteínas Ribossômicas/metabolismo , Ribossomos/metabolismo , Staphylococcus aureus/metabolismo , Sequência de Aminoácidos , Biologia Computacional , Cisteína Proteases/genética , Escherichia coli/genética , Espectrometria de Massas , Modelos Moleculares , Dados de Sequência Molecular , Filogenia , Biossíntese de Proteínas , Proteínas Ribossômicas/genética , Homologia de Sequência de Aminoácidos , Staphylococcus aureus/genética
10.
PLoS Comput Biol ; 11(2): e1004107, 2015 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-25723151

RESUMO

Large-scale analyses of protein complexes have recently become available for Escherichia coli and Mycoplasma pneumoniae, yielding 443 and 116 heteromultimeric soluble protein complexes, respectively. We have coupled the results of these mass spectrometry-characterized protein complexes with the 285 "gold standard" protein complexes identified by EcoCyc. A comparison with databases of gene orthology, conservation, and essentiality identified proteins conserved or lost in complexes of other species. For instance, of 285 "gold standard" protein complexes in E. coli, less than 10% are fully conserved among a set of 7 distantly-related bacterial "model" species. Complex conservation follows one of three models: well-conserved complexes, complexes with a conserved core, and complexes with partial conservation but no conserved core. Expanding the comparison to 894 distinct bacterial genomes illustrates fractional conservation and the limits of co-conservation among components of protein complexes: just 14 out of 285 model protein complexes are perfectly conserved across 95% of the genomes used, yet we predict more than 180 may be partially conserved across at least half of the genomes. No clear relationship between gene essentiality and protein complex conservation is observed, as even poorly conserved complexes contain a significant number of essential proteins. Finally, we identify 183 complexes containing well-conserved components and uncharacterized proteins which will be interesting targets for future experimental studies.


Assuntos
Proteínas de Bactérias/química , Proteínas de Bactérias/genética , Complexos Multiproteicos/química , Complexos Multiproteicos/genética , Proteômica/métodos , Escherichia coli/genética , Genoma Bacteriano/genética , Mycoplasma pneumoniae/genética
11.
J Bacteriol ; 197(15): 2508-16, 2015 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-25986902

RESUMO

UNLABELLED: Mycobacteriophages are viruses that infect mycobacterial hosts and are prevalent in the environment. Nearly 700 mycobacteriophage genomes have been completely sequenced, revealing considerable diversity and genetic novelty. Here, we have determined the protein complement of mycobacteriophage Giles by mass spectrometry and mapped its genome-wide protein interactome to help elucidate the roles of its 77 predicted proteins, 50% of which have no known function. About 22,000 individual yeast two-hybrid (Y2H) tests with four different Y2H vectors, followed by filtering and retest screens, resulted in 324 reproducible protein-protein interactions, including 171 (136 nonredundant) high-confidence interactions. The complete set of high-confidence interactions among Giles proteins reveals new mechanistic details and predicts functions for unknown proteins. The Giles interactome is the first for any mycobacteriophage and one of just five known phage interactomes so far. Our results will help in understanding mycobacteriophage biology and aid in development of new genetic and therapeutic tools to understand Mycobacterium tuberculosis. IMPORTANCE: Mycobacterium tuberculosis causes over 9 million new cases of tuberculosis each year. Mycobacteriophages, viruses of mycobacterial hosts, hold considerable potential to understand phage diversity, evolution, and mycobacterial biology, aiding in the development of therapeutic tools to control mycobacterial infections. The mycobacteriophage Giles protein-protein interaction network allows us to predict functions for unknown proteins and shed light on major biological processes in phage biology. For example, Giles gp76, a protein of unknown function, is found to associate with phage packaging and maturation. The functions of mycobacteriophage-derived proteins may suggest novel therapeutic approaches for tuberculosis. Our ORFeome clone set of Giles proteins and the interactome data will be useful resources for phage interactomics.


Assuntos
Regulação Viral da Expressão Gênica/fisiologia , Micobacteriófagos/metabolismo , Mycobacterium smegmatis/virologia , Domínios e Motivos de Interação entre Proteínas/fisiologia , Proteínas Virais/metabolismo , Biologia Computacional , Espectrometria de Massas , Micobacteriófagos/genética , Mycobacterium tuberculosis/virologia , Mapas de Interação de Proteínas , Técnicas do Sistema de Duplo-Híbrido , Proteínas Virais/genética
12.
medRxiv ; 2024 Feb 26.
Artigo em Inglês | MEDLINE | ID: mdl-37503093

RESUMO

Objective: Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information. Materials and Methods: We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically. Results: Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task. Discussion: The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings. Conclusion: Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.

13.
medRxiv ; 2024 Jul 22.
Artigo em Inglês | MEDLINE | ID: mdl-39108510

RESUMO

Large language models (LLM) have shown great promise in supporting differential diagnosis, but 23 available published studies on the diagnostic accuracy evaluated small cohorts (number of cases, 30-422, mean 104) and have evaluated LLM responses subjectively by manual curation (23/23 studies). The performance of LLMs for rare disease diagnosis has not been evaluated systematically. Here, we perform a rigorous and large-scale analysis of the performance of a GPT-4 in prioritizing candidate diagnoses, using the largest-ever cohort of rare disease patients. Our computational study used 5267 computational case reports from previously published data. Each case was formatted as a Global Alliance for Genomics and Health (GA4GH) phenopacket, in which clinical anomalies were represented as Human Phenotype Ontology (HPO) terms. We developed software to generate prompts from each phenopacket. Prompts were sent to Generative Pre-trained Transformer 4 (GPT-4), and the rank of the correct diagnosis, if present in the response, was recorded. The mean reciprocal rank of the correct diagnosis was 0.24 (with the reciprocal of the MRR corresponding to a rank of 4.2), and the correct diagnosis was placed in rank 1 in 19.2% of the cases, in the first 3 ranks in 28.6%, and in the first 10 ranks in 32.5%. Our study is the largest to be reported to date and provides a realistic estimate of the performance of GPT-4 in rare disease medicine.

14.
medRxiv ; 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39228707

RESUMO

Structured representations of clinical data can support computational analysis of individuals and cohorts, and ontologies representing disease entities and phenotypic abnormalities are now commonly used for translational research. The Medical Action Ontology (MAxO) provides a computational representation of treatments and other actions taken for the clinical management of patients. Currently, manual biocuration is used to assign MAxO terms to rare diseases, enabling clinical management of rare diseases to be described computationally for use in clinical decision support and mechanism discovery. However, it is challenging to scale manual curation to comprehensively capture information about medical actions for the more than 10,000 rare diseases. We present AutoMAxO, a semi-automated workflow that leverages Large Language Models (LLMs) to streamline MAxO biocuration for rare diseases. AutoMAxO first uses LLMs to retrieve candidate curations from abstracts of relevant publications. Next, the candidate curations are matched to ontology terms from MAxO, Human Phenotype Ontology (HPO), and MONDO disease ontology via a combination of LLMs and post-processing techniques. Finally, the matched terms are presented in a structured form to a human curator for approval. We used this approach to process 4,918 unique medical abstracts and identified annotations for 21 rare genetic diseases, we extracted 18,631 candidate disease-treatment curations, 538 of which were confirmed and transferred to the MAxO annotation dataset. The results of this project underscore the potential of generative AI to accelerate precision medicine by enabling a robust and comprehensive curation of the primary literature to represent information about diseases and procedures in a structured fashion. Although we focused on MAxO in this project, similar approaches could be taken for other biomedical curation tasks.

15.
Bioinform Adv ; 4(1): vbae036, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38577542

RESUMO

Motivation: Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes. Results: We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement. Availability and implementation: Our code and data are publicly available at https://github.com/monarch-initiative/negativeExampleSelection.

16.
Methods ; 58(4): 392-9, 2012 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-22841565

RESUMO

Protein complexes are typically analyzed by affinity purification and subsequent mass spectrometric analysis. However, in most cases the structure and topology of the complexes remains elusive from such studies. Here we investigate how the yeast two-hybrid system can be used to analyze direct interactions among proteins in a complex. First we tested all pairwise interactions among the seven proteins of Escherichia coli DNA polymerase III as well as an uncharacterized complex that includes MntR and PerR. Four and seven interactions were identified in these two complexes, respectively. In addition, we review Y2H data for three other complexes of known structure which serve as "gold-standards", namely Varicella Zoster Virus (VZV) ribonucleotide reductase (RNR), the yeast proteasome, and bacteriophage lambda. Finally, we review an Y2H analysis of the human spliceosome which may serve as an example for a dynamic mega-complex.


Assuntos
Técnicas do Sistema de Duplo-Híbrido/normas , Animais , Bacteriófago lambda/metabolismo , Proteínas de Caenorhabditis elegans/metabolismo , Cristalização , DNA Polimerase III/metabolismo , Escherichia coli/enzimologia , Proteínas de Escherichia coli/metabolismo , Herpesvirus Humano 3/enzimologia , Humanos , Modelos Moleculares , Complexo de Endopeptidases do Proteassoma/metabolismo , Mapeamento de Interação de Proteínas , Mapas de Interação de Proteínas , Estrutura Quaternária de Proteína , Subunidades Proteicas/metabolismo , Padrões de Referência , Proteínas Repressoras/metabolismo , Ribonucleotídeo Redutases/química , Ribonucleotídeo Redutases/metabolismo , Spliceossomos/metabolismo , Proteínas Virais/química , Proteínas Virais/metabolismo
17.
ArXiv ; 2023 May 25.
Artigo em Inglês | MEDLINE | ID: mdl-37292480

RESUMO

Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling the use of Large Language Models (LLMs), potentially utilizing scientific texts directly and avoiding reliance on a KB. We developed SPINDOCTOR (Structured Prompt Interpolation of Natural Language Descriptions of Controlled Terms for Ontology Reporting), a method that uses GPT models to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct model retrieval. We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for gene sets. However, GPT-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, these methods were rarely able to recapitulate the most precise and informative term from standard enrichment, likely due to an inability to generalize and reason using an ontology. Results are highly nondeterministic, with minor variations in prompt resulting in radically different term lists. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis and that manual curation of ontological assertions remains necessary.

18.
J Vis Exp ; (200)2023 10 13.
Artigo em Inglês | MEDLINE | ID: mdl-37902366

RESUMO

The rapidly increasing and vast quantities of biomedical reports, each containing numerous entities and rich information, represent a rich resource for biomedical text-mining applications. These tools enable investigators to integrate, conceptualize, and translate these discoveries to uncover new insights into disease pathology and therapeutics. In this protocol, we present CaseOLAP LIFT, a new computational pipeline to investigate cellular components and their disease associations by extracting user-selected information from text datasets (e.g., biomedical literature). The software identifies sub-cellular proteins and their functional partners within disease-relevant documents. Additional disease-relevant documents are identified via the software's label imputation method. To contextualize the resulting protein-disease associations and to integrate information from multiple relevant biomedical resources, a knowledge graph is automatically constructed for further analyses. We present one use case with a corpus of ~34 million text documents downloaded online to provide an example of elucidating the role of mitochondrial proteins in distinct cardiovascular disease phenotypes using this method. Furthermore, a deep learning model was applied to the resulting knowledge graph to predict previously unreported relationships between proteins and disease, resulting in 1,583 associations with predicted probabilities >0.90 and with an area under the receiver operating characteristic curve (AUROC) of 0.91 on the test set. This software features a highly customizable and automated workflow, with a broad scope of raw data available for analysis; therefore, using this method, protein-disease associations can be identified with enhanced reliability within a text corpus.


Assuntos
Reconhecimento Automatizado de Padrão , Software , Reprodutibilidade dos Testes , Mineração de Dados/métodos
19.
EBioMedicine ; 87: 104413, 2023 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-36563487

RESUMO

BACKGROUND: Stratification of patients with post-acute sequelae of SARS-CoV-2 infection (PASC, or long COVID) would allow precision clinical management strategies. However, long COVID is incompletely understood and characterised by a wide range of manifestations that are difficult to analyse computationally. Additionally, the generalisability of machine learning classification of COVID-19 clinical outcomes has rarely been tested. METHODS: We present a method for computationally modelling PASC phenotype data based on electronic healthcare records (EHRs) and for assessing pairwise phenotypic similarity between patients using semantic similarity. Our approach defines a nonlinear similarity function that maps from a feature space of phenotypic abnormalities to a matrix of pairwise patient similarity that can be clustered using unsupervised machine learning. FINDINGS: We found six clusters of PASC patients, each with distinct profiles of phenotypic abnormalities, including clusters with distinct pulmonary, neuropsychiatric, and cardiovascular abnormalities, and a cluster associated with broad, severe manifestations and increased mortality. There was significant association of cluster membership with a range of pre-existing conditions and measures of severity during acute COVID-19. We assigned new patients from other healthcare centres to clusters by maximum semantic similarity to the original patients, and showed that the clusters were generalisable across different hospital systems. The increased mortality rate originally identified in one cluster was consistently observed in patients assigned to that cluster in other hospital systems. INTERPRETATION: Semantic phenotypic clustering provides a foundation for assigning patients to stratified subgroups for natural history or therapy studies on PASC. FUNDING: NIH (TR002306/OT2HL161847-01/OD011883/HG010860), U.S.D.O.E. (DE-AC02-05CH11231), Donald A. Roux Family Fund at Jackson Laboratory, Marsico Family at CU Anschutz.


Assuntos
COVID-19 , Síndrome de COVID-19 Pós-Aguda , Humanos , Progressão da Doença , SARS-CoV-2
20.
Clin Transl Sci ; 15(8): 1848-1855, 2022 08.
Artigo em Inglês | MEDLINE | ID: mdl-36125173

RESUMO

Within clinical, biomedical, and translational science, an increasing number of projects are adopting graphs for knowledge representation. Graph-based data models elucidate the interconnectedness among core biomedical concepts, enable data structures to be easily updated, and support intuitive queries, visualizations, and inference algorithms. However, knowledge discovery across these "knowledge graphs" (KGs) has remained difficult. Data set heterogeneity and complexity; the proliferation of ad hoc data formats; poor compliance with guidelines on findability, accessibility, interoperability, and reusability; and, in particular, the lack of a universally accepted, open-access model for standardization across biomedical KGs has left the task of reconciling data sources to downstream consumers. Biolink Model is an open-source data model that can be used to formalize the relationships between data structures in translational science. It incorporates object-oriented classification and graph-oriented features. The core of the model is a set of hierarchical, interconnected classes (or categories) and relationships between them (or predicates) representing biomedical entities such as gene, disease, chemical, anatomic structure, and phenotype. The model provides class and edge attributes and associations that guide how entities should relate to one another. Here, we highlight the need for a standardized data model for KGs, describe Biolink Model, and compare it with other models. We demonstrate the utility of Biolink Model in various initiatives, including the Biomedical Data Translator Consortium and the Monarch Initiative, and show how it has supported easier integration and interoperability of biomedical KGs, bringing together knowledge from multiple sources and helping to realize the goals of translational science.


Assuntos
Reconhecimento Automatizado de Padrão , Ciência Translacional Biomédica , Conhecimento
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA