Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 36
Filter
1.
Nucleic Acids Res ; 52(D1): D938-D949, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-38000386

ABSTRACT

Bridging the gap between genetic variations, environmental determinants, and phenotypic outcomes is critical for supporting clinical diagnosis and understanding mechanisms of diseases. It requires integrating open data at a global scale. The Monarch Initiative advances these goals by developing open ontologies, semantic data models, and knowledge graphs for translational research. The Monarch App is an integrated platform combining data about genes, phenotypes, and diseases across species. Monarch's APIs enable access to carefully curated datasets and advanced analysis tools that support the understanding and diagnosis of disease for diverse applications such as variant prioritization, deep phenotyping, and patient profile-matching. We have migrated our system into a scalable, cloud-based infrastructure; simplified Monarch's data ingestion and knowledge graph integration systems; enhanced data mapping and integration standards; and developed a new user interface with novel search and graph navigation features. Furthermore, we advanced Monarch's analytic tools by developing a customized plugin for OpenAI's ChatGPT to increase the reliability of its responses about phenotypic data, allowing us to interrogate the knowledge in the Monarch graph using state-of-the-art Large Language Models. The resources of the Monarch Initiative can be found at monarchinitiative.org and its corresponding code repository at github.com/monarch-initiative/monarch-app.


Subject(s)
Databases, Factual , Disease , Genes , Phenotype , Humans , Internet , Databases, Factual/standards , Software , Genes/genetics , Disease/genetics
2.
Bioinformatics ; 40(3)2024 03 04.
Article in English | MEDLINE | ID: mdl-38383067

ABSTRACT

MOTIVATION: Creating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas. RESULTS: Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM. AVAILABILITY AND IMPLEMENTATION: SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.


Subject(s)
Knowledge Bases , Semantics , Databases, Factual
3.
Bioinformatics ; 39(7)2023 07 01.
Article in English | MEDLINE | ID: mdl-37389415

ABSTRACT

MOTIVATION: Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of KGs is lacking. RESULTS: Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of KGs. Features include a simple, modular extract-transform-load pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate KGs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph ML, including node embeddings and training of models for link prediction and node classification. AVAILABILITY AND IMPLEMENTATION: https://kghub.org.


Subject(s)
Biological Ontologies , COVID-19 , Humans , Pattern Recognition, Automated , Rare Diseases , Machine Learning
4.
BMC Med Inform Decis Mak ; 24(1): 30, 2024 Jan 31.
Article in English | MEDLINE | ID: mdl-38297371

ABSTRACT

OBJECTIVE: Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. MATERIALS AND METHODS: The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. RESULTS: The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. CONCLUSION: Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.


Subject(s)
Knowledge , Language , Humans , Machine Learning , Phenotype , Rare Diseases
5.
Virol J ; 19(1): 84, 2022 05 15.
Article in English | MEDLINE | ID: mdl-35570298

ABSTRACT

BACKGROUND: Non-steroidal anti-inflammatory drugs (NSAIDs) are commonly used to reduce pain, fever, and inflammation but have been associated with complications in community-acquired pneumonia. Observations shortly after the start of the COVID-19 pandemic in 2020 suggested that ibuprofen was associated with an increased risk of adverse events in COVID-19 patients, but subsequent observational studies failed to demonstrate increased risk and in one case showed reduced risk associated with NSAID use. METHODS: A 38-center retrospective cohort study was performed that leveraged the harmonized, high-granularity electronic health record data of the National COVID Cohort Collaborative. A propensity-matched cohort of 19,746 COVID-19 inpatients was constructed by matching cases (treated with NSAIDs at the time of admission) and 19,746 controls (not treated) from 857,061 patients with COVID-19 available for analysis. The primary outcome of interest was COVID-19 severity in hospitalized patients, which was classified as: moderate, severe, or mortality/hospice. Secondary outcomes were acute kidney injury (AKI), extracorporeal membrane oxygenation (ECMO), invasive ventilation, and all-cause mortality at any time following COVID-19 diagnosis. RESULTS: Logistic regression showed that NSAID use was not associated with increased COVID-19 severity (OR: 0.57 95% CI: 0.53-0.61). Analysis of secondary outcomes using logistic regression showed that NSAID use was not associated with increased risk of all-cause mortality (OR 0.51 95% CI: 0.47-0.56), invasive ventilation (OR: 0.59 95% CI: 0.55-0.64), AKI (OR: 0.67 95% CI: 0.63-0.72), or ECMO (OR: 0.51 95% CI: 0.36-0.7). In contrast, the odds ratios indicate reduced risk of these outcomes, but our quantitative bias analysis showed E-values of between 1.9 and 3.3 for these associations, indicating that comparatively weak or moderate confounder associations could explain away the observed associations. CONCLUSIONS: Study interpretation is limited by the observational design. Recording of NSAID use may have been incomplete. Our study demonstrates that NSAID use is not associated with increased COVID-19 severity, all-cause mortality, invasive ventilation, AKI, or ECMO in COVID-19 inpatients. A conservative interpretation in light of the quantitative bias analysis is that there is no evidence that NSAID use is associated with risk of increased severity or the other measured outcomes. Our results confirm and extend analogous findings in previous observational studies using a large cohort of patients drawn from 38 centers in a nationally representative multicenter database.


Subject(s)
Acute Kidney Injury , COVID-19 , Anti-Inflammatory Agents, Non-Steroidal/adverse effects , COVID-19 Testing , Cohort Studies , Humans , Pandemics , Retrospective Studies
6.
Genome Res ; 23(8): 1235-47, 2013 Aug.
Article in English | MEDLINE | ID: mdl-23636946

ABSTRACT

Genomes of eusocial insects code for dramatic examples of phenotypic plasticity and social organization. We compared the genomes of seven ants, the honeybee, and various solitary insects to examine whether eusocial lineages share distinct features of genomic organization. Each ant lineage contains ∼4000 novel genes, but only 64 of these genes are conserved among all seven ants. Many gene families have been expanded in ants, notably those involved in chemical communication (e.g., desaturases and odorant receptors). Alignment of the ant genomes revealed reduced purifying selection compared with Drosophila without significantly reduced synteny. Correspondingly, ant genomes exhibit dramatic divergence of noncoding regulatory elements; however, extant conserved regions are enriched for novel noncoding RNAs and transcription factor-binding sites. Comparison of orthologous gene promoters between eusocial and solitary species revealed significant regulatory evolution in both cis (e.g., Creb) and trans (e.g., fork head) for nearly 2000 genes, many of which exhibit phenotypic plasticity. Our results emphasize that genomic changes can occur remarkably fast in ants, because two recently diverged leaf-cutter ant species exhibit faster accumulation of species-specific genes and greater divergence in regulatory elements compared with other ants or Drosophila. Thus, while the "socio-genomes" of ants and the honeybee are broadly characterized by a pervasive pattern of divergence in gene composition and regulation, they preserve lineage-specific regulatory features linked to eusociality. We propose that changes in gene regulation played a key role in the origins of insect eusociality, whereas changes in gene composition were more relevant for lineage-specific eusocial adaptations.


Subject(s)
Ants/genetics , Genome, Insect , Animals , Behavior, Animal , Binding Sites , Conserved Sequence , DNA Methylation , Evolution, Molecular , Gene Expression Regulation , Hymenoptera/genetics , Insect Proteins/genetics , MicroRNAs/genetics , Models, Genetic , Phylogeny , Regulatory Sequences, Nucleic Acid , Sequence Analysis, DNA , Social Behavior , Species Specificity , Synteny , Transcription Factors/genetics
7.
BMC Genomics ; 15: 86, 2014 Jan 30.
Article in English | MEDLINE | ID: mdl-24479613

ABSTRACT

BACKGROUND: The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. As only a few Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for improvement of most genome assemblies. The honey bee (Apis mellifera) genome, published in 2006, was noted for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer genes in the initial gene set (OGSv1.0) compared to what would be expected based on other sequenced insect genomes. RESULTS: Here, we report an improved honey bee genome assembly (Amel_4.5) with a new gene annotation set (OGSv3.2), and show that the honey bee genome contains a number of genes similar to that of other insect genomes, contrary to what was suggested in OGSv1.0. The new genome assembly is more contiguous and complete and the new gene set includes ~5000 more protein-coding genes, 50% more than previously reported. About 1/6 of the additional genes were due to improvements to the assembly, and the remaining were inferred based on new RNAseq and protein data. CONCLUSIONS: Lessons learned from this genome upgrade have important implications for future genome sequencing projects. Furthermore, the improvements significantly enhance genomic resources for the honey bee, a key model for social behavior and essential to global ecology through pollination.


Subject(s)
Bees/genetics , Genes, Insect , Animals , Base Composition , Databases, Genetic , Interspersed Repetitive Sequences/genetics , Molecular Sequence Annotation , Open Reading Frames/genetics , Peptides/analysis , Sequence Analysis, RNA , Sequence Homology, Amino Acid
8.
PLoS Genet ; 7(2): e1002007, 2011 Feb 10.
Article in English | MEDLINE | ID: mdl-21347285

ABSTRACT

Leaf-cutter ants are one of the most important herbivorous insects in the Neotropics, harvesting vast quantities of fresh leaf material. The ants use leaves to cultivate a fungus that serves as the colony's primary food source. This obligate ant-fungus mutualism is one of the few occurrences of farming by non-humans and likely facilitated the formation of their massive colonies. Mature leaf-cutter ant colonies contain millions of workers ranging in size from small garden tenders to large soldiers, resulting in one of the most complex polymorphic caste systems within ants. To begin uncovering the genomic underpinnings of this system, we sequenced the genome of Atta cephalotes using 454 pyrosequencing. One prediction from this ant's lifestyle is that it has undergone genetic modifications that reflect its obligate dependence on the fungus for nutrients. Analysis of this genome sequence is consistent with this hypothesis, as we find evidence for reductions in genes related to nutrient acquisition. These include extensive reductions in serine proteases (which are likely unnecessary because proteolysis is not a primary mechanism used to process nutrients obtained from the fungus), a loss of genes involved in arginine biosynthesis (suggesting that this amino acid is obtained from the fungus), and the absence of a hexamerin (which sequesters amino acids during larval development in other insects). Following recent reports of genome sequences from other insects that engage in symbioses with beneficial microbes, the A. cephalotes genome provides new insights into the symbiotic lifestyle of this ant and advances our understanding of host-microbe symbioses.


Subject(s)
Ants/physiology , Genome, Insect/genetics , Plant Leaves/physiology , Symbiosis , Animals , Ants/genetics , Arginine/genetics , Arginine/metabolism , Base Sequence , Fungi/genetics , Insect Proteins/genetics , Insect Proteins/metabolism , Sequence Analysis, DNA , Serine Proteases/genetics , Serine Proteases/metabolism
9.
Proc Natl Acad Sci U S A ; 108(14): 5673-8, 2011 Apr 05.
Article in English | MEDLINE | ID: mdl-21282631

ABSTRACT

Ants are some of the most abundant and familiar animals on Earth, and they play vital roles in most terrestrial ecosystems. Although all ants are eusocial, and display a variety of complex and fascinating behaviors, few genomic resources exist for them. Here, we report the draft genome sequence of a particularly widespread and well-studied species, the invasive Argentine ant (Linepithema humile), which was accomplished using a combination of 454 (Roche) and Illumina sequencing and community-based funding rather than federal grant support. Manual annotation of >1,000 genes from a variety of different gene families and functional classes reveals unique features of the Argentine ant's biology, as well as similarities to Apis mellifera and Nasonia vitripennis. Distinctive features of the Argentine ant genome include remarkable expansions of gustatory (116 genes) and odorant receptors (367 genes), an abundance of cytochrome P450 genes (>110), lineage-specific expansions of yellow/major royal jelly proteins and desaturases, and complete CpG DNA methylation and RNAi toolkits. The Argentine ant genome contains fewer immune genes than Drosophila and Tribolium, which may reflect the prominent role played by behavioral and chemical suppression of pathogens. Analysis of the ratio of observed to expected CpG nucleotides for genes in the reproductive development and apoptosis pathways suggests higher levels of methylation than in the genome overall. The resources provided by this genome sequence will offer an abundance of tools for researchers seeking to illuminate the fascinating biology of this emerging model organism.


Subject(s)
Ants/genetics , Genome, Insect/genetics , Genomics/methods , Phylogeny , Animals , Ants/physiology , Base Sequence , California , DNA Methylation , Gene Library , Genetics, Population , Hierarchy, Social , Molecular Sequence Data , Polymorphism, Single Nucleotide/genetics , Receptors, Odorant/genetics , Sequence Analysis, DNA
10.
Proc Natl Acad Sci U S A ; 108(14): 5667-72, 2011 Apr 05.
Article in English | MEDLINE | ID: mdl-21282651

ABSTRACT

We report the draft genome sequence of the red harvester ant, Pogonomyrmex barbatus. The genome was sequenced using 454 pyrosequencing, and the current assembly and annotation were completed in less than 1 y. Analyses of conserved gene groups (more than 1,200 manually annotated genes to date) suggest a high-quality assembly and annotation comparable to recently sequenced insect genomes using Sanger sequencing. The red harvester ant is a model for studying reproductive division of labor, phenotypic plasticity, and sociogenomics. Although the genome of P. barbatus is similar to other sequenced hymenopterans (Apis mellifera and Nasonia vitripennis) in GC content and compositional organization, and possesses a complete CpG methylation toolkit, its predicted genomic CpG content differs markedly from the other hymenopterans. Gene networks involved in generating key differences between the queen and worker castes (e.g., wings and ovaries) show signatures of increased methylation and suggest that ants and bees may have independently co-opted the same gene regulatory mechanisms for reproductive division of labor. Gene family expansions (e.g., 344 functional odorant receptors) and pseudogene accumulation in chemoreception and P450 genes compared with A. mellifera and N. vitripennis are consistent with major life-history changes during the adaptive radiation of Pogonomyrmex spp., perhaps in parallel with the development of the North American deserts.


Subject(s)
Ants/genetics , Gene Regulatory Networks/genetics , Genome, Insect/genetics , Genomics/methods , Phylogeny , Animals , Ants/physiology , Base Sequence , Desert Climate , Hierarchy, Social , Molecular Sequence Data , North America , Phenotype , Polymorphism, Single Nucleotide/genetics , Receptors, Odorant/genetics , Sequence Analysis, DNA
11.
medRxiv ; 2024 Feb 26.
Article in English | MEDLINE | ID: mdl-37503093

ABSTRACT

Objective: Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information. Materials and Methods: We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically. Results: Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task. Discussion: The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings. Conclusion: Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.

12.
bioRxiv ; 2024 Jul 04.
Article in English | MEDLINE | ID: mdl-39005436

ABSTRACT

Objectives: Concept embeddings are low-dimensional vector representations of concepts such as MeSH:D009203 (Myocardial Infarction), whose similarity in the embedded vector space reflects their semantic similarity. Here, we test the hypothesis that non-biomedical concept synonym replacement can improve the quality of biomedical concepts embeddings. Materials and methods: We developed an approach that leverages WordNet to replace sets of synonyms with the most common representative of the synonym set. Results: We tested our approach on 1055 concept sets and found that, on average, the mean intra-cluster distance was reduced by 8% in the vector-space. Assuming that homophily of related concepts in the vector space is desirable, our approach tends to improve the quality of embeddings. Discussion and Conclusion: This pilot study shows that non-biomedical synonym replacement tends to improve the quality of embeddings of biomedical concepts using the Word2Vec algorithm. We have implemented our approach in a freely available Python package available at https://github.com/TheJacksonLaboratory/wn2vec.

13.
medRxiv ; 2024 Jul 22.
Article in English | MEDLINE | ID: mdl-39108510

ABSTRACT

Large language models (LLM) have shown great promise in supporting differential diagnosis, but 23 available published studies on the diagnostic accuracy evaluated small cohorts (number of cases, 30-422, mean 104) and have evaluated LLM responses subjectively by manual curation (23/23 studies). The performance of LLMs for rare disease diagnosis has not been evaluated systematically. Here, we perform a rigorous and large-scale analysis of the performance of a GPT-4 in prioritizing candidate diagnoses, using the largest-ever cohort of rare disease patients. Our computational study used 5267 computational case reports from previously published data. Each case was formatted as a Global Alliance for Genomics and Health (GA4GH) phenopacket, in which clinical anomalies were represented as Human Phenotype Ontology (HPO) terms. We developed software to generate prompts from each phenopacket. Prompts were sent to Generative Pre-trained Transformer 4 (GPT-4), and the rank of the correct diagnosis, if present in the response, was recorded. The mean reciprocal rank of the correct diagnosis was 0.24 (with the reciprocal of the MRR corresponding to a rank of 4.2), and the correct diagnosis was placed in rank 1 in 19.2% of the cases, in the first 3 ranks in 28.6%, and in the first 10 ranks in 32.5%. Our study is the largest to be reported to date and provides a realistic estimate of the performance of GPT-4 in rare disease medicine.

14.
medRxiv ; 2024 Aug 22.
Article in English | MEDLINE | ID: mdl-39228707

ABSTRACT

Structured representations of clinical data can support computational analysis of individuals and cohorts, and ontologies representing disease entities and phenotypic abnormalities are now commonly used for translational research. The Medical Action Ontology (MAxO) provides a computational representation of treatments and other actions taken for the clinical management of patients. Currently, manual biocuration is used to assign MAxO terms to rare diseases, enabling clinical management of rare diseases to be described computationally for use in clinical decision support and mechanism discovery. However, it is challenging to scale manual curation to comprehensively capture information about medical actions for the more than 10,000 rare diseases. We present AutoMAxO, a semi-automated workflow that leverages Large Language Models (LLMs) to streamline MAxO biocuration for rare diseases. AutoMAxO first uses LLMs to retrieve candidate curations from abstracts of relevant publications. Next, the candidate curations are matched to ontology terms from MAxO, Human Phenotype Ontology (HPO), and MONDO disease ontology via a combination of LLMs and post-processing techniques. Finally, the matched terms are presented in a structured form to a human curator for approval. We used this approach to process 4,918 unique medical abstracts and identified annotations for 21 rare genetic diseases, we extracted 18,631 candidate disease-treatment curations, 538 of which were confirmed and transferred to the MAxO annotation dataset. The results of this project underscore the potential of generative AI to accelerate precision medicine by enabling a robust and comprehensive curation of the primary literature to represent information about diseases and procedures in a structured fashion. Although we focused on MAxO in this project, similar approaches could be taken for other biomedical curation tasks.

15.
Transl Psychiatry ; 14(1): 246, 2024 Jun 08.
Article in English | MEDLINE | ID: mdl-38851761

ABSTRACT

Acute COVID-19 infection can be followed by diverse clinical manifestations referred to as Post Acute Sequelae of SARS-CoV2 Infection (PASC). Studies have shown an increased risk of being diagnosed with new-onset psychiatric disease following a diagnosis of acute COVID-19. However, it was unclear whether non-psychiatric PASC-associated manifestations (PASC-AMs) are associated with an increased risk of new-onset psychiatric disease following COVID-19. A retrospective electronic health record (EHR) cohort study of 2,391,006 individuals with acute COVID-19 was performed to evaluate whether non-psychiatric PASC-AMs are associated with new-onset psychiatric disease. Data were obtained from the National COVID Cohort Collaborative (N3C), which has EHR data from 76 clinical organizations. EHR codes were mapped to 151 non-psychiatric PASC-AMs recorded 28-120 days following SARS-CoV-2 diagnosis and before diagnosis of new-onset psychiatric disease. Association of newly diagnosed psychiatric disease with age, sex, race, pre-existing comorbidities, and PASC-AMs in seven categories was assessed by logistic regression. There were significant associations between a diagnosis of any psychiatric disease and five categories of PASC-AMs with odds ratios highest for neurological, cardiovascular, and constitutional PASC-AMs with odds ratios of 1.31, 1.29, and 1.23 respectively. Secondary analysis revealed that the proportions of 50 individual clinical features significantly differed between patients diagnosed with different psychiatric diseases. Our study provides evidence for association between non-psychiatric PASC-AMs and the incidence of newly diagnosed psychiatric disease. Significant associations were found for features related to multiple organ systems. This information could prove useful in understanding risk stratification for new-onset psychiatric disease following COVID-19. Prospective studies are needed to corroborate these findings.


Subject(s)
COVID-19 , Mental Disorders , SARS-CoV-2 , Humans , COVID-19/psychology , COVID-19/complications , COVID-19/epidemiology , Male , Female , Mental Disorders/epidemiology , Middle Aged , Adult , Retrospective Studies , Aged , Phenotype , Post-Acute COVID-19 Syndrome , Comorbidity , Electronic Health Records , Young Adult , Risk Factors , Adolescent
16.
J Biomed Semantics ; 15(1): 19, 2024 Oct 17.
Article in English | MEDLINE | ID: mdl-39415214

ABSTRACT

BACKGROUND: Ontologies are fundamental components of informatics infrastructure in domains such as biomedical, environmental, and food sciences, representing consensus knowledge in an accurate and computable form. However, their construction and maintenance demand substantial resources and necessitate substantial collaboration between domain experts, curators, and ontology experts. We present Dynamic Retrieval Augmented Generation of Ontologies using AI (DRAGON-AI), an ontology generation method employing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). DRAGON-AI can generate textual and logical ontology components, drawing from existing knowledge in multiple ontologies and unstructured text sources. RESULTS: We assessed performance of DRAGON-AI on de novo term construction across ten diverse ontologies, making use of extensive manual evaluation of results. Our method has high precision for relationship generation, but has slightly lower precision than from logic-based reasoning. Our method is also able to generate definitions deemed acceptable by expert evaluators, but these scored worse than human-authored definitions. Notably, evaluators with the highest level of confidence in a domain were better able to discern flaws in AI-generated definitions. We also demonstrated the ability of DRAGON-AI to incorporate natural language instructions in the form of GitHub issues. CONCLUSIONS: These findings suggest DRAGON-AI's potential to substantially aid the manual ontology construction process. However, our results also underscore the importance of having expert curators and ontology editors drive the ontology generation process.


Subject(s)
Artificial Intelligence , Biological Ontologies , Natural Language Processing , Information Storage and Retrieval/methods
17.
medRxiv ; 2024 May 29.
Article in English | MEDLINE | ID: mdl-38854034

ABSTRACT

The Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema was released in 2022 and approved by ISO as a standard for sharing clinical and genomic information about an individual, including phenotypic descriptions, numerical measurements, genetic information, diagnoses, and treatments. A phenopacket can be used as an input file for software that supports phenotype-driven genomic diagnostics and for algorithms that facilitate patient classification and stratification for identifying new diseases and treatments. There has been a great need for a collection of phenopackets to test software pipelines and algorithms. Here, we present phenopacket-store. Version 0.1.12 of phenopacket-store includes 4916 phenopackets representing 277 Mendelian and chromosomal diseases associated with 236 genes, and 2872 unique pathogenic alleles curated from 605 different publications. This represents the first large-scale collection of case-level, standardized phenotypic information derived from case reports in the literature with detailed descriptions of the clinical data and will be useful for many purposes, including the development and testing of software for prioritizing genes and diseases in diagnostic genomics, machine learning analysis of clinical phenotype data, patient stratification, and genotype-phenotype correlations. This corpus also provides best-practice examples for curating literature-derived data using the GA4GH Phenopacket Schema.

18.
HGG Adv ; : 100371, 2024 Oct 10.
Article in English | MEDLINE | ID: mdl-39394689

ABSTRACT

The Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema was released in 2022 and approved by ISO as a standard for sharing clinical and genomic information about an individual, including phenotypic descriptions, numerical measurements, genetic information, diagnoses, and treatments. A phenopacket can be used as an input file for software that supports phenotype-driven genomic diagnostics and for algorithms that facilitate patient classification and stratification for identifying new diseases and treatments. There has been a great need for a collection of phenopackets to test software pipelines and algorithms. Here, we present Phenopacket Store. Version 0.1.19 of Phenopacket Store includes 6668 phenopackets representing 475 Mendelian and chromosomal diseases associated with 423 genes and 3834 unique pathogenic alleles curated from 959 different publications. This represents the first large-scale collection of case-level, standardized phenotypic information derived from case reports in the literature with detailed descriptions of the clinical data and will be useful for many purposes, including the development and testing of software for prioritizing genes and diseases in diagnostic genomics, machine learning analysis of clinical phenotype data, patient stratification, and genotype-phenotype correlations. This corpus also provides best-practice examples for curating literature-derived data using the GA4GH Phenopacket Schema.

19.
Nucleic Acids Res ; 39(Database issue): D658-62, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21071397

ABSTRACT

The Hymenoptera Genome Database (HGD) is a comprehensive model organism database that caters to the needs of scientists working on insect species of the order Hymenoptera. This system implements open-source software and relational databases providing access to curated data contributed by an extensive, active research community. HGD contains data from 9 different species across ∼200 million years in the phylogeny of Hymenoptera, allowing researchers to leverage genetic, genome sequence and gene expression data, as well as the biological knowledge of related model organisms. The availability of resources across an order greatly facilitates comparative genomics and enhances our understanding of the biology of agriculturally important Hymenoptera species through genomics. Curated data at HGD includes predicted and annotated gene sets supported with evidence tracks such as ESTs/cDNAs, small RNA sequences and GC composition domains. Data at HGD can be queried using genome browsers and/or BLAST/PSI-BLAST servers, and it may also be downloaded to perform local searches. We encourage the public to access and contribute data to HGD at: http://HymenopteraGenome.org.


Subject(s)
Databases, Genetic , Genome, Insect , Hymenoptera/genetics , Animals , Genomics , Molecular Sequence Annotation , Software , Systems Integration
20.
Nucleic Acids Res ; 39(Database issue): D830-4, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21123190

ABSTRACT

The Bovine Genome Database (BGD; http://BovineGenome.org) strives to improve annotation of the bovine genome and to integrate the genome sequence with other genomics data. BGD includes GBrowse genome browsers, the Apollo Annotation Editor, a quantitative trait loci (QTL) viewer, BLAST databases and gene pages. Genome browsers, available for both scaffold and chromosome coordinate systems, display the bovine Official Gene Set (OGS), RefSeq and Ensembl gene models, non-coding RNA, repeats, pseudogenes, single-nucleotide polymorphism, markers, QTL and alignments to complementary DNAs, ESTs and protein homologs. The Bovine QTL viewer is connected to the BGD Chromosome GBrowse, allowing for the identification of candidate genes underlying QTL. The Apollo Annotation Editor connects directly to the BGD Chado database to provide researchers with remote access to gene evidence in a graphical interface that allows editing and creating new gene models. Researchers may upload their annotations to the BGD server for review and integration into the subsequent release of the OGS. Gene pages display information for individual OGS gene models, including gene structure, transcript variants, functional descriptions, gene symbols, Gene Ontology terms, annotator comments and links to National Center for Biotechnology Information and Ensembl. Each gene page is linked to a wiki page to allow input from the research community.


Subject(s)
Cattle/genetics , Databases, Genetic , Genomics , Molecular Sequence Annotation , Animals , Genome , Models, Genetic , Quantitative Trait Loci , Sequence Alignment , Software , Systems Integration
SELECTION OF CITATIONS
SEARCH DETAIL