Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
1.
Bioinformatics ; 40(7)2024 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-38913850

RESUMEN

MOTIVATION: Human Phenotype Ontology (HPO)-based phenotype concept recognition (CR) underpins a faster and more effective mechanism to create patient phenotype profiles or to document novel phenotype-centred knowledge statements. While the increasing adoption of large language models (LLMs) for natural language understanding has led to several LLM-based solutions, we argue that their intrinsic resource-intensive nature is not suitable for realistic management of the phenotype CR lifecycle. Consequently, we propose to go back to the basics and adopt a dictionary-based approach that enables both an immediate refresh of the ontological concepts as well as efficient re-analysis of past data. RESULTS: We developed a dictionary-based approach using a pre-built large collection of clusters of morphologically equivalent tokens-to address lexical variability and a more effective CR step by reducing the entity boundary detection strictly to candidates consisting of tokens belonging to ontology concepts. Our method achieves state-of-the-art results (0.76 F1 on the GSC+ corpus) and a processing efficiency of 10 000 publication abstracts in 5 s. AVAILABILITY AND IMPLEMENTATION: FastHPOCR is available as a Python package installable via pip. The source code is available at https://github.com/tudorgroza/fast_hpo_cr. A Java implementation of FastHPOCR will be made available as part of the Fenominal Java library available at https://github.com/monarch-initiative/fenominal. The up-to-date GCS-2024 corpus is available at https://github.com/tudorgroza/code-for-papers/tree/main/gsc-2024.


Asunto(s)
Ontologías Biológicas , Fenotipo , Humanos , Procesamiento de Lenguaje Natural , Programas Informáticos , Algoritmos
2.
Bioinformatics ; 40(3)2024 Mar 04.
Artículo en Inglés | MEDLINE | ID: mdl-38383067

RESUMEN

MOTIVATION: Creating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas. RESULTS: Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM. AVAILABILITY AND IMPLEMENTATION: SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.


Asunto(s)
Bases del Conocimiento , Semántica , Bases de Datos Factuales
3.
Genet Med ; 26(7): 101141, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38629401

RESUMEN

PURPOSE: Existing resources that characterize the essentiality status of genes are based on either proliferation assessment in human cell lines, viability evaluation in mouse knockouts, or constraint metrics derived from human population sequencing studies. Several repositories document phenotypic annotations for rare disorders; however, there is a lack of comprehensive reporting on lethal phenotypes. METHODS: We queried Online Mendelian Inheritance in Man for terms related to lethality and classified all Mendelian genes according to the earliest age of death recorded for the associated disorders, from prenatal death to no reports of premature death. We characterized the genes across these lethality categories, examined the evidence on viability from mouse models and explored how this information could be used for novel gene discovery. RESULTS: We developed the Lethal Phenotypes Portal to showcase this curated catalog of human essential genes. Differences in the mode of inheritance, physiological systems affected, and disease class were found for genes in different lethality categories, as well as discrepancies between the lethal phenotypes observed in mouse and human. CONCLUSION: We anticipate that this resource will aid clinicians in the diagnosis of early lethal conditions and assist researchers in investigating the properties that make these genes essential for human development.


Asunto(s)
Genes Letales , Enfermedades Genéticas Congénitas , Fenotipo , Humanos , Animales , Ratones , Enfermedades Genéticas Congénitas/genética , Bases de Datos Genéticas , Modelos Animales de Enfermedad , Genes Esenciales/genética
4.
Prenat Diagn ; 44(4): 454-464, 2024 04.
Artículo en Inglés | MEDLINE | ID: mdl-38242839

RESUMEN

Advances in sequencing and imaging technologies enable enhanced assessment in the prenatal space, with a goal to diagnose and predict the natural history of disease, to direct targeted therapies, and to implement clinical management, including transfer of care, election of supportive care, and selection of surgical interventions. The current lack of standardization and aggregation stymies variant interpretation and gene discovery, which hinders the provision of prenatal precision medicine, leaving clinicians and patients without an accurate diagnosis. With large amounts of data generated, it is imperative to establish standards for data collection, processing, and aggregation. Aggregated and homogeneously processed genetic and phenotypic data permits dissection of the genomic architecture of prenatal presentations of disease and provides a dataset on which data analysis algorithms can be tuned to the prenatal space. Here we discuss the importance of generating aggregate data sets and how the prenatal space is driving the development of interoperable standards and phenotype-driven tools.


Asunto(s)
Medicina de Precisión , Diagnóstico Prenatal , Embarazo , Femenino , Humanos , Fenotipo , Genómica , Algoritmos
5.
BMC Med Inform Decis Mak ; 24(1): 30, 2024 Jan 31.
Artículo en Inglés | MEDLINE | ID: mdl-38297371

RESUMEN

OBJECTIVE: Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. MATERIALS AND METHODS: The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. RESULTS: The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. CONCLUSION: Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.


Asunto(s)
Conocimiento , Lenguaje , Humanos , Aprendizaje Automático , Fenotipo , Enfermedades Raras
6.
Front Cell Dev Biol ; 12: 1240384, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38989060

RESUMEN

Cell level functions underlie tissue and organ physiology. Gene expression patterns offer extensive views of the pathways and processes within and between cells. Single cell transcriptomics provides detailed information on gene expression within cells, cell types, subtypes and their relative proportions in organs. Functional pathways can be scalably connected to physiological functions at the cell and organ levels. Integrating experimentally obtained gene expression patterns with prior knowledge of pathway interactions enables identification of networks underlying whole cell functions such as growth, contractility, and secretion. These pathways can be computationally modeled using differential equations to simulate cell and organ physiological dynamics regulated by gene expression changes. Such computational systems can be thought of as parts of digital twins of organs. Digital twins, at the core, need computational models that represent in detail and simulate how dynamics of pathways and networks give rise to whole cell level physiological functions. Integration of transcriptomic responses and numerical simulations could simulate and predict whole cell functional outputs from transcriptomic data. We developed a computational pipeline that integrates gene expression timelines and systems of coupled differential equations to generate cell-type selective dynamical models. We tested our integrative algorithm on the eicosanoid biosynthesis network in macrophages. Converting transcriptomic changes to a dynamical model allowed us to predict dynamics of prostaglandin and thromboxane synthesis and secretion by macrophages that matched published lipidomics data obtained in the same experiments. Integration of cell-level system biology simulations with genomic and clinical data using a knowledge graph framework will allow us to create explicit predictive models that mechanistically link genomic determinants to organ function. Such integration requires a multi-domain ontological framework to connect genomic determinants to gene expression and cell pathways and functions to organ level phenotypes in healthy and diseased states. These integrated scalable models of tissues and organs as accurate digital twins predict health and disease states for precision medicine.

7.
medRxiv ; 2024 Jan 13.
Artículo en Inglés | MEDLINE | ID: mdl-38260283

RESUMEN

Essential genes are those whose function is required for cell proliferation and/or organism survival. A gene's intolerance to loss-of-function can be allocated within a spectrum, as opposed to being considered a binary feature, since this function might be essential at different stages of development, genetic backgrounds or other contexts. Existing resources that collect and characterise the essentiality status of genes are based on either proliferation assessment in human cell lines, embryonic and postnatal viability evaluation in different model organisms, and gene metrics such as intolerance to variation scores derived from human population sequencing studies. There are also several repositories available that document phenotypic annotations for rare disorders in humans such as the Online Mendelian Inheritance in Man (OMIM) and the Human Phenotype Ontology (HPO) knowledgebases. This raises the prospect of being able to use clinical data, including lethality as the most severe phenotypic manifestation, to further our characterisation of gene essentiality. Here we queried OMIM for terms related to lethality and classified all Mendelian genes into categories, according to the earliest age of death recorded for the associated disorders, from prenatal death to no reports of premature death. To showcase this curated catalogue of human essential genes, we developed the Lethal Phenotypes Portal (https://lethalphenotypes.research.its.qmul.ac.uk), where we also explore the relationships between these lethality categories, constraint metrics and viability in cell lines and mouse. Further analysis of the genes in these categories reveals differences in the mode of inheritance of the associated disorders, physiological systems affected and disease class. We highlight how the phenotypic similarity between genes in the same lethality category combined with gene family/group information can be used for novel disease gene discovery. Finally, we explore the overlaps and discrepancies between the lethal phenotypes observed in mouse and human and discuss potential explanations that include differences in transcriptional regulation, functional compensation and molecular disease mechanisms. We anticipate that this resource will aid clinicians in the diagnosis of early lethal conditions and assist researchers in investigating the properties that make these genes essential for human development.

8.
bioRxiv ; 2024 Jul 04.
Artículo en Inglés | MEDLINE | ID: mdl-39005436

RESUMEN

Objectives: Concept embeddings are low-dimensional vector representations of concepts such as MeSH:D009203 (Myocardial Infarction), whose similarity in the embedded vector space reflects their semantic similarity. Here, we test the hypothesis that non-biomedical concept synonym replacement can improve the quality of biomedical concepts embeddings. Materials and methods: We developed an approach that leverages WordNet to replace sets of synonyms with the most common representative of the synonym set. Results: We tested our approach on 1055 concept sets and found that, on average, the mean intra-cluster distance was reduced by 8% in the vector-space. Assuming that homophily of related concepts in the vector space is desirable, our approach tends to improve the quality of embeddings. Discussion and Conclusion: This pilot study shows that non-biomedical synonym replacement tends to improve the quality of embeddings of biomedical concepts using the Word2Vec algorithm. We have implemented our approach in a freely available Python package available at https://github.com/TheJacksonLaboratory/wn2vec.

9.
medRxiv ; 2024 Jul 22.
Artículo en Inglés | MEDLINE | ID: mdl-39108510

RESUMEN

Large language models (LLM) have shown great promise in supporting differential diagnosis, but 23 available published studies on the diagnostic accuracy evaluated small cohorts (number of cases, 30-422, mean 104) and have evaluated LLM responses subjectively by manual curation (23/23 studies). The performance of LLMs for rare disease diagnosis has not been evaluated systematically. Here, we perform a rigorous and large-scale analysis of the performance of a GPT-4 in prioritizing candidate diagnoses, using the largest-ever cohort of rare disease patients. Our computational study used 5267 computational case reports from previously published data. Each case was formatted as a Global Alliance for Genomics and Health (GA4GH) phenopacket, in which clinical anomalies were represented as Human Phenotype Ontology (HPO) terms. We developed software to generate prompts from each phenopacket. Prompts were sent to Generative Pre-trained Transformer 4 (GPT-4), and the rank of the correct diagnosis, if present in the response, was recorded. The mean reciprocal rank of the correct diagnosis was 0.24 (with the reciprocal of the MRR corresponding to a rank of 4.2), and the correct diagnosis was placed in rank 1 in 19.2% of the cases, in the first 3 ranks in 28.6%, and in the first 10 ranks in 32.5%. Our study is the largest to be reported to date and provides a realistic estimate of the performance of GPT-4 in rare disease medicine.

10.
Bioinform Adv ; 4(1): vbae036, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38577542

RESUMEN

Motivation: Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes. Results: We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement. Availability and implementation: Our code and data are publicly available at https://github.com/monarch-initiative/negativeExampleSelection.

11.
bioRxiv ; 2024 Jun 16.
Artículo en Inglés | MEDLINE | ID: mdl-38915571

RESUMEN

Background: Computational approaches to support rare disease diagnosis are challenging to build, requiring the integration of complex data types such as ontologies, gene-to-phenotype associations, and cross-species data into variant and gene prioritisation algorithms (VGPAs). However, the performance of VGPAs has been difficult to measure and is impacted by many factors, for example, ontology structure, annotation completeness or changes to the underlying algorithm. Assertions of the capabilities of VGPAs are often not reproducible, in part because there is no standardised, empirical framework and openly available patient data to assess the efficacy of VGPAs - ultimately hindering the development of effective prioritisation tools. Results: In this paper, we present our benchmarking tool, PhEval, which aims to provide a standardised and empirical framework to evaluate phenotype-driven VGPAs. The inclusion of standardised test corpora and test corpus generation tools in the PhEval suite of tools allows open benchmarking and comparison of methods on standardised data sets. Conclusions: PhEval and the standardised test corpora solve the issues of patient data availability and experimental tooling configuration when benchmarking and comparing rare disease VGPAs. By providing standardised data on patient cohorts from real-world case-reports and controlling the configuration of evaluated VGPAs, PhEval enables transparent, portable, comparable and reproducible benchmarking of VGPAs. As these tools are often a key component of many rare disease diagnostic pipelines, a thorough and standardised method of assessment is essential for improving patient diagnosis and care.

12.
bioRxiv ; 2024 Apr 22.
Artículo en Inglés | MEDLINE | ID: mdl-38712026

RESUMEN

P21-activated kinase 2 (PAK2) is a serine/threonine kinase essential for a variety of cellular processes including signal transduction, cellular survival, proliferation, and migration. A recent report proposed monoallelic PAK2 variants cause Knobloch syndrome type 2 (KNO2)-a developmental disorder primarily characterized by ocular anomalies. Here, we identified a novel de novo heterozygous missense variant in PAK2, NM_002577.4:c.1273G>A, p.(D425N), by whole genome sequencing in an individual with features consistent with KNO2. Notable clinical phenotypes include global developmental delay, congenital retinal detachment, mild cerebral ventriculomegaly, hypotonia, FTT, pyloric stenosis, feeding intolerance, patent ductus arteriosus, and mild facial dysmorphism. The p.(D425N) variant lies within the protein kinase domain and is predicted to be functionally damaging by in silico analysis. Previous clinical genetic testing did not report this variant due to unknown relevance of PAK2 variants at the time of testing, highlighting the importance of reanalysis. Our findings also substantiate the candidacy of PAK2 variants in KNO2 and expand the KNO2 clinical spectrum.

13.
Sci Data ; 11(1): 906, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39174566

RESUMEN

The "RNA world" represents a novel frontier for the study of fundamental biological processes and human diseases and is paving the way for the development of new drugs tailored to each patient's biomolecular characteristics. Although scientific data about coding and non-coding RNA molecules are constantly produced and available from public repositories, they are scattered across different databases and a centralized, uniform, and semantically consistent representation of the "RNA world" is still lacking. We propose RNA-KG, a knowledge graph (KG) encompassing biological knowledge about RNAs gathered from more than 60 public databases, integrating functional relationships with genes, proteins, and chemicals and ontologically grounded biomedical concepts. To develop RNA-KG, we first identified, pre-processed, and characterized each data source; next, we built a meta-graph that provides an ontological description of the KG by representing all the bio-molecular entities and medical concepts of interest in this domain, as well as the types of interactions connecting them. Finally, we leveraged an instance-based semantically abstracted knowledge model to specify the ontological alignment according to which RNA-KG was generated. RNA-KG can be downloaded in different formats and also queried by a SPARQL endpoint. A thorough topological analysis of the resulting heterogeneous graph provides further insights into the characteristics of the "RNA world". RNA-KG can be both directly explored and visualized, and/or analyzed by applying computational methods to infer bio-medical knowledge from its heterogeneous nodes and edges. The resource can be easily updated with new experimental data, and specific views of the overall KG can be extracted according to the bio-medical problem to be studied.


Asunto(s)
ARN , ARN/genética , Humanos , Ontologías Biológicas
14.
Int J Med Inform ; 187: 105461, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38643701

RESUMEN

OBJECTIVE: Female reproductive disorders (FRDs) are common health conditions that may present with significant symptoms. Diet and environment are potential areas for FRD interventions. We utilized a knowledge graph (KG) method to predict factors associated with common FRDs (for example, endometriosis, ovarian cyst, and uterine fibroids). MATERIALS AND METHODS: We harmonized survey data from the Personalized Environment and Genes Study (PEGS) on internal and external environmental exposures and health conditions with biomedical ontology content. We merged the harmonized data and ontologies with supplemental nutrient and agricultural chemical data to create a KG. We analyzed the KG by embedding edges and applying a random forest for edge prediction to identify variables potentially associated with FRDs. We also conducted logistic regression analysis for comparison. RESULTS: Across 9765 PEGS respondents, the KG analysis resulted in 8535 significant or suggestive predicted links between FRDs and chemicals, phenotypes, and diseases. Amongst these links, 32 were exact matches when compared with the logistic regression results, including comorbidities, medications, foods, and occupational exposures. DISCUSSION: Mechanistic underpinnings of predicted links documented in the literature may support some of our findings. Our KG methods are useful for predicting possible associations in large, survey-based datasets with added information on directionality and magnitude of effect from logistic regression. These results should not be construed as causal but can support hypothesis generation. CONCLUSION: This investigation enabled the generation of hypotheses on a variety of potential links between FRDs and exposures. Future investigations should prospectively evaluate the variables hypothesized to impact FRDs.


Asunto(s)
Exposición a Riesgos Ambientales , Humanos , Femenino , Exposición a Riesgos Ambientales/efectos adversos , Enfermedades de los Genitales Femeninos , Modelos Logísticos , Estado Nutricional , Dieta , Adulto , Bosques Aleatorios
15.
Transl Psychiatry ; 14(1): 246, 2024 Jun 08.
Artículo en Inglés | MEDLINE | ID: mdl-38851761

RESUMEN

Acute COVID-19 infection can be followed by diverse clinical manifestations referred to as Post Acute Sequelae of SARS-CoV2 Infection (PASC). Studies have shown an increased risk of being diagnosed with new-onset psychiatric disease following a diagnosis of acute COVID-19. However, it was unclear whether non-psychiatric PASC-associated manifestations (PASC-AMs) are associated with an increased risk of new-onset psychiatric disease following COVID-19. A retrospective electronic health record (EHR) cohort study of 2,391,006 individuals with acute COVID-19 was performed to evaluate whether non-psychiatric PASC-AMs are associated with new-onset psychiatric disease. Data were obtained from the National COVID Cohort Collaborative (N3C), which has EHR data from 76 clinical organizations. EHR codes were mapped to 151 non-psychiatric PASC-AMs recorded 28-120 days following SARS-CoV-2 diagnosis and before diagnosis of new-onset psychiatric disease. Association of newly diagnosed psychiatric disease with age, sex, race, pre-existing comorbidities, and PASC-AMs in seven categories was assessed by logistic regression. There were significant associations between a diagnosis of any psychiatric disease and five categories of PASC-AMs with odds ratios highest for neurological, cardiovascular, and constitutional PASC-AMs with odds ratios of 1.31, 1.29, and 1.23 respectively. Secondary analysis revealed that the proportions of 50 individual clinical features significantly differed between patients diagnosed with different psychiatric diseases. Our study provides evidence for association between non-psychiatric PASC-AMs and the incidence of newly diagnosed psychiatric disease. Significant associations were found for features related to multiple organ systems. This information could prove useful in understanding risk stratification for new-onset psychiatric disease following COVID-19. Prospective studies are needed to corroborate these findings.


Asunto(s)
COVID-19 , Trastornos Mentales , SARS-CoV-2 , Humanos , COVID-19/psicología , COVID-19/complicaciones , COVID-19/epidemiología , Masculino , Femenino , Trastornos Mentales/epidemiología , Persona de Mediana Edad , Adulto , Estudios Retrospectivos , Anciano , Fenotipo , Síndrome Post Agudo de COVID-19 , Comorbilidad , Registros Electrónicos de Salud , Adulto Joven , Factores de Riesgo , Adolescente
16.
Sci Data ; 11(1): 363, 2024 Apr 11.
Artículo en Inglés | MEDLINE | ID: mdl-38605048

RESUMEN

Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoint resources and abstraction algorithms), and benchmarks (e.g., prebuilt KGs). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 different large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.


Asunto(s)
Disciplinas de las Ciencias Biológicas , Bases del Conocimiento , Reconocimiento de Normas Patrones Automatizadas , Algoritmos , Investigación Biomédica Traslacional
17.
medRxiv ; 2024 May 29.
Artículo en Inglés | MEDLINE | ID: mdl-38854034

RESUMEN

The Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema was released in 2022 and approved by ISO as a standard for sharing clinical and genomic information about an individual, including phenotypic descriptions, numerical measurements, genetic information, diagnoses, and treatments. A phenopacket can be used as an input file for software that supports phenotype-driven genomic diagnostics and for algorithms that facilitate patient classification and stratification for identifying new diseases and treatments. There has been a great need for a collection of phenopackets to test software pipelines and algorithms. Here, we present phenopacket-store. Version 0.1.12 of phenopacket-store includes 4916 phenopackets representing 277 Mendelian and chromosomal diseases associated with 236 genes, and 2872 unique pathogenic alleles curated from 605 different publications. This represents the first large-scale collection of case-level, standardized phenotypic information derived from case reports in the literature with detailed descriptions of the clinical data and will be useful for many purposes, including the development and testing of software for prioritizing genes and diseases in diagnostic genomics, machine learning analysis of clinical phenotype data, patient stratification, and genotype-phenotype correlations. This corpus also provides best-practice examples for curating literature-derived data using the GA4GH Phenopacket Schema.

18.
Cancer Res ; 84(13): 2060-2072, 2024 Jul 02.
Artículo en Inglés | MEDLINE | ID: mdl-39082680

RESUMEN

Patient-derived xenografts (PDX) model human intra- and intertumoral heterogeneity in the context of the intact tissue of immunocompromised mice. Histologic imaging via hematoxylin and eosin (H&E) staining is routinely performed on PDX samples, which could be harnessed for computational analysis. Prior studies of large clinical H&E image repositories have shown that deep learning analysis can identify intercellular and morphologic signals correlated with disease phenotype and therapeutic response. In this study, we developed an extensive, pan-cancer repository of >1,000 PDX and paired parental tumor H&E images. These images, curated from the PDX Development and Trial Centers Research Network Consortium, had a range of associated genomic and transcriptomic data, clinical metadata, pathologic assessments of cell composition, and, in several cases, detailed pathologic annotations of neoplastic, stromal, and necrotic regions. The amenability of these images to deep learning was highlighted through three applications: (i) development of a classifier for neoplastic, stromal, and necrotic regions; (ii) development of a predictor of xenograft-transplant lymphoproliferative disorder; and (iii) application of a published predictor of microsatellite instability. Together, this PDX Development and Trial Centers Research Network image repository provides a valuable resource for controlled digital pathology analysis, both for the evaluation of technical issues and for the development of computational image-based methods that make clinical predictions based on PDX treatment studies. Significance: A pan-cancer repository of >1,000 patient-derived xenograft hematoxylin and eosin-stained images will facilitate cancer biology investigations through histopathologic analysis and contributes important model system data that expand existing human histology repositories.


Asunto(s)
Aprendizaje Profundo , Neoplasias , Humanos , Animales , Ratones , Neoplasias/genética , Neoplasias/patología , Neoplasias/diagnóstico por imagen , Genómica/métodos , Xenoinjertos , Ensayos Antitumor por Modelo de Xenoinjerto , Trastornos Linfoproliferativos/genética , Trastornos Linfoproliferativos/patología , Procesamiento de Imagen Asistido por Computador/métodos
19.
Nat Comput Sci ; 3(6): 552-568, 2023 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38177435

RESUMEN

Graph representation learning methods opened new avenues for addressing complex, real-world problems represented by graphs. However, many graphs used in these applications comprise millions of nodes and billions of edges and are beyond the capabilities of current methods and software implementations. We present GRAPE (Graph Representation Learning, Prediction and Evaluation), a software resource for graph processing and embedding that is able to scale with big graphs by using specialized and smart data structures, algorithms, and a fast parallel implementation of random-walk-based methods. Compared with state-of-the-art software resources, GRAPE shows an improvement of orders of magnitude in empirical space and time complexity, as well as competitive edge- and node-label prediction performance. GRAPE comprises approximately 1.7 million well-documented lines of Python and Rust code and provides 69 node-embedding methods, 25 inference models, a collection of efficient graph-processing utilities, and over 80,000 graphs from the literature and other sources. Standardized interfaces allow a seamless integration of third-party libraries, while ready-to-use and modular pipelines permit an easy-to-use evaluation of graph-representation-learning methods, therefore also positioning GRAPE as a software resource that performs a fair comparison between methods and libraries for graph processing and embedding.


Asunto(s)
Bibliotecas , Vitis , Algoritmos , Programas Informáticos , Aprendizaje
20.
medRxiv ; 2023 Dec 21.
Artículo en Inglés | MEDLINE | ID: mdl-38196618

RESUMEN

To discover rare disease-gene associations, we developed a gene burden analytical framework and applied it to rare, protein-coding variants from whole genome sequencing of 35,008 cases with rare diseases and their family members recruited to the 100,000 Genomes Project (100KGP). Following in silico triaging of the results, 88 novel associations were identified including 38 with existing experimental evidence. We have published the confirmation of one of these associations, hereditary ataxia with UCHL1 , and independent confirmatory evidence has recently been published for four more. We highlight a further seven compelling associations: hypertrophic cardiomyopathy with DYSF and SLC4A3 where both genes show high/specific heart expression and existing associations to skeletal dystrophies or short QT syndrome respectively; monogenic diabetes with UNC13A with a known role in the regulation of ß cells and a mouse model with impaired glucose tolerance; epilepsy with KCNQ1 where a mouse model shows seizures and the existing long QT syndrome association may be linked; early onset Parkinson's disease with RYR1 with existing links to tremor pathophysiology and a mouse model with neurological phenotypes; anterior segment ocular abnormalities associated with POMK showing expression in corneal cells and with a zebrafish model with developmental ocular abnormalities; and cystic kidney disease with COL4A3 showing high renal expression and prior evidence for a digenic or modifying role in renal disease. Confirmation of all 88 associations would lead to potential diagnoses in 456 molecularly undiagnosed cases within the 100KGP, as well as other rare disease patients worldwide, highlighting the clinical impact of a large-scale statistical approach to rare disease gene discovery.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA