Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 68
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Am J Hum Genet ; 108(9): 1564-1577, 2021 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-34289339

RESUMEN

A critical challenge in genetic diagnostics is the computational assessment of candidate splice variants, specifically the interpretation of nucleotide changes located outside of the highly conserved dinucleotide sequences at the 5' and 3' ends of introns. To address this gap, we developed the Super Quick Information-content Random-forest Learning of Splice variants (SQUIRLS) algorithm. SQUIRLS generates a small set of interpretable features for machine learning by calculating the information-content of wild-type and variant sequences of canonical and cryptic splice sites, assessing changes in candidate splicing regulatory sequences, and incorporating characteristics of the sequence such as exon length, disruptions of the AG exclusion zone, and conservation. We curated a comprehensive collection of disease-associated splice-altering variants at positions outside of the highly conserved AG/GT dinucleotides at the termini of introns. SQUIRLS trains two random-forest classifiers for the donor and for the acceptor and combines their outputs by logistic regression to yield a final score. We show that SQUIRLS transcends previous state-of-the-art accuracy in classifying splice variants as assessed by rank analysis in simulated exomes, and is significantly faster than competing methods. SQUIRLS provides tabular output files for incorporation into diagnostic pipelines for exome and genome analysis, as well as visualizations that contextualize predicted effects of variants on splicing to make it easier to interpret splice variants in diagnostic settings.


Asunto(s)
Algoritmos , Curaduría de Datos/métodos , Enfermedades Genéticas Congénitas/genética , Sitios de Empalme de ARN , Empalme del ARN , Programas Informáticos , Secuencia de Bases , Biología Computacional/métodos , Exoma , Exones , Enfermedades Genéticas Congénitas/diagnóstico , Enfermedades Genéticas Congénitas/patología , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Intrones , Mutación , Secuenciación del Exoma
2.
Brief Bioinform ; 23(4)2022 07 18.
Artículo en Inglés | MEDLINE | ID: mdl-35679533

RESUMEN

Patient similarity networks (PSNs), where patients are represented as nodes and their similarities as weighted edges, are being increasingly used in clinical research. These networks provide an insightful summary of the relationships among patients and can be exploited by inductive or transductive learning algorithms for the prediction of patient outcome, phenotype and disease risk. PSNs can also be easily visualized, thus offering a natural way to inspect complex heterogeneous patient data and providing some level of explainability of the predictions obtained by machine learning algorithms. The advent of high-throughput technologies, enabling us to acquire high-dimensional views of the same patients (e.g. omics data, laboratory data, imaging data), calls for the development of data fusion techniques for PSNs in order to leverage this rich heterogeneous information. In this article, we review existing methods for integrating multiple biomedical data views to construct PSNs, together with the different patient similarity measures that have been proposed. We also review methods that have appeared in the machine learning literature but have not yet been applied to PSNs, thus providing a resource to navigate the vast machine learning literature existing on this topic. In particular, we focus on methods that could be used to integrate very heterogeneous datasets, including multi-omics data as well as data derived from clinical information and medical imaging.


Asunto(s)
Algoritmos , Aprendizaje Automático
3.
Bioinformatics ; 39(4)2023 04 03.
Artículo en Inglés | MEDLINE | ID: mdl-36929917

RESUMEN

MOTIVATION: Advances in RNA sequencing technologies have achieved an unprecedented accuracy in the quantification of mRNA isoforms, but our knowledge of isoform-specific functions has lagged behind. There is a need to understand the functional consequences of differential splicing, which could be supported by the generation of accurate and comprehensive isoform-specific gene ontology annotations. RESULTS: We present isoform interpretation, a method that uses expectation-maximization to infer isoform-specific functions based on the relationship between sequence and functional isoform similarity. We predicted isoform-specific functional annotations for 85 617 isoforms of 17 900 protein-coding human genes spanning a range of 17 430 distinct gene ontology terms. Comparison with a gold-standard corpus of manually annotated human isoform functions showed that isoform interpretation significantly outperforms state-of-the-art competing methods. We provide experimental evidence that functionally related isoforms predicted by isoform interpretation show a higher degree of domain sharing and expression correlation than functionally related genes. We also show that isoform sequence similarity correlates better with inferred isoform function than with gene-level function. AVAILABILITY AND IMPLEMENTATION: Source code, documentation, and resource files are freely available under a GNU3 license at https://github.com/TheJacksonLaboratory/isopretEM and https://zenodo.org/record/7594321.


Asunto(s)
Motivación , Programas Informáticos , Humanos , Isoformas de Proteínas/genética , Empalme Alternativo , Análisis de Secuencia de ARN
4.
J Biomed Inform ; 139: 104295, 2023 03.
Artículo en Inglés | MEDLINE | ID: mdl-36716983

RESUMEN

Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm's parameters and data-related modeling choices are also both crucial and challenging. In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.


Asunto(s)
COVID-19 , Humanos , Algoritmos , Proyectos de Investigación , Sesgo , Probabilidad
5.
BMC Bioinformatics ; 23(Suppl 2): 154, 2022 Dec 12.
Artículo en Inglés | MEDLINE | ID: mdl-36510125

RESUMEN

BACKGROUND: Cis-regulatory regions (CRRs) are non-coding regions of the DNA that fine control the spatio-temporal pattern of transcription; they are involved in a wide range of pivotal processes such as the development of specific cell-lines/tissues and the dynamic cell response to physiological stimuli. Recent studies showed that genetic variants occurring in CRRs are strongly correlated with pathogenicity or deleteriousness. Considering the central role of CRRs in the regulation of physiological and pathological conditions, the correct identification of CRRs and of their tissue-specific activity status through Machine Learning methods plays a major role in dissecting the impact of genetic variants on human diseases. Unfortunately, the problem is still open, though some promising results have been already reported by (deep) machine-learning based methods that predict active promoters and enhancers in specific tissues or cell lines by encoding epigenetic or spectral features directly extracted from DNA sequences. RESULTS: We present the experiments we performed to compare two Deep Neural Networks, a Feed-Forward Neural Network model working on epigenomic features, and a Convolutional Neural Network model working only on genomic sequence, targeted to the identification of enhancer- and promoter-activity in specific cell lines. While performing experiments to understand how the experimental setup influences the prediction performance of the methods, we particularly focused on (1) automatic model selection performed by Bayesian optimization and (2) exploring different data rebalancing setups for reducing negative unbalancing effects. CONCLUSIONS: Results show that (1) automatic model selection by Bayesian optimization improves the quality of the learner; (2) data rebalancing considerably impacts the prediction performance of the models; test set rebalancing may provide over-optimistic results, and should therefore be cautiously applied; (3) despite working on sequence data, convolutional models obtain performance close to those of feed forward models working on epigenomic information, which suggests that also sequence data carries informative content for CRR-activity prediction. We therefore suggest combining both models/data types in future works.


Asunto(s)
Aprendizaje Profundo , Humanos , Teorema de Bayes , Secuencias Reguladoras de Ácidos Nucleicos , Redes Neurales de la Computación , Aprendizaje Automático
6.
Bioinformatics ; 37(23): 4526-4533, 2021 12 07.
Artículo en Inglés | MEDLINE | ID: mdl-34240108

RESUMEN

MOTIVATION: Automated protein function prediction is a complex multi-class, multi-label, structured classification problem in which protein functions are organized in a controlled vocabulary, according to the Gene Ontology (GO). 'Hierarchy-unaware' classifiers, also known as 'flat' methods, predict GO terms without exploiting the inherent structure of the ontology, potentially violating the True-Path-Rule (TPR) that governs the GO, while 'hierarchy-aware' approaches, even if they obey the TPR, do not always show clear improvements with respect to flat methods, or do not scale well when applied to the full GO. RESULTS: To overcome these limitations, we propose Hierarchical Ensemble Methods for Directed Acyclic Graphs (HEMDAG), a family of highly modular hierarchical ensembles of classifiers, able to build upon any flat method and to provide 'TPR-safe' predictions, by leveraging a combination of isotonic regression and TPR learning strategies. Extensive experiments on synthetic and real data across several organisms firstly show that HEMDAG can be used as a general tool to improve the predictions of flat classifiers, and secondly that HEMDAG is competitive versus state-of-the-art hierarchy-aware learning methods proposed in the last CAFA international challenges. AVAILABILITY AND IMPLEMENTATION: Fully tested R code freely available at https://anaconda.org/bioconda/r-hemdag. Tutorial and documentation at https://hemdag.readthedocs.io. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Biología Computacional , Ontología de Genes , Biología Computacional/métodos , Proteínas/metabolismo
7.
Virol J ; 19(1): 84, 2022 05 15.
Artículo en Inglés | MEDLINE | ID: mdl-35570298

RESUMEN

BACKGROUND: Non-steroidal anti-inflammatory drugs (NSAIDs) are commonly used to reduce pain, fever, and inflammation but have been associated with complications in community-acquired pneumonia. Observations shortly after the start of the COVID-19 pandemic in 2020 suggested that ibuprofen was associated with an increased risk of adverse events in COVID-19 patients, but subsequent observational studies failed to demonstrate increased risk and in one case showed reduced risk associated with NSAID use. METHODS: A 38-center retrospective cohort study was performed that leveraged the harmonized, high-granularity electronic health record data of the National COVID Cohort Collaborative. A propensity-matched cohort of 19,746 COVID-19 inpatients was constructed by matching cases (treated with NSAIDs at the time of admission) and 19,746 controls (not treated) from 857,061 patients with COVID-19 available for analysis. The primary outcome of interest was COVID-19 severity in hospitalized patients, which was classified as: moderate, severe, or mortality/hospice. Secondary outcomes were acute kidney injury (AKI), extracorporeal membrane oxygenation (ECMO), invasive ventilation, and all-cause mortality at any time following COVID-19 diagnosis. RESULTS: Logistic regression showed that NSAID use was not associated with increased COVID-19 severity (OR: 0.57 95% CI: 0.53-0.61). Analysis of secondary outcomes using logistic regression showed that NSAID use was not associated with increased risk of all-cause mortality (OR 0.51 95% CI: 0.47-0.56), invasive ventilation (OR: 0.59 95% CI: 0.55-0.64), AKI (OR: 0.67 95% CI: 0.63-0.72), or ECMO (OR: 0.51 95% CI: 0.36-0.7). In contrast, the odds ratios indicate reduced risk of these outcomes, but our quantitative bias analysis showed E-values of between 1.9 and 3.3 for these associations, indicating that comparatively weak or moderate confounder associations could explain away the observed associations. CONCLUSIONS: Study interpretation is limited by the observational design. Recording of NSAID use may have been incomplete. Our study demonstrates that NSAID use is not associated with increased COVID-19 severity, all-cause mortality, invasive ventilation, AKI, or ECMO in COVID-19 inpatients. A conservative interpretation in light of the quantitative bias analysis is that there is no evidence that NSAID use is associated with risk of increased severity or the other measured outcomes. Our results confirm and extend analogous findings in previous observational studies using a large cohort of patients drawn from 38 centers in a nationally representative multicenter database.


Asunto(s)
Lesión Renal Aguda , COVID-19 , Antiinflamatorios no Esteroideos/efectos adversos , Prueba de COVID-19 , Estudios de Cohortes , Humanos , Pandemias , Estudios Retrospectivos
8.
Hematol Oncol ; 39(3): 364-379, 2021 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-33497493

RESUMEN

Wnt/Fzd signaling has been implicated in hematopoietic stem cell maintenance and in acute leukemia establishment. In our previous work, we described a recurrent rearrangement involving the WNT10B locus (WNT10BR ), characterized by the expression of WNT10BIVS1 transcript variant, in acute myeloid leukemia. To determine the occurrence of WNT10BR in T-cell acute lymphoblastic leukemia (T-ALL), we retrospectively analyzed an Italian cohort of patients (n = 20) and detected a high incidence (13/20) of WNT10BIVS1 expression. To address genes involved in WNT10B molecular response, we have designed a Wnt-targeted RNA sequencing panel. Identifying Wnt agonists and antagonists, it results that the expression of FZD6, LRP5, and PROM1 genes stands out in WNT10BIVS1 positive patients compared to negative ones. Using MOLT4 and MUTZ-2 as leukemic cell models, which are characterized by the expression of WNT10BIVS1 , we have observed that WNT10B drives major Wnt activation to the FZD6 receptor complex through receipt of ligand. Additionally, short hairpin RNAs (shRNAs)-mediated gene silencing and small molecule-mediated inhibition of WNTs secretion have been observed to interfere with the WNT10B/FZD6 interaction. We have therefore identified that WNT10BIVS1 knockdown, or pharmacological interference by the LGK974 porcupine (PORCN) inhibitor, reduces WNT10B/FZD6 protein complex formation and significantly impairs intracellular effectors and leukemic expansion. These results describe the molecular circuit induced by WNT10B and suggest WNT10B/FZD6 as a new target in the T-ALL treatment strategy.


Asunto(s)
Receptores Frizzled/metabolismo , Regulación Leucémica de la Expresión Génica , Leucemia-Linfoma Linfoblástico de Células T Precursoras/metabolismo , Proteínas Proto-Oncogénicas/biosíntesis , Proteínas Wnt/biosíntesis , Vía de Señalización Wnt , Aciltransferasas/antagonistas & inhibidores , Aciltransferasas/genética , Aciltransferasas/metabolismo , Femenino , Receptores Frizzled/genética , Células HeLa , Humanos , Masculino , Proteínas de la Membrana/antagonistas & inhibidores , Proteínas de la Membrana/genética , Proteínas de la Membrana/metabolismo , Leucemia-Linfoma Linfoblástico de Células T Precursoras/tratamiento farmacológico , Leucemia-Linfoma Linfoblástico de Células T Precursoras/genética , Leucemia-Linfoma Linfoblástico de Células T Precursoras/patología , Proteínas Proto-Oncogénicas/genética , Pirazinas/farmacología , Piridinas/farmacología , Proteínas Wnt/genética
9.
BMC Bioinformatics ; 20(1): 422, 2019 Aug 14.
Artículo en Inglés | MEDLINE | ID: mdl-31412768

RESUMEN

BACKGROUND: One of the main issues in the automated protein function prediction (AFP) problem is the integration of multiple networked data sources. The UNIPred algorithm was thereby proposed to efficiently integrate -in a function-specific fashion- the protein networks by taking into account the imbalance that characterizes protein annotations, and to subsequently predict novel hypotheses about unannotated proteins. UNIPred is publicly available as R code, which might result of limited usage for non-expert users. Moreover, its application requires efforts in the acquisition and preparation of the networks to be integrated. Finally, the UNIPred source code does not handle the visualization of the resulting consensus network, whereas suitable views of the network topology are necessary to explore and interpret existing protein relationships. RESULTS: We address the aforementioned issues by proposing UNIPred-Web, a user-friendly Web tool for the application of the UNIPred algorithm to a variety of biomolecular networks, already supplied by the system, and for the visualization and exploration of protein networks. We support different organisms and different types of networks -e.g., co-expression, shared domains and physical interaction networks. Users are supported in the different phases of the process, ranging from the selection of the networks and the protein function to be predicted, to the navigation of the integrated network. The system also supports the upload of user-defined protein networks. The vertex-centric and the highly interactive approach of UNIPred-Web allow a narrow exploration of specific proteins, and an interactive analysis of large sub-networks with only a few mouse clicks. CONCLUSIONS: UNIPred-Web offers a practical and intuitive (visual) guidance to biologists interested in gaining insights into protein biomolecular functions. UNIPred-Web provides facilities for the integration of networks, and supplies a framework for the imbalance-aware protein network integration of nine organisms, the prediction of thousands of GO protein functions, and a easy-to-use graphical interface for the visual analysis, navigation and interpretation of the integrated networks and of the functional predictions.


Asunto(s)
Biología Computacional/métodos , Internet , Mapas de Interacción de Proteínas , Proteínas/metabolismo , Programas Informáticos , Algoritmos , Interfaz Usuario-Computador
11.
Am J Hum Genet ; 99(3): 595-606, 2016 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-27569544

RESUMEN

The interpretation of non-coding variants still constitutes a major challenge in the application of whole-genome sequencing in Mendelian disease, especially for single-nucleotide and other small non-coding variants. Here we present Genomiser, an analysis framework that is able not only to score the relevance of variation in the non-coding genome, but also to associate regulatory variants to specific Mendelian diseases. Genomiser scores variants through either existing methods such as CADD or a bespoke machine learning method and combines these with allele frequency, regulatory sequences, chromosomal topological domains, and phenotypic relevance to discover variants associated to specific Mendelian disorders. Overall, Genomiser is able to identify causal regulatory variants as the top candidate in 77% of simulated whole genomes, allowing effective detection and discovery of regulatory variants in Mendelian disease.


Asunto(s)
Algoritmos , Enfermedades Genéticas Congénitas/genética , Genoma Humano/genética , Mutación/genética , Frecuencia de los Genes , Estudio de Asociación del Genoma Completo , Humanos , Aprendizaje Automático , Sistemas de Lectura Abierta/genética , Fenotipo , Mutación Puntual/genética
12.
BMC Bioinformatics ; 19(Suppl 10): 353, 2018 Oct 15.
Artículo en Inglés | MEDLINE | ID: mdl-30367594

RESUMEN

BACKGROUND: Several problems in network biology and medicine can be cast into a framework where entities are represented through partially labeled networks, and the aim is inferring the labels (usually binary) of the unlabeled part. Connections represent functional or genetic similarity between entities, while the labellings often are highly unbalanced, that is one class is largely under-represented: for instance in the automated protein function prediction (AFP) for most Gene Ontology terms only few proteins are annotated, or in the disease-gene prioritization problem only few genes are actually known to be involved in the etiology of a given disease. Imbalance-aware approaches to accurately predict node labels in biological networks are thereby required. Furthermore, such methods must be scalable, since input data can be large-sized as, for instance, in the context of multi-species protein networks. RESULTS: We propose a novel semi-supervised parallel enhancement of COSNET, an imbalance-aware algorithm build on Hopfield neural model recently suggested to solve the AFP problem. By adopting an efficient representation of the graph and assuming a sparse network topology, we empirically show that it can be efficiently applied to networks with millions of nodes. The key strategy to speed up the computations is to partition nodes into independent sets so as to process each set in parallel by exploiting the power of GPU accelerators. This parallel technique ensures the convergence to asymptotically stable attractors, while preserving the asynchronous dynamics of the original model. Detailed experiments on real data and artificial big instances of the problem highlight scalability and efficiency of the proposed method. CONCLUSIONS: By parallelizing COSNET we achieved on average a speed-up of 180x in solving the AFP problem in the S. cerevisiae, Mus musculus and Homo sapiens organisms, while lowering memory requirements. In addition, to show the potential applicability of the method to huge biomolecular networks, we predicted node labels in artificially generated sparse networks involving hundreds of thousands to millions of nodes.


Asunto(s)
Algoritmos , Gráficos por Computador , Redes Reguladoras de Genes , Animales , Ontología de Genes , Humanos , Ratones , Mapas de Interacción de Proteínas/genética , Proteínas/genética , Saccharomyces cerevisiae/genética , Factores de Tiempo
13.
BMC Bioinformatics ; 18(1): 449, 2017 Oct 12.
Artículo en Inglés | MEDLINE | ID: mdl-29025394

RESUMEN

BACKGROUND: The prediction of human gene-abnormal phenotype associations is a fundamental step toward the discovery of novel genes associated with human disorders, especially when no genes are known to be associated with a specific disease. In this context the Human Phenotype Ontology (HPO) provides a standard categorization of the abnormalities associated with human diseases. While the problem of the prediction of gene-disease associations has been widely investigated, the related problem of gene-phenotypic feature (i.e., HPO term) associations has been largely overlooked, even if for most human genes no HPO term associations are known and despite the increasing application of the HPO to relevant medical problems. Moreover most of the methods proposed in literature are not able to capture the hierarchical relationships between HPO terms, thus resulting in inconsistent and relatively inaccurate predictions. RESULTS: We present two hierarchical ensemble methods that we formally prove to provide biologically consistent predictions according to the hierarchical structure of the HPO. The modular structure of the proposed methods, that consists in a "flat" learning first step and a hierarchical combination of the predictions in the second step, allows the predictions of virtually any flat learning method to be enhanced. The experimental results show that hierarchical ensemble methods are able to predict novel associations between genes and abnormal phenotypes with results that are competitive with state-of-the-art algorithms and with a significant reduction of the computational complexity. CONCLUSIONS: Hierarchical ensembles are efficient computational methods that guarantee biologically meaningful predictions that obey the true path rule, and can be used as a tool to improve and make consistent the HPO terms predictions starting from virtually any flat learning method. The implementation of the proposed methods is available as an R package from the CRAN repository.


Asunto(s)
Algoritmos , Ontologías Biológicas , Área Bajo la Curva , Estudios de Asociación Genética , Humanos , Anotación de Secuencia Molecular , Fenotipo , Curva ROC
14.
Bioinformatics ; 32(18): 2872-4, 2016 09 15.
Artículo en Inglés | MEDLINE | ID: mdl-27256314

RESUMEN

UNLABELLED: RANKS is a flexible software package that can be easily applied to any bioinformatics task formalizable as ranking of nodes with respect to a property given as a label, such as automated protein function prediction, gene disease prioritization and drug repositioning. To this end RANKS provides an efficient and easy-to-use implementation of kernelized score functions, a semi-supervised algorithmic scheme embedding both local and global learning strategies for the analysis of biomolecular networks. To facilitate comparative assessment, baseline network-based methods, e.g. label propagation and random walk algorithms, have also been implemented. AVAILABILITY AND IMPLEMENTATION: The package is available from CRAN: https://cran.r-project.org/ The package is written in R, except for the most computationally intensive functionalities which are implemented in C. CONTACT: valentini@di.unimi.it SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Reposicionamiento de Medicamentos , Programas Informáticos , Algoritmos , Biología Computacional/métodos , Bases de Datos Factuales , Genómica , Humanos , Proteínas , Biología de Sistemas
15.
Bioinformatics ; 30(15): 2235-6, 2014 Aug 01.
Artículo en Inglés | MEDLINE | ID: mdl-24659104

RESUMEN

SUMMARY: We present GOssTo, the Gene Ontology semantic similarity Tool, a user-friendly software system for calculating semantic similarities between gene products according to the Gene Ontology. GOssTo is bundled with six semantic similarity measures, including both term- and graph-based measures, and has extension capabilities to allow the user to add new similarities. Importantly, for any measure, GOssTo can also calculate the Random Walk Contribution that has been shown to greatly improve the accuracy of similarity measures. GOssTo is very fast, easy to use, and it allows the calculation of similarities on a genomic scale in a few minutes on a regular desktop machine. CONTACT: alberto@cs.rhul.ac.uk AVAILABILITY: GOssTo is available both as a stand-alone application running on GNU/Linux, Windows and MacOS from www.paccanarolab.org/gossto and as a web application from www.paccanarolab.org/gosstoweb. The stand-alone application features a simple and concise command line interface for easy integration into high-throughput data processing pipelines.


Asunto(s)
Minería de Datos/métodos , Ontología de Genes , Internet , Semántica , Programas Informáticos , Proteínas/genética , Vocabulario Controlado
16.
medRxiv ; 2024 Feb 26.
Artículo en Inglés | MEDLINE | ID: mdl-37503093

RESUMEN

Objective: Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information. Materials and Methods: We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically. Results: Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task. Discussion: The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings. Conclusion: Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.

17.
Bioinform Adv ; 4(1): vbae036, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38577542

RESUMEN

Motivation: Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes. Results: We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement. Availability and implementation: Our code and data are publicly available at https://github.com/monarch-initiative/negativeExampleSelection.

18.
Int J Med Inform ; 187: 105461, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38643701

RESUMEN

OBJECTIVE: Female reproductive disorders (FRDs) are common health conditions that may present with significant symptoms. Diet and environment are potential areas for FRD interventions. We utilized a knowledge graph (KG) method to predict factors associated with common FRDs (for example, endometriosis, ovarian cyst, and uterine fibroids). MATERIALS AND METHODS: We harmonized survey data from the Personalized Environment and Genes Study (PEGS) on internal and external environmental exposures and health conditions with biomedical ontology content. We merged the harmonized data and ontologies with supplemental nutrient and agricultural chemical data to create a KG. We analyzed the KG by embedding edges and applying a random forest for edge prediction to identify variables potentially associated with FRDs. We also conducted logistic regression analysis for comparison. RESULTS: Across 9765 PEGS respondents, the KG analysis resulted in 8535 significant or suggestive predicted links between FRDs and chemicals, phenotypes, and diseases. Amongst these links, 32 were exact matches when compared with the logistic regression results, including comorbidities, medications, foods, and occupational exposures. DISCUSSION: Mechanistic underpinnings of predicted links documented in the literature may support some of our findings. Our KG methods are useful for predicting possible associations in large, survey-based datasets with added information on directionality and magnitude of effect from logistic regression. These results should not be construed as causal but can support hypothesis generation. CONCLUSION: This investigation enabled the generation of hypotheses on a variety of potential links between FRDs and exposures. Future investigations should prospectively evaluate the variables hypothesized to impact FRDs.


Asunto(s)
Exposición a Riesgos Ambientales , Humanos , Femenino , Exposición a Riesgos Ambientales/efectos adversos , Enfermedades de los Genitales Femeninos , Modelos Logísticos , Estado Nutricional , Dieta , Adulto , Bosques Aleatorios
19.
Transl Psychiatry ; 14(1): 246, 2024 Jun 08.
Artículo en Inglés | MEDLINE | ID: mdl-38851761

RESUMEN

Acute COVID-19 infection can be followed by diverse clinical manifestations referred to as Post Acute Sequelae of SARS-CoV2 Infection (PASC). Studies have shown an increased risk of being diagnosed with new-onset psychiatric disease following a diagnosis of acute COVID-19. However, it was unclear whether non-psychiatric PASC-associated manifestations (PASC-AMs) are associated with an increased risk of new-onset psychiatric disease following COVID-19. A retrospective electronic health record (EHR) cohort study of 2,391,006 individuals with acute COVID-19 was performed to evaluate whether non-psychiatric PASC-AMs are associated with new-onset psychiatric disease. Data were obtained from the National COVID Cohort Collaborative (N3C), which has EHR data from 76 clinical organizations. EHR codes were mapped to 151 non-psychiatric PASC-AMs recorded 28-120 days following SARS-CoV-2 diagnosis and before diagnosis of new-onset psychiatric disease. Association of newly diagnosed psychiatric disease with age, sex, race, pre-existing comorbidities, and PASC-AMs in seven categories was assessed by logistic regression. There were significant associations between a diagnosis of any psychiatric disease and five categories of PASC-AMs with odds ratios highest for neurological, cardiovascular, and constitutional PASC-AMs with odds ratios of 1.31, 1.29, and 1.23 respectively. Secondary analysis revealed that the proportions of 50 individual clinical features significantly differed between patients diagnosed with different psychiatric diseases. Our study provides evidence for association between non-psychiatric PASC-AMs and the incidence of newly diagnosed psychiatric disease. Significant associations were found for features related to multiple organ systems. This information could prove useful in understanding risk stratification for new-onset psychiatric disease following COVID-19. Prospective studies are needed to corroborate these findings.


Asunto(s)
COVID-19 , Trastornos Mentales , SARS-CoV-2 , Humanos , COVID-19/psicología , COVID-19/complicaciones , COVID-19/epidemiología , Masculino , Femenino , Trastornos Mentales/epidemiología , Persona de Mediana Edad , Adulto , Estudios Retrospectivos , Anciano , Fenotipo , Síndrome Post Agudo de COVID-19 , Comorbilidad , Registros Electrónicos de Salud , Adulto Joven , Factores de Riesgo , Adolescente
20.
Sci Data ; 11(1): 363, 2024 Apr 11.
Artículo en Inglés | MEDLINE | ID: mdl-38605048

RESUMEN

Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoint resources and abstraction algorithms), and benchmarks (e.g., prebuilt KGs). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 different large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.


Asunto(s)
Disciplinas de las Ciencias Biológicas , Bases del Conocimiento , Reconocimiento de Normas Patrones Automatizadas , Algoritmos , Investigación Biomédica Traslacional
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA