Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 30
Filtrar
1.
Brief Bioinform ; 24(2)2023 03 19.
Artículo en Inglés | MEDLINE | ID: mdl-36752352

RESUMEN

Drug response prediction (DRP) is important for precision medicine to predict how a patient would react to a drug before administration. Existing studies take the cell line transcriptome data, and the chemical structure of drugs as input and predict drug response as IC50 or AUC values. Intuitively, use of drug target interaction (DTI) information can be useful for DRP. However, use of DTI is difficult because existing drug response database such as CCLE and GDSC do not have information about transcriptome after drug treatment. Although transcriptome after drug treatment is not available, if we can compute the perturbation effects by the pharmacologic modulation of target gene, we can utilize the DTI information in CCLE and GDSC. In this study, we proposed a framework that can improve existing deep learning-based DRP models by effectively utilizing drug target information. Our framework includes NetGP, a module to compute gene perturbation scores by the network propagation technique on a network. NetGP produces genes in a ranked list in terms of gene perturbation scores and the ranked genes are input to a multi-layer perceptron to generate a fixed dimension vector for the integration with existing DRP models. This integration is done in a model-agnostic way so that any existing DRP tool can be incorporated. As a result, our framework boosts the performance of existing DRP models, in 64 of 72 comparisons. The performance gains are larger especially for test scenarios with samples with unseen drugs by large margins up to 34% in Pearson's correlation coefficient.


Asunto(s)
Bases de Datos Farmacéuticas , Redes Neurales de la Computación , Humanos , Medicina de Precisión/métodos , Sistemas de Liberación de Medicamentos , Transcriptoma
2.
Brief Bioinform ; 24(5)2023 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-37544660

RESUMEN

Combination therapies have brought significant advancements to the treatment of various diseases in the medical field. However, searching for effective drug combinations remains a major challenge due to the vast number of possible combinations. Biomedical knowledge graph (KG)-based methods have shown potential in predicting effective combinations for wide spectrum of diseases, but the lack of credible negative samples has limited the prediction performance of machine learning models. To address this issue, we propose a novel model-agnostic framework that leverages existing drug-drug interaction (DDI) data as a reliable negative dataset and employs supervised contrastive learning (SCL) to transform drug embedding vectors to be more suitable for drug combination prediction. We conducted extensive experiments using various network embedding algorithms, including random walk and graph neural networks, on a biomedical KG. Our framework significantly improved performance metrics compared to the baseline framework. We also provide embedding space visualizations and case studies that demonstrate the effectiveness of our approach. This work highlights the potential of using DDI data and SCL in finding tighter decision boundaries for predicting effective drug combinations.


Asunto(s)
Algoritmos , Reconocimiento de Normas Patrones Automatizadas , Benchmarking , Combinación de Medicamentos , Interacciones Farmacológicas
3.
Proc Natl Acad Sci U S A ; 118(11)2021 03 16.
Artículo en Inglés | MEDLINE | ID: mdl-33836591

RESUMEN

White adipose tissue (WAT) is a key regulator of systemic energy metabolism, and impaired WAT plasticity characterized by enlargement of preexisting adipocytes associates with WAT dysfunction, obesity, and metabolic complications. However, the mechanisms that retain proper adipose tissue plasticity required for metabolic fitness are unclear. Here, we comprehensively showed that adipocyte-specific DNA methylation, manifested in enhancers and CTCF sites, directs distal enhancer-mediated transcriptomic features required to conserve metabolic functions of white adipocytes. Particularly, genetic ablation of adipocyte Dnmt1, the major methylation writer, led to increased adiposity characterized by increased adipocyte hypertrophy along with reduced expansion of adipocyte precursors (APs). These effects of Dnmt1 deficiency provoked systemic hyperlipidemia and impaired energy metabolism both in lean and obese mice. Mechanistically, Dnmt1 deficiency abrogated mitochondrial bioenergetics by inhibiting mitochondrial fission and promoted aberrant lipid metabolism in adipocytes, rendering adipocyte hypertrophy and WAT dysfunction. Dnmt1-dependent DNA methylation prevented aberrant CTCF binding and, in turn, sustained the proper chromosome architecture to permit interactions between enhancer and dynamin-1-like protein gene Dnm1l (Drp1) in adipocytes. Also, adipose DNMT1 expression inversely correlated with adiposity and markers of metabolic health but positively correlated with AP-specific markers in obese human subjects. Thus, these findings support strategies utilizing Dnmt1 action on mitochondrial bioenergetics in adipocytes to combat obesity and related metabolic pathology.


Asunto(s)
Adipocitos/metabolismo , ADN (Citosina-5-)-Metiltransferasa 1/metabolismo , Epigénesis Genética , Dinámicas Mitocondriales , Adipocitos/patología , Tejido Adiposo/metabolismo , Tejido Adiposo/patología , Adiposidad , Animales , Factor de Unión a CCCTC/metabolismo , Estructuras Cromosómicas , ADN (Citosina-5-)-Metiltransferasa 1/deficiencia , ADN (Citosina-5-)-Metiltransferasa 1/genética , Metilación de ADN , Dinaminas/genética , Dinaminas/metabolismo , Metabolismo Energético , Elementos de Facilitación Genéticos , Perfilación de la Expresión Génica , Metabolismo de los Lípidos , Ratones , Mitocondrias/metabolismo , Obesidad/metabolismo , Obesidad/patología , Regiones Promotoras Genéticas , Unión Proteica
4.
BMC Bioinformatics ; 23(Suppl 3): 149, 2022 Apr 25.
Artículo en Inglés | MEDLINE | ID: mdl-35468739

RESUMEN

BACKGROUND: The widely spreading coronavirus disease (COVID-19) has three major spreading properties: pathogenic mutations, spatial, and temporal propagation patterns. We know the spread of the virus geographically and temporally in terms of statistics, i.e., the number of patients. However, we are yet to understand the spread at the level of individual patients. As of March 2021, COVID-19 is wide-spread all over the world with new genetic variants. One important question is to track the early spreading patterns of COVID-19 until the virus has got spread all over the world. RESULTS: In this work, we proposed AutoCoV, a deep learning method with multiple loss object, that can track the early spread of COVID-19 in terms of spatial and temporal patterns until the disease is fully spread over the world in July 2020. Performances in learning spatial or temporal patterns were measured with two clustering measures and one classification measure. For annotated SARS-CoV-2 sequences from the National Center for Biotechnology Information (NCBI), AutoCoV outperformed seven baseline methods in our experiments for learning either spatial or temporal patterns. For spatial patterns, AutoCoV had at least 1.7-fold higher clustering performances and an F1 score of 88.1%. For temporal patterns, AutoCoV had at least 1.6-fold higher clustering performances and an F1 score of 76.1%. Furthermore, AutoCoV demonstrated the robustness of the embedding space with an independent dataset, Global Initiative for Sharing All Influenza Data (GISAID). CONCLUSIONS: In summary, AutoCoV learns geographic and temporal spreading patterns successfully in experiments on NCBI and GISAID datasets and is the first of its kind that learns virus spreading patterns from the genome sequences, to the best of our knowledge. We expect that this type of embedding method will be helpful in characterizing fast-evolving pandemics.


Asunto(s)
COVID-19 , Aprendizaje Profundo , COVID-19/epidemiología , Genoma , Humanos , Pandemias , SARS-CoV-2
5.
Brief Bioinform ; 21(1): 36-46, 2020 Jan 17.
Artículo en Inglés | MEDLINE | ID: mdl-30462155

RESUMEN

MOTIVATION: Biological pathways are extensively used for the analysis of transcriptome data to characterize biological mechanisms underlying various phenotypes. There are a number of computational tools that summarize transcriptome data at the pathway level. However, there is no comparative study on how well these tools produce useful information at the cohort level, enabling comparison of many samples or patients. RESULTS: In this study, we systematically compared and evaluated 13 different pathway activity inference tools based on 5 comparison criteria using pan-cancer data set. This study has two major contributions. First, our study provides a comprehensive survey on computational techniques used by existing pathway activity inference tools. The tools use different strategies and assume different requirements on data: input transformation, use of labels, necessity of cohort-level input data, use of gene relations and scoring metric. Second, we performed extensive evaluations on the performance of these tools. Because different tools use different methods to map samples to the pathway dimension, the tools are evaluated at the pathway level using five comparison criteria. Starting from measuring how well a tool maintains the characteristics of original gene expression values, robustness was also investigated by adding noise into gene expression data. Classification tasks on three clinical variables (tumor versus normal, survival and cancer subtypes) were performed to evaluate the utility of tools for their clinical applications. In addition, the inferred activity values were compared between the tools to see how similar they are along with the scoring schemes they use.

6.
Bioinformatics ; 36(12): 3818-3824, 2020 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-32207514

RESUMEN

MOTIVATION: Biological pathway is an important curated knowledge of biological processes. Thus, cancer subtype classification based on pathways will be very useful to understand differences in biological mechanisms among cancer subtypes. However, pathways include only a fraction of the entire gene set, only one-third of human genes in KEGG, and pathways are fragmented. For this reason, there are few computational methods to use pathways for cancer subtype classification. RESULTS: We present an explainable deep-learning model with attention mechanism and network propagation for cancer subtype classification. Each pathway is modeled by a graph convolutional network. Then, a multi-attention-based ensemble model combines several hundreds of pathways in an explainable manner. Lastly, network propagation on pathway-gene network explains why gene expression profiles in subtypes are different. In experiments with five TCGA cancer datasets, our method achieved very good classification accuracies and, additionally, identified subtype-specific pathways and biological functions. AVAILABILITY AND IMPLEMENTATION: The source code is available at http://biohealth.snu.ac.kr/software/GCN_MAE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Neoplasias , Programas Informáticos , Atención , Humanos , Neoplasias/genética , Transcriptoma
7.
Bioinformatics ; 35(14): i520-i529, 2019 07 15.
Artículo en Inglés | MEDLINE | ID: mdl-31510697

RESUMEN

MOTIVATION: Characterizing cancer subclones is crucial for the ultimate conquest of cancer. Thus, a number of bioinformatic tools have been developed to infer heterogeneous tumor populations based on genomic signatures such as mutations and copy number variations. Despite accumulating evidence for the significance of global DNA methylation reprogramming in certain cancer types including myeloid malignancies, none of the bioinformatic tools are designed to exploit subclonally reprogrammed methylation patterns to reveal constituent populations of a tumor. In accordance with the notion of global methylation reprogramming, our preliminary observations on acute myeloid leukemia (AML) samples implied the existence of subclonally occurring focal methylation aberrance throughout the genome. RESULTS: We present PRISM, a tool for inferring the composition of epigenetically distinct subclones of a tumor solely from methylation patterns obtained by reduced representation bisulfite sequencing. PRISM adopts DNA methyltransferase 1-like hidden Markov model-based in silico proofreading for the correction of erroneous methylation patterns. With error-corrected methylation patterns, PRISM focuses on a short individual genomic region harboring dichotomous patterns that can be split into fully methylated and unmethylated patterns. Frequencies of such two patterns form a sufficient statistic for subclonal abundance. A set of statistics collected from each genomic region is modeled with a beta-binomial mixture. Fitting the mixture with expectation-maximization algorithm finally provides inferred composition of subclones. Applying PRISM for two AML samples, we demonstrate that PRISM could infer the evolutionary history of malignant samples from an epigenetic point of view. AVAILABILITY AND IMPLEMENTATION: PRISM is freely available on GitHub (https://github.com/dohlee/prism). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Variaciones en el Número de Copia de ADN , Metilación de ADN , Epigenómica , Genoma , Genómica
8.
BMC Bioinformatics ; 20(Suppl 23): 667, 2019 Dec 27.
Artículo en Inglés | MEDLINE | ID: mdl-31881980

RESUMEN

BACKGROUND: The main research topic in this paper is how to compare multiple biological experiments using transcriptome data, where each experiment is measured and designed to compare control and treated samples. Comparison of multiple biological experiments is usually performed in terms of the number of DEGs in an arbitrary combination of biological experiments. This process is usually facilitated with Venn diagram but there are several issues when Venn diagram is used to compare and analyze multiple experiments in terms of DEGs. First, current Venn diagram tools do not provide systematic analysis to prioritize genes. Because that current tools generally do not fully focus to prioritize genes, genes that are located in the segments in the Venn diagram (especially, intersection) is usually difficult to rank. Second, elucidating the phenotypic difference only with the lists of DEGs and expression values is challenging when the experimental designs have the combination of treatments. Experiment designs that aim to find the synergistic effect of the combination of treatments are very difficult to find without an informative system. RESULTS: We introduce Venn-diaNet, a Venn diagram based analysis framework that uses network propagation upon protein-protein interaction network to prioritizes genes from experiments that have multiple DEG lists. We suggest that the two issues can be effectively handled by ranking or prioritizing genes with segments of a Venn diagram. The user can easily compare multiple DEG lists with gene rankings, which is easy to understand and also can be coupled with additional analysis for their purposes. Our system provides a web-based interface to select seed genes in any of areas in a Venn diagram and then perform network propagation analysis to measure the influence of the selected seed genes in terms of ranked list of DEGs. CONCLUSIONS: We suggest that our system can logically guide to select seed genes without additional prior knowledge that makes us free from the seed selection of network propagation issues. We showed that Venn-diaNet can reproduce the research findings reported in the original papers that have experiments that compare two, three and eight experiments. Venn-diaNet is freely available at: http://biohealth.snu.ac.kr/software/venndianet.


Asunto(s)
Redes Reguladoras de Genes , Programas Informáticos , Animales , Perfilación de la Expresión Génica , Ontología de Genes , Internet , Ratones Noqueados , Mapas de Interacción de Proteínas , Transcriptoma , Interfaz Usuario-Computador
9.
BMC Genomics ; 20(Suppl 11): 949, 2019 Dec 20.
Artículo en Inglés | MEDLINE | ID: mdl-31856731

RESUMEN

BACKGROUND: Recently, a number of studies have been conducted to investigate how plants respond to stress at the cellular molecular level by measuring gene expression profiles over time. As a result, a set of time-series gene expression data for the stress response are available in databases. With the data, an integrated analysis of multiple stresses is possible, which identifies stress-responsive genes with higher specificity because considering multiple stress can capture the effect of interference between stresses. To analyze such data, a machine learning model needs to be built. RESULTS: In this study, we developed StressGenePred, a neural network-based machine learning method, to integrate time-series transcriptome data of multiple stress types. StressGenePred is designed to detect single stress-specific biomarker genes by using a simple feature embedding method, a twin neural network model, and Confident Multiple Choice Learning (CMCL) loss. The twin neural network model consists of a biomarker gene discovery and a stress type prediction model that share the same logical layer to reduce training complexity. The CMCL loss is used to make the twin model select biomarker genes that respond specifically to a single stress. In experiments using Arabidopsis gene expression data for four major environmental stresses, such as heat, cold, salt, and drought, StressGenePred classified the types of stress more accurately than the limma feature embedding method and the support vector machine and random forest classification methods. In addition, StressGenePred discovered known stress-related genes with higher specificity than the Fisher method. CONCLUSIONS: StressGenePred is a machine learning method for identifying stress-related genes and predicting stress types for an integrated analysis of multiple stress time-series transcriptome data. This method can be used to other phenotype-gene associated studies.


Asunto(s)
Arabidopsis/genética , Genes de Plantas/genética , Modelos Biológicos , Redes Neurales de la Computación , Estrés Fisiológico/genética , Biología Computacional , Perfilación de la Expresión Génica , Estudios de Asociación Genética , Aprendizaje Automático , Fenotipo , Transcriptoma
10.
Methods ; 145: 10-15, 2018 08 01.
Artículo en Inglés | MEDLINE | ID: mdl-29758273

RESUMEN

Determining functions of a gene requires time consuming, expensive biological experiments. Scientists can speed up this experimental process if the literature information and biological networks can be adequately provided. In this paper, we present a web-based information system that can perform in silico experiments of computationally testing hypothesis on the function of a gene. A hypothesis that is specified in English by the user is converted to genes using a literature and knowledge mining system called BEST. Condition-specific TF, miRNA and PPI (protein-protein interaction) networks are automatically generated by projecting gene and miRNA expression data to template networks. Then, an in silico experiment is to test how well the target genes are connected from the knockout gene through the condition-specific networks. The test result visualizes path from the knockout gene to the target genes in the three networks. Statistical and information-theoretic scores are provided on the resulting web page to help scientists either accept or reject the hypothesis being tested. Our web-based system was extensively tested using three data sets, such as E2f1, Lrrk2, and Dicer1 knockout data sets. We were able to re-produce gene functions reported in the original research papers. In addition, we comprehensively tested with all disease names in MalaCards as hypothesis to show the effectiveness of our system. Our in silico experiment system can be very useful in suggesting biological mechanisms which can be further tested in vivo or in vitro. AVAILABILITY: http://biohealth.snu.ac.kr/software/insilico/.


Asunto(s)
Biología Computacional , Simulación por Computador , Redes Reguladoras de Genes , Animales , Ratones , MicroARNs/metabolismo , Mapas de Interacción de Proteínas , Factores de Transcripción/metabolismo
11.
Methods ; 124: 13-24, 2017 07 15.
Artículo en Inglés | MEDLINE | ID: mdl-28579402

RESUMEN

Pathway based analysis of high throughput transcriptome data is a widely used approach to investigate biological mechanisms. Since a pathway consists of multiple functions, the recent approach is to determine condition specific sub-pathways or subpaths. However, there are several challenges. First, few existing methods utilize explicit gene expression information from RNA-seq. More importantly, subpath activity is usually an average of statistical scores, e.g., correlations, of edges in a candidate subpath, which fails to reflect gene expression quantity information. In addition, none of existing methods can handle multiple phenotypes. To address these technical problems, we designed and implemented an algorithm, MIDAS, that determines condition specific subpaths, each of which has different activities across multiple phenotypes. MIDAS utilizes gene expression quantity information fully and the network centrality information to determine condition specific subpaths. To test performance of our tool, we used TCGA breast cancer RNA-seq gene expression profiles with five molecular subtypes. 36 differentially activate subpaths were determined. The utility of our method, MIDAS, was demonstrated in four ways. All 36 subpaths are well supported by the literature information. Subsequently, we showed that these subpaths had a good discriminant power for five cancer subtype classification and also had a prognostic power in terms of survival analysis. Finally, in a performance comparison of MIDAS to a recent subpath prediction method, PATHOME, our method identified more subpaths and much more genes that are well supported by the literature information. AVAILABILITY: http://biohealth.snu.ac.kr/software/MIDAS/.


Asunto(s)
Algoritmos , Neoplasias de la Mama/genética , Minería de Datos/estadística & datos numéricos , Regulación Neoplásica de la Expresión Génica , Redes Reguladoras de Genes , ARN Neoplásico/genética , Neoplasias de la Mama/clasificación , Neoplasias de la Mama/metabolismo , Neoplasias de la Mama/mortalidad , Minería de Datos/métodos , Femenino , Perfilación de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , ARN Neoplásico/metabolismo , Análisis de Secuencia de ARN , Transducción de Señal , Programas Informáticos , Análisis de Supervivencia , Transcriptoma
12.
Methods ; 111: 64-71, 2016 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-27477210

RESUMEN

Measuring gene expression, DNA sequence variation, and DNA methylation status is routinely done using high throughput sequencing technologies. To analyze such multi-omics data and explore relationships, reliable bioinformatics systems are much needed. Existing systems are either for exploring curated data or for processing omics data in the form of a library such as R. Thus scientists have much difficulty in investigating relationships among gene expression, DNA sequence variation, and DNA methylation using multi-omics data. In this study, we report a system called BioVLAB-mCpG-SNP-EXPRESS for the integrated analysis of DNA methylation, sequence variation (SNPs), and gene expression for distinguishing cellular phenotypes at the pairwise and multiple phenotype levels. The system can be deployed on either the Amazon cloud or a publicly available high-performance computing node, and the data analysis and exploration of the analysis result can be conveniently done using a web-based interface. In order to alleviate analysis complexity, all the process are fully automated, and graphical workflow system is integrated to represent real-time analysis progression. The BioVLAB-mCpG-SNP-EXPRESS system works in three stages. First, it processes and analyzes multi-omics data as input in the form of the raw data, i.e., FastQ files. Second, various integrated analyses such as methylation vs. gene expression and mutation vs. methylation are performed. Finally, the analysis result can be explored in a number of ways through a web interface for the multi-level, multi-perspective exploration. Multi-level interpretation can be done by either gene, gene set, pathway or network level and multi-perspective exploration can be explored from either gene expression, DNA methylation, sequence variation, or their relationship perspective. The utility of the system is demonstrated by performing analysis of phenotypically distinct 30 breast cancer cell line data set. BioVLAB-mCpG-SNP-EXPRESS is available at http://biohealth.snu.ac.kr/software/biovlab_mcpg_snp_express/.


Asunto(s)
Biología Computacional/métodos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Metilación de ADN/genética , Bases de Datos Genéticas , Variación Genética , Humanos , Polimorfismo de Nucleótido Simple/genética
13.
Artículo en Inglés | MEDLINE | ID: mdl-38241108

RESUMEN

Knowledge of unintended effects of drugs is critical in assessing the risk of treatment and in drug repurposing. Although numerous existing studies predict drug-side effect presence, only four of them predict the frequency of the side effects. Unfortunately, current prediction methods (1) do not utilize drug targets, (2) do not predict well for unseen drugs, and (3) do not use multiple heterogeneous drug features. We propose a novel deep learning-based drug-side effect frequency prediction model. Our model utilized heterogeneous features such as target protein information as well as molecular graph, fingerprints, and chemical similarity to create drug embeddings simultaneously. Furthermore, the model represents drugs and side effects into a common vector space, learning the dual representation vectors of drugs and side effects, respectively. We also extended the predictive power of our model to compensate for the drugs without clear target proteins using the Adaboost method. We achieved state-of-the-art performance over the existing methods in predicting side effect frequencies, especially for unseen drugs. Ablation studies show that our model effectively combines and utilizes heterogeneous features of drugs. Moreover, we observed that, when the target information given, drugs with explicit targets resulted in better prediction than the drugs without explicit targets. The implementation is available at https://github.com/eskendrian/sider.

14.
Comput Struct Biotechnol J ; 23: 1715-1724, 2024 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-38689720

RESUMEN

Multi-gene assays have been widely used to predict the recurrence risk for hormone receptor (HR)-positive breast cancer patients. However, these assays lack explanatory power regarding the underlying mechanisms of the recurrence risk. To address this limitation, we proposed a novel multi-layered knowledge graph neural network for the multi-gene assays. Our model elucidated the regulatory pathways of assay genes and utilized an attention-based graph neural network to predict recurrence risk while interpreting transcriptional subpathways relevant to risk prediction. Evaluation on three multi-gene assays-Oncotype DX, Prosigna, and EndoPredict-using SCAN-B dataset demonstrated the efficacy of our method. Through interpretation of attention weights, we found that all three assays are mainly regulated by signaling pathways driving cancer proliferation especially RTK-ERK-ETS-mediated cell proliferation for breast cancer recurrence. In addition, our analysis highlighted that the important regulatory subpathways remain consistent across different knowledgebases used for constructing the multi-level knowledge graph. Furthermore, through attention analysis, we demonstrated the biological significance and clinical relevance of these subpathways in predicting patient outcomes. The source code is available at http://biohealth.snu.ac.kr/software/ExplainableMLKGNN.

15.
Nat Commun ; 14(1): 3570, 2023 06 15.
Artículo en Inglés | MEDLINE | ID: mdl-37322032

RESUMEN

Computational drug repurposing aims to identify new indications for existing drugs by utilizing high-throughput data, often in the form of biomedical knowledge graphs. However, learning on biomedical knowledge graphs can be challenging due to the dominance of genes and a small number of drug and disease entities, resulting in less effective representations. To overcome this challenge, we propose a "semantic multi-layer guilt-by-association" approach that leverages the principle of guilt-by-association - "similar genes share similar functions", at the drug-gene-disease level. Using this approach, our model DREAMwalk: Drug Repurposing through Exploring Associations using Multi-layer random walk uses our semantic information-guided random walk to generate drug and disease-populated node sequences, allowing for effective mapping of both drugs and diseases in a unified embedding space. Compared to state-of-the-art link prediction models, our approach improves drug-disease association prediction accuracy by up to 16.8%. Moreover, exploration of the embedding space reveals a well-aligned harmony between biological and semantic contexts. We demonstrate the effectiveness of our approach through repurposing case studies for breast carcinoma and Alzheimer's disease, highlighting the potential of multi-layer guilt-by-association perspective for drug repurposing on biomedical knowledge graphs.


Asunto(s)
Reposicionamiento de Medicamentos , Reconocimiento de Normas Patrones Automatizadas , Aprendizaje
16.
Comput Struct Biotechnol J ; 21: 4187-4195, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37680266

RESUMEN

Motivation: Lead identification is a fundamental step to prioritize candidate compounds for downstream drug discovery process. Machine learning (ML) and deep learning (DL) approaches are widely used to identify lead compounds using both chemical property and experimental information. However, ML or DL methods rarely consider compound similarity information directly since ML and DL models use abstract representation of molecules for model construction. Alternatively, data mining approaches are also used to explore chemical space with drug candidates by screening undesirable compounds. A major challenge for data mining approaches is to develop efficient data mining methods that search large chemical space for desirable lead compounds with low false positive rate. Results: In this work, we developed a network propagation (NP) based data mining method for lead identification that performs search on an ensemble of chemical similarity networks. We compiled 14 fingerprint-based similarity networks. Given a target protein of interest, we use a deep learning-based drug target interaction model to narrow down compound candidates and then we use network propagation to prioritize drug candidates that are highly correlated with drug activity score such as IC50. In an extensive experiment with BindingDB, we showed that our approach successfully discovered intentionally unlabeled compounds for given targets. To further demonstrate the prediction power of our approach, we identified 24 candidate leads for CLK1. Two out of five synthesizable candidates were experimentally validated in binding assays. In conclusion, our framework can be very useful for lead identification from very large compound databases such as ZINC.

17.
IEEE/ACM Trans Comput Biol Bioinform ; 19(4): 2356-2364, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-33750713

RESUMEN

MOTIVATION: Identifying differentially expressed genes (DEGs) in transcriptome data is a very important task. However, performances of existing DEG methods vary significantly for data sets measured in different conditions and no single statistical or machine learning model for DEG detection perform consistently well for data sets of different traits. In addition, setting a cutoff value for the significance of differential expressions is one of confounding factors to determine DEGs. RESULTS: We address these problems by developing an ensemble model that refines the heterogeneous and inconsistent results of the existing methods by taking accounts into network information such as network propagation and network property. DEG candidates that are predicted with weak evidence by the existing tools are re-classified by our proposed ensemble model for the transcriptome data. Tested on 10 RNA-seq datasets downloaded from gene expression omnibus (GEO), our method showed excellent performance of winning the first place in detecting ground truth (GT) genes in eight datasets and find almost all GT genes in six datasets. On the other hand, performances of all existing methods varied significantly for the 10 data sets. Because of the design principle, our method can accommodate any new DEG methods naturally. AVAILABILITY: The source code of our method is available at https://github.com/jihmoon/MLDEG.


Asunto(s)
Perfilación de la Expresión Génica , Programas Informáticos , Perfilación de la Expresión Génica/métodos , Aprendizaje Automático , Transcriptoma
18.
Comput Struct Biotechnol J ; 20: 4288-4304, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36051875

RESUMEN

A large number of chemical compounds are available in databases such as PubChem and ZINC. However, currently known compounds, though large, represent only a fraction of possible compounds, which is known as chemical space. Many of these compounds in the databases are annotated with properties and assay data that can be used for drug discovery efforts. For this goal, a number of machine learning algorithms have been developed and recent deep learning technologies can be effectively used to navigate chemical space, especially for unknown chemical compounds, in terms of drug-related tasks. In this article, we survey how deep learning technologies can model and utilize chemical compound information in a task-oriented way by exploiting annotated properties and assay data in the chemical compounds databases. We first compile what kind of tasks are trying to be accomplished by machine learning methods. Then, we survey deep learning technologies to show their modeling power and current applications for accomplishing drug related tasks. Next, we survey deep learning techniques to address the insufficiency issue of annotated data for more effective navigation of chemical space. Chemical compound information alone may not be powerful enough for drug related tasks, thus we survey what kind of information, such as assay and gene expression data, can be used to improve the prediction power of deep learning models. Finally, we conclude this survey with four important newly developed technologies that are yet to be fully incorporated into computational analysis of chemical information.

19.
Cancers (Basel) ; 14(17)2022 Aug 25.
Artículo en Inglés | MEDLINE | ID: mdl-36077657

RESUMEN

Patient stratification is a clinically important task because it allows us to establish and develop efficient treatment strategies for particular groups of patients. Molecular subtypes have been successfully defined using transcriptomic profiles, and they are used effectively in clinical practice, e.g., PAM50 subtypes of breast cancer. Survival prediction contributed to understanding diseases and also identifying genes related to prognosis. It is desirable to stratify patients considering these two aspects simultaneously. However, there are no methods for patient stratification that consider molecular subtypes and survival outcomes at once. Here, we propose a methodology to deal with the problem. A genetic algorithm is used to select a gene set from transcriptome data, and their expression quantities are utilized to assign a risk score to each patient. The patients are ordered and stratified according to the score. A gene set was selected by our method on a breast cancer cohort (TCGA-BRCA), and we examined its clinical utility using an independent cohort (SCAN-B). In this experiment, our method was successful in stratifying patients with respect to both molecular subtype and survival outcome. We demonstrated that the orders of patients were consistent across repeated experiments, and prognostic genes were successfully nominated. Additionally, it was observed that the risk score can be used to evaluate the molecular aggressiveness of individual patients.

20.
Sci Rep ; 11(1): 9543, 2021 05 05.
Artículo en Inglés | MEDLINE | ID: mdl-33953216

RESUMEN

GPCR proteins belong to diverse families of proteins that are defined at multiple hierarchical levels. Inspecting relationships between GPCR proteins on the hierarchical structure is important, since characteristics of the protein can be inferred from proteins in similar hierarchical information. However, modeling of GPCR families has been performed separately for each of the family, subfamily, and sub-subfamily level. Relationships between GPCR proteins are ignored in these approaches as they process the information in the proteins with several disconnected models. In this study, we propose DeepHier, a deep learning model to simultaneously learn representations of GPCR family hierarchy from the protein sequences with a unified single model. Novel loss term based on metric learning is introduced to incorporate hierarchical relations between proteins. We tested our approach using a public GPCR sequence dataset. Metric distances in the deep feature space corresponded to the hierarchical family relation between GPCR proteins. Furthermore, we demonstrated that further downstream tasks, like phylogenetic reconstruction and motif discovery, are feasible in the constructed embedding space. These results show that hierarchical relations between sequences were successfully captured in both of technical and biological aspects.


Asunto(s)
Receptores Acoplados a Proteínas G/química , Secuencia de Aminoácidos , Animales , Aprendizaje Profundo , Humanos , Modelos Moleculares , Redes Neurales de la Computación , Conformación Proteica , Análisis de Secuencia de Proteína
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA