Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 30
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
Comput Struct Biotechnol J ; 23: 1715-1724, 2024 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-38689720

RESUMO

Multi-gene assays have been widely used to predict the recurrence risk for hormone receptor (HR)-positive breast cancer patients. However, these assays lack explanatory power regarding the underlying mechanisms of the recurrence risk. To address this limitation, we proposed a novel multi-layered knowledge graph neural network for the multi-gene assays. Our model elucidated the regulatory pathways of assay genes and utilized an attention-based graph neural network to predict recurrence risk while interpreting transcriptional subpathways relevant to risk prediction. Evaluation on three multi-gene assays-Oncotype DX, Prosigna, and EndoPredict-using SCAN-B dataset demonstrated the efficacy of our method. Through interpretation of attention weights, we found that all three assays are mainly regulated by signaling pathways driving cancer proliferation especially RTK-ERK-ETS-mediated cell proliferation for breast cancer recurrence. In addition, our analysis highlighted that the important regulatory subpathways remain consistent across different knowledgebases used for constructing the multi-level knowledge graph. Furthermore, through attention analysis, we demonstrated the biological significance and clinical relevance of these subpathways in predicting patient outcomes. The source code is available at http://biohealth.snu.ac.kr/software/ExplainableMLKGNN.

2.
Artigo em Inglês | MEDLINE | ID: mdl-38241108

RESUMO

Knowledge of unintended effects of drugs is critical in assessing the risk of treatment and in drug repurposing. Although numerous existing studies predict drug-side effect presence, only four of them predict the frequency of the side effects. Unfortunately, current prediction methods (1) do not utilize drug targets, (2) do not predict well for unseen drugs, and (3) do not use multiple heterogeneous drug features. We propose a novel deep learning-based drug-side effect frequency prediction model. Our model utilized heterogeneous features such as target protein information as well as molecular graph, fingerprints, and chemical similarity to create drug embeddings simultaneously. Furthermore, the model represents drugs and side effects into a common vector space, learning the dual representation vectors of drugs and side effects, respectively. We also extended the predictive power of our model to compensate for the drugs without clear target proteins using the Adaboost method. We achieved state-of-the-art performance over the existing methods in predicting side effect frequencies, especially for unseen drugs. Ablation studies show that our model effectively combines and utilizes heterogeneous features of drugs. Moreover, we observed that, when the target information given, drugs with explicit targets resulted in better prediction than the drugs without explicit targets. The implementation is available at https://github.com/eskendrian/sider.

3.
Comput Struct Biotechnol J ; 21: 4187-4195, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37680266

RESUMO

Motivation: Lead identification is a fundamental step to prioritize candidate compounds for downstream drug discovery process. Machine learning (ML) and deep learning (DL) approaches are widely used to identify lead compounds using both chemical property and experimental information. However, ML or DL methods rarely consider compound similarity information directly since ML and DL models use abstract representation of molecules for model construction. Alternatively, data mining approaches are also used to explore chemical space with drug candidates by screening undesirable compounds. A major challenge for data mining approaches is to develop efficient data mining methods that search large chemical space for desirable lead compounds with low false positive rate. Results: In this work, we developed a network propagation (NP) based data mining method for lead identification that performs search on an ensemble of chemical similarity networks. We compiled 14 fingerprint-based similarity networks. Given a target protein of interest, we use a deep learning-based drug target interaction model to narrow down compound candidates and then we use network propagation to prioritize drug candidates that are highly correlated with drug activity score such as IC50. In an extensive experiment with BindingDB, we showed that our approach successfully discovered intentionally unlabeled compounds for given targets. To further demonstrate the prediction power of our approach, we identified 24 candidate leads for CLK1. Two out of five synthesizable candidates were experimentally validated in binding assays. In conclusion, our framework can be very useful for lead identification from very large compound databases such as ZINC.

4.
Brief Bioinform ; 24(5)2023 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-37544660

RESUMO

Combination therapies have brought significant advancements to the treatment of various diseases in the medical field. However, searching for effective drug combinations remains a major challenge due to the vast number of possible combinations. Biomedical knowledge graph (KG)-based methods have shown potential in predicting effective combinations for wide spectrum of diseases, but the lack of credible negative samples has limited the prediction performance of machine learning models. To address this issue, we propose a novel model-agnostic framework that leverages existing drug-drug interaction (DDI) data as a reliable negative dataset and employs supervised contrastive learning (SCL) to transform drug embedding vectors to be more suitable for drug combination prediction. We conducted extensive experiments using various network embedding algorithms, including random walk and graph neural networks, on a biomedical KG. Our framework significantly improved performance metrics compared to the baseline framework. We also provide embedding space visualizations and case studies that demonstrate the effectiveness of our approach. This work highlights the potential of using DDI data and SCL in finding tighter decision boundaries for predicting effective drug combinations.


Assuntos
Algoritmos , Reconhecimento Automatizado de Padrão , Benchmarking , Combinação de Medicamentos , Interações Medicamentosas
5.
Nat Commun ; 14(1): 3570, 2023 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-37322032

RESUMO

Computational drug repurposing aims to identify new indications for existing drugs by utilizing high-throughput data, often in the form of biomedical knowledge graphs. However, learning on biomedical knowledge graphs can be challenging due to the dominance of genes and a small number of drug and disease entities, resulting in less effective representations. To overcome this challenge, we propose a "semantic multi-layer guilt-by-association" approach that leverages the principle of guilt-by-association - "similar genes share similar functions", at the drug-gene-disease level. Using this approach, our model DREAMwalk: Drug Repurposing through Exploring Associations using Multi-layer random walk uses our semantic information-guided random walk to generate drug and disease-populated node sequences, allowing for effective mapping of both drugs and diseases in a unified embedding space. Compared to state-of-the-art link prediction models, our approach improves drug-disease association prediction accuracy by up to 16.8%. Moreover, exploration of the embedding space reveals a well-aligned harmony between biological and semantic contexts. We demonstrate the effectiveness of our approach through repurposing case studies for breast carcinoma and Alzheimer's disease, highlighting the potential of multi-layer guilt-by-association perspective for drug repurposing on biomedical knowledge graphs.


Assuntos
Reposicionamento de Medicamentos , Reconhecimento Automatizado de Padrão , Aprendizagem
6.
Brief Bioinform ; 24(2)2023 03 19.
Artigo em Inglês | MEDLINE | ID: mdl-36752352

RESUMO

Drug response prediction (DRP) is important for precision medicine to predict how a patient would react to a drug before administration. Existing studies take the cell line transcriptome data, and the chemical structure of drugs as input and predict drug response as IC50 or AUC values. Intuitively, use of drug target interaction (DTI) information can be useful for DRP. However, use of DTI is difficult because existing drug response database such as CCLE and GDSC do not have information about transcriptome after drug treatment. Although transcriptome after drug treatment is not available, if we can compute the perturbation effects by the pharmacologic modulation of target gene, we can utilize the DTI information in CCLE and GDSC. In this study, we proposed a framework that can improve existing deep learning-based DRP models by effectively utilizing drug target information. Our framework includes NetGP, a module to compute gene perturbation scores by the network propagation technique on a network. NetGP produces genes in a ranked list in terms of gene perturbation scores and the ranked genes are input to a multi-layer perceptron to generate a fixed dimension vector for the integration with existing DRP models. This integration is done in a model-agnostic way so that any existing DRP tool can be incorporated. As a result, our framework boosts the performance of existing DRP models, in 64 of 72 comparisons. The performance gains are larger especially for test scenarios with samples with unseen drugs by large margins up to 34% in Pearson's correlation coefficient.


Assuntos
Bases de Dados de Produtos Farmacêuticos , Redes Neurais de Computação , Humanos , Medicina de Precisão/métodos , Sistemas de Liberação de Medicamentos , Transcriptoma
7.
Comput Struct Biotechnol J ; 20: 4288-4304, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36051875

RESUMO

A large number of chemical compounds are available in databases such as PubChem and ZINC. However, currently known compounds, though large, represent only a fraction of possible compounds, which is known as chemical space. Many of these compounds in the databases are annotated with properties and assay data that can be used for drug discovery efforts. For this goal, a number of machine learning algorithms have been developed and recent deep learning technologies can be effectively used to navigate chemical space, especially for unknown chemical compounds, in terms of drug-related tasks. In this article, we survey how deep learning technologies can model and utilize chemical compound information in a task-oriented way by exploiting annotated properties and assay data in the chemical compounds databases. We first compile what kind of tasks are trying to be accomplished by machine learning methods. Then, we survey deep learning technologies to show their modeling power and current applications for accomplishing drug related tasks. Next, we survey deep learning techniques to address the insufficiency issue of annotated data for more effective navigation of chemical space. Chemical compound information alone may not be powerful enough for drug related tasks, thus we survey what kind of information, such as assay and gene expression data, can be used to improve the prediction power of deep learning models. Finally, we conclude this survey with four important newly developed technologies that are yet to be fully incorporated into computational analysis of chemical information.

8.
Cancers (Basel) ; 14(17)2022 Aug 25.
Artigo em Inglês | MEDLINE | ID: mdl-36077657

RESUMO

Patient stratification is a clinically important task because it allows us to establish and develop efficient treatment strategies for particular groups of patients. Molecular subtypes have been successfully defined using transcriptomic profiles, and they are used effectively in clinical practice, e.g., PAM50 subtypes of breast cancer. Survival prediction contributed to understanding diseases and also identifying genes related to prognosis. It is desirable to stratify patients considering these two aspects simultaneously. However, there are no methods for patient stratification that consider molecular subtypes and survival outcomes at once. Here, we propose a methodology to deal with the problem. A genetic algorithm is used to select a gene set from transcriptome data, and their expression quantities are utilized to assign a risk score to each patient. The patients are ordered and stratified according to the score. A gene set was selected by our method on a breast cancer cohort (TCGA-BRCA), and we examined its clinical utility using an independent cohort (SCAN-B). In this experiment, our method was successful in stratifying patients with respect to both molecular subtype and survival outcome. We demonstrated that the orders of patients were consistent across repeated experiments, and prognostic genes were successfully nominated. Additionally, it was observed that the risk score can be used to evaluate the molecular aggressiveness of individual patients.

9.
BMC Bioinformatics ; 23(Suppl 3): 149, 2022 Apr 25.
Artigo em Inglês | MEDLINE | ID: mdl-35468739

RESUMO

BACKGROUND: The widely spreading coronavirus disease (COVID-19) has three major spreading properties: pathogenic mutations, spatial, and temporal propagation patterns. We know the spread of the virus geographically and temporally in terms of statistics, i.e., the number of patients. However, we are yet to understand the spread at the level of individual patients. As of March 2021, COVID-19 is wide-spread all over the world with new genetic variants. One important question is to track the early spreading patterns of COVID-19 until the virus has got spread all over the world. RESULTS: In this work, we proposed AutoCoV, a deep learning method with multiple loss object, that can track the early spread of COVID-19 in terms of spatial and temporal patterns until the disease is fully spread over the world in July 2020. Performances in learning spatial or temporal patterns were measured with two clustering measures and one classification measure. For annotated SARS-CoV-2 sequences from the National Center for Biotechnology Information (NCBI), AutoCoV outperformed seven baseline methods in our experiments for learning either spatial or temporal patterns. For spatial patterns, AutoCoV had at least 1.7-fold higher clustering performances and an F1 score of 88.1%. For temporal patterns, AutoCoV had at least 1.6-fold higher clustering performances and an F1 score of 76.1%. Furthermore, AutoCoV demonstrated the robustness of the embedding space with an independent dataset, Global Initiative for Sharing All Influenza Data (GISAID). CONCLUSIONS: In summary, AutoCoV learns geographic and temporal spreading patterns successfully in experiments on NCBI and GISAID datasets and is the first of its kind that learns virus spreading patterns from the genome sequences, to the best of our knowledge. We expect that this type of embedding method will be helpful in characterizing fast-evolving pandemics.


Assuntos
COVID-19 , Aprendizado Profundo , COVID-19/epidemiologia , Genoma , Humanos , Pandemias , SARS-CoV-2
11.
IEEE/ACM Trans Comput Biol Bioinform ; 19(4): 2356-2364, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-33750713

RESUMO

MOTIVATION: Identifying differentially expressed genes (DEGs) in transcriptome data is a very important task. However, performances of existing DEG methods vary significantly for data sets measured in different conditions and no single statistical or machine learning model for DEG detection perform consistently well for data sets of different traits. In addition, setting a cutoff value for the significance of differential expressions is one of confounding factors to determine DEGs. RESULTS: We address these problems by developing an ensemble model that refines the heterogeneous and inconsistent results of the existing methods by taking accounts into network information such as network propagation and network property. DEG candidates that are predicted with weak evidence by the existing tools are re-classified by our proposed ensemble model for the transcriptome data. Tested on 10 RNA-seq datasets downloaded from gene expression omnibus (GEO), our method showed excellent performance of winning the first place in detecting ground truth (GT) genes in eight datasets and find almost all GT genes in six datasets. On the other hand, performances of all existing methods varied significantly for the 10 data sets. Because of the design principle, our method can accommodate any new DEG methods naturally. AVAILABILITY: The source code of our method is available at https://github.com/jihmoon/MLDEG.


Assuntos
Perfilação da Expressão Gênica , Software , Perfilação da Expressão Gênica/métodos , Aprendizado de Máquina , Transcriptoma
12.
Sci Rep ; 11(1): 23992, 2021 12 14.
Artigo em Inglês | MEDLINE | ID: mdl-34907266

RESUMO

Cervical lymph node metastasis is the leading cause of poor prognosis in oral tongue squamous cell carcinoma and also occurs in the early stages. The current clinical diagnosis depends on a physical examination that is not enough to determine whether micrometastasis remains. The transcriptome profiling technique has shown great potential for predicting micrometastasis by capturing the dynamic activation state of genes. However, there are several technical challenges in using transcriptome data to model patient conditions: (1) An Insufficient number of samples compared to the number of genes, (2) Complex dependence between genes that govern the cancer phenotype, and (3) Heterogeneity between patients between cohorts that differ geographically and racially. We developed a computational framework to learn the subnetwork representation of the transcriptome to discover network biomarkers and determine the potential of metastasis in early oral tongue squamous cell carcinoma. Our method achieved high accuracy in predicting the potential of metastasis in two geographically and racially different groups of patients. The robustness of the model and the reproducibility of the discovered network biomarkers show great potential as a tool to diagnose lymph node metastasis in early oral cancer.


Assuntos
Biomarcadores Tumorais/biossíntese , Carcinoma de Células Escamosas/metabolismo , Bases de Dados de Ácidos Nucleicos , Regulação Neoplásica da Expressão Gênica , Modelos Biológicos , Neoplasias Bucais/metabolismo , Transcriptoma , Adulto , Idoso , Carcinoma de Células Escamosas/patologia , Feminino , Humanos , Metástase Linfática , Masculino , Pessoa de Meia-Idade , Neoplasias Bucais/patologia
13.
Front Genet ; 12: 652623, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34093651

RESUMO

Gene expression profile or transcriptome can represent cellular states, thus understanding gene regulation mechanisms can help understand how cells respond to external stress. Interaction between transcription factor (TF) and target gene (TG) is one of the representative regulatory mechanisms in cells. In this paper, we present a novel computational method to construct condition-specific transcriptional networks from transcriptome data. Regulatory interaction between TFs and TGs is very complex, specifically multiple-to-multiple relations. Experimental data from TF Chromatin Immunoprecipitation sequencing is useful but produces one-to-multiple relations between TF and TGs. On the other hand, co-expression networks of genes can be useful for constructing condition transcriptional networks, but there are many false positive relations in co-expression networks. In this paper, we propose a novel method to construct a condition-specific and combinatorial transcriptional network, applying kernel canonical correlation analysis (kernel CCA) to identify multiple-to-multiple TF-TG relations in certain biological condition. Kernel CCA is a well-established statistical method for computing the correlation of a group of features vs. another group of features. We, therefore, employed kernel CCA to embed TFs and TGs into a new space where the correlation of TFs and TGs are reflected. To demonstrate the usefulness of our network construction method, we used the blood transcriptome data for the investigation on the response to high fat diet in a human and an arabidopsis data set for the investigation on the response to cold/heat stress. Our method detected not only important regulatory interactions reported in previous studies but also novel TF-TG relations where a module of TF is regulating a module of TGs upon specific stress.

14.
Sci Rep ; 11(1): 9543, 2021 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-33953216

RESUMO

GPCR proteins belong to diverse families of proteins that are defined at multiple hierarchical levels. Inspecting relationships between GPCR proteins on the hierarchical structure is important, since characteristics of the protein can be inferred from proteins in similar hierarchical information. However, modeling of GPCR families has been performed separately for each of the family, subfamily, and sub-subfamily level. Relationships between GPCR proteins are ignored in these approaches as they process the information in the proteins with several disconnected models. In this study, we propose DeepHier, a deep learning model to simultaneously learn representations of GPCR family hierarchy from the protein sequences with a unified single model. Novel loss term based on metric learning is introduced to incorporate hierarchical relations between proteins. We tested our approach using a public GPCR sequence dataset. Metric distances in the deep feature space corresponded to the hierarchical family relation between GPCR proteins. Furthermore, we demonstrated that further downstream tasks, like phylogenetic reconstruction and motif discovery, are feasible in the constructed embedding space. These results show that hierarchical relations between sequences were successfully captured in both of technical and biological aspects.


Assuntos
Receptores Acoplados a Proteínas G/química , Sequência de Aminoácidos , Animais , Aprendizado Profundo , Humanos , Modelos Moleculares , Redes Neurais de Computação , Conformação Proteica , Análise de Sequência de Proteína
15.
Proc Natl Acad Sci U S A ; 118(11)2021 03 16.
Artigo em Inglês | MEDLINE | ID: mdl-33836591

RESUMO

White adipose tissue (WAT) is a key regulator of systemic energy metabolism, and impaired WAT plasticity characterized by enlargement of preexisting adipocytes associates with WAT dysfunction, obesity, and metabolic complications. However, the mechanisms that retain proper adipose tissue plasticity required for metabolic fitness are unclear. Here, we comprehensively showed that adipocyte-specific DNA methylation, manifested in enhancers and CTCF sites, directs distal enhancer-mediated transcriptomic features required to conserve metabolic functions of white adipocytes. Particularly, genetic ablation of adipocyte Dnmt1, the major methylation writer, led to increased adiposity characterized by increased adipocyte hypertrophy along with reduced expansion of adipocyte precursors (APs). These effects of Dnmt1 deficiency provoked systemic hyperlipidemia and impaired energy metabolism both in lean and obese mice. Mechanistically, Dnmt1 deficiency abrogated mitochondrial bioenergetics by inhibiting mitochondrial fission and promoted aberrant lipid metabolism in adipocytes, rendering adipocyte hypertrophy and WAT dysfunction. Dnmt1-dependent DNA methylation prevented aberrant CTCF binding and, in turn, sustained the proper chromosome architecture to permit interactions between enhancer and dynamin-1-like protein gene Dnm1l (Drp1) in adipocytes. Also, adipose DNMT1 expression inversely correlated with adiposity and markers of metabolic health but positively correlated with AP-specific markers in obese human subjects. Thus, these findings support strategies utilizing Dnmt1 action on mitochondrial bioenergetics in adipocytes to combat obesity and related metabolic pathology.


Assuntos
Adipócitos/metabolismo , DNA (Citosina-5-)-Metiltransferase 1/metabolismo , Epigênese Genética , Dinâmica Mitocondrial , Adipócitos/patologia , Tecido Adiposo/metabolismo , Tecido Adiposo/patologia , Adiposidade , Animais , Fator de Ligação a CCCTC/metabolismo , Estruturas Cromossômicas , DNA (Citosina-5-)-Metiltransferase 1/deficiência , DNA (Citosina-5-)-Metiltransferase 1/genética , Metilação de DNA , Dinaminas/genética , Dinaminas/metabolismo , Metabolismo Energético , Elementos Facilitadores Genéticos , Perfilação da Expressão Gênica , Metabolismo dos Lipídeos , Camundongos , Mitocôndrias/metabolismo , Obesidade/metabolismo , Obesidade/patologia , Regiões Promotoras Genéticas , Ligação Proteica
16.
IEEE/ACM Trans Comput Biol Bioinform ; 18(3): 1174-1183, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-31494555

RESUMO

MOTIVATION: Existing k-mer based string kernel methods have been successfully used for sequence comparison. However, existing kernel methods have limitations for comparative and evolutionary comparisons of genomes due to the sensitiveness to over-represented k-mers and variable sequence lengths. RESULTS: In this study, we propose a novel ranked k-spectrum string (RKSS) kernel. 1) RKSS kernel utilizes common k-mer sets across species, named landmarks, that can be used for comparing multiple genomes. 2) Based on the landmarks, we can use ranks of k-mers, rather than frequencies, that can produce more robust distances between genomes. To show the power of RKSS kernel, we conducted two experiments using 10 mammalian species with exon, intron, and CpG island sequences. RKSS kernel reconstructed more consistent evolutionary trees than the k-spectrum string kernel. In the subsequent experiment, for each sequence, kernel distance was calculated from 30 landmarks representing exon, intron, and CpG island sequences of 10 genomes. Based on kernel distances, concordance tests were performed and the result suggested that more information is conserved in CpG islands across species than in introns. In conclusion, our analysis suggests that the relational order, exon CpG island intron, in terms of evolutionary information contents.


Assuntos
Algoritmos , Ilhas de CpG/genética , Éxons/genética , Genômica/métodos , Íntrons/genética , Animais , Evolução Molecular , Humanos , Filogenia , Análise de Sequência de DNA/métodos
17.
Front Genet ; 11: 564792, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33281870

RESUMO

Pharmacogenomics is the study of how genes affect a person's response to drugs. Thus, understanding the effect of drug at the molecular level can be helpful in both drug discovery and personalized medicine. Over the years, transcriptome data upon drug treatment has been collected and several databases compiled before drug treatment cancer cell multi-omics data with drug sensitivity (IC 50, AUC) or time-series transcriptomic data after drug treatment. However, analyzing transcriptome data upon drug treatment is challenging since more than 20,000 genes interact in complex ways. In addition, due to the difficulty of both time-series analysis and multi-omics integration, current methods can hardly perform analysis of databases with different data characteristics. One effective way is to interpret transcriptome data in terms of well-characterized biological pathways. Another way is to leverage state-of-the-art methods for multi-omics data integration. In this paper, we developed Drug Response analysis Integrating Multi-omics and time-series data (DRIM), an integrative multi-omics and time-series data analysis framework that identifies perturbed sub-pathways and regulation mechanisms upon drug treatment. The system takes drug name and cell line identification numbers or user's drug control/treat time-series gene expression data as input. Then, analysis of multi-omics data upon drug treatment is performed in two perspectives. For the multi-omics perspective analysis, IC 50-related multi-omics potential mediator genes are determined by embedding multi-omics data to gene-centric vector space using a tensor decomposition method and an autoencoder deep learning model. Then, perturbed pathway analysis of potential mediator genes is performed. For the time-series perspective analysis, time-varying perturbed sub-pathways upon drug treatment are constructed. Additionally, a network involving transcription factors (TFs), multi-omics potential mediator genes, and perturbed sub-pathways is constructed, and paths to perturbed pathways from TFs are determined by an influence maximization method. To demonstrate the utility of our system, we provide analysis results of sub-pathway regulatory mechanisms in breast cancer cell lines of different drug sensitivity. DRIM is available at: http://biohealth.snu.ac.kr/software/DRIM/.

18.
Front Genet ; 11: 869, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33133123

RESUMO

Epigenetic gene regulation is a major control mechanism of gene expression. Most existing methods for modeling control mechanisms of gene expression use only a single epigenetic marker and very few methods are successful in modeling complex mechanisms of gene regulations using multiple epigenetic markers on transcriptional regulation. In this paper, we propose a multi-attention based deep learning model that integrates multiple markers to characterize complex gene regulation mechanisms. In experiments with 18 cell line multi-omics data, our proposed model predicted the gene expression level more accurately than the state-of-the-art model. Moreover, the model successfully revealed cell-type-specific gene expression control mechanisms. Finally, the model was used to identify genes enriched for specific cell types in terms of their functions and epigenetic regulation.

19.
Bioinformatics ; 36(12): 3818-3824, 2020 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-32207514

RESUMO

MOTIVATION: Biological pathway is an important curated knowledge of biological processes. Thus, cancer subtype classification based on pathways will be very useful to understand differences in biological mechanisms among cancer subtypes. However, pathways include only a fraction of the entire gene set, only one-third of human genes in KEGG, and pathways are fragmented. For this reason, there are few computational methods to use pathways for cancer subtype classification. RESULTS: We present an explainable deep-learning model with attention mechanism and network propagation for cancer subtype classification. Each pathway is modeled by a graph convolutional network. Then, a multi-attention-based ensemble model combines several hundreds of pathways in an explainable manner. Lastly, network propagation on pathway-gene network explains why gene expression profiles in subtypes are different. In experiments with five TCGA cancer datasets, our method achieved very good classification accuracies and, additionally, identified subtype-specific pathways and biological functions. AVAILABILITY AND IMPLEMENTATION: The source code is available at http://biohealth.snu.ac.kr/software/GCN_MAE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Neoplasias , Software , Atenção , Humanos , Neoplasias/genética , Transcriptoma
20.
Brief Bioinform ; 21(1): 36-46, 2020 Jan 17.
Artigo em Inglês | MEDLINE | ID: mdl-30462155

RESUMO

MOTIVATION: Biological pathways are extensively used for the analysis of transcriptome data to characterize biological mechanisms underlying various phenotypes. There are a number of computational tools that summarize transcriptome data at the pathway level. However, there is no comparative study on how well these tools produce useful information at the cohort level, enabling comparison of many samples or patients. RESULTS: In this study, we systematically compared and evaluated 13 different pathway activity inference tools based on 5 comparison criteria using pan-cancer data set. This study has two major contributions. First, our study provides a comprehensive survey on computational techniques used by existing pathway activity inference tools. The tools use different strategies and assume different requirements on data: input transformation, use of labels, necessity of cohort-level input data, use of gene relations and scoring metric. Second, we performed extensive evaluations on the performance of these tools. Because different tools use different methods to map samples to the pathway dimension, the tools are evaluated at the pathway level using five comparison criteria. Starting from measuring how well a tool maintains the characteristics of original gene expression values, robustness was also investigated by adding noise into gene expression data. Classification tasks on three clinical variables (tumor versus normal, survival and cancer subtypes) were performed to evaluate the utility of tools for their clinical applications. In addition, the inferred activity values were compared between the tools to see how similar they are along with the scoring schemes they use.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA