Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 64
Filter
Add more filters










Publication year range
1.
Bioinformatics ; 40(3)2024 Mar 04.
Article in English | MEDLINE | ID: mdl-38426338

ABSTRACT

MOTIVATION: Retrosynthesis is a critical task in drug discovery, aimed at finding a viable pathway for synthesizing a given target molecule. Many existing approaches frame this task as a graph-generating problem. Specifically, these methods first identify the reaction center, and break a targeted molecule accordingly to generate the synthons. Reactants are generated by either adding atoms sequentially to synthon graphs or by directly adding appropriate leaving groups. However, both of these strategies have limitations. Adding atoms results in a long prediction sequence that increases the complexity of generation, while adding leaving groups only considers those in the training set, which leads to poor generalization. RESULTS: In this paper, we propose a novel end-to-end graph generation model for retrosynthesis prediction, which sequentially identifies the reaction center, generates the synthons, and adds motifs to the synthons to generate reactants. Given that chemically meaningful motifs fall between the size of atoms and leaving groups, our model achieves lower prediction complexity than adding atoms and demonstrates superior performance than adding leaving groups. We evaluate our proposed model on a benchmark dataset and show that it significantly outperforms previous state-of-the-art models. Furthermore, we conduct ablation studies to investigate the contribution of each component of our proposed model to the overall performance on benchmark datasets. Experiment results demonstrate the effectiveness of our model in predicting retrosynthesis pathways and suggest its potential as a valuable tool in drug discovery. AVAILABILITY AND IMPLEMENTATION: All code and data are available at https://github.com/szu-ljh2020/MARS.


Subject(s)
Benchmarking , Drug Discovery , Reading Frames
2.
IEEE J Biomed Health Inform ; 28(7): 4336-4347, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38551822

ABSTRACT

Binding affinity prediction of three-dimensional (3D) protein-ligand complexes is critical for drug repositioning and virtual drug screening. Existing approaches usually transform a 3D protein-ligand complex to a two-dimensional (2D) graph, and then use graph neural networks (GNNs) to predict its binding affinity. However, the node and edge features of the 2D graph are extracted based on invariant local coordinate systems of the 3D complex. As a result, these approaches can not fully learn the global information of the complex, such as the physical symmetry and the topological information of bonds. To address these issues, we propose a novel Equivariant Line Graph Network (ELGN) for binding affinity prediction of 3D protein-ligand complexes. The proposed ELGN firstly adds a super node to the 3D complex, and then builds a line graph based on the 3D complex. After that, ELGN uses a new E(3)-equivariant network layer to pass the messages between nodes and edges based on the global coordinate system of the 3D complex. Experimental results on two real datasets demonstrate the effectiveness of ELGN over several state-of-the-art baselines.


Subject(s)
Neural Networks, Computer , Proteins , Ligands , Proteins/chemistry , Proteins/metabolism , Protein Binding , Computational Biology/methods , Algorithms
3.
Bioinformatics ; 40(4)2024 Mar 29.
Article in English | MEDLINE | ID: mdl-38547401

ABSTRACT

MOTIVATION: Single-cell clustering plays a crucial role in distinguishing between cell types, facilitating the analysis of cell heterogeneity mechanisms. While many existing clustering methods rely solely on gene expression data obtained from single-cell RNA sequencing techniques to identify cell clusters, the information contained in mono-omic data is often limited, leading to suboptimal clustering performance. The emergence of single-cell multi-omics sequencing technologies enables the integration of multiple omics data for identifying cell clusters, but how to integrate different omics data effectively remains challenging. In addition, designing a clustering method that performs well across various types of multi-omics data poses a persistent challenge due to the data's inherent characteristics. RESULTS: In this paper, we propose a graph-regularized multi-view ensemble clustering (GRMEC-SC) model for single-cell clustering. Our proposed approach can adaptively integrate multiple omics data and leverage insights from multiple base clustering results. We extensively evaluate our method on five multi-omics datasets through a series of rigorous experiments. The results of these experiments demonstrate that our GRMEC-SC model achieves competitive performance across diverse multi-omics datasets with varying characteristics. AVAILABILITY AND IMPLEMENTATION: Implementation of GRMEC-SC, along with examples, can be found on the GitHub repository: https://github.com/polarisChen/GRMEC-SC.


Subject(s)
Machine Learning , Multiomics , Cluster Analysis , Single-Cell Analysis , Algorithms
4.
Nat Biotechnol ; 2024 Jan 23.
Article in English | MEDLINE | ID: mdl-38263515

ABSTRACT

Integrating single-cell datasets produced by multiple omics technologies is essential for defining cellular heterogeneity. Mosaic integration, in which different datasets share only some of the measured modalities, poses major challenges, particularly regarding modality alignment and batch effect removal. Here, we present a deep probabilistic framework for the mosaic integration and knowledge transfer (MIDAS) of single-cell multimodal data. MIDAS simultaneously achieves dimensionality reduction, imputation and batch correction of mosaic data by using self-supervised modality alignment and information-theoretic latent disentanglement. We demonstrate its superiority to 19 other methods and reliability by evaluating its performance in trimodal and mosaic integration tasks. We also constructed a single-cell trimodal atlas of human peripheral blood mononuclear cells and tailored transfer learning and reciprocal reference mapping schemes to enable flexible and accurate knowledge transfer from the atlas to new data. Applications in mosaic integration, pseudotime analysis and cross-tissue knowledge transfer on bone marrow mosaic datasets demonstrate the versatility and superiority of MIDAS. MIDAS is available at https://github.com/labomics/midas .

5.
IEEE J Biomed Health Inform ; 27(12): 6121-6132, 2023 Dec.
Article in English | MEDLINE | ID: mdl-37725723

ABSTRACT

Cell type identification is a crucial step towards the study of cellular heterogeneity and biological processes. Advances in single-cell sequencing technology have enabled the development of a variety of clustering methods for cell type identification. However, most of existing methods are designed for clustering single omic data such as single-cell RNA-sequencing (scRNA-seq) data. The accumulation of single-cell multi-omics data provides a great opportunity to integrate different omics data for cell clustering, but also raise new computational challenges for existing methods. How to integrate multi-omics data and leverage their consensus and complementary information to improve the accuracy of cell clustering still remains a challenge. In this study, we propose a new deep multi-level information fusion framework, named scMIC, for clustering single-cell multi-omics data. Our model can integrate the attribute information of cells and the potential structural relationship among cells from local and global levels, and reduce redundant information between different omics from cell and feature levels, leading to more discriminative representations. Moreover, the proposed multiple collaborative supervised clustering strategy is able to guide the learning process of the core encoding part by learning the high-confidence target distribution, which facilitates the interaction between the clustering part and the representation learning part, as well as the information exchange between omics, and finally obtain more robust clustering results. Experiments on seven single-cell multi-omics datasets show the superiority of scMIC over existing state-of-the-art methods.


Subject(s)
Multiomics , Single-Cell Analysis , Humans , Cluster Analysis , Algorithms
6.
Comput Biol Med ; 159: 106936, 2023 06.
Article in English | MEDLINE | ID: mdl-37105110

ABSTRACT

Detecting protein complexes is critical for studying cellular organizations and functions. The accumulation of protein-protein interaction (PPI) data enables the identification of protein complexes computationally. Although a great number of computational methods have been proposed to identify protein complexes from PPI networks, most of them ignore the signs of PPIs that reflect the ways proteins interact (activation or inhibition). As not all PPIs imply co-complex relationships, taking into account the signs of PPIs can benefit the identification of protein complexes. Moreover, PPI networks are not static, but vary with the change of cell states or environments. However, existing methods are primarily designed for single-network clustering, and rarely consider joint clustering of multiple PPI networks. In this study, we propose a novel partially shared signed network clustering (PS-SNC) model for identifying protein complexes from multiple state-specific signed PPI networks jointly. PS-SNC can not only consider the signs of PPIs, but also identify the common and unique protein complexes in different states. Experimental results on synthetic and real datasets show that our PS-SNC model can achieve better performance than other state-of-the-art protein complex detection methods. Extensive analysis on real datasets demonstrate the effectiveness of PS-SNC in revealing novel insights about the underlying patterns of different cell lines.


Subject(s)
Protein Interaction Mapping , Protein Interaction Maps , Protein Interaction Mapping/methods , Proteins , Cluster Analysis , Algorithms , Computational Biology/methods
7.
Comput Struct Biotechnol J ; 21: 974-990, 2023.
Article in English | MEDLINE | ID: mdl-36733706

ABSTRACT

Cancer is a complex disease caused primarily by genetic variants. Reconstructing gene networks within tumors is essential for understanding the functional regulatory mechanisms of carcinogenesis. Advances in high-throughput sequencing technologies have provided tremendous opportunities for inferring gene networks via computational approaches. However, due to the heterogeneity of the same cancer type and the similarities between different cancer types, it remains a challenge to systematically investigate the commonalities and specificities between gene networks of different cancer types, which is a crucial step towards precision cancer diagnosis and treatment. In this study, we propose a new sparse regularized multi-layer decomposition graphical model to jointly estimate the gene networks of multiple cancer types. Our model can handle various types of gene expression data and decomposes each cancer-type-specific network into three components, i.e., globally shared, partially shared and cancer-type-unique components. By identifying the globally and partially shared gene network components, our model can explore the heterogeneous similarities between different cancer types, and our identified cancer-type-unique components can help to reveal the regulatory mechanisms unique to each cancer type. Extensive experiments on synthetic data illustrate the effectiveness of our model in joint estimation of multiple gene networks. We also apply our model to two real data sets to infer the gene networks of multiple cancer subtypes or cell lines. By analyzing our estimated globally shared, partially shared, and cancer-type-unique components, we identified a number of important genes associated with common and specific regulatory mechanisms across different cancer types.

8.
IEEE/ACM Trans Comput Biol Bioinform ; 20(2): 1445-1456, 2023.
Article in English | MEDLINE | ID: mdl-35476574

ABSTRACT

The single-cell RNA sequencing (scRNA-seq) technique begins a new era by revealing gene expression patterns at single-cell resolution, enabling studies of heterogeneity and transcriptome dynamics of complex tissues at single-cell resolution. However, existing large proportion of dropout events may hinder downstream analyses. Thus imputation of dropout events is an important step in analyzing scRNA-seq data. We develop scTSSR2, a new imputation method that combines matrix decomposition with the previously developed two-side sparse self-representation, leading to fast two-side sparse self-representation to impute dropout events in scRNA-seq data. The comparisons of computational speed and memory usage among different imputation methods show that scTSSR2 has distinct advantages in terms of computational speed and memory usage. Comprehensive downstream experiments show that scTSSR2 outperforms the state-of-the-art imputation methods. A user-friendly R package scTSSR2 is developed to denoise the scRNA-seq data to improve the data quality.


Subject(s)
Gene Expression Profiling , Transcriptome , Transcriptome/genetics , Sequence Analysis, RNA , Single-Cell Analysis
9.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36585783

ABSTRACT

The inference of gene regulatory networks (GRNs) is of great importance for understanding the complex regulatory mechanisms within cells. The emergence of single-cell RNA-sequencing (scRNA-seq) technologies enables the measure of gene expression levels for individual cells, which promotes the reconstruction of GRNs at single-cell resolution. However, existing network inference methods are mainly designed for data collected from a single data source, which ignores the information provided by multiple related data sources. In this paper, we propose a multi-view contrastive learning (DeepMCL) model to infer GRNs from scRNA-seq data collected from multiple data sources or time points. We first represent each gene pair as a set of histogram images, and then introduce a deep Siamese convolutional neural network with contrastive loss to learn the low-dimensional embedding for each gene pair. Moreover, an attention mechanism is introduced to integrate the embeddings extracted from different data sources and different neighbor gene pairs. Experimental results on synthetic and real-world datasets validate the effectiveness of our contrastive learning and attention mechanisms, demonstrating the effectiveness of our model in integrating multiple data sources for GRN inference.


Subject(s)
Algorithms , Gene Regulatory Networks , Neural Networks, Computer , Exome Sequencing , Gene Expression
10.
Brief Bioinform ; 23(5)2022 09 20.
Article in English | MEDLINE | ID: mdl-36047285

ABSTRACT

Advances in single-cell RNA sequencing (scRNA-seq) technologies has provided an unprecedent opportunity for cell-type identification. As clustering is an effective strategy towards cell-type identification, various computational approaches have been proposed for clustering scRNA-seq data. Recently, with the emergence of cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq), the cell surface expression of specific proteins and the RNA expression on the same cell can be captured, which provides more comprehensive information for cell analysis. However, existing single cell clustering algorithms are mainly designed for single-omic data, and have difficulties in handling multi-omics data with diverse characteristics efficiently. In this study, we propose a novel deep embedded multi-omics clustering with collaborative training (DEMOC) model to perform joint clustering on CITE-seq data. Our model can take into account the characteristics of transcriptomic and proteomic data, and make use of the consistent and complementary information provided by different data sources effectively. Experiment results on two real CITE-seq datasets demonstrate that our DEMOC model not only outperforms state-of-the-art single-omic clustering methods, but also achieves better and more stable performance than existing multi-omics clustering methods. We also apply our model on three scRNA-seq datasets to assess the performance of our model in rare cell-type identification, novel cell-subtype detection and cellular heterogeneity analysis. Experiment results illustrate the effectiveness of our model in discovering the underlying patterns of data.


Subject(s)
Gene Expression Profiling , Single-Cell Analysis , Algorithms , Cluster Analysis , Epitopes , Gene Expression Profiling/methods , Proteomics , RNA , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods
11.
Brief Funct Genomics ; 21(4): 325-338, 2022 07 27.
Article in English | MEDLINE | ID: mdl-35760070

ABSTRACT

Identification of cancer-related genes is helpful for understanding the pathogenesis of cancer, developing targeted drugs and creating new diagnostic and therapeutic methods. Considering the complexity of the biological laboratory methods, many network-based methods have been proposed to identify cancer-related genes at the global perspective with the increasing availability of high-throughput data. Some studies have focused on the tissue-specific cancer networks. However, cancers from different tissues may share common features, and those methods may ignore the differences and similarities across cancers during the establishment of modeling. In this work, in order to make full use of global information of the network, we first establish the pan-cancer network via differential network algorithm, which not only contains heterogeneous data across multiple cancer types but also contains heterogeneous data between tumor samples and normal samples. Second, the node representation vectors are learned by network embedding. In contrast to ranking analysis-based methods, with the help of integrative network analysis, we transform the cancer-related gene identification problem into a binary classification problem. The final results are obtained via ensemble classification. We further applied these methods to the most commonly used gene expression data involving six tissue-specific cancer types. As a result, an integrative pan-cancer network and several biologically meaningful results were obtained. As examples, nine genes were ultimately identified as potential pan-cancer-related genes. Most of these genes have been reported in published studies, thus showing our method's potential for application in identifying driver gene candidates for further biological experimental verification.


Subject(s)
Neoplasms , Oncogenes , Algorithms , Gene Regulatory Networks , Humans , Neoplasms/genetics , Neoplasms/pathology
13.
IEEE/ACM Trans Comput Biol Bioinform ; 19(5): 2894-2906, 2022.
Article in English | MEDLINE | ID: mdl-34383650

ABSTRACT

Inferring gene co-expression networks from high-throughput gene expression data is an important task in bioinformatics. Many gene networks often exhibit modular structures. Although several Gaussian graphical model-based methods have been developed to estimate gene co-expression networks by incorporating the modular structural prior, none of them takes into account the modular structures captured by the prior networks (e.g., protein interaction networks). In this study, we propose a novel prior network-dependent gene network inference (pGNI) method to estimate gene co-expression networks by integrating gene expression data and prior protein interaction network data. The underlying modular structure is learned from both sets of data. Through simulation studies, we demonstrate the feasibility and effectiveness of our method. We also apply our method to two real datasets. The modular structures in the networks estimated by our method are biological significant.


Subject(s)
Gene Regulatory Networks , Protein Interaction Maps , Algorithms , Computational Biology/methods , Computer Simulation , Gene Expression Profiling/methods , Gene Regulatory Networks/genetics , Normal Distribution , Protein Interaction Maps/genetics
14.
Brief Bioinform ; 23(1)2022 01 17.
Article in English | MEDLINE | ID: mdl-34864871

ABSTRACT

Advances in high-throughput experimental technologies promote the accumulation of vast number of biomedical data. Biomedical link prediction and single-cell RNA-sequencing (scRNA-seq) data imputation are two essential tasks in biomedical data analyses, which can facilitate various downstream studies and gain insights into the mechanisms of complex diseases. Both tasks can be transformed into matrix completion problems. For a variety of matrix completion tasks, matrix factorization has shown promising performance. However, the sparseness and high dimensionality of biomedical networks and scRNA-seq data have raised new challenges. To resolve these issues, various matrix factorization methods have emerged recently. In this paper, we present a comprehensive review on such matrix factorization methods and their usage in biomedical link prediction and scRNA-seq data imputation. Moreover, we select representative matrix factorization methods and conduct a systematic empirical comparison on 15 real data sets to evaluate their performance under different scenarios. By summarizing the experimental results, we provide general guidelines for selecting matrix factorization methods for different biomedical matrix completion tasks and point out some future directions to further improve the performance for biomedical link prediction and scRNA-seq data imputation.


Subject(s)
Data Analysis , Single-Cell Analysis , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Exome Sequencing
15.
Article in English | MEDLINE | ID: mdl-32750866

ABSTRACT

It is an important task to learn how gene regulatory networks change under different conditions. Several Gaussian graphical model-based methods have been proposed to deal with this task by inferring differential networks from gene expression data. However, most existing methods define the differential networks as the difference of precision matrices, which may include false differential edges caused by the change of conditional variances. In addition, prior information about the condition-specific networks and the differential networks can be obtained from other domains. It is useful to incorporate prior information into differential network analysis. In this study, we propose a new differential network analysis method to address the above challenges. Instead of using the precision matrices, we define the differential networks as the difference of partial correlations, which can exclude the spurious differential edges due to the variants of conditional variances. Furthermore, prior information from multiple hypothesis testing is incorporated using a weighted fused penalty. Simulation studies show that our method outperforms the competing methods. We also apply our method to identify the differential network between luminal A and basal-like subtypes of breast cancers and the differential network between acute myeloid leukemia tumors and normal samples. The hub genes in the differential networks identified by our method carry out important biological functions.


Subject(s)
Breast Neoplasms , Gene Regulatory Networks , Breast Neoplasms/genetics , Computer Simulation , Female , Gene Regulatory Networks/genetics , Humans , Normal Distribution
16.
Brief Bioinform ; 23(1)2022 01 17.
Article in English | MEDLINE | ID: mdl-34571530

ABSTRACT

The identification of differentially expressed genes between different cell groups is a crucial step in analyzing single-cell RNA-sequencing (scRNA-seq) data. Even though various differential expression analysis methods for scRNA-seq data have been proposed based on different model assumptions and strategies recently, the differentially expressed genes identified by them are quite different from each other, and the performances of them depend on the underlying data structures. In this paper, we propose a new ensemble learning-based differential expression analysis method, scDEA, to produce a more stable and accurate result. scDEA integrates the P-values obtained from 12 individual differential expression analysis methods for each gene using a P-value combination method. Comprehensive experiments show that scDEA outperforms the state-of-the-art individual methods with different experimental settings and evaluation metrics. We expect that scDEA will serve a wide range of users, including biologists, bioinformaticians and data scientists, who need to detect differentially expressed genes in scRNA-seq data.


Subject(s)
RNA , Single-Cell Analysis , Gene Expression Profiling/methods , Machine Learning , RNA/genetics , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Exome Sequencing
17.
IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2781-2787, 2021.
Article in English | MEDLINE | ID: mdl-34495837

ABSTRACT

The advancements of single-cell RNA sequencing (scRNA-seq) technologies have provided us unprecedented opportunities to characterize cellular states and investigate the mechanisms of complex diseases. Due to technical issues such as dropout events, scRNA-seq data contains excess of false zero counts, which has a substantial impact on the downstream analyses. Although several computational approaches have been proposed to impute dropout events in scRNA-seq data, there is no strong consensus on which is the best approach. In this study, we propose a novel weighted ensemble learning method, named EnTSSR, to impute dropout events in scRNA-seq data. By using a multi-view two-side sparse self-representation framework, our model can exploit the consensus similarities between genes and between cells based on the imputed results of various imputation methods. Moreover, we introduce a weighted ensemble strategy to leverage the information captured by various imputation methods effectively. Down-sampling experiments, clustering analysis, differential expression analysis and cell trajectory inference are carried out to evaluate the performance of our proposed model. Experiment results demonstrate that our EnTSSR can effectively recover the true expression pattern of scRNA-seq data.


Subject(s)
Machine Learning , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Algorithms , Cells, Cultured , Cluster Analysis , Computational Biology , Embryonic Stem Cells , Humans , Software
18.
Bioinformatics ; 37(23): 4414-4423, 2021 12 07.
Article in English | MEDLINE | ID: mdl-34245246

ABSTRACT

MOTIVATION: Differential network analysis is an important tool to investigate the rewiring of gene interactions under different conditions. Several computational methods have been developed to estimate differential networks from gene expression data, but most of them do not consider that gene network rewiring may be driven by the differential expression of individual genes. New differential network analysis methods that simultaneously take account of the changes in gene interactions and changes in expression levels are needed. RESULTS: : In this article, we propose a differential network analysis method that considers the differential expression of individual genes when identifying differential edges. First, two hypothesis test statistics are used to quantify changes in partial correlations between gene pairs and changes in expression levels for individual genes. Then, an optimization framework is proposed to combine the two test statistics so that the resulting differential network has a hierarchical property, where a differential edge can be considered only if at least one of the two involved genes is differentially expressed. Simulation results indicate that our method outperforms current state-of-the-art methods. We apply our method to identify the differential networks between the luminal A and basal-like subtypes of breast cancer and those between acute myeloid leukemia and normal samples. Hub nodes in the differential networks estimated by our method, including both differentially and nondifferentially expressed genes, have important biological functions. AVAILABILITY AND IMPLEMENTATION: All the datasets underlying this article are publicly available. Processed data and source code can be accessed through the Github repository at https://github.com/Zhangxf-ccnu/chNet. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Breast Neoplasms , Software , Humans , Female , Computer Simulation , Gene Regulatory Networks , Breast Neoplasms/genetics , Gene Expression
19.
Brief Bioinform ; 22(6)2021 11 05.
Article in English | MEDLINE | ID: mdl-33975339

ABSTRACT

The mechanisms controlling biological process, such as the development of disease or cell differentiation, can be investigated by examining changes in the networks of gene dependencies between states in the process. High-throughput experimental methods, like microarray and RNA sequencing, have been widely used to gather gene expression data, which paves the way to infer gene dependencies based on computational methods. However, most differential network analysis methods are designed to deal with fully observed data, but missing values, such as the dropout events in single-cell RNA-sequencing data, are frequent. New methods are needed to take account of these missing values. Moreover, since the changes of gene dependencies may be driven by certain perturbed genes, considering the changes in gene expression levels may promote the identification of gene network rewiring. In this study, a novel weighted differential network estimation (WDNE) model is proposed to handle multi-platform gene expression data with missing values and take account of changes in gene expression levels. Simulation studies demonstrate that WDNE outperforms state-of-the-art differential network estimation methods. When applied WDNE to infer differential gene networks associated with drug resistance in ovarian tumors, cell differentiation and breast tumor heterogeneity, the hub genes in the estimated differential gene networks can provide important insights into the underlying mechanisms. Furthermore, a Matlab toolbox, differential network analysis toolbox, was developed to implement the WDNE model and visualize the estimated differential networks.


Subject(s)
Algorithms , Breast Neoplasms , Drug Resistance, Neoplasm/genetics , Gene Expression Regulation, Neoplastic , Gene Regulatory Networks , Models, Genetic , Ovarian Neoplasms , Breast Neoplasms/genetics , Breast Neoplasms/metabolism , Female , Gene Expression Profiling , Humans , Ovarian Neoplasms/genetics , Ovarian Neoplasms/metabolism
20.
IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2891-2897, 2021.
Article in English | MEDLINE | ID: mdl-33656995

ABSTRACT

The identification of cancer subtypes is of great importance for understanding the heterogeneity of tumors and providing patients with more accurate diagnoses and treatments. However, it is still a challenge to effectively integrate multiple omics data to establish cancer subtypes. In this paper, we propose an unsupervised integration method, named weighted multi-view low rank representation (WMLRR), to identify cancer subtypes from multiple types of omics data. Given a group of patients described by multiple omics data matrices, we first learn a unified affinity matrix which encodes the similarities among patients by exploring the sparsity-consistent low-rank representations from the joint decompositions of multiple omics data matrices. Unlike existing subtype identification methods that treat each omics data matrix equally, we assign a weight to each omics data matrix and learn these weights automatically through the optimization process. Finally, we apply spectral clustering on the learned affinity matrix to identify cancer subtypes. Experiment results show that the survival times between our identified cancer subtypes are significantly different, and our predicted survivals are more accurate than other state-of-the-art methods. In addition, some clinical analyses of the diseases also demonstrate the effectiveness of our method in identifying molecular subtypes with biological significance and clinical relevance.


Subject(s)
Computational Biology/methods , Neoplasms , Unsupervised Machine Learning , Algorithms , Cluster Analysis , DNA Methylation/genetics , Humans , Neoplasms/classification , Neoplasms/genetics , Neoplasms/mortality , Transcriptome/genetics
SELECTION OF CITATIONS
SEARCH DETAIL