|

1.

TransPTM: a transformer-based model for non-histone acetylation site prediction.

Meng, Lingkuan; Chen, Xingjian; Cheng, Ke; Chen, Nanjun; Zheng, Zetian; Wang, Fuzhou; Sun, Hongyan; Wong, Ka-Chun.

Brief Bioinform ; 25(3)2024 Mar 27.

Article En | MEDLINE | ID: mdl-38725156

Protein acetylation is one of the extensively studied post-translational modifications (PTMs) due to its significant roles across a myriad of biological processes. Although many computational tools for acetylation site identification have been developed, there is a lack of benchmark dataset and bespoke predictors for non-histone acetylation site prediction. To address these problems, we have contributed to both dataset creation and predictor benchmark in this study. First, we construct a non-histone acetylation site benchmark dataset, namely NHAC, which includes 11 subsets according to the sequence length ranging from 11 to 61 amino acids. There are totally 886 positive samples and 4707 negative samples for each sequence length. Secondly, we propose TransPTM, a transformer-based neural network model for non-histone acetylation site predication. During the data representation phase, per-residue contextualized embeddings are extracted using ProtT5 (an existing pre-trained protein language model). This is followed by the implementation of a graph neural network framework, which consists of three TransformerConv layers for feature extraction and a multilayer perceptron module for classification. The benchmark results reflect that TransPTM has the competitive performance for non-histone acetylation site prediction over three state-of-the-art tools. It improves our comprehension on the PTM mechanism and provides a theoretical basis for developing drug targets for diseases. Moreover, the created PTM datasets fills the gap in non-histone acetylation site datasets and is beneficial to the related communities. The related source code and data utilized by TransPTM are accessible at https://www.github.com/TransPTM/TransPTM.

Neural Networks, Computer , Protein Processing, Post-Translational , Acetylation , Computational Biology/methods , Databases, Protein , Software , Algorithms , Humans , Proteins/chemistry , Proteins/metabolism

2.

DeepSeq2Drug: An expandable ensemble end-to-end anti-viral drug repurposing benchmark framework by multi-modal embeddings and transfer learning.

Xie, Weidun; Yu, Jixiang; Huang, Lei; For, Lek Shyuen; Zheng, Zetian; Chen, Xingjian; Wang, Yuchen; Liu, Zhichao; Peng, Chengbin; Wong, Ka-Chun.

Comput Biol Med ; 175: 108487, 2024 Jun.

Article En | MEDLINE | ID: mdl-38653064

Drug repurposing is promising in multiple scenarios, such as emerging viral outbreak controls and cost reductions of drug discovery. Traditional graph-based drug repurposing methods are limited to fast, large-scale virtual screens, as they constrain the counts for drugs and targets and fail to predict novel viruses or drugs. Moreover, though deep learning has been proposed for drug repurposing, only a few methods have been used, including a group of pre-trained deep learning models for embedding generation and transfer learning. Hence, we propose DeepSeq2Drug to tackle the shortcomings of previous methods. We leverage multi-modal embeddings and an ensemble strategy to complement the numbers of drugs and viruses and to guarantee the novel prediction. This framework (including the expanded version) involves four modal types: six NLP models, four CV models, four graph models, and two sequence models. In detail, we first make a pipeline and calculate the predictive performance of each pair of viral and drug embeddings. Then, we select the best embedding pairs and apply an ensemble strategy to conduct anti-viral drug repurposing. To validate the effect of the proposed ensemble model, a monkeypox virus (MPV) case study is conducted to reflect the potential predictive capability. This framework could be a benchmark method for further pre-trained deep learning optimization and anti-viral drug repurposing tasks. We also build software further to make the proposed model easier to reuse. The code and software are freely available at http://deepseq2drug.cs.cityu.edu.hk.

Antiviral Agents , Deep Learning , Drug Repositioning , Drug Repositioning/methods , Antiviral Agents/pharmacology , Antiviral Agents/therapeutic use , Humans , Software , Benchmarking

3.

Discovering DNA shape motifs with multiple DNA shape features: generalization, methods, and validation.

Chen, Nanjun; Yu, Jixiang; Liu, Zhe; Meng, Lingkuan; Li, Xiangtao; Wong, Ka-Chun.

Nucleic Acids Res ; 52(8): 4137-4150, 2024 May 08.

Article En | MEDLINE | ID: mdl-38572749

DNA motifs are crucial patterns in gene regulation. DNA-binding proteins (DBPs), including transcription factors, can bind to specific DNA motifs to regulate gene expression and other cellular activities. Past studies suggest that DNA shape features could be subtly involved in DNA-DBP interactions. Therefore, the shape motif annotations based on intrinsic DNA topology can deepen the understanding of DNA-DBP binding. Nevertheless, high-throughput tools for DNA shape motif discovery that incorporate multiple features altogether remain insufficient. To address it, we propose a series of methods to discover non-redundant DNA shape motifs with the generalization to multiple motifs in multiple shape features. Specifically, an existing Gibbs sampling method is generalized to multiple DNA motif discovery with multiple shape features. Meanwhile, an expectation-maximization (EM) method and a hybrid method coupling EM with Gibbs sampling are proposed and developed with promising performance, convergence capability, and efficiency. The discovered DNA shape motif instances reveal insights into low-signal ChIP-seq peak summits, complementing the existing sequence motif discovery works. Additionally, our modelling captures the potential interplays across multiple DNA shape features. We provide a valuable platform of tools for DNA shape motif discovery. An R package is built for open accessibility and long-lasting impact: https://zenodo.org/doi/10.5281/zenodo.10558980.

DNA , Nucleotide Motifs , DNA/chemistry , DNA/genetics , DNA/metabolism , DNA-Binding Proteins/metabolism , DNA-Binding Proteins/chemistry , DNA-Binding Proteins/genetics , Algorithms , Nucleic Acid Conformation , Chromatin Immunoprecipitation Sequencing/methods , Binding Sites , Transcription Factors/metabolism , Transcription Factors/genetics , Transcription Factors/chemistry , Humans , Protein Binding

4.

Exhaustive Exploitation of Nature-Inspired Computation for Cancer Screening in an Ensemble Manner.

Wang, Xubin; Wang, Yunhe; Ma, Zhiqiang; Wong, Ka-Chun; Li, Xiangtao.

IEEE/ACM Trans Comput Biol Bioinform ; PP2024 Apr 05.

Article En | MEDLINE | ID: mdl-38578856

Accurate screening of cancer types is crucial for effective cancer detection and precise treatment selection. However, the association between gene expression profiles and tumors is often limited to a small number of biomarker genes. While computational methods using nature-inspired algorithms have shown promise in selecting predictive genes, existing techniques are limited by inefficient search and poor generalization across diverse datasets. This study presents a framework termed Evolutionary Optimized Diverse Ensemble Learning (EODE) to improve ensemble learning for cancer classification from gene expression data. The EODE methodology combines an intelligent grey wolf optimization algorithm for selective feature space reduction, guided random injection modeling for ensemble diversity enhancement, and subset model optimization for synergistic classifier combinations. Extensive experiments were conducted across 35 gene expression benchmark datasets encompassing varied cancer types. Results demonstrated that EODE obtained significantly improved screening accuracy over individual and conventionally aggregated models. The integrated optimization of advanced feature selection, directed specialized modeling, and cooperative classifier ensembles helps address key challenges in current nature-inspired approaches. This provides an effective framework for robust and generalized ensemble learning with gene expression biomarkers. Specifically, we have opened EODE source code on Github at https://github.com/wangxb96/EODE.

5.

A dual diffusion model enables 3D molecule generation and lead optimization based on target pockets.

Huang, Lei; Xu, Tingyang; Yu, Yang; Zhao, Peilin; Chen, Xingjian; Han, Jing; Xie, Zhi; Li, Hailong; Zhong, Wenge; Wong, Ka-Chun; Zhang, Hengtong.

Nat Commun ; 15(1): 2657, 2024 Mar 26.

Article En | MEDLINE | ID: mdl-38531837

Structure-based generative chemistry is essential in computer-aided drug discovery by exploring a vast chemical space to design ligands with high binding affinity for targets. However, traditional in silico methods are limited by computational inefficiency, while machine learning approaches face bottlenecks due to auto-regressive sampling. To address these concerns, we have developed a conditional deep generative model, PMDM, for 3D molecule generation fitting specified targets. PMDM consists of a conditional equivariant diffusion model with both local and global molecular dynamics, enabling PMDM to consider the conditioned protein information to generate molecules efficiently. The comprehensive experiments indicate that PMDM outperforms baseline models across multiple evaluation metrics. To evaluate the applications of PMDM under real drug design scenarios, we conduct lead compound optimization for SARS-CoV-2 main protease (Mpro) and Cyclin-dependent Kinase 2 (CDK2), respectively. The selected lead optimization molecules are synthesized and evaluated for their in-vitro activities against CDK2, displaying improved CDK2 activity.

Anti-HIV Agents , Methacrylates , Benchmarking , Benzoates , Chemistry, Physical , Drug Design

6.

scGREAT: Transformer-based deep-language model for gene regulatory network inference from single-cell transcriptomics.

Wang, Yuchen; Chen, Xingjian; Zheng, Zetian; Huang, Lei; Xie, Weidun; Wang, Fuzhou; Zhang, Zhaolei; Wong, Ka-Chun.

iScience ; 27(4): 109352, 2024 Apr 19.

Article En | MEDLINE | ID: mdl-38510148

Gene regulatory networks (GRNs) involve complex and multi-layer regulatory interactions between regulators and their target genes. Precise knowledge of GRNs is important in understanding cellular processes and molecular functions. Recent breakthroughs in single-cell sequencing technology made it possible to infer GRNs at single-cell level. Existing methods, however, are limited by expensive computations, and sometimes simplistic assumptions. To overcome these obstacles, we propose scGREAT, a framework to infer GRN using gene embeddings and transformer from single-cell transcriptomics. scGREAT starts by constructing gene expression and gene biotext dictionaries from scRNA-seq data and gene text information. The representation of TF gene pairs is learned through optimizing embedding space by transformer-based engine. Results illustrated scGREAT outperformed other contemporary methods on benchmarks. Besides, gene representations from scGREAT provide valuable gene regulation insights, and external validation on spatial transcriptomics illuminated the mechanism behind scGREAT annotation. Moreover, scGREAT identified several TF target regulations corroborated in studies.

7.

Uncovering the ceRNA Network Related to the Prognosis of Stomach Adenocarcinoma Among 898 Patient Samples.

Liu, Zhe; Liu, Fang; Petinrin, Olutomilayo Olayemi; Wang, Fuzhou; Zhang, Yu; Wong, Ka-Chun.

Biochem Genet ; 2024 Feb 15.

Article En | MEDLINE | ID: mdl-38361095

Stomach adenocarcinoma (STAD) patients are often associated with significantly high mortality rates and poor prognoses worldwide. Among STAD patients, competing endogenous RNAs (ceRNAs) play key roles in regulating one another at the post-transcriptional stage by competing for shared miRNAs. In this study, we aimed to elucidate the roles of lncRNAs in the ceRNA network of STAD, uncovering the molecular biomarkers for target therapy and prognosis. Specifically, a multitude of differentially expressed lncRNAs, miRNAs, and mRNAs (i.e., 898 samples in total) was collected and processed from TCGA. Cytoplasmic lncRNAs were kept for evaluating overall survival (OS) time and constructing the ceRNA network. Differentially expressed mRNAs in the ceRNA network were also investigated for functional and pathological insights. Interestingly, we identified one ceRNA network including 13 lncRNAs, 25 miRNAs, and 9 mRNAs. Among them, 13 RNAs were found related to the patient survival time; their individual risk score can be adopted for prognosis inference. Finally, we constructed a comprehensive ceRNA regulatory network for STAD and developed our own risk-scoring system that can predict the OS time of STAD patients by taking into account the above.

8.

Distribution-Agnostic Deep Learning Enables Accurate Single-Cell Data Recovery and Transcriptional Regulation Interpretation.

Su, Yanchi; Yu, Zhuohan; Yang, Yuning; Wong, Ka-Chun; Li, Xiangtao.

Adv Sci (Weinh) ; 11(16): e2307280, 2024 Apr.

Article En | MEDLINE | ID: mdl-38380499

Single-cell RNA sequencing (scRNA-seq) is a robust method for studying gene expression at the single-cell level, but accurately quantifying genetic material is often hindered by limited mRNA capture, resulting in many missing expression values. Existing imputation methods rely on strict data assumptions, limiting their broader application, and lack reliable supervision, leading to biased signal recovery. To address these challenges, authors developed Bis, a distribution-agnostic deep learning model for accurately recovering missing sing-cell gene expression from multiple platforms. Bis is an optimal transport-based autoencoder model that can capture the intricate distribution of scRNA-seq data while addressing the characteristic sparsity by regularizing the cellular embedding space. Additionally, they propose a module using bulk RNA-seq data to guide reconstruction and ensure expression consistency. Experimental results show Bis outperforms other models across simulated and real datasets, showcasing superiority in various downstream analyses including batch effect removal, clustering, differential expression analysis, and trajectory inference. Moreover, Bis successfully restores gene expression levels in rare cell subsets in a tumor-matched peripheral blood dataset, revealing developmental characteristics of cytokine-induced natural killer cells within a head and neck squamous cell carcinoma microenvironment.

Deep Learning , Single-Cell Analysis , Single-Cell Analysis/methods , Humans , Sequence Analysis, RNA/methods , Gene Expression Profiling/methods

9.

MotifHub: Detection of trans-acting DNA motif group with probabilistic modeling algorithm.

Liu, Zhe; Wong, Hiu-Man; Chen, Xingjian; Lin, Jiecong; Zhang, Shixiong; Yan, Shankai; Wang, Fuzhou; Li, Xiangtao; Wong, Ka-Chun.

Comput Biol Med ; 168: 107753, 2024 01.

Article En | MEDLINE | ID: mdl-38039889

BACKGROUND: Trans-acting factors are of special importance in transcription regulation, which is a group of proteins that can directly or indirectly recognize or bind to the 8-12 bp core sequence of cis-acting elements and regulate the transcription efficiency of target genes. The progressive development in high-throughput chromatin capture technology (e.g., Hi-C) enables the identification of chromatin-interacting sequence groups where trans-acting DNA motif groups can be discovered. The problem difficulty lies in the combinatorial nature of DNA sequence pattern matching and its underlying sequence pattern search space. METHOD: Here, we propose to develop MotifHub for trans-acting DNA motif group discovery on grouped sequences. Specifically, the main approach is to develop probabilistic modeling for accommodating the stochastic nature of DNA motif patterns. RESULTS: Based on the modeling, we develop global sampling techniques based on EM and Gibbs sampling to address the global optimization challenge for model fitting with latent variables. The results reflect that our proposed approaches demonstrate promising performance with linear time complexities. CONCLUSION: MotifHub is a novel algorithm considering the identification of both DNA co-binding motif groups and trans-acting TFs. Our study paves the way for identifying hub TFs of stem cell development (OCT4 and SOX2) and determining potential therapeutic targets of prostate cancer (FOXA1 and MYC). To ensure scientific reproducibility and long-term impact, its matrix-algebra-optimized source code is released at http://bioinfo.cs.cityu.edu.hk/MotifHub.

Algorithms , Software , Nucleotide Motifs/genetics , Reproducibility of Results , Chromatin/genetics

10.

Automated exploitation of deep learning for cancer patient stratification across multiple types.

Sun, Pingping; Fan, Shijie; Li, Shaochuan; Zhao, Yingwei; Lu, Chang; Wong, Ka-Chun; Li, Xiangtao.

Bioinformatics ; 39(11)2023 11 01.

Article En | MEDLINE | ID: mdl-37934154

MOTIVATION: Recent frameworks based on deep learning have been developed to identify cancer subtypes from high-throughput gene expression profiles. Unfortunately, the performance of deep learning is highly dependent on its neural network architectures which are often hand-crafted with expertise in deep neural networks, meanwhile, the optimization and adjustment of the network are usually costly and time consuming. RESULTS: To address such limitations, we proposed a fully automated deep neural architecture search model for diagnosing consensus molecular subtypes from gene expression data (DNAS). The proposed model uses ant colony algorithm, one of the heuristic swarm intelligence algorithms, to search and optimize neural network architecture, and it can automatically find the optimal deep learning model architecture for cancer diagnosis in its search space. We validated DNAS on eight colorectal cancer datasets, achieving the average accuracy of 95.48%, the average specificity of 98.07%, and the average sensitivity of 96.24%, respectively. Without the loss of generality, we investigated the general applicability of DNAS further on other cancer types from different platforms including lung cancer and breast cancer, and DNAS achieved an area under the curve of 95% and 96%, respectively. In addition, we conducted gene ontology enrichment and pathological analysis to reveal interesting insights into cancer subtype identification and characterization across multiple cancer types. AVAILABILITY AND IMPLEMENTATION: The source code and data can be downloaded from https://github.com/userd113/DNAS-main. And the web server of DNAS is publicly accessible at 119.45.145.120:5001.

Breast Neoplasms , Deep Learning , Humans , Female , Neural Networks, Computer , Algorithms , Software

11.

LncRNA-Top: Controlled deep learning approaches for lncRNA gene regulatory relationship annotations across different platforms.

Xie, Weidun; Chen, Xingjian; Zheng, Zetian; Wang, Fuzhou; Zhu, Xiaowei; Lin, Qiuzhen; Sun, Yanni; Wong, Ka-Chun.

iScience ; 26(11): 108197, 2023 Nov 17.

Article En | MEDLINE | ID: mdl-37965148

By soaking microRNAs (miRNAs), long non-coding RNAs (lncRNAs) have the potential to regulate gene expression. Few methods have been created based on this mechanism to anticipate the lncRNA-gene relationship prediction. Hence, we present lncRNA-Top to forecast potential lncRNA-gene regulation relationships. Specifically, we constructed controlled deep-learning methods using 12417 lncRNAs and 16127 genes. We have provided retrospective and innovative views among negative sampling, random seeds, cross-validation, metrics, and independent datasets. The AUC, AUPR, and our defined precision@k were leveraged to evaluate performance. In-depth case studies demonstrate that 47 out of 100 projected top unknown pairings were recorded in publications, supporting the predictive power. Our additional software can annotate the scores with target candidates. The lncRNA-Top will be a helpful tool to uncover prospective lncRNA targets and better comprehend the regulatory processes of lncRNAs.

12.

A Lightweight Framework For Chromatin Loop Detection at the Single-Cell Level.

Wang, Fuzhou; Alinejad-Rokny, Hamid; Lin, Jiecong; Gao, Tingxiao; Chen, Xingjian; Zheng, Zetian; Meng, Lingkuan; Li, Xiangtao; Wong, Ka-Chun.

Adv Sci (Weinh) ; 10(33): e2303502, 2023 11.

Article En | MEDLINE | ID: mdl-37816141

Single-cell Hi-C (scHi-C) has made it possible to analyze chromatin organization at the single-cell level. However, scHi-C experiments generate inherently sparse data, which poses a challenge for loop calling methods. The existing approach performs significance tests across the imputed dense contact maps, leading to substantial computational overhead and loss of information at the single-cell level. To overcome this limitation, a lightweight framework called scGSLoop is proposed, which sets a new paradigm for scHi-C loop calling by adapting the training and inferencing strategies of graph-based deep learning to leverage the sequence features and 1D positional information of genomic loci. With this framework, sparsity is no longer a challenge, but rather an advantage that the model leverages to achieve unprecedented computational efficiency. Compared to existing methods, scGSLoop makes more accurate predictions and is able to identify more loops that have the potential to play regulatory roles in genome functioning. Moreover, scGSLoop preserves single-cell information by identifying a distinct group of loops for each individual cell, which not only enables an understanding of the variability of chromatin looping states between cells, but also allows scGSLoop to be extended for the investigation of multi-connected hubs and their underlying mechanisms.

Chromatin , Genomics , Chromatin/genetics , Genome

13.

Construction of Immune Infiltration-Related LncRNA Signatures Based on Machine Learning for the Prognosis in Colon Cancer.

Liu, Zhe; Petinrin, Olutomilayo Olayemi; Toseef, Muhammad; Chen, Nanjun; Wong, Ka-Chun.

Biochem Genet ; 2023 Oct 04.

Article En | MEDLINE | ID: mdl-37792224

Colon cancer is one of the malignant tumors with high morbidity, lethality, and prevalence across global human health. Molecular biomarkers play key roles in its prognosis. In particular, immune-related lncRNAs (IRL) have attracted enormous interest in diagnosis and treatment, but less is known about their potential functions. We aimed to investigate dysfunctional IRL and construct a risk model for improving the outcomes of patients. Nineteen immune cell types were collected for identifying house-keeping lncRNAs (HKLncRNA). GSE39582 and TCGA-COAD were treated as the discovery and validation datasets, respectively. Four machine learning algorithms (LASSO, Random Forest, Boruta, and Xgboost) and a Gaussian mixture model were utilized to mine the optimal combination of lncRNAs. Univariate and multivariate Cox regression was utilized to construct the risk score model. We distinguished the functional difference in an immune perspective between low- and high-risk cohorts calculated by this scoring system. Finally, we provided a nomogram. By leveraging the microarray, sequencing, and clinical data for immune cells and colon cancer patients, we identified the 221 HKLncRNAs with a low cell type-specificity index. Eighty-seven lncRNAs were up-regulated in the immune compared to cancer cells. Twelve lncRNAs were beneficial in improving performance. A risk score model with three lncRNAs (CYB561D2, LINC00638, and DANCR) was proposed with robust ROC performance on an independent dataset. According to immune-related analysis, the risk score is strongly associated with the tumor immune microenvironment. Our results emphasized IRL has the potential to be a powerful and effective therapy for enhancing the prognostic of colon cancer.

14.

Dynamic characterization and interpretation for protein-RNA interactions across diverse cellular conditions using HDRNet.

Zhu, Haoran; Yang, Yuning; Wang, Yunhe; Wang, Fuzhou; Huang, Yujian; Chang, Yi; Wong, Ka-Chun; Li, Xiangtao.

Nat Commun ; 14(1): 6824, 2023 10 26.

Article En | MEDLINE | ID: mdl-37884495

RNA-binding proteins play crucial roles in the regulation of gene expression, and understanding the interactions between RNAs and RBPs in distinct cellular conditions forms the basis for comprehending the underlying RNA function. However, current computational methods pose challenges to the cross-prediction of RNA-protein binding events across diverse cell lines and tissue contexts. Here, we develop HDRNet, an end-to-end deep learning-based framework to precisely predict dynamic RBP binding events under diverse cellular conditions. Our results demonstrate that HDRNet can accurately and efficiently identify binding sites, particularly for dynamic prediction, outperforming other state-of-the-art models on 261 linear RNA datasets from both eCLIP and CLIP-seq, supplemented with additional tissue data. Moreover, we conduct motif and interpretation analyses to provide fresh insights into the pathological mechanisms underlying RNA-RBP interactions from various perspectives. Our functional genomic analysis further explores the gene-human disease associations, uncovering previously uncharacterized observations for a broad range of genetic disorders.

RNA-Binding Proteins , RNA , Humans , RNA/genetics , RNA/metabolism , RNA-Binding Proteins/metabolism , Binding Sites/genetics , Protein Binding , Chromatin Immunoprecipitation Sequencing

15.

Chromothripsis detection with multiple myeloma patients based on deep graph learning.

Yu, Jixiang; Chen, Nanjun; Zheng, Zetian; Gao, Ming; Liang, Ning; Wong, Ka-Chun.

Bioinformatics ; 39(7)2023 07 01.

Article En | MEDLINE | ID: mdl-37399092

MOTIVATION: Chromothripsis, associated with poor clinical outcomes, is prognostically vital in multiple myeloma. The catastrophic event is reported to be detectable prior to the progression of multiple myeloma. As a result, chromothripsis detection can contribute to risk estimation and early treatment guidelines for multiple myeloma patients. However, manual diagnosis remains the gold standard approach to detect chromothripsis events with the whole-genome sequencing technology to retrieve both copy number variation (CNV) and structural variation data. Meanwhile, CNV data are much easier to obtain than structural variation data. Hence, in order to reduce the reliance on human experts' efforts and structural variation data extraction, it is necessary to establish a reliable and accurate chromothripsis detection method based on CNV data. RESULTS: To address those issues, we propose a method to detect chromothripsis solely based on CNV data. With the help of structure learning, the intrinsic relationship-directed acyclic graph of CNV features is inferred to derive a CNV embedding graph (i.e. CNV-DAG). Subsequently, a neural network based on Graph Transformer, local feature extraction, and non-linear feature interaction, is proposed with the embedding graph as the input to distinguish whether the chromothripsis event occurs. Ablation experiments, clustering, and feature importance analysis are also conducted to enable the proposed model to be explained by capturing mechanistic insights. AVAILABILITY AND IMPLEMENTATION: The source code and data are freely available at https://github.com/luvyfdawnYu/CNV_chromothripsis.

Chromothripsis , Multiple Myeloma , Humans , Multiple Myeloma/diagnosis , Multiple Myeloma/genetics , DNA Copy Number Variations , Software , Neural Networks, Computer

16.

Deep transfer learning for clinical decision-making based on high-throughput data: comprehensive survey with benchmark results.

Toseef, Muhammad; Olayemi Petinrin, Olutomilayo; Wang, Fuzhou; Rahaman, Saifur; Liu, Zhe; Li, Xiangtao; Wong, Ka-Chun.

Brief Bioinform ; 24(4)2023 07 20.

Article En | MEDLINE | ID: mdl-37455245

The rapid growth of omics-based data has revolutionized biomedical research and precision medicine, allowing machine learning models to be developed for cutting-edge performance. However, despite the wealth of high-throughput data available, the performance of these models is hindered by the lack of sufficient training data, particularly in clinical research (in vivo experiments). As a result, translating this knowledge into clinical practice, such as predicting drug responses, remains a challenging task. Transfer learning is a promising tool that bridges the gap between data domains by transferring knowledge from the source to the target domain. Researchers have proposed transfer learning to predict clinical outcomes by leveraging pre-clinical data (mouse, zebrafish), highlighting its vast potential. In this work, we present a comprehensive literature review of deep transfer learning methods for health informatics and clinical decision-making, focusing on high-throughput molecular data. Previous reviews mostly covered image-based transfer learning works, while we present a more detailed analysis of transfer learning papers. Furthermore, we evaluated original studies based on different evaluation settings across cross-validations, data splits and model architectures. The result shows that those transfer learning methods have great potential; high-throughput sequencing data and state-of-the-art deep learning models lead to significant insights and conclusions. Additionally, we explored various datasets in transfer learning papers with statistics and visualization.

Benchmarking , Zebrafish , Animals , Mice , Zebrafish/genetics , Machine Learning , Precision Medicine , Clinical Decision-Making

17.

Reliable Identification and Interpretation of Single-Cell Molecular Heterogeneity and Transcriptional Regulation using Dynamic Ensemble Pruning.

Fan, Yi; Wang, Yunhe; Wang, Fuzhou; Huang, Lei; Yang, Yuning; Wong, Ka-Chun; Li, Xiangtao.

Adv Sci (Weinh) ; 10(22): e2205442, 2023 08.

Article En | MEDLINE | ID: mdl-37290050

Unsupervised clustering is an essential step in identifying cell types from single-cell RNA sequencing (scRNA-seq) data. However, a common issue with unsupervised clustering models is that the optimization direction of the objective function and the final generated clustering labels in the absence of supervised information may be inconsistent or even arbitrary. To address this challenge, a dynamic ensemble pruning framework (DEPF) is proposed to identify and interpret single-cell molecular heterogeneity. In particular, a silhouette coefficient-based indicator is developed to determine the optimization direction of the bi-objective function. In addition, a hierarchical autoencoder is employed to project the high-dimensional data onto multiple low-dimensional latent space sets, and then a clustering ensemble is produced in the latent space by the basic clustering algorithm. Following that, a bi-objective fruit fly optimization algorithm is designed to prune dynamically the low-quality basic clustering in the ensemble. Multiple experiments are conducted on 28 real scRNA-seq datasets and one large real scRNA-seq dataset from diverse platforms and species to validate the effectiveness of the DEPF. In addition, biological interpretability and transcriptional and post-transcriptional regulatory are conducted to explore biological patterns from the cell types identified, which could provide novel insights into characterizing the mechanisms.

Algorithms , Single-Cell Analysis , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Cluster Analysis , Gene Expression Regulation

18.

Spatial-Temporal Co-Attention Learning for Diagnosis of Mental Disorders From Resting-State fMRI Data.

Liu, Rui; Huang, Zhi-An; Hu, Yao; Zhu, Zexuan; Wong, Ka-Chun; Tan, Kay Chen.

IEEE Trans Neural Netw Learn Syst ; PP2023 Feb 17.

Article En | MEDLINE | ID: mdl-37027556

Neuroimaging techniques have been widely adopted to detect the neurological brain structures and functions of the nervous system. As an effective noninvasive neuroimaging technique, functional magnetic resonance imaging (fMRI) has been extensively used in computer-aided diagnosis (CAD) of mental disorders, e.g., autism spectrum disorder (ASD) and attention deficit/hyperactivity disorder (ADHD). In this study, we propose a spatial-temporal co-attention learning (STCAL) model for diagnosing ASD and ADHD from fMRI data. In particular, a guided co-attention (GCA) module is developed to model the intermodal interactions of spatial and temporal signal patterns. A novel sliding cluster attention module is designed to address global feature dependency of self-attention mechanism in fMRI time series. Comprehensive experimental results demonstrate that our STCAL model can achieve competitive accuracies of 73.0 ± 4.5%, 72.0 ± 3.8%, and 72.5 ± 4.2% on the ABIDE I, ABIDE II, and ADHD-200 datasets, respectively. Moreover, the potential for feature pruning based on the co-attention scores is validated by the simulation experiment. The clinical interpretation analysis of STCAL can allow medical professionals to concentrate on the discriminative regions of interest and key time frames from fMRI data.

19.

Machine learning in metastatic cancer research: Potentials, possibilities, and prospects.

Petinrin, Olutomilayo Olayemi; Saeed, Faisal; Toseef, Muhammad; Liu, Zhe; Basurra, Shadi; Muyide, Ibukun Omotayo; Li, Xiangtao; Lin, Qiuzhen; Wong, Ka-Chun.

Comput Struct Biotechnol J ; 21: 2454-2470, 2023.

Article En | MEDLINE | ID: mdl-37077177

Cancer has received extensive recognition for its high mortality rate, with metastatic cancer being the top cause of cancer-related deaths. Metastatic cancer involves the spread of the primary tumor to other body organs. As much as the early detection of cancer is essential, the timely detection of metastasis, the identification of biomarkers, and treatment choice are valuable for improving the quality of life for metastatic cancer patients. This study reviews the existing studies on classical machine learning (ML) and deep learning (DL) in metastatic cancer research. Since the majority of metastatic cancer research data are collected in the formats of PET/CT and MRI image data, deep learning techniques are heavily involved. However, its black-box nature and expensive computational cost are notable concerns. Furthermore, existing models could be overestimated for their generality due to the non-diverse population in clinical trial datasets. Therefore, research gaps are itemized; follow-up studies should be carried out on metastatic cancer using machine learning and deep learning tools with data in a symmetric manner.

20.

Enabling Single-Cell Drug Response Annotations from Bulk RNA-Seq Using SCAD.

Zheng, Zetian; Chen, Junyi; Chen, Xingjian; Huang, Lei; Xie, Weidun; Lin, Qiuzhen; Li, Xiangtao; Wong, Ka-Chun.

Adv Sci (Weinh) ; 10(11): e2204113, 2023 04.

Article En | MEDLINE | ID: mdl-36762572

The single-cell RNA sequencing (scRNA-seq) quantifies the gene expression of individual cells, while the bulk RNA sequencing (bulk RNA-seq) characterizes the mixed transcriptome of cells. The inference of drug sensitivities for individual cells can provide new insights to understand the mechanism of anti-cancer response heterogeneity and drug resistance at the cellular resolution. However, pharmacogenomic information related to their corresponding scRNA-Seq is often limited. Therefore, a transfer learning model is proposed to infer the drug sensitivities at single-cell level. This framework learns bulk transcriptome profiles and pharmacogenomics information from population cell lines in a large public dataset and transfers the knowledge to infer drug efficacy of individual cells. The results suggest that it is suitable to learn knowledge from pre-clinical cell lines to infer pre-existing cell subpopulations with different drug sensitivities prior to drug exposure. In addition, the model offers a new perspective on drug combinations. It is observed that drug-resistant subpopulation can be sensitive to other drugs (e.g., a subset of JHU006 is Vorinostat-resistant while Gefitinib-sensitive); such finding corroborates the previously reported drug combination (Gefitinib + Vorinostat) strategy in several cancer types. The identified drug sensitivity biomarkers reveal insights into the tumor heterogeneity and treatment at cellular resolution.

Transcriptome , RNA-Seq/methods , Gefitinib , Vorinostat , Transcriptome/genetics , Sequence Analysis, RNA/methods