Pesquisa | BVS Integralidade em Saúde

1.

TP-LMMSG: a peptide prediction graph neural network incorporating flexible amino acid property representation.

Chen, Nanjun; Yu, Jixiang; Zhe, Liu; Wang, Fuzhou; Li, Xiangtao; Wong, Ka-Chun.

Brief Bioinform ; 25(4)2024 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-38920345

RESUMO

Bioactive peptide therapeutics has been a long-standing research topic. Notably, the antimicrobial peptides (AMPs) have been extensively studied for its therapeutic potential. Meanwhile, the demand for annotating other therapeutic peptides, such as antiviral peptides (AVPs) and anticancer peptides (ACPs), also witnessed an increase in recent years. However, we conceive that the structure of peptide chains and the intrinsic information between the amino acids is not fully investigated among the existing protocols. Therefore, we develop a new graph deep learning model, namely TP-LMMSG, which offers lightweight and easy-to-deploy advantages while improving the annotation performance in a generalizable manner. The results indicate that our model can accurately predict the properties of different peptides. The model surpasses the other state-of-the-art models on AMP, AVP and ACP prediction across multiple experimental validated datasets. Moreover, TP-LMMSG also addresses the challenges of time-consuming pre-processing in graph neural network frameworks. With its flexibility in integrating heterogeneous peptide features, our model can provide substantial impacts on the screening and discovery of therapeutic peptides. The source code is available at https://github.com/NanjunChen37/TP_LMMSG.

Assuntos

Aminoácidos , Redes Neurais de Computação , Peptídeos , Aminoácidos/química , Peptídeos/química , Biologia Computacional/métodos , Aprendizado Profundo , Peptídeos Antimicrobianos/química , Algoritmos

2.

TransPTM: a transformer-based model for non-histone acetylation site prediction.

Meng, Lingkuan; Chen, Xingjian; Cheng, Ke; Chen, Nanjun; Zheng, Zetian; Wang, Fuzhou; Sun, Hongyan; Wong, Ka-Chun.

Brief Bioinform ; 25(3)2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38725156

RESUMO

Protein acetylation is one of the extensively studied post-translational modifications (PTMs) due to its significant roles across a myriad of biological processes. Although many computational tools for acetylation site identification have been developed, there is a lack of benchmark dataset and bespoke predictors for non-histone acetylation site prediction. To address these problems, we have contributed to both dataset creation and predictor benchmark in this study. First, we construct a non-histone acetylation site benchmark dataset, namely NHAC, which includes 11 subsets according to the sequence length ranging from 11 to 61 amino acids. There are totally 886 positive samples and 4707 negative samples for each sequence length. Secondly, we propose TransPTM, a transformer-based neural network model for non-histone acetylation site predication. During the data representation phase, per-residue contextualized embeddings are extracted using ProtT5 (an existing pre-trained protein language model). This is followed by the implementation of a graph neural network framework, which consists of three TransformerConv layers for feature extraction and a multilayer perceptron module for classification. The benchmark results reflect that TransPTM has the competitive performance for non-histone acetylation site prediction over three state-of-the-art tools. It improves our comprehension on the PTM mechanism and provides a theoretical basis for developing drug targets for diseases. Moreover, the created PTM datasets fills the gap in non-histone acetylation site datasets and is beneficial to the related communities. The related source code and data utilized by TransPTM are accessible at https://www.github.com/TransPTM/TransPTM.

Assuntos

Redes Neurais de Computação , Processamento de Proteína Pós-Traducional , Acetilação , Biologia Computacional/métodos , Bases de Dados de Proteínas , Software , Algoritmos , Humanos , Proteínas/química , Proteínas/metabolismo

3.

Discovering DNA shape motifs with multiple DNA shape features: generalization, methods, and validation.

Chen, Nanjun; Yu, Jixiang; Liu, Zhe; Meng, Lingkuan; Li, Xiangtao; Wong, Ka-Chun.

Nucleic Acids Res ; 52(8): 4137-4150, 2024 May 08.

Artigo em Inglês | MEDLINE | ID: mdl-38572749

RESUMO

DNA motifs are crucial patterns in gene regulation. DNA-binding proteins (DBPs), including transcription factors, can bind to specific DNA motifs to regulate gene expression and other cellular activities. Past studies suggest that DNA shape features could be subtly involved in DNA-DBP interactions. Therefore, the shape motif annotations based on intrinsic DNA topology can deepen the understanding of DNA-DBP binding. Nevertheless, high-throughput tools for DNA shape motif discovery that incorporate multiple features altogether remain insufficient. To address it, we propose a series of methods to discover non-redundant DNA shape motifs with the generalization to multiple motifs in multiple shape features. Specifically, an existing Gibbs sampling method is generalized to multiple DNA motif discovery with multiple shape features. Meanwhile, an expectation-maximization (EM) method and a hybrid method coupling EM with Gibbs sampling are proposed and developed with promising performance, convergence capability, and efficiency. The discovered DNA shape motif instances reveal insights into low-signal ChIP-seq peak summits, complementing the existing sequence motif discovery works. Additionally, our modelling captures the potential interplays across multiple DNA shape features. We provide a valuable platform of tools for DNA shape motif discovery. An R package is built for open accessibility and long-lasting impact: https://zenodo.org/doi/10.5281/zenodo.10558980.

Assuntos

DNA , Motivos de Nucleotídeos , DNA/química , DNA/genética , DNA/metabolismo , Proteínas de Ligação a DNA/metabolismo , Proteínas de Ligação a DNA/química , Proteínas de Ligação a DNA/genética , Algoritmos , Conformação de Ácido Nucleico , Sequenciamento de Cromatina por Imunoprecipitação/métodos , Sítios de Ligação , Fatores de Transcrição/metabolismo , Fatores de Transcrição/genética , Fatores de Transcrição/química , Humanos , Ligação Proteica

4.

Review of single-cell RNA-seq data clustering for cell-type identification and characterization.

Zhang, Shixiong; Li, Xiangtao; Lin, Jiecong; Lin, Qiuzhen; Wong, Ka-Chun.

RNA ; 29(5): 517-530, 2023 05.

Artigo em Inglês | MEDLINE | ID: mdl-36737104

RESUMO

In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on simulated and multiple single-cell transcriptomic data sets.

Assuntos

Algoritmos , Análise da Expressão Gênica de Célula Única , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Perfilação da Expressão Gênica/métodos , Análise por Conglomerados

5.

Deep transfer learning for clinical decision-making based on high-throughput data: comprehensive survey with benchmark results.

Toseef, Muhammad; Olayemi Petinrin, Olutomilayo; Wang, Fuzhou; Rahaman, Saifur; Liu, Zhe; Li, Xiangtao; Wong, Ka-Chun.

Brief Bioinform ; 24(4)2023 07 20.

Artigo em Inglês | MEDLINE | ID: mdl-37455245

RESUMO

The rapid growth of omics-based data has revolutionized biomedical research and precision medicine, allowing machine learning models to be developed for cutting-edge performance. However, despite the wealth of high-throughput data available, the performance of these models is hindered by the lack of sufficient training data, particularly in clinical research (in vivo experiments). As a result, translating this knowledge into clinical practice, such as predicting drug responses, remains a challenging task. Transfer learning is a promising tool that bridges the gap between data domains by transferring knowledge from the source to the target domain. Researchers have proposed transfer learning to predict clinical outcomes by leveraging pre-clinical data (mouse, zebrafish), highlighting its vast potential. In this work, we present a comprehensive literature review of deep transfer learning methods for health informatics and clinical decision-making, focusing on high-throughput molecular data. Previous reviews mostly covered image-based transfer learning works, while we present a more detailed analysis of transfer learning papers. Furthermore, we evaluated original studies based on different evaluation settings across cross-validations, data splits and model architectures. The result shows that those transfer learning methods have great potential; high-throughput sequencing data and state-of-the-art deep learning models lead to significant insights and conclusions. Additionally, we explored various datasets in transfer learning papers with statistics and visualization.

Assuntos

Benchmarking , Peixe-Zebra , Animais , Camundongos , Peixe-Zebra/genética , Aprendizado de Máquina , Medicina de Precisão , Tomada de Decisão Clínica

6.

HE2Gene: image-to-RNA translation via multi-task learning for spatial transcriptomics data.

Chen, Xingjian; Lin, Jiecong; Wang, Yuchen; Zhang, Weitong; Xie, Weidun; Zheng, Zetian; Wong, Ka-Chun.

Bioinformatics ; 40(6)2024 Jun 03.

Artigo em Inglês | MEDLINE | ID: mdl-38837395

RESUMO

MOTIVATION: Tissue context and molecular profiling are commonly used measures in understanding normal development and disease pathology. In recent years, the development of spatial molecular profiling technologies (e.g. spatial resolved transcriptomics) has enabled the exploration of quantitative links between tissue morphology and gene expression. However, these technologies remain expensive and time-consuming, with subsequent analyses necessitating high-throughput pathological annotations. On the other hand, existing computational tools are limited to predicting only a few dozen to several hundred genes, and the majority of the methods are designed for bulk RNA-seq. RESULTS: In this context, we propose HE2Gene, the first multi-task learning-based method capable of predicting tens of thousands of spot-level gene expressions along with pathological annotations from H&E-stained images. Experimental results demonstrate that HE2Gene is comparable to state-of-the-art methods and generalizes well on an external dataset without the need for re-training. Moreover, HE2Gene preserves the annotated spatial domains and has the potential to identify biomarkers. This capability facilitates cancer diagnosis and broadens its applicability to investigate gene-disease associations. AVAILABILITY AND IMPLEMENTATION: The source code and data information has been deposited at https://github.com/Microbiods/HE2Gene.

Assuntos

Transcriptoma , Humanos , Perfilação da Expressão Gênica/métodos , Biologia Computacional/métodos , Aprendizado de Máquina , RNA/metabolismo

7.

Reducing healthcare disparities using multiple multiethnic data distributions with fine-tuning of transfer learning.

Toseef, Muhammad; Li, Xiangtao; Wong, Ka-Chun.

Brief Bioinform ; 23(3)2022 05 13.

Artigo em Inglês | MEDLINE | ID: mdl-35323862

RESUMO

Healthcare disparities in multiethnic medical data is a major challenge; the main reason lies in the unequal data distribution of ethnic groups among data cohorts. Biomedical data collected from different cancer genome research projects may consist of mainly one ethnic group, such as people with European ancestry. In contrast, the data distribution of other ethnic races such as African, Asian, Hispanic, and Native Americans can be less visible than the counterpart. Data inequality in the biomedical field is an important research problem, resulting in the diverse performance of machine learning models while creating healthcare disparities. Previous researches have reduced the healthcare disparities only using limited data distributions. In our study, we work on fine-tuning of deep learning and transfer learning models with different multiethnic data distributions for the prognosis of 33 cancer types. In previous studies, to reduce the healthcare disparities, only a single ethnic cohort was used as the target domain with one major source domain. In contrast, we focused on multiple ethnic cohorts as the target domain in transfer learning using the TCGA and MMRF CoMMpass study datasets. After performance comparison for experiments with new data distributions, our proposed model shows promising performance for transfer learning schemes compared to the baseline approach for old and new data distributation experiments.

Assuntos

Disparidades em Assistência à Saúde , Neoplasias , Etnicidade , Hispânico ou Latino , Humanos , Aprendizado de Máquina , Neoplasias/genética

8.

High-throughput single-cell RNA-seq data imputation and characterization with surrogate-assisted automated deep learning.

Li, Xiangtao; Li, Shaochuan; Huang, Lei; Zhang, Shixiong; Wong, Ka-Chun.

Brief Bioinform ; 23(1)2022 01 17.

Artigo em Inglês | MEDLINE | ID: mdl-34553763

RESUMO

Single-cell RNA sequencing (scRNA-seq) technologies have been heavily developed to probe gene expression profiles at single-cell resolution. Deep imputation methods have been proposed to address the related computational challenges (e.g. the gene sparsity in single-cell data). In particular, the neural architectures of those deep imputation models have been proven to be critical for performance. However, deep imputation architectures are difficult to design and tune for those without rich knowledge of deep neural networks and scRNA-seq. Therefore, Surrogate-assisted Evolutionary Deep Imputation Model (SEDIM) is proposed to automatically design the architectures of deep neural networks for imputing gene expression levels in scRNA-seq data without any manual tuning. Moreover, the proposed SEDIM constructs an offline surrogate model, which can accelerate the computational efficiency of the architectural search. Comprehensive studies show that SEDIM significantly improves the imputation and clustering performance compared with other benchmark methods. In addition, we also extensively explore the performance of SEDIM in other contexts and platforms including mass cytometry and metabolic profiling in a comprehensive manner. Marker gene detection, gene ontology enrichment and pathological analysis are conducted to provide novel insights into cell-type identification and the underlying mechanisms. The source code is available at https://github.com/li-shaochuan/SEDIM.

Assuntos

Aprendizado Profundo , Análise de Célula Única , Perfilação da Expressão Gênica/métodos , RNA-Seq , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos

9.

DeepMotifSyn: a deep learning approach to synthesize heterodimeric DNA motifs.

Lin, Jiecong; Huang, Lei; Chen, Xingjian; Zhang, Shixiong; Wong, Ka-Chun.

Brief Bioinform ; 23(1)2022 01 17.

Artigo em Inglês | MEDLINE | ID: mdl-34524404

RESUMO

The cooperativity of transcription factors (TFs) is a widespread phenomenon in the gene regulation system. However, the interaction patterns between TF binding motifs remain elusive. The recent high-throughput assays, CAP-SELEX, have identified over 600 composite DNA sites (i.e. heterodimeric motifs) bound by cooperative TF pairs. However, there are over 25 000 inferentially effective heterodimeric TFs in the human cells. It is not practically feasible to validate all heterodimeric motifs due to cost and labor. We introduce DeepMotifSyn, a deep learning-based tool for synthesizing heterodimeric motifs from monomeric motif pairs. Specifically, DeepMotifSyn is composed of heterodimeric motif generator and evaluator. The generator is a U-Net-based neural network that can synthesize heterodimeric motifs from aligned motif pairs. The evaluator is a machine learning-based model that can score the generated heterodimeric motif candidates based on the motif sequence features. Systematic evaluations on CAP-SELEX data illustrate that DeepMotifSyn significantly outperforms the current state-of-the-art predictors. In addition, DeepMotifSyn can synthesize multiple heterodimeric motifs with different orientation and spacing settings. Such a feature can address the shortcomings of previous models. We believe DeepMotifSyn is a more practical and reliable model than current predictors on heterodimeric motif synthesis. Contact:kc.w@cityu.edu.hk.

Assuntos

Aprendizado Profundo , Sítios de Ligação/genética , Humanos , Motivos de Nucleotídeos , Ligação Proteica , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo

10.

EGFI: drug-drug interaction extraction and generation with fusion of enriched entity and sentence information.

Huang, Lei; Lin, Jiecong; Li, Xiangtao; Song, Linqi; Zheng, Zetian; Wong, Ka-Chun.

Brief Bioinform ; 23(1)2022 01 17.

Artigo em Inglês | MEDLINE | ID: mdl-34791012

RESUMO

MOTIVATION: The rapid growth in literature accumulates diverse and yet comprehensive biomedical knowledge hidden to be mined such as drug interactions. However, it is difficult to extract the heterogeneous knowledge to retrieve or even discover the latest and novel knowledge in an efficient manner. To address such a problem, we propose EGFI for extracting and consolidating drug interactions from large-scale medical literature text data. Specifically, EGFI consists of two parts: classification and generation. In the classification part, EGFI encompasses the language model BioBERT which has been comprehensively pretrained on biomedical corpus. In particular, we propose the multihead self-attention mechanism and packed BiGRU to fuse multiple semantic information for rigorous context modeling. In the generation part, EGFI utilizes another pretrained language model BioGPT-2 where the generation sentences are selected based on filtering rules. RESULTS: We evaluated the classification part on 'DDIs 2013' dataset and 'DTIs' dataset, achieving the F1 scores of 0.842 and 0.720 respectively. Moreover, we applied the classification part to distinguish high-quality generated sentences and verified with the existing growth truth to confirm the filtered sentences. The generated sentences that are not recorded in DrugBank and DDIs 2013 dataset demonstrated the potential of EGFI to identify novel drug relationships. AVAILABILITY: Source code are publicly available at https://github.com/Layne-Huang/EGFI.

Assuntos

Idioma , Processamento de Linguagem Natural , Interações Medicamentosas , Semântica , Software

11.

HCRNet: high-throughput circRNA-binding event identification from CLIP-seq data using deep temporal convolutional network.

Yang, Yuning; Hou, Zilong; Wang, Yansong; Ma, Hongli; Sun, Pingping; Ma, Zhiqiang; Wong, Ka-Chun; Li, Xiangtao.

Brief Bioinform ; 23(2)2022 03 10.

Artigo em Inglês | MEDLINE | ID: mdl-35189638

RESUMO

Identifying genome-wide binding events between circular RNAs (circRNAs) and RNA-binding proteins (RBPs) can greatly facilitate our understanding of functional mechanisms within circRNAs. Thanks to the development of cross-linked immunoprecipitation sequencing technology, large amounts of genome-wide circRNA binding event data have accumulated, providing opportunities for designing high-performance computational models to discriminate RBP interaction sites and thus to interpret the biological significance of circRNAs. Unfortunately, there are still no computational models sufficiently flexible to accommodate circRNAs from different data scales and with various degrees of feature representation. Here, we present HCRNet, a novel end-to-end framework for identification of circRNA-RBP binding events. To capture the hierarchical relationships, the multi-source biological information is fused to represent circRNAs, including various natural language sequence features. Furthermore, a deep temporal convolutional network incorporating global expectation pooling was developed to exploit the latent nucleotide dependencies in an exhaustive manner. We benchmarked HCRNet on 37 circRNA datasets and 31 linear RNA datasets to demonstrate the effectiveness of our proposed method. To evaluate further the model's robustness, we performed HCRNet on a full-length dataset containing 740 circRNAs. Results indicate that HCRNet generally outperforms existing methods. In addition, motif analyses were conducted to exhibit the interpretability of HCRNet on circRNAs. All supporting source code and data can be downloaded from https://github.com/yangyn533/HCRNet and https://doi.org/10.6084/m9.figshare.16943722.v1. And the web server of HCRNet is publicly accessible at http://39.104.118.143:5001/.

Assuntos

Sequenciamento de Cromatina por Imunoprecipitação , RNA Circular , Sítios de Ligação , RNA/genética , RNA/metabolismo , Proteínas de Ligação a RNA/genética , Proteínas de Ligação a RNA/metabolismo

12.

CoaDTI: multi-modal co-attention based framework for drug-target interaction annotation.

Huang, Lei; Lin, Jiecong; Liu, Rui; Zheng, Zetian; Meng, Lingkuan; Chen, Xingjian; Li, Xiangtao; Wong, Ka-Chun.

Brief Bioinform ; 23(6)2022 11 19.

Artigo em Inglês | MEDLINE | ID: mdl-36274236

RESUMO

MOTIVATION: The identification of drug-target interactions (DTIs) plays a vital role for in silico drug discovery, in which the drug is the chemical molecule, and the target is the protein residues in the binding pocket. Manual DTI annotation approaches remain reliable; however, it is notoriously laborious and time-consuming to test each drug-target pair exhaustively. Recently, the rapid growth of labelled DTI data has catalysed interests in high-throughput DTI prediction. Unfortunately, those methods highly rely on the manual features denoted by human, leading to errors. RESULTS: Here, we developed an end-to-end deep learning framework called CoaDTI to significantly improve the efficiency and interpretability of drug target annotation. CoaDTI incorporates the Co-attention mechanism to model the interaction information from the drug modality and protein modality. In particular, CoaDTI incorporates transformer to learn the protein representations from raw amino acid sequences, and GraphSage to extract the molecule graph features from SMILES. Furthermore, we proposed to employ the transfer learning strategy to encode protein features by pre-trained transformer to address the issue of scarce labelled data. The experimental results demonstrate that CoaDTI achieves competitive performance on three public datasets compared with state-of-the-art models. In addition, the transfer learning strategy further boosts the performance to an unprecedented level. The extended study reveals that CoaDTI can identify novel DTIs such as reactions between candidate drugs and severe acute respiratory syndrome coronavirus 2-associated proteins. The visualization of co-attention scores can illustrate the interpretability of our model for mechanistic insights. AVAILABILITY: Source code are publicly available at https://github.com/Layne-Huang/CoaDTI.

Assuntos

COVID-19 , Humanos , Simulação por Computador , Proteínas/química , Sequência de Aminoácidos , Descoberta de Drogas/métodos

13.

Chromothripsis detection with multiple myeloma patients based on deep graph learning.

Yu, Jixiang; Chen, Nanjun; Zheng, Zetian; Gao, Ming; Liang, Ning; Wong, Ka-Chun.

Bioinformatics ; 39(7)2023 07 01.

Artigo em Inglês | MEDLINE | ID: mdl-37399092

RESUMO

MOTIVATION: Chromothripsis, associated with poor clinical outcomes, is prognostically vital in multiple myeloma. The catastrophic event is reported to be detectable prior to the progression of multiple myeloma. As a result, chromothripsis detection can contribute to risk estimation and early treatment guidelines for multiple myeloma patients. However, manual diagnosis remains the gold standard approach to detect chromothripsis events with the whole-genome sequencing technology to retrieve both copy number variation (CNV) and structural variation data. Meanwhile, CNV data are much easier to obtain than structural variation data. Hence, in order to reduce the reliance on human experts' efforts and structural variation data extraction, it is necessary to establish a reliable and accurate chromothripsis detection method based on CNV data. RESULTS: To address those issues, we propose a method to detect chromothripsis solely based on CNV data. With the help of structure learning, the intrinsic relationship-directed acyclic graph of CNV features is inferred to derive a CNV embedding graph (i.e. CNV-DAG). Subsequently, a neural network based on Graph Transformer, local feature extraction, and non-linear feature interaction, is proposed with the embedding graph as the input to distinguish whether the chromothripsis event occurs. Ablation experiments, clustering, and feature importance analysis are also conducted to enable the proposed model to be explained by capturing mechanistic insights. AVAILABILITY AND IMPLEMENTATION: The source code and data are freely available at https://github.com/luvyfdawnYu/CNV_chromothripsis.

Assuntos

Cromotripsia , Mieloma Múltiplo , Humanos , Mieloma Múltiplo/diagnóstico , Mieloma Múltiplo/genética , Variações do Número de Cópias de DNA , Software , Redes Neurais de Computação

14.

scBGEDA: deep single-cell clustering analysis via a dual denoising autoencoder with bipartite graph ensemble clustering.

Wang, Yunhe; Yu, Zhuohan; Li, Shaochuan; Bian, Chuang; Liang, Yanchun; Wong, Ka-Chun; Li, Xiangtao.

Bioinformatics ; 39(2)2023 02 14.

Artigo em Inglês | MEDLINE | ID: mdl-36734596

RESUMO

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) is an increasingly popular technique for transcriptomic analysis of gene expression at the single-cell level. Cell-type clustering is the first crucial task in the analysis of scRNA-seq data that facilitates accurate identification of cell types and the study of the characteristics of their transcripts. Recently, several computational models based on a deep autoencoder and the ensemble clustering have been developed to analyze scRNA-seq data. However, current deep autoencoders are not sufficient to learn the latent representations of scRNA-seq data, and obtaining consensus partitions from these feature representations remains under-explored. RESULTS: To address this challenge, we propose a single-cell deep clustering model via a dual denoising autoencoder with bipartite graph ensemble clustering called scBGEDA, to identify specific cell populations in single-cell transcriptome profiles. First, a single-cell dual denoising autoencoder network is proposed to project the data into a compressed low-dimensional space and that can learn feature representation via explicit modeling of synergistic optimization of the zero-inflated negative binomial reconstruction loss and denoising reconstruction loss. Then, a bipartite graph ensemble clustering algorithm is designed to exploit the relationships between cells and the learned latent embedded space by means of a graph-based consensus function. Multiple comparison experiments were conducted on 20 scRNA-seq datasets from different sequencing platforms using a variety of clustering metrics. The experimental results indicated that scBGEDA outperforms other state-of-the-art methods on these datasets, and also demonstrated its scalability to large-scale scRNA-seq datasets. Moreover, scBGEDA was able to identify cell-type specific marker genes and provide functional genomic analysis by quantifying the influence of genes on cell clusters, bringing new insights into identifying cell types and characterizing the scRNA-seq data from different perspectives. AVAILABILITY AND IMPLEMENTATION: The source code of scBGEDA is available at https://github.com/wangyh082/scBGEDA. The software and the supporting data can be downloaded from https://figshare.com/articles/software/scBGEDA/19657911. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Perfilação da Expressão Gênica , Análise de Sequência de RNA/métodos , Perfilação da Expressão Gênica/métodos , Software , Análise de Célula Única/métodos , Análise por Conglomerados

15.

Automated exploitation of deep learning for cancer patient stratification across multiple types.

Sun, Pingping; Fan, Shijie; Li, Shaochuan; Zhao, Yingwei; Lu, Chang; Wong, Ka-Chun; Li, Xiangtao.

Bioinformatics ; 39(11)2023 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-37934154

RESUMO

MOTIVATION: Recent frameworks based on deep learning have been developed to identify cancer subtypes from high-throughput gene expression profiles. Unfortunately, the performance of deep learning is highly dependent on its neural network architectures which are often hand-crafted with expertise in deep neural networks, meanwhile, the optimization and adjustment of the network are usually costly and time consuming. RESULTS: To address such limitations, we proposed a fully automated deep neural architecture search model for diagnosing consensus molecular subtypes from gene expression data (DNAS). The proposed model uses ant colony algorithm, one of the heuristic swarm intelligence algorithms, to search and optimize neural network architecture, and it can automatically find the optimal deep learning model architecture for cancer diagnosis in its search space. We validated DNAS on eight colorectal cancer datasets, achieving the average accuracy of 95.48%, the average specificity of 98.07%, and the average sensitivity of 96.24%, respectively. Without the loss of generality, we investigated the general applicability of DNAS further on other cancer types from different platforms including lung cancer and breast cancer, and DNAS achieved an area under the curve of 95% and 96%, respectively. In addition, we conducted gene ontology enrichment and pathological analysis to reveal interesting insights into cancer subtype identification and characterization across multiple cancer types. AVAILABILITY AND IMPLEMENTATION: The source code and data can be downloaded from https://github.com/userd113/DNAS-main. And the web server of DNAS is publicly accessible at 119.45.145.120:5001.

Assuntos

Neoplasias da Mama , Aprendizado Profundo , Humanos , Feminino , Redes Neurais de Computação , Algoritmos , Software

16.

Uncovering the ceRNA Network Related to the Prognosis of Stomach Adenocarcinoma Among 898 Patient Samples.

Liu, Zhe; Liu, Fang; Petinrin, Olutomilayo Olayemi; Wang, Fuzhou; Zhang, Yu; Wong, Ka-Chun.

Biochem Genet ; 2024 Feb 15.

Artigo em Inglês | MEDLINE | ID: mdl-38361095

RESUMO

Stomach adenocarcinoma (STAD) patients are often associated with significantly high mortality rates and poor prognoses worldwide. Among STAD patients, competing endogenous RNAs (ceRNAs) play key roles in regulating one another at the post-transcriptional stage by competing for shared miRNAs. In this study, we aimed to elucidate the roles of lncRNAs in the ceRNA network of STAD, uncovering the molecular biomarkers for target therapy and prognosis. Specifically, a multitude of differentially expressed lncRNAs, miRNAs, and mRNAs (i.e., 898 samples in total) was collected and processed from TCGA. Cytoplasmic lncRNAs were kept for evaluating overall survival (OS) time and constructing the ceRNA network. Differentially expressed mRNAs in the ceRNA network were also investigated for functional and pathological insights. Interestingly, we identified one ceRNA network including 13 lncRNAs, 25 miRNAs, and 9 mRNAs. Among them, 13 RNAs were found related to the patient survival time; their individual risk score can be adopted for prognosis inference. Finally, we constructed a comprehensive ceRNA regulatory network for STAD and developed our own risk-scoring system that can predict the OS time of STAD patients by taking into account the above.

17.

RNCE: network integration with reciprocal neighbors contextual encoding for multi-modal drug community study on cancer targets.

Chen, Junyi; Wong, Ka-Chun.

Brief Bioinform ; 22(3)2021 05 20.

Artigo em Inglês | MEDLINE | ID: mdl-32577712

RESUMO

Mining drug targets and mechanisms of action (MoA) for novel anticancer drugs from pharmacogenomic data is a path to enhance the drug discovery efficiency. Recent approaches have successfully attempted to discover targets/MoA by characterizing drug similarities and communities with integrative methods on multi-modal or multi-omics drug information. However, the sparse and imbalanced community size structure of the drug network is seldom considered in recent approaches. Consequently, we developed a novel network integration approach accounting for network structure by a reciprocal nearest neighbor and contextual information encoding (RNCE) approach. In addition, we proposed a tailor-made clustering algorithm to perform drug community detection on drug networks. RNCE and spectral clustering are proved to outperform state-of-the-art approaches in a series of tests, including network similarity tests and community detection tests on two drug databases. The observed improvement of RNCE can contribute to the field of drug discovery and the related multi-modal/multi-omics integrative studies. Availabilityhttps://github.com/WINGHARE/RNCE.

Assuntos

Antineoplásicos/química , Bases de Dados de Produtos Farmacêuticos , Sistemas de Liberação de Medicamentos , Descoberta de Drogas , Neoplasias/tratamento farmacológico , Humanos

18.

Deep embedded clustering with multiple objectives on scRNA-seq data.

Li, Xiangtao; Zhang, Shixiong; Wong, Ka-Chun.

Brief Bioinform ; 22(5)2021 09 02.

Artigo em Inglês | MEDLINE | ID: mdl-33822877

RESUMO

In recent years, single-cell RNA sequencing (scRNA-seq) technologies have been widely adopted to interrogate gene expression of individual cells; it brings opportunities to understand the underlying processes in a high-throughput manner. Deep embedded clustering (DEC) was demonstrated successful in high-dimensional sparse scRNA-seq data by joint feature learning and cluster assignment for identifying cell types simultaneously. However, the deep network architecture for embedding clustering is not trivial to optimize. Therefore, we propose an evolutionary multiobjective DEC by synergizing the multiobjective evolutionary optimization to simultaneously evolve the hyperparameters and architectures of DEC in an automatic manner. Firstly, a denoising autoencoder is integrated into the DEC to project the high-dimensional sparse scRNA-seq data into a low-dimensional space. After that, to guide the evolution, three objective functions are formulated to balance the model's generality and clustering performance for robustness. Meanwhile, migration and mutation operators are proposed to optimize the objective functions to select the suitable hyperparameters and architectures of DEC in the multiobjective framework. Multiple comparison analyses are conducted on twenty synthetic data and eight real data from different representative single-cell sequencing platforms to validate the effectiveness. The experimental results reveal that the proposed algorithm outperforms other state-of-the-art clustering methods under different metrics. Meanwhile, marker genes identification, gene ontology enrichment and pathology analysis are conducted to reveal novel insights into the cell type identification and characterization mechanisms.

Assuntos

Algoritmos , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Redes Neurais de Computação , RNA-Seq/métodos , Análise de Célula Única/métodos , Análise por Conglomerados , Ontologia Genética , Humanos , Modelos Genéticos , Mutação , Reprodutibilidade dos Testes

19.

Identification of haploinsufficient genes from epigenomic data using deep forest.

Yang, Yuning; Li, Shaochuan; Wang, Yunhe; Ma, Zhiqiang; Wong, Ka-Chun; Li, Xiangtao.

Brief Bioinform ; 22(5)2021 09 02.

Artigo em Inglês | MEDLINE | ID: mdl-33454736

RESUMO

Haploinsufficiency, wherein a single allele is not enough to maintain normal functions, can lead to many diseases including cancers and neurodevelopmental disorders. Recently, computational methods for identifying haploinsufficiency have been developed. However, most of those computational methods suffer from study bias, experimental noise and instability, resulting in unsatisfactory identification of haploinsufficient genes. To address those challenges, we propose a deep forest model, called HaForest, to identify haploinsufficient genes. The multiscale scanning is proposed to extract local contextual representations from input features under Linear Discriminant Analysis. After that, the cascade forest structure is applied to obtain the concatenated features directly by integrating decision-tree-based forests. Meanwhile, to exploit the complex dependency structure among haploinsufficient genes, the LightGBM library is embedded into HaForest to reveal the highly expressive features. To validate the effectiveness of our method, we compared it to several computational methods and four deep learning algorithms on five epigenomic data sets. The results reveal that HaForest achieves superior performance over the other algorithms, demonstrating its unique and complementary performance in identifying haploinsufficient genes. The standalone tool is available at https://github.com/yangyn533/HaForest.

Assuntos

Aprendizado Profundo , Epigênese Genética , Haploinsuficiência , Neoplasias/genética , Transtornos do Neurodesenvolvimento/genética , Software , Alelos , Benchmarking , Árvores de Decisões , Análise Discriminante , Elementos Facilitadores Genéticos , Genoma Humano , Histonas/genética , Histonas/metabolismo , Humanos , Internet , Neoplasias/diagnóstico , Neoplasias/patologia , Transtornos do Neurodesenvolvimento/diagnóstico , Transtornos do Neurodesenvolvimento/patologia , Regiões Promotoras Genéticas

20.

Elucidating transcriptomic profiles from single-cell RNA sequencing data using nature-inspired compressed sensing.

Yu, Zhuohan; Bian, Chuang; Liu, Genggeng; Zhang, Shixiong; Wong, Ka-Chun; Li, Xiangtao.

Brief Bioinform ; 22(5)2021 09 02.

Artigo em Inglês | MEDLINE | ID: mdl-33855366

RESUMO

Gene-expression profiling can define the cell state and gene-expression pattern of cells at the genetic level in a high-throughput manner. With the development of transcriptome techniques, processing high-dimensional genetic data has become a major challenge in expression profiling. Thanks to the recent widespread use of matrix decomposition methods in bioinformatics, a computational framework based on compressed sensing was adopted to reduce dimensionality. However, compressed sensing requires an optimization strategy to learn the modular dictionaries and activity levels from the low-dimensional random composite measurements to reconstruct the high-dimensional gene-expression data. Considering this, here we introduce and compare four compressed sensing frameworks coming from nature-inspired optimization algorithms (CSCS, ABCCS, BACS and FACS) to improve the quality of the decompression process. Several experiments establish that the three proposed methods outperform benchmark methods on nine different datasets, especially the FACS method. We illustrate therefore, the robustness and convergence of FACS in various aspects; notably, time complexity and parameter analyses highlight properties of our proposed FACS. Furthermore, differential gene-expression analysis, cell-type clustering, gene ontology enrichment and pathology analysis are conducted, which bring novel insights into cell-type identification and characterization mechanisms from different perspectives. All algorithms are written in Python and available at https://github.com/Philyzh8/Nature-inspired-CS.

Assuntos

Algoritmos , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , RNA-Seq/métodos , Análise de Célula Única/métodos , Transcriptoma , Animais , Análise por Conglomerados , Redes Reguladoras de Genes/genética , Humanos , Anotação de Sequência Molecular/métodos , Reprodutibilidade dos Testes , Transdução de Sinais/genética , Fatores de Tempo

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa