Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 183
Filter
Add more filters

Affiliation country
Publication year range
1.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38426327

ABSTRACT

Cluster assignment is vital to analyzing single-cell RNA sequencing (scRNA-seq) data to understand high-level biological processes. Deep learning-based clustering methods have recently been widely used in scRNA-seq data analysis. However, existing deep models often overlook the interconnections and interactions among network layers, leading to the loss of structural information within the network layers. Herein, we develop a new self-supervised clustering method based on an adaptive multi-scale autoencoder, called scAMAC. The self-supervised clustering network utilizes the Multi-Scale Attention mechanism to fuse the feature information from the encoder, hidden and decoder layers of the multi-scale autoencoder, which enables the exploration of cellular correlations within the same scale and captures deep features across different scales. The self-supervised clustering network calculates the membership matrix using the fused latent features and optimizes the clustering network based on the membership matrix. scAMAC employs an adaptive feedback mechanism to supervise the parameter updates of the multi-scale autoencoder, obtaining a more effective representation of cell features. scAMAC not only enables cell clustering but also performs data reconstruction through the decoding layer. Through extensive experiments, we demonstrate that scAMAC is superior to several advanced clustering and imputation methods in both data clustering and reconstruction. In addition, scAMAC is beneficial for downstream analysis, such as cell trajectory inference. Our scAMAC model codes are freely available at https://github.com/yancy2024/scAMAC.


Subject(s)
Data Analysis , Single-Cell Gene Expression Analysis , Cluster Analysis , Sequence Analysis, RNA , Gene Expression Profiling , Algorithms
2.
Brief Bioinform ; 25(4)2024 May 23.
Article in English | MEDLINE | ID: mdl-38935070

ABSTRACT

Inferring gene regulatory network (GRN) is one of the important challenges in systems biology, and many outstanding computational methods have been proposed; however there remains some challenges especially in real datasets. In this study, we propose Directed Graph Convolutional neural network-based method for GRN inference (DGCGRN). To better understand and process the directed graph structure data of GRN, a directed graph convolutional neural network is conducted which retains the structural information of the directed graph while also making full use of neighbor node features. The local augmentation strategy is adopted in graph neural network to solve the problem of poor prediction accuracy caused by a large number of low-degree nodes in GRN. In addition, for real data such as E.coli, sequence features are obtained by extracting hidden features using Bi-GRU and calculating the statistical physicochemical characteristics of gene sequence. At the training stage, a dynamic update strategy is used to convert the obtained edge prediction scores into edge weights to guide the subsequent training process of the model. The results on synthetic benchmark datasets and real datasets show that the prediction performance of DGCGRN is significantly better than existing models. Furthermore, the case studies on bladder uroepithelial carcinoma and lung cancer cells also illustrate the performance of the proposed model.


Subject(s)
Computational Biology , Gene Regulatory Networks , Neural Networks, Computer , Humans , Computational Biology/methods , Algorithms , Urinary Bladder Neoplasms/genetics , Urinary Bladder Neoplasms/pathology , Escherichia coli/genetics
3.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38581416

ABSTRACT

The inference of gene regulatory networks (GRNs) from gene expression profiles has been a key issue in systems biology, prompting many researchers to develop diverse computational methods. However, most of these methods do not reconstruct directed GRNs with regulatory types because of the lack of benchmark datasets or defects in the computational methods. Here, we collect benchmark datasets and propose a deep learning-based model, DeepFGRN, for reconstructing fine gene regulatory networks (FGRNs) with both regulation types and directions. In addition, the GRNs of real species are always large graphs with direction and high sparsity, which impede the advancement of GRN inference. Therefore, DeepFGRN builds a node bidirectional representation module to capture the directed graph embedding representation of the GRN. Specifically, the source and target generators are designed to learn the low-dimensional dense embedding of the source and target neighbors of a gene, respectively. An adversarial learning strategy is applied to iteratively learn the real neighbors of each gene. In addition, because the expression profiles of genes with regulatory associations are correlative, a correlation analysis module is designed. Specifically, this module not only fully extracts gene expression features, but also captures the correlation between regulators and target genes. Experimental results show that DeepFGRN has a competitive capability for both GRN and FGRN inference. Potential biomarkers and therapeutic drugs for breast cancer, liver cancer, lung cancer and coronavirus disease 2019 are identified based on the candidate FGRNs, providing a possible opportunity to advance our knowledge of disease treatments.


Subject(s)
Gene Regulatory Networks , Liver Neoplasms , Humans , Systems Biology/methods , Transcriptome , Algorithms , Computational Biology/methods
4.
Brief Bioinform ; 24(2)2023 03 19.
Article in English | MEDLINE | ID: mdl-36715275

ABSTRACT

A large number of works have presented the single-cell RNA sequencing (scRNA-seq) to study the diversity and biological functions of cells at the single-cell level. Clustering identifies unknown cell types, which is essential for downstream analysis of scRNA-seq samples. However, the high dimensionality, high noise and pervasive dropout rate of scRNA-seq samples have a significant challenge to the cluster analysis of scRNA-seq samples. Herein, we propose a new adaptive fuzzy clustering model based on the denoising autoencoder and self-attention mechanism called the scDASFK. It implements the comparative learning to integrate cell similar information into the clustering method and uses a deep denoising network module to denoise the data. scDASFK consists of a self-attention mechanism for further denoising where an adaptive clustering optimization function for iterative clustering is implemented. In order to make the denoised latent features better reflect the cell structure, we introduce a new adaptive feedback mechanism to supervise the denoising process through the clustering results. Experiments on 16 real scRNA-seq datasets show that scDASFK performs well in terms of clustering accuracy, scalability and stability. Overall, scDASFK is an effective clustering model with great potential for scRNA-seq samples analysis. Our scDASFK model codes are freely available at https://github.com/LRX2022/scDASFK.


Subject(s)
Gene Expression Profiling , Single-Cell Analysis , Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Cluster Analysis , Algorithms
5.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36592058

ABSTRACT

The progress of single-cell RNA sequencing (scRNA-seq) has led to a large number of scRNA-seq data, which are widely used in biomedical research. The noise in the raw data and tens of thousands of genes pose a challenge to capture the real structure and effective information of scRNA-seq data. Most of the existing single-cell analysis methods assume that the low-dimensional embedding of the raw data belongs to a Gaussian distribution or a low-dimensional nonlinear space without any prior information, which limits the flexibility and controllability of the model to a great extent. In addition, many existing methods need high computational cost, which makes them difficult to be used to deal with large-scale datasets. Here, we design and develop a depth generation model named Gaussian mixture adversarial autoencoders (scGMAAE), assuming that the low-dimensional embedding of different types of cells follows different Gaussian distributions, integrating Bayesian variational inference and adversarial training, as to give the interpretable latent representation of complex data and discover the statistical distribution of different types of cells. The scGMAAE is provided with good controllability, interpretability and scalability. Therefore, it can process large-scale datasets in a short time and give competitive results. scGMAAE outperforms existing methods in several ways, including dimensionality reduction visualization, cell clustering, differential expression analysis and batch effect removal. Importantly, compared with most deep learning methods, scGMAAE requires less iterations to generate the best results.


Subject(s)
Gene Expression Profiling , Single-Cell Gene Expression Analysis , Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Normal Distribution , Bayes Theorem , Single-Cell Analysis/methods , Cluster Analysis
6.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36631401

ABSTRACT

The advances in single-cell ribonucleic acid sequencing (scRNA-seq) allow researchers to explore cellular heterogeneity and human diseases at cell resolution. Cell clustering is a prerequisite in scRNA-seq analysis since it can recognize cell identities. However, the high dimensionality, noises and significant sparsity of scRNA-seq data have made it a big challenge. Although many methods have emerged, they still fail to fully explore the intrinsic properties of cells and the relationship among cells, which seriously affects the downstream clustering performance. Here, we propose a new deep contrastive clustering algorithm called scDCCA. It integrates a denoising auto-encoder and a dual contrastive learning module into a deep clustering framework to extract valuable features and realize cell clustering. Specifically, to better characterize and learn data representations robustly, scDCCA utilizes a denoising Zero-Inflated Negative Binomial model-based auto-encoder to extract low-dimensional features. Meanwhile, scDCCA incorporates a dual contrastive learning module to capture the pairwise proximity of cells. By increasing the similarities between positive pairs and the differences between negative ones, the contrasts at both the instance and the cluster level help the model learn more discriminative features and achieve better cell segregation. Furthermore, scDCCA joins feature learning with clustering, which realizes representation learning and cell clustering in an end-to-end manner. Experimental results of 14 real datasets validate that scDCCA outperforms eight state-of-the-art methods in terms of accuracy, generalizability, scalability and efficiency. Cell visualization and biological analysis demonstrate that scDCCA significantly improves clustering and facilitates downstream analysis for scRNA-seq data. The code is available at https://github.com/WJ319/scDCCA.


Subject(s)
Gene Expression Profiling , Single-Cell Gene Expression Analysis , Humans , Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Algorithms , Cluster Analysis
7.
Brief Bioinform ; 24(6)2023 09 22.
Article in English | MEDLINE | ID: mdl-37861174

ABSTRACT

Antiviral peptides (AVPs) are widely found in animals and plants, with high specificity and strong sensitivity to drug-resistant viruses. However, due to the great heterogeneity of different viruses, most of the AVPs have specific antiviral activities. Therefore, it is necessary to identify the specific activities of AVPs on virus types. Most existing studies only identify AVPs, with only a few studies identifying subclasses by training multiple binary classifiers. We develop a two-stage prediction tool named FFMAVP that can simultaneously predict AVPs and their subclasses. In the first stage, we identify whether a peptide is AVP or not. In the second stage, we predict the six virus families and eight species specifically targeted by AVPs based on two multiclass tasks. Specifically, the feature extraction module in the two-stage task of FFMAVP adopts the same neural network structure, in which one branch extracts features based on amino acid feature descriptors and the other branch extracts sequence features. Then, the two types of features are fused for the following task. Considering the correlation between the two tasks of the second stage, a multitask learning model is constructed to improve the effectiveness of the two multiclass tasks. In addition, to improve the effectiveness of the second stage, the network parameters trained through the first-stage data are used to initialize the network parameters in the second stage. As a demonstration, the cross-validation results, independent test results and visualization results show that FFMAVP achieves great advantages in both stages.


Subject(s)
Algorithms , Peptides , Peptides/chemistry , Neural Networks, Computer , Machine Learning , Antiviral Agents/pharmacology , Antiviral Agents/chemistry
8.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36611253

ABSTRACT

Although previous studies have revealed that synonymous mutations contribute to various human diseases, distinguishing deleterious synonymous mutations from benign ones is still a challenge in medical genomics. Recently, computational tools have been introduced to predict the harmfulness of synonymous mutations. However, most of these computational tools rely on balanced training sets without considering abundant negative samples that could result in deficient performance. In this study, we propose a computational model that uses a selective ensemble to predict deleterious synonymous mutations (seDSM). We construct several candidate base classifiers for the ensemble using balanced training subsets randomly sampled from the imbalanced benchmark training sets. The diversity measures of the base classifiers are calculated by the pairwise diversity metrics, and the classifiers with the highest diversities are selected for integration using soft voting for synonymous mutation prediction. We also design two strategies for filling in missing values in the imbalanced dataset and constructing models using different pairwise diversity metrics. The experimental results show that a selective ensemble based on double fault with the ensemble strategy EKNNI for filling in missing values is the most effective scheme. Finally, using 40-dimensional biology features, we propose a novel model based on a selective ensemble for predicting deleterious synonymous mutations (seDSM). seDSM outperformed other state-of-the-art methods on the independent test sets according to multiple evaluation indicators, indicating that it has an outstanding predictive performance for deleterious synonymous mutations. We hope that seDSM will be useful for studying deleterious synonymous mutations and advancing our understanding of synonymous mutations. The source code of seDSM is freely accessible at https://github.com/xialab-ahu/seDSM.git.


Subject(s)
Genomics , Silent Mutation , Humans , Genomics/methods , Software , Algorithms
9.
Brief Bioinform ; 25(1)2023 11 22.
Article in English | MEDLINE | ID: mdl-38145949

ABSTRACT

Prediction of drug-target interactions (DTIs) is essential in medicine field, since it benefits the identification of molecular structures potentially interacting with drugs and facilitates the discovery and reposition of drugs. Recently, much attention has been attracted to network representation learning to learn rich information from heterogeneous data. Although network representation learning algorithms have achieved success in predicting DTI, several manually designed meta-graphs limit the capability of extracting complex semantic information. To address the problem, we introduce an adaptive meta-graph-based method, termed AMGDTI, for DTI prediction. In the proposed AMGDTI, the semantic information is automatically aggregated from a heterogeneous network by training an adaptive meta-graph, thereby achieving efficient information integration without requiring domain knowledge. The effectiveness of the proposed AMGDTI is verified on two benchmark datasets. Experimental results demonstrate that the AMGDTI method overall outperforms eight state-of-the-art methods in predicting DTI and achieves the accurate identification of novel DTIs. It is also verified that the adaptive meta-graph exhibits flexibility and effectively captures complex fine-grained semantic information, enabling the learning of intricate heterogeneous network topology and the inference of potential drug-target relationship.


Subject(s)
Algorithms , Medicine , Benchmarking , Drug Delivery Systems , Semantics
10.
Bioinformatics ; 2024 Jul 23.
Article in English | MEDLINE | ID: mdl-39041594

ABSTRACT

MOTIVATION: In drug development process, a significant portion of budget and research time are dedicated to the lead compound optimization procedure in order to identify potential drugs. This procedure focuses on enhancing the pharmacological and bioactive properties of compounds by optimizing their local substructures. However, due to the vast and discrete chemical structure space and the unpredictable element combinations within this space, the optimization process is inherently complex. Various structure enumeration-based combinatorial optimization methods have shown certain advantages. However, they still have limitations. Those methods fail to consider the differences between molecules and struggle to explore the unknown outer search space. RESULTS: In this study, we propose an adaptive space search-based molecular evolution optimization algorithm (ASSMOEA). It consists of three key modules: construction of molecule-specific search space, molecular evolutionary optimization, and adaptive expansion of molecule-specific search space. Specifically, we design a fragment similarity tree in molecule-specific search space, and apply a dynamic mutation strategy in this space to guide molecular optimization. Then we utilize an encoder-encoder structure to adaptively expand the space. Those three modules are circled iteratively to optimize molecules. Our experiments demonstrate that ASSMOEA outperforms existing methods in terms of molecular optimization. It not only enhances the efficiency of the molecular optimization process, but also exhibits a robust ability to search for correct solutions. AVAILABILITY AND IMPLEMENTATION: The code is freely available on the web at https://github.com/bbbbb-b/MEOAFST. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

11.
PLoS Comput Biol ; 20(8): e1012399, 2024 Aug 22.
Article in English | MEDLINE | ID: mdl-39173070

ABSTRACT

Circular RNAs (circRNAs) play vital roles in transcription and translation. Identification of circRNA-RBP (RNA-binding protein) interaction sites has become a fundamental step in molecular and cell biology. Deep learning (DL)-based methods have been proposed to predict circRNA-RBP interaction sites and achieved impressive identification performance. However, those methods cannot effectively capture long-distance dependencies, and cannot effectively utilize the interaction information of multiple features. To overcome those limitations, we propose a DL-based model iCRBP-LKHA using deep hybrid networks for identifying circRNA-RBP interaction sites. iCRBP-LKHA adopts five encoding schemes. Meanwhile, the neural network architecture, which consists of large kernel convolutional neural network (LKCNN), convolutional block attention module with one-dimensional convolution (CBAM-1D) and bidirectional gating recurrent unit (BiGRU), can explore local information, global context information and multiple features interaction information automatically. To verify the effectiveness of iCRBP-LKHA, we compared its performance with shallow learning algorithms on 37 circRNAs datasets and 37 circRNAs stringent datasets. And we compared its performance with state-of-the-art DL-based methods on 37 circRNAs datasets, 37 circRNAs stringent datasets and 31 linear RNAs datasets. The experimental results not only show that iCRBP-LKHA outperforms other competing methods, but also demonstrate the potential of this model in identifying other RNA-RBP interaction sites.

12.
Brief Bioinform ; 23(5)2022 09 20.
Article in English | MEDLINE | ID: mdl-35988921

ABSTRACT

Neuropeptides (NPs) are a particular class of informative substances in the immune system and physiological regulation. They play a crucial role in regulating physiological functions in various biological growth and developmental stages. In addition, NPs are crucial for developing new drugs for the treatment of neurological diseases. With the development of molecular biology techniques, some data-driven tools have emerged to predict NPs. However, it is necessary to improve the predictive performance of these tools for NPs. In this study, we developed a deep learning model (NeuroPred-CLQ) based on the temporal convolutional network (TCN) and multi-head attention mechanism to identify NPs effectively and translate the internal relationships of peptide sequences into numerical features by the Word2vec algorithm. The experimental results show that NeuroPred-CLQ learns data information effectively, achieving 93.6% accuracy and 98.8% AUC on the independent test set. The model has better performance in identifying NPs than the state-of-the-art predictors. Visualization of features using t-distribution random neighbor embedding shows that the NeuroPred-CLQ can clearly distinguish the positive NPs from the negative ones. We believe the NeuroPred-CLQ can facilitate drug development and clinical trial studies to treat neurological disorders.


Subject(s)
Algorithms , Neuropeptides , Neuropeptides/genetics , Peptides/chemistry
13.
Brief Bioinform ; 23(6)2022 11 19.
Article in English | MEDLINE | ID: mdl-36125190

ABSTRACT

The rapid development of biomedicine has produced a large number of biomedical written materials. These unstructured text data create serious challenges for biomedical researchers to find information. Biomedical named entity recognition (BioNER) and biomedical relation extraction (BioRE) are the two most fundamental tasks of biomedical text mining. Accurately and efficiently identifying entities and extracting relations have become very important. Methods that perform two tasks separately are called pipeline models, and they have shortcomings such as insufficient interaction, low extraction quality and easy redundancy. To overcome the above shortcomings, many deep learning-based joint name entity recognition and relation extraction models have been proposed, and they have achieved advanced performance. This paper comprehensively summarize deep learning models for joint name entity recognition and relation extraction for biomedicine. The joint BioNER and BioRE models are discussed in the light of the challenges existing in the BioNER and BioRE tasks. Five joint BioNER and BioRE models and one pipeline model are selected for comparative experiments on four biomedical public datasets, and the experimental results are analyzed. Finally, we discuss the opportunities for future development of deep learning-based joint BioNER and BioRE models.


Subject(s)
Deep Learning , Data Mining/methods
14.
Brief Bioinform ; 23(2)2022 03 10.
Article in English | MEDLINE | ID: mdl-35136924

ABSTRACT

Rapid development of single-cell RNA sequencing (scRNA-seq) technology has allowed researchers to explore biological phenomena at the cellular scale. Clustering is a crucial and helpful step for researchers to study the heterogeneity of cell. Although many clustering methods have been proposed, massive dropout events and the curse of dimensionality in scRNA-seq data make it still difficult to analysis because they reduce the accuracy of clustering methods, leading to misidentification of cell types. In this work, we propose the scHFC, which is a hybrid fuzzy clustering method optimized by natural computation based on Fuzzy C Mean (FCM) and Gath-Geva (GG) algorithms. Specifically, principal component analysis algorithm is utilized to reduce the dimensions of scRNA-seq data after it is preprocessed. Then, FCM algorithm optimized by simulated annealing algorithm and genetic algorithm is applied to cluster the data to output a membership matrix, which represents the initial clustering result and is taken as the input for GG algorithm to get the final clustering results. We also develop a cluster number estimation method called multi-index comprehensive estimation, which can estimate the cluster numbers well by combining four clustering effectiveness indexes. The performance of the scHFC method is evaluated on 17 scRNA-seq datasets, and compared with six state-of-the-art methods. Experimental results validate the better performance of our scHFC method in terms of clustering accuracy and stability of algorithm. In short, scHFC is an effective method to cluster cells for scRNA-seq data, and it presents great potential for downstream analysis of scRNA-seq data. The source code is available at https://github.com/WJ319/scHFC.


Subject(s)
Single-Cell Analysis , Software , Algorithms , Cluster Analysis , Gene Expression Profiling/methods , RNA-Seq , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods
15.
Brief Bioinform ; 23(6)2022 11 19.
Article in English | MEDLINE | ID: mdl-36305457

ABSTRACT

With the development of research on the complex aetiology of many diseases, computational drug repositioning methodology has proven to be a shortcut to costly and inefficient traditional methods. Therefore, developing more promising computational methods is indispensable for finding new candidate diseases to treat with existing drugs. In this paper, a model integrating a new variant of message passing neural network and a novel-gated fusion mechanism called GLGMPNN is proposed for drug-disease association prediction. First, a light-gated message passing neural network (LGMPNN), including message passing, aggregation and updating, is proposed to separately extract multiple pieces of information from the similarity networks and the association network. Then, a gated fusion mechanism consisting of a forget gate and an output gate is applied to integrate the multiple pieces of information to extent. The forget gate calculated by the multiple embeddings is built to integrate the association information into the similarity information. Furthermore, the final node representations are controlled by the output gate, which fuses the topology information of the networks and the initial similarity information. Finally, a bilinear decoder is adopted to reconstruct an adjacency matrix for drug-disease associations. Evaluated by 10-fold cross-validations, GLGMPNN achieves excellent performance compared with the current models. The following studies show that our model can effectively discover novel drug-disease associations.


Subject(s)
Computational Biology , Neural Networks, Computer , Computational Biology/methods , Drug Repositioning/methods , Algorithms
16.
PLoS Comput Biol ; 19(8): e1011344, 2023 08.
Article in English | MEDLINE | ID: mdl-37651321

ABSTRACT

Accumulating evidence suggests that circRNAs play crucial roles in human diseases. CircRNA-disease association prediction is extremely helpful in understanding pathogenesis, diagnosis, and prevention, as well as identifying relevant biomarkers. During the past few years, a large number of deep learning (DL) based methods have been proposed for predicting circRNA-disease association and achieved impressive prediction performance. However, there are two main drawbacks to these methods. The first is these methods underutilize biometric information in the data. Second, the features extracted by these methods are not outstanding to represent association characteristics between circRNAs and diseases. In this study, we developed a novel deep learning model, named iCircDA-NEAE, to predict circRNA-disease associations. In particular, we use disease semantic similarity, Gaussian interaction profile kernel, circRNA expression profile similarity, and Jaccard similarity simultaneously for the first time, and extract hidden features based on accelerated attribute network embedding (AANE) and dynamic convolutional autoencoder (DCAE). Experimental results on the circR2Disease dataset show that iCircDA-NEAE outperforms other competing methods significantly. Besides, 16 of the top 20 circRNA-disease pairs with the highest prediction scores were validated by relevant literature. Furthermore, we observe that iCircDA-NEAE can effectively predict new potential circRNA-disease associations.


Subject(s)
Algorithms , RNA, Circular , Humans , RNA, Circular/genetics , Semantics
17.
J Chem Inf Model ; 64(13): 5161-5174, 2024 Jul 08.
Article in English | MEDLINE | ID: mdl-38870455

ABSTRACT

Optimization techniques play a pivotal role in advancing drug development, serving as the foundation of numerous generative methods tailored to efficiently design optimized molecules derived from existing lead compounds. However, existing methods often encounter difficulties in generating diverse, novel, and high-property molecules that simultaneously optimize multiple drug properties. To overcome this bottleneck, we propose a multiobjective molecule optimization framework (MOMO). MOMO employs a specially designed Pareto-based multiproperty evaluation strategy at the molecular sequence level to guide the evolutionary search in an implicit chemical space. A comparative analysis of MOMO with five state-of-the-art methods across two benchmark multiproperty molecule optimization tasks reveals that MOMO markedly outperforms them in terms of diversity, novelty, and optimized properties. The practical applicability of MOMO in drug discovery has also been validated on four challenging tasks in the real-world discovery problem. These results suggest that MOMO can provide a useful tool to facilitate molecule optimization problems with multiple properties.


Subject(s)
Drug Discovery , Drug Discovery/methods , Drug Design , Algorithms
18.
BMC Bioinformatics ; 24(1): 217, 2023 May 26.
Article in English | MEDLINE | ID: mdl-37237310

ABSTRACT

BACKGROUND: Single-cell RNA sequencing (scRNA-seq) strives to capture cellular diversity with higher resolution than bulk RNA sequencing. Clustering analysis is critical to transcriptome research as it allows for further identification and discovery of new cell types. Unsupervised clustering cannot integrate prior knowledge where relevant information is widely available. Purely unsupervised clustering algorithms may not yield biologically interpretable clusters when confronted with the high dimensionality of scRNA-seq data and frequent dropout events, which makes identification of cell types more challenging. RESULTS: We propose scSemiAAE, a semi-supervised clustering model for scRNA sequence analysis using deep generative neural networks. Specifically, scSemiAAE carefully designs a ZINB adversarial autoencoder-based architecture that inherently integrates adversarial training and semi-supervised modules in the latent space. In a series of experiments on scRNA-seq datasets spanning thousands to tens of thousands of cells, scSemiAAE can significantly improve clustering performance compared to dozens of unsupervised and semi-supervised algorithms, promoting clustering and interpretability of downstream analyses. CONCLUSION: scSemiAAE is a Python-based algorithm implemented on the VSCode platform that provides efficient visualization, clustering, and cell type assignment for scRNA-seq data. The tool is available from https://github.com/WHang98/scSemiAAE .


Subject(s)
Gene Expression Profiling , Single-Cell Gene Expression Analysis , Single-Cell Analysis , Transcriptome , Sequence Analysis, RNA , Algorithms , Cluster Analysis
19.
BMC Genomics ; 24(1): 448, 2023 Aug 09.
Article in English | MEDLINE | ID: mdl-37559017

ABSTRACT

BACKGROUND: Previous studies have identified that chromosome structure plays a very important role in gene control. The transcription factor Yin Yang 1 (YY1), a multifunctional DNA binding protein, could form a dimer to mediate chromatin loops and active enhancer-promoter interactions. The deletion of YY1 or point mutations at the YY1 binding sites significantly inhibit the enhancer-promoter interactions and affect gene expression. To date, only a few computational methods are available for identifying YY1-mediated chromatin loops. RESULTS: We proposed a novel model named CapsNetYY1, which was based on capsule network architecture to identify whether a pair of YY1 motifs can form a chromatin loop. Firstly, we encode the DNA sequence using one-hot encoding method. Secondly, multi-scale convolution layer is used to extract local features of the sequence, and bidirectional gated recurrent unit is used to learn the features across time steps. Finally, capsule networks (convolution capsule layer and digital capsule layer) used to extract higher level features and recognize YY1-mediated chromatin loops. Compared with DeepYY1, the only prediction for YY1-mediated chromatin loops, our model CapsNetYY1 achieved the better performance on the independent datasets (AUC [Formula: see text]). CONCLUSION: The results indicate that CapsNetYY1 is an excellent method for identifying YY1-mediated chromatin loops. We believe that the CapsNetYY1 method will be used for predictive classification of other DNA sequences.


Subject(s)
Regulatory Sequences, Nucleic Acid , YY1 Transcription Factor , YY1 Transcription Factor/genetics , YY1 Transcription Factor/metabolism , Chromatin Immunoprecipitation , Promoter Regions, Genetic , Chromatin/genetics
20.
Brief Bioinform ; 22(5)2021 09 02.
Article in English | MEDLINE | ID: mdl-33415333

ABSTRACT

Predicting disease-related long non-coding RNAs (lncRNAs) is beneficial to finding of new biomarkers for prevention, diagnosis and treatment of complex human diseases. In this paper, we proposed a machine learning techniques-based classification approach to identify disease-related lncRNAs by graph auto-encoder (GAE) and random forest (RF) (GAERF). First, we combined the relationship of lncRNA, miRNA and disease into a heterogeneous network. Then, low-dimensional representation vectors of nodes were learned from the network by GAE, which reduce the dimension and heterogeneity of biological data. Taking these feature vectors as input, we trained a RF classifier to predict new lncRNA-disease associations (LDAs). Related experiment results show that the proposed method for the representation of lncRNA-disease characterizes them accurately. GAERF achieves superior performance owing to the ensemble learning method, outperforming other methods significantly. Moreover, case studies further demonstrated that GAERF is an effective method to predict LDAs.


Subject(s)
Lung Neoplasms/genetics , Machine Learning , Neural Networks, Computer , Prostatic Neoplasms/genetics , RNA, Long Noncoding/genetics , Stomach Neoplasms/genetics , Biomarkers, Tumor/genetics , Biomarkers, Tumor/metabolism , Computational Biology/methods , Computer Graphics/statistics & numerical data , Decision Trees , Gene Expression Regulation, Neoplastic , Humans , Lung Neoplasms/diagnosis , Lung Neoplasms/metabolism , Lung Neoplasms/pathology , Male , MicroRNAs/classification , MicroRNAs/genetics , MicroRNAs/metabolism , Prostatic Neoplasms/diagnosis , Prostatic Neoplasms/metabolism , Prostatic Neoplasms/pathology , RNA, Long Noncoding/classification , RNA, Long Noncoding/metabolism , ROC Curve , Risk Factors , Stomach Neoplasms/diagnosis , Stomach Neoplasms/metabolism , Stomach Neoplasms/pathology
SELECTION OF CITATIONS
SEARCH DETAIL