ABSTRACT
Bioactive peptide therapeutics has been a long-standing research topic. Notably, the antimicrobial peptides (AMPs) have been extensively studied for its therapeutic potential. Meanwhile, the demand for annotating other therapeutic peptides, such as antiviral peptides (AVPs) and anticancer peptides (ACPs), also witnessed an increase in recent years. However, we conceive that the structure of peptide chains and the intrinsic information between the amino acids is not fully investigated among the existing protocols. Therefore, we develop a new graph deep learning model, namely TP-LMMSG, which offers lightweight and easy-to-deploy advantages while improving the annotation performance in a generalizable manner. The results indicate that our model can accurately predict the properties of different peptides. The model surpasses the other state-of-the-art models on AMP, AVP and ACP prediction across multiple experimental validated datasets. Moreover, TP-LMMSG also addresses the challenges of time-consuming pre-processing in graph neural network frameworks. With its flexibility in integrating heterogeneous peptide features, our model can provide substantial impacts on the screening and discovery of therapeutic peptides. The source code is available at https://github.com/NanjunChen37/TP_LMMSG.
Subject(s)
Amino Acids , Neural Networks, Computer , Peptides , Amino Acids/chemistry , Peptides/chemistry , Computational Biology/methods , Deep Learning , Antimicrobial Peptides/chemistry , AlgorithmsABSTRACT
DNA motifs are crucial patterns in gene regulation. DNA-binding proteins (DBPs), including transcription factors, can bind to specific DNA motifs to regulate gene expression and other cellular activities. Past studies suggest that DNA shape features could be subtly involved in DNA-DBP interactions. Therefore, the shape motif annotations based on intrinsic DNA topology can deepen theĀ understanding of DNA-DBP binding. Nevertheless, high-throughput tools for DNA shape motif discovery that incorporate multiple features altogether remain insufficient. To address it, we propose a series of methods to discover non-redundant DNA shape motifs with the generalization to multiple motifs inĀ multiple shape features. Specifically, an existingĀ Gibbs sampling methodĀ is generalized to multiple DNA motif discovery with multiple shape features. Meanwhile, an expectation-maximization (EM) method and a hybrid method coupling EM with Gibbs sampling are proposed andĀ developed with promisingĀ performance, convergence capability, and efficiency. The discovered DNA shape motif instances reveal insights intoĀ low-signal ChIP-seq peak summits, complementing the existing sequence motif discovery works. Additionally, our modelling captures the potential interplays across multiple DNA shape features. We provide a valuable platform of tools for DNA shape motif discovery. An R package is built for open accessibility and long-lasting impact: https://zenodo.org/doi/10.5281/zenodo.10558980.
Subject(s)
DNA , Nucleotide Motifs , DNA/chemistry , DNA/genetics , DNA/metabolism , DNA-Binding Proteins/metabolism , DNA-Binding Proteins/chemistry , DNA-Binding Proteins/genetics , Algorithms , Nucleic Acid Conformation , Chromatin Immunoprecipitation Sequencing/methods , Binding Sites , Transcription Factors/metabolism , Transcription Factors/genetics , Transcription Factors/chemistry , Humans , Protein BindingABSTRACT
In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on simulated and multiple single-cell transcriptomic data sets.
Subject(s)
Algorithms , Single-Cell Gene Expression Analysis , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Gene Expression Profiling/methods , Cluster AnalysisABSTRACT
The rapid growth of omics-based data has revolutionized biomedical research and precision medicine, allowing machine learning models to be developed for cutting-edge performance. However, despite the wealth of high-throughput data available, the performance of these models is hindered by the lack of sufficient training data, particularly in clinical research (in vivo experiments). As a result, translating this knowledge into clinical practice, such as predicting drug responses, remains a challenging task. Transfer learning is a promising tool that bridges the gap between data domains by transferring knowledge from the source to the target domain. Researchers have proposed transfer learning to predict clinical outcomes by leveraging pre-clinical data (mouse, zebrafish), highlighting its vast potential. In this work, we present a comprehensive literature review of deep transfer learning methods for health informatics and clinical decision-making, focusing on high-throughput molecular data. Previous reviews mostly covered image-based transfer learning works, while we present a more detailed analysis of transfer learning papers. Furthermore, we evaluated original studies based on different evaluation settings across cross-validations, data splits and model architectures. The result shows that those transfer learning methods have great potential; high-throughput sequencing data and state-of-the-art deep learning models lead to significant insights and conclusions. Additionally, we explored various datasets in transfer learning papers with statistics and visualization.
Subject(s)
Benchmarking , Zebrafish , Animals , Mice , Zebrafish/genetics , Machine Learning , Precision Medicine , Clinical Decision-MakingABSTRACT
MOTIVATION: Spatial transcriptomics can quantify gene expression and its spatial distribution in tissues, thus revealing molecular mechanisms of cellular interactions underlying tissue heterogeneity, tissue regeneration, and spatially localized disease mechanisms. However, existing spatial clustering methods often fail to exploit the full potential of spatial information, resulting in inaccurate identification of spatial domains. RESULTS: In this paper, we develop a deep graph contrastive clustering framework, stDGCC, that accurately uncovers underlying spatial domains via explicitly modeling spatial information and gene expression profiles from spatial transcriptomics data. The stDGCC framework proposes a spatially informed graph node embedding model to preserve the topological information of spots and to learn the informative and discriminative characterization of spatial transcriptomics data through self-supervised contrastive learning. By simultaneously optimizing the contrastive learning loss, reconstruction loss, and Kullback-Leibler (KL) divergence loss, stDGCC achieves joint optimization of feature learning and topology structure preservation in an end-to-end manner. We validate the effectiveness of stDGCC on various spatial transcriptomics datasets acquired from different platforms, each with varying spatial resolutions. Our extensive experiments demonstrate the superiority of stDGCC over various state-of-the-art clustering methods in accurately identifying cellular-level biological structures. AVAILABILITY: Code and data are available from https://github.com/TimE9527/stDGCC and https://figshare.com/projects/stDGCC/186525. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
ABSTRACT
MOTIVATION: The annotation of cell types from single-cell transcriptomics is essential for understanding the biological identity and functionality of cellular populations. Although manual annotation remains the gold standard, the advent of automatic pipelines has become crucial for scalable, unbiased, and cost-effective annotations. Nonetheless, the effectiveness of these automatic methods, particularly those employing deep learning, significantly depends on the architecture of the classifier and the quality and diversity of the training datasets. RESULTS: To address these limitations, we present a Pruning-enabled Gene-Cell Net (PredGCN) incorporating a Coupled Gene-Cell Net (CGCN) to enable representation learning and information storage. PredGCN integrates a Gene Splicing Net (GSN) and a Cell Stratification Net (CSN), employing a pruning operation (PrO) to dynamically tackle the complexity of heterogeneous cell identification. Among them, GSN leverages multiple statistical and hypothesis-driven feature extraction methods to selectively assemble genes with specificity for scRNA-seq data while CSN unifies elements based on diverse region demarcation principles, exploiting the representations from GSN and precise identification from different regional homogeneity perspectives. Furthermore, we develop a multi-objective Pareto pruning operation (Pareto PrO) to expand the dynamic capabilities of CGCN, optimizing the sub-network structure for accurate cell type annotation. Multiple comparison experiments on real scRNA-seq datasets from various species have demonstrated that PredGCN surpasses existing state-of-the-art methods, including its scalability to cross-species datasets. Moreover, PredGCN can uncover unknown cell types and provide functional genomic analysis by quantifying the influence of genes on cell clusters, bringing new insights into cell type identification and characterizing scRNA-seq data from different perspectives. AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/IrisQi7/PredGCN and test data is available at https://figshare.com/articles/dataset/PredGCN/25251163.
Subject(s)
Single-Cell Analysis , Transcriptome , Single-Cell Analysis/methods , Transcriptome/genetics , Software , Molecular Sequence Annotation/methods , Animals , Humans , Gene Expression Profiling/methods , Computational Biology/methods , AlgorithmsABSTRACT
MicroRNAs (miRNAs) are vital in regulating gene expression through binding to specific target sites on messenger RNAs (mRNAs), a process closely tied to cancer pathogenesis. Identifying miRNA functional targets is essential but challenging, due to incomplete genome annotation and an emphasis on known miRNA-mRNA interactions, restricting predictions of unknown ones. To address those challenges, we have developed a deep learning model based on miRNA functional target identification, named miTDS, to investigate miRNA-mRNA interactions. miTDS first employs a scoring mechanism to eliminate unstable sequence pairs and then utilizes a dynamic word embedding model based on the transformer architecture, enabling a comprehensive analysis of miRNA-mRNA interaction sites by harnessing the global contextual associations of each nucleotide. On this basis, miTDS fuses extended seed alignment representations learned in the multi-scale attention mechanism module with dynamic semantic representations extracted in the RNA-based dual-path module, which can further elucidate and predict miRNA and mRNA functions and interactions. To validate the effectiveness of miTDS, we conducted a thorough comparison with state-of-the-art miRNA-mRNA functional target prediction methods. The evaluation, performed on a dataset cross-referenced with entries from MirTarbase and Diana-TarBase, revealed that miTDS surpasses current methods in accurately predicting functional targets. In addition, our model exhibited proficiency in identifying A-to-I RNA editing sites, which represents an aberrant interaction that yields valuable insights into the suppression of cancerous processes.
Subject(s)
Deep Learning , MicroRNAs , MicroRNAs/genetics , RNA, Messenger/genetics , Nucleotides , RNA EditingABSTRACT
Healthcare disparities in multiethnic medical data is a major challenge; the main reason lies in the unequal data distribution of ethnic groups among data cohorts. Biomedical data collected from different cancer genome research projects may consist of mainly one ethnic group, such as people with European ancestry. In contrast, the data distribution of other ethnic races such as African, Asian, Hispanic, and Native Americans can be less visible than the counterpart. Data inequality in the biomedical field is an important research problem, resulting in the diverse performance of machine learning models while creating healthcare disparities. Previous researches have reduced the healthcare disparities only using limited data distributions. In our study, we work on fine-tuning of deep learning and transfer learning models with different multiethnic data distributions for the prognosis of 33 cancer types. In previous studies, to reduce the healthcare disparities, only a single ethnic cohort was used as the target domain with one major source domain. In contrast, we focused on multiple ethnic cohorts as the target domain in transfer learning using the TCGA and MMRF CoMMpass study datasets. After performance comparison for experiments with new data distributions, our proposed model shows promising performance for transfer learning schemes compared to the baseline approach for old and new data distributation experiments.
Subject(s)
Healthcare Disparities , Neoplasms , Ethnicity , Hispanic or Latino , Humans , Machine Learning , Neoplasms/geneticsABSTRACT
Single-cell RNA sequencing (scRNA-seq) technologies have been heavily developed to probe gene expression profiles at single-cell resolution. Deep imputation methods have been proposed to address the related computational challenges (e.g. the gene sparsity in single-cell data). In particular, the neural architectures of those deep imputation models have been proven to be critical for performance. However, deep imputation architectures are difficult to design and tune for those without rich knowledge of deep neural networks and scRNA-seq. Therefore, Surrogate-assisted Evolutionary Deep Imputation Model (SEDIM) is proposed to automatically design the architectures of deep neural networks for imputing gene expression levels in scRNA-seq data without any manual tuning. Moreover, the proposed SEDIM constructs an offline surrogate model, which can accelerate the computational efficiency of the architectural search. Comprehensive studies show that SEDIM significantly improves the imputation and clustering performance compared with other benchmark methods. In addition, we also extensively explore the performance of SEDIM in other contexts and platforms including mass cytometry and metabolic profiling in a comprehensive manner. Marker gene detection, gene ontology enrichment and pathological analysis are conducted to provide novel insights into cell-type identification and the underlying mechanisms. The source code is available at https://github.com/li-shaochuan/SEDIM.
Subject(s)
Deep Learning , Single-Cell Analysis , Gene Expression Profiling/methods , RNA-Seq , Sequence Analysis, RNA/methods , Single-Cell Analysis/methodsABSTRACT
MOTIVATION: The rapid growth in literature accumulates diverse and yet comprehensive biomedical knowledge hidden to be mined such as drug interactions. However, it is difficult to extract the heterogeneous knowledge to retrieve or even discover the latest and novel knowledge in an efficient manner. To address such a problem, we propose EGFI for extracting and consolidating drug interactions from large-scale medical literature text data. Specifically, EGFI consists of two parts: classification and generation. In the classification part, EGFI encompasses the language model BioBERT which has been comprehensively pretrained on biomedical corpus. In particular, we propose the multihead self-attention mechanism and packed BiGRU to fuse multiple semantic information for rigorous context modeling. In the generation part, EGFI utilizes another pretrained language model BioGPT-2 where the generation sentences are selected based on filtering rules. RESULTS: We evaluated the classification part on 'DDIs 2013' dataset and 'DTIs' dataset, achieving the F1 scores of 0.842 and 0.720 respectively. Moreover, we applied the classification part to distinguish high-quality generated sentences and verified with the existing growth truth to confirm the filtered sentences. The generated sentences that are not recorded in DrugBank and DDIs 2013 dataset demonstrated the potential of EGFI to identify novel drug relationships. AVAILABILITY: Source code are publicly available at https://github.com/Layne-Huang/EGFI.
Subject(s)
Language , Natural Language Processing , Drug Interactions , Semantics , SoftwareABSTRACT
MOTIVATION: The identification of drug-target interactions (DTIs) plays a vital role for in silico drug discovery, in which the drug is the chemical molecule, and the target is the protein residues in the binding pocket. Manual DTI annotation approaches remain reliable; however, it is notoriously laborious and time-consuming to test each drug-target pair exhaustively. Recently, the rapid growth of labelled DTI data has catalysed interests in high-throughput DTI prediction. Unfortunately, those methods highly rely on the manual features denoted by human, leading to errors. RESULTS: Here, we developed an end-to-end deep learning framework called CoaDTI to significantly improve the efficiency and interpretability of drug target annotation. CoaDTI incorporates the Co-attention mechanism to model the interaction information from the drug modality and protein modality. In particular, CoaDTI incorporates transformer to learn the protein representations from raw amino acid sequences, and GraphSage to extract the molecule graph features from SMILES. Furthermore, we proposed to employ the transfer learning strategy to encode protein features by pre-trained transformer to address the issue of scarce labelled data. The experimental results demonstrate that CoaDTI achieves competitive performance on three public datasets compared with state-of-the-art models. In addition, the transfer learning strategy further boosts the performance to an unprecedented level. The extended study reveals that CoaDTI can identify novel DTIs such as reactions between candidate drugs and severe acute respiratory syndrome coronavirus 2-associated proteins. The visualization of co-attention scores can illustrate the interpretability of our model for mechanistic insights. AVAILABILITY: Source code are publicly available at https://github.com/Layne-Huang/CoaDTI.
Subject(s)
COVID-19 , Humans , Computer Simulation , Proteins/chemistry , Amino Acid Sequence , Drug Discovery/methodsABSTRACT
Identifying genome-wide binding events between circular RNAs (circRNAs) and RNA-binding proteins (RBPs) can greatly facilitate our understanding of functional mechanisms within circRNAs. Thanks to the development of cross-linked immunoprecipitation sequencing technology, large amounts of genome-wide circRNA binding event data have accumulated, providing opportunities for designing high-performance computational models to discriminate RBP interaction sites and thus to interpret the biological significance of circRNAs. Unfortunately, there are still no computational models sufficiently flexible to accommodate circRNAs from different data scales and with various degrees of feature representation. Here, we present HCRNet, a novel end-to-end framework for identification of circRNA-RBP binding events. To capture the hierarchical relationships, the multi-source biological information is fused to represent circRNAs, including various natural language sequence features. Furthermore, a deep temporal convolutional network incorporating global expectation pooling was developed to exploit the latent nucleotide dependencies in an exhaustive manner. We benchmarked HCRNet on 37 circRNA datasets and 31 linear RNA datasets to demonstrate the effectiveness of our proposed method. To evaluate further the model's robustness, we performed HCRNet on a full-length dataset containing 740 circRNAs. Results indicate that HCRNet generally outperforms existing methods. In addition, motif analyses were conducted to exhibit the interpretability of HCRNet on circRNAs. All supporting source code and data can be downloaded from https://github.com/yangyn533/HCRNet and https://doi.org/10.6084/m9.figshare.16943722.v1. And the web server of HCRNet is publicly accessible at http://39.104.118.143:5001/.
Subject(s)
Chromatin Immunoprecipitation Sequencing , RNA, Circular , Binding Sites , RNA/genetics , RNA/metabolism , RNA-Binding Proteins/genetics , RNA-Binding Proteins/metabolismABSTRACT
MOTIVATION: Single-cell RNA sequencing (scRNA-seq) is an increasingly popular technique for transcriptomic analysis of gene expression at the single-cell level. Cell-type clustering is the first crucial task in the analysis of scRNA-seq data that facilitates accurate identification of cell types and the study of the characteristics of their transcripts. Recently, several computational models based on a deep autoencoder and the ensemble clustering have been developed to analyze scRNA-seq data. However, current deep autoencoders are not sufficient to learn the latent representations of scRNA-seq data, and obtaining consensus partitions from these feature representations remains under-explored. RESULTS: To address this challenge, we propose a single-cell deep clustering model via a dual denoising autoencoder with bipartite graph ensemble clustering called scBGEDA, to identify specific cell populations in single-cell transcriptome profiles. First, a single-cell dual denoising autoencoder network is proposed to project the data into a compressed low-dimensional space and that can learn feature representation via explicit modeling of synergistic optimization of the zero-inflated negative binomial reconstruction loss and denoising reconstruction loss. Then, a bipartite graph ensemble clustering algorithm is designed to exploit the relationships between cells and the learned latent embedded space by means of a graph-based consensus function. Multiple comparison experiments were conducted on 20 scRNA-seq datasets from different sequencing platforms using a variety of clustering metrics. The experimental results indicated that scBGEDA outperforms other state-of-the-art methods on these datasets, and also demonstrated its scalability to large-scale scRNA-seq datasets. Moreover, scBGEDA was able to identify cell-type specific marker genes and provide functional genomic analysis by quantifying the influence of genes on cell clusters, bringing new insights into identifying cell types and characterizing the scRNA-seq data from different perspectives. AVAILABILITY AND IMPLEMENTATION: The source code of scBGEDA is available at https://github.com/wangyh082/scBGEDA. The software and the supporting data can be downloaded from https://figshare.com/articles/software/scBGEDA/19657911. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Algorithms , Gene Expression Profiling , Sequence Analysis, RNA/methods , Gene Expression Profiling/methods , Software , Single-Cell Analysis/methods , Cluster AnalysisABSTRACT
MOTIVATION: Recent frameworks based on deep learning have been developed to identify cancer subtypes from high-throughput gene expression profiles. Unfortunately, the performance of deep learning is highly dependent on its neural network architectures which are often hand-crafted with expertise in deep neural networks, meanwhile, the optimization and adjustment of the network are usually costly and time consuming. RESULTS: To address such limitations, we proposed a fully automated deep neural architecture search model for diagnosing consensus molecular subtypes from gene expression data (DNAS). The proposed model uses ant colony algorithm, one of the heuristic swarm intelligence algorithms, to search and optimize neural network architecture, and it can automatically find the optimal deep learning model architecture for cancer diagnosis in its search space. We validated DNAS on eight colorectal cancer datasets, achieving the average accuracy of 95.48%, the average specificity of 98.07%, and the average sensitivity of 96.24%, respectively. Without the loss of generality, we investigated the general applicability of DNAS further on other cancer types from different platforms including lung cancer and breast cancer, and DNAS achieved an area under the curve of 95% and 96%, respectively. In addition, we conducted gene ontology enrichment and pathological analysis to reveal interesting insights into cancer subtype identification and characterization across multiple cancer types. AVAILABILITY AND IMPLEMENTATION: The source code and data can be downloaded from https://github.com/userd113/DNAS-main. And the web server of DNAS is publicly accessible at 119.45.145.120:5001.
Subject(s)
Breast Neoplasms , Deep Learning , Humans , Female , Neural Networks, Computer , Algorithms , SoftwareABSTRACT
Visible light-driven photocatalytic deracemization is highly esteemed as an ideal tool for organic synthesis due to its exceptional atom economy and synthetic efficiency. Consequently, successful instances of deracemization of allenes have been established, where the activated energy of photosensitizer should surpass that of the substrates, representing an intrinsic requirement. Accordingly, this method is not applicable for axially chiral molecules with significantly high triplet energies. In this study, we present a photoredox catalytic deracemization approach that enables the efficient synthesis of valuable yet challenging-to-access axially chiral 2-azaarene-functionalized quinazolinones. The substrate scope is extensive, allowing for both 3-axis and unmet 1-axis assembly through facile oxidation of diverse central chiral 2,3-dihydroquinazolin-4(1H)-ones that can be easily prepared and achieve enantiomer enrichment via deracemization. Mechanistic studies reveal the importance of photosensitizer selection in attaining excellent chemoselectivity and highlight the indispensability of a chiral BrĆønsted acid in enabling highly enantioselective protonation to accomplish efficient deracemization.
ABSTRACT
In recent years, single-cell RNA sequencing (scRNA-seq) technologies have been widely adopted to interrogate gene expression of individual cells; it brings opportunities to understand the underlying processes in a high-throughput manner. Deep embedded clustering (DEC) was demonstrated successful in high-dimensional sparse scRNA-seq data by joint feature learning and cluster assignment for identifying cell types simultaneously. However, the deep network architecture for embedding clustering is not trivial to optimize. Therefore, we propose an evolutionary multiobjective DEC by synergizing the multiobjective evolutionary optimization to simultaneously evolve the hyperparameters and architectures of DEC in an automatic manner. Firstly, a denoising autoencoder is integrated into the DEC to project the high-dimensional sparse scRNA-seq data into a low-dimensional space. After that, to guide the evolution, three objective functions are formulated to balance the model's generality and clustering performance for robustness. Meanwhile, migration and mutation operators are proposed to optimize the objective functions to select the suitable hyperparameters and architectures of DEC in the multiobjective framework. Multiple comparison analyses are conducted on twenty synthetic data and eight real data from different representative single-cell sequencing platforms to validate the effectiveness. The experimental results reveal that the proposed algorithm outperforms other state-of-the-art clustering methods under different metrics. Meanwhile, marker genes identification, gene ontology enrichment and pathology analysis are conducted to reveal novel insights into the cell type identification and characterization mechanisms.
Subject(s)
Algorithms , Computational Biology/methods , Gene Expression Profiling/methods , Neural Networks, Computer , RNA-Seq/methods , Single-Cell Analysis/methods , Cluster Analysis , Gene Ontology , Humans , Models, Genetic , Mutation , Reproducibility of ResultsABSTRACT
Gene-expression profiling can define the cell state and gene-expression pattern of cells at the genetic level in a high-throughput manner. With the development of transcriptome techniques, processing high-dimensional genetic data has become a major challenge in expression profiling. Thanks to the recent widespread use of matrix decomposition methods in bioinformatics, a computational framework based on compressed sensing was adopted to reduce dimensionality. However, compressed sensing requires an optimization strategy to learn the modular dictionaries and activity levels from the low-dimensional random composite measurements to reconstruct the high-dimensional gene-expression data. Considering this, here we introduce and compare four compressed sensing frameworks coming from nature-inspired optimization algorithms (CSCS, ABCCS, BACS and FACS) to improve the quality of the decompression process. Several experiments establish that the three proposed methods outperform benchmark methods on nine different datasets, especially the FACS method. We illustrate therefore, the robustness and convergence of FACS in various aspects; notably, time complexity and parameter analyses highlight properties of our proposed FACS. Furthermore, differential gene-expression analysis, cell-type clustering, gene ontology enrichment and pathology analysis are conducted, which bring novel insights into cell-type identification and characterization mechanisms from different perspectives. All algorithms are written in Python and available at https://github.com/Philyzh8/Nature-inspired-CS.
Subject(s)
Algorithms , Computational Biology/methods , Gene Expression Profiling/methods , RNA-Seq/methods , Single-Cell Analysis/methods , Transcriptome , Animals , Cluster Analysis , Gene Regulatory Networks/genetics , Humans , Molecular Sequence Annotation/methods , Reproducibility of Results , Signal Transduction/genetics , Time FactorsABSTRACT
Mitochondria are membrane-bound organelles containing over 1000 different proteins involved in mitochondrial function, gene expression and metabolic processes. Accurate localization of those proteins in the mitochondrial compartments is critical to their operation. A few computational methods have been developed for predicting submitochondrial localization from the protein sequences. Unfortunately, most of these computational methods focus on employing biological features or evolutionary information to extract sequence features, which greatly limits the performance of subsequent identification. Moreover, the efficiency of most computational models is still under explored, especially the deep learning feature, which is promising but requires improvement. To address these limitations, we propose a novel computational method called iDeepSubMito to predict the location of mitochondrial proteins to the submitochondrial compartments. First, we adopted a coding scheme using the ProteinELMo to model the probability distribution over the protein sequences and then represent the protein sequences as continuous vectors. Then, we proposed and implemented convolutional neural network architecture based on the bidirectional LSTM with self-attention mechanism, to effectively explore the contextual information and protein sequence semantic features. To demonstrate the effectiveness of our proposed iDeepSubMito, we performed cross-validation on two datasets containing 424 proteins and 570 proteins respectively, and consisting of four different mitochondrial compartments (matrix, inner membrane, outer membrane and intermembrane regions). Experimental results revealed that our method outperformed other computational methods. In addition, we tested iDeepSubMito on the M187, M983 and MitoCarta3.0 to further verify the efficiency of our method. Finally, the motif analysis and the interpretability analysis were conducted to reveal novel insights into subcellular biological functions of mitochondrial proteins. iDeepSubMito source code is available on GitHub at https://github.com/houzl3416/iDeepSubMito.
Subject(s)
Deep Learning , Mitochondrial Proteins/metabolism , Submitochondrial Particles/metabolism , Algorithms , Datasets as Topic , Neural Networks, Computer , Protein TransportABSTRACT
Haploinsufficiency, wherein a single allele is not enough to maintain normal functions, can lead to many diseases including cancers and neurodevelopmental disorders. Recently, computational methods for identifying haploinsufficiency have been developed. However, most of those computational methods suffer from study bias, experimental noise and instability, resulting in unsatisfactory identification of haploinsufficient genes. To address those challenges, we propose a deep forest model, called HaForest, to identify haploinsufficient genes. The multiscale scanning is proposed to extract local contextual representations from input features under Linear Discriminant Analysis. After that, the cascade forest structure is applied to obtain the concatenated features directly by integrating decision-tree-based forests. Meanwhile, to exploit the complex dependency structure among haploinsufficient genes, the LightGBM library is embedded into HaForest to reveal the highly expressive features. To validate the effectiveness of our method, we compared it to several computational methods and four deep learning algorithms on five epigenomic data sets. The results reveal that HaForest achieves superior performance over the other algorithms, demonstrating its unique and complementary performance in identifying haploinsufficient genes. The standalone tool is available at https://github.com/yangyn533/HaForest.
Subject(s)
Deep Learning , Epigenesis, Genetic , Haploinsufficiency , Neoplasms/genetics , Neurodevelopmental Disorders/genetics , Software , Alleles , Benchmarking , Decision Trees , Discriminant Analysis , Enhancer Elements, Genetic , Genome, Human , Histones/genetics , Histones/metabolism , Humans , Internet , Neoplasms/diagnosis , Neoplasms/pathology , Neurodevelopmental Disorders/diagnosis , Neurodevelopmental Disorders/pathology , Promoter Regions, GeneticABSTRACT
The identification of hidden responders is often an essential challenge in precision oncology. A recent attempt based on machine learning has been proposed for classifying aberrant pathway activity from multiomic cancer data. However, we note several critical limitations there, such as high-dimensionality, data sparsity and model performance. Given the central importance and broad impact of precision oncology, we propose nature-inspired deep Ras activation pan-cancer (NatDRAP), a deep neural network (DNN) model, to address those restrictions for the identification of hidden responders. In this study, we develop the nature-inspired deep learning model that integrates bulk RNA sequencing, copy number and mutation data from PanCanAltas to detect pan-cancer Ras pathway activation. In NatDRAP, we propose to synergize the nature-inspired artificial bee colony algorithm with different gradient-based optimizers in one framework for optimizing DNNs in a collaborative manner. Multiple experiments were conducted on 33 different cancer types across PanCanAtlas. The experimental results demonstrate that the proposed NatDRAP can provide superior performance over other benchmark methods with strong robustness towards diagnosing RAS aberrant pathway activity across different cancer types. In addition, gene ontology enrichment and pathological analysis are conducted to reveal novel insights into the RAS aberrant pathway activity identification and characterization. NatDRAP is written in Python and available at https://github.com/lixt314/NatDRAP1.