Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 558
Filtrar
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38754408

RESUMO

MOTIVATION: The technology for analyzing single-cell multi-omics data has advanced rapidly and has provided comprehensive and accurate cellular information by exploring cell heterogeneity in genomics, transcriptomics, epigenomics, metabolomics and proteomics data. However, because of the high-dimensional and sparse characteristics of single-cell multi-omics data, as well as the limitations of various analysis algorithms, the clustering performance is generally poor. Matrix factorization is an unsupervised, dimensionality reduction-based method that can cluster individuals and discover related omics variables from different blocks. Here, we present a novel algorithm that performs joint dimensionality reduction learning and cell clustering analysis on single-cell multi-omics data using non-negative matrix factorization that we named scMNMF. We formulate the objective function of joint learning as a constrained optimization problem and derive the corresponding iterative formulas through alternating iterative algorithms. The major advantage of the scMNMF algorithm remains its capability to explore hidden related features among omics data. Additionally, the feature selection for dimensionality reduction and cell clustering mutually influence each other iteratively, leading to a more effective discovery of cell types. We validated the performance of the scMNMF algorithm using two simulated and five real datasets. The results show that scMNMF outperformed seven other state-of-the-art algorithms in various measurements. AVAILABILITY AND IMPLEMENTATION: scMNMF code can be found at https://github.com/yushanqiu/scMNMF.


Assuntos
Algoritmos , Análise de Célula Única , Análise de Célula Única/métodos , Análise por Conglomerados , Humanos , Genômica/métodos , Biologia Computacional/métodos , Proteômica/métodos , Metabolômica/métodos , Epigenômica/métodos , Multiômica
2.
Brief Bioinform ; 25(6)2024 Sep 23.
Artigo em Inglês | MEDLINE | ID: mdl-39373051

RESUMO

Single-cell ribonucleic acid sequencing (scRNA-seq) technology can be used to perform high-resolution analysis of the transcriptomes of individual cells. Therefore, its application has gained popularity for accurately analyzing the ever-increasing content of heterogeneous single-cell datasets. Central to interpreting scRNA-seq data is the clustering of cells to decipher transcriptomic diversity and infer cell behavior patterns. However, its complexity necessitates the application of advanced methodologies capable of resolving the inherent heterogeneity and limited gene expression characteristics of single-cell data. Herein, we introduce a novel deep learning-based algorithm for single-cell clustering, designated scDFN, which can significantly enhance the clustering of scRNA-seq data through a fusion network strategy. The scDFN algorithm applies a dual mechanism involving an autoencoder to extract attribute information and an improved graph autoencoder to capture topological nuances, integrated via a cross-network information fusion mechanism complemented by a triple self-supervision strategy. This fusion is optimized through a holistic consideration of four distinct loss functions. A comparative analysis with five leading scRNA-seq clustering methodologies across multiple datasets revealed the superiority of scDFN, as determined by better the Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI) metrics. Additionally, scDFN demonstrated robust multi-cluster dataset performance and exceptional resilience to batch effects. Ablation studies highlighted the key roles of the autoencoder and the improved graph autoencoder components, along with the critical contribution of the four joint loss functions to the overall efficacy of the algorithm. Through these advancements, scDFN set a new benchmark in single-cell clustering and can be used as an effective tool for the nuanced analysis of single-cell transcriptomics.


Assuntos
Algoritmos , RNA-Seq , Análise de Célula Única , Análise de Célula Única/métodos , RNA-Seq/métodos , Análise por Conglomerados , Humanos , Aprendizado Profundo , Análise de Sequência de RNA/métodos , Transcriptoma , Perfilação da Expressão Gênica/métodos , Biologia Computacional/métodos , Animais , Análise da Expressão Gênica de Célula Única
3.
Brief Bioinform ; 25(6)2024 Sep 23.
Artigo em Inglês | MEDLINE | ID: mdl-39434494

RESUMO

Liquid-liquid phase separation (LLPS) is one of the mechanisms mediating the compartmentalization of macromolecules (proteins and nucleic acids) in cells, forming biomolecular condensates or membraneless organelles. Consequently, the systematic identification of potential LLPS proteins is crucial for understanding the phase separation process and its biological mechanisms. A two-task predictor, Opt_PredLLPS, was developed to discover potential phase separation proteins and further evaluate their mechanism. The first task model of Opt_PredLLPS combines a convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) through a fully connected layer, where the CNN utilizes evolutionary information features as input, and BiLSTM utilizes multimodal features as input. If a protein is predicted to be an LLPS protein, it is input into the second task model to predict whether this protein needs to interact with its partners to undergo LLPS. The second task model employs the XGBoost classification algorithm and 37 physicochemical properties following a three-step feature selection. The effectiveness of the model was validated on multiple benchmark datasets, and in silico saturation mutagenesis was used to identify regions that play a key role in phase separation. These findings may assist future research on the LLPS mechanism and the discovery of potential phase separation proteins.


Assuntos
Redes Neurais de Computação , Proteínas , Proteínas/química , Proteínas/metabolismo , Algoritmos , Biologia Computacional/métodos , Condensados Biomoleculares/metabolismo , Condensados Biomoleculares/química , Separação de Fases
4.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38446739

RESUMO

Antimicrobial peptides (AMPs), short peptides with diverse functions, effectively target and combat various organisms. The widespread misuse of chemical antibiotics has led to increasing microbial resistance. Due to their low drug resistance and toxicity, AMPs are considered promising substitutes for traditional antibiotics. While existing deep learning technology enhances AMP generation, it also presents certain challenges. Firstly, AMP generation overlooks the complex interdependencies among amino acids. Secondly, current models fail to integrate crucial tasks like screening, attribute prediction and iterative optimization. Consequently, we develop a integrated deep learning framework, Diff-AMP, that automates AMP generation, identification, attribute prediction and iterative optimization. We innovatively integrate kinetic diffusion and attention mechanisms into the reinforcement learning framework for efficient AMP generation. Additionally, our prediction module incorporates pre-training and transfer learning strategies for precise AMP identification and screening. We employ a convolutional neural network for multi-attribute prediction and a reinforcement learning-based iterative optimization strategy to produce diverse AMPs. This framework automates molecule generation, screening, attribute prediction and optimization, thereby advancing AMP research. We have also deployed Diff-AMP on a web server, with code, data and server details available in the Data Availability section.


Assuntos
Aminoácidos , Peptídeos Antimicrobianos , Antibacterianos , Difusão , Cinética
5.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38366802

RESUMO

Anti-coronavirus peptides (ACVPs) represent a relatively novel approach of inhibiting the adsorption and fusion of the virus with human cells. Several peptide-based inhibitors showed promise as potential therapeutic drug candidates. However, identifying such peptides in laboratory experiments is both costly and time consuming. Therefore, there is growing interest in using computational methods to predict ACVPs. Here, we describe a model for the prediction of ACVPs that is based on the combination of feature engineering (FE) optimization and deep representation learning. FEOpti-ACVP was pre-trained using two feature extraction frameworks. At the next step, several machine learning approaches were tested in to construct the final algorithm. The final version of FEOpti-ACVP outperformed existing methods used for ACVPs prediction and it has the potential to become a valuable tool in ACVP drug design. A user-friendly webserver of FEOpti-ACVP can be accessed at http://servers.aibiochem.net/soft/FEOpti-ACVP/.


Assuntos
Algoritmos , Peptídeos , Humanos , Sequência de Aminoácidos , Peptídeos/farmacologia , Aprendizado de Máquina
6.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38555479

RESUMO

MOTIVATION: Accurately predicting molecular metabolic stability is of great significance to drug research and development, ensuring drug safety and effectiveness. Existing deep learning methods, especially graph neural networks, can reveal the molecular structure of drugs and thus efficiently predict the metabolic stability of molecules. However, most of these methods focus on the message passing between adjacent atoms in the molecular graph, ignoring the relationship between bonds. This makes it difficult for these methods to estimate accurate molecular representations, thereby being limited in molecular metabolic stability prediction tasks. RESULTS: We propose the MS-BACL model based on bond graph augmentation technology and contrastive learning strategy, which can efficiently and reliably predict the metabolic stability of molecules. To our knowledge, this is the first time that bond-to-bond relationships in molecular graph structures have been considered in the task of metabolic stability prediction. We build a bond graph based on 'atom-bond-atom', and the model can simultaneously capture the information of atoms and bonds during the message propagation process. This enhances the model's ability to reveal the internal structure of the molecule, thereby improving the structural representation of the molecule. Furthermore, we perform contrastive learning training based on the molecular graph and its bond graph to learn the final molecular representation. Multiple sets of experimental results on public datasets show that the proposed MS-BACL model outperforms the state-of-the-art model. AVAILABILITY AND IMPLEMENTATION: The code and data are publicly available at https://github.com/taowang11/MS.


Assuntos
Redes Neurais de Computação
7.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38279648

RESUMO

Virus-encoded circular RNA (circRNA) participates in the immune response to viral infection, affects the human immune system, and can be used as a target for precision therapy and tumor biomarker. The coronaviruses SARS-CoV-1 and SARS-CoV-2 (SARS-CoV-1/2) that have emerged in recent years are highly contagious and have high mortality rates. In coronaviruses, little is known about the circRNA encoded by the SARS-CoV-1/2. Therefore, this study explores whether SARS-CoV-1/2 encodes circRNA and characteristics and functions of circRNA. Based on RNA-seq data of SARS-CoV-1 and SARS-CoV-2 infections, we used circRNA identification tools (circRNA_finder, find_circ and CIRI2) to identify circRNAs. The number of circRNAs encoded by SARS-CoV-1 and SARS-CoV-2 was identified as 151 and 470, respectively. It can be found that SARS-CoV-2 shows more prominent circRNA encoding ability than SARS-CoV-1. Expression analysis showed that only a few circRNAs encoded by SARS-CoV-1/2 showed high expression levels, and the positive strand produced more abundant circRNAs. Then, based on the identified SARS-CoV-1/2-encoded circRNAs, we performed circRNA identification and characterization using the previously developed CirRNAPL. Finally, target gene prediction and functional enrichment analysis were performed. It was found that viral circRNA is closely related to cancer and has a potential role in regulating host cell functions. This study studied the characteristics and functions of viral circRNA encoded by coronavirus SARS-CoV-1/2, providing a valuable resource for further research on the function and molecular mechanism of coronavirus circRNA.


Assuntos
COVID-19 , MicroRNAs , Neoplasias , Humanos , RNA Circular/genética , SARS-CoV-2/genética , COVID-19/genética , RNA Viral/genética , Neoplasias/genética , MicroRNAs/genética
8.
Nucleic Acids Res ; 52(D1): D990-D997, 2024 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-37831073

RESUMO

Rare variants contribute significantly to the genetic causes of complex traits, as they can have much larger effects than common variants and account for much of the missing heritability in genome-wide association studies. The emergence of UK Biobank scale datasets and accurate gene-level rare variant-trait association testing methods have dramatically increased the number of rare variant associations that have been detected. However, no systematic collection of these associations has been carried out to date, especially at the gene level. To address the issue, we present the Rare Variant Association Repository (RAVAR), a comprehensive collection of rare variant associations. RAVAR includes 95 047 high-quality rare variant associations (76186 gene-level and 18 861 variant-level associations) for 4429 reported traits which are manually curated from 245 publications. RAVAR is the first resource to collect and curate published rare variant associations in an interactive web interface with integrated visualization, search, and download features. Detailed gene and SNP information are provided for each association, and users can conveniently search for related studies by exploring the EFO tree structure and interactive Manhattan plots. RAVAR could vastly improve the accessibility of rare variant studies. RAVAR is freely available for all users without login requirement at http://www.ravar.bio.


Assuntos
Bases de Dados Genéticas , Variação Genética , Estudo de Associação Genômica Ampla , Estudo de Associação Genômica Ampla/métodos , Herança Multifatorial , Fenótipo
9.
PLoS Genet ; 19(9): e1010942, 2023 09.
Artigo em Inglês | MEDLINE | ID: mdl-37703293

RESUMO

The gene regulatory structure of cells involves not only the regulatory relationship between two genes, but also the cooperative associations of multiple genes. However, most gene regulatory network inference methods for single cell only focus on and infer the regulatory relationships of pairs of genes, ignoring the global regulatory structure which is crucial to identify the regulations in the complex biological systems. Here, we proposed a graph-based Deep learning model for Regulatory networks Inference among Genes (DeepRIG) from single-cell RNA-seq data. To learn the global regulatory structure, DeepRIG builds a prior regulatory graph by transforming the gene expression of data into the co-expression mode. Then it utilizes a graph autoencoder model to embed the global regulatory information contained in the graph into gene latent embeddings and to reconstruct the gene regulatory network. Extensive benchmarking results demonstrate that DeepRIG can accurately reconstruct the gene regulatory networks and outperform existing methods on multiple simulated networks and real-cell regulatory networks. Additionally, we applied DeepRIG to the samples of human peripheral blood mononuclear cells and triple-negative breast cancer, and presented that DeepRIG can provide accurate cell-type-specific gene regulatory networks inference and identify novel regulators of progression and inhibition.


Assuntos
Redes Reguladoras de Genes , Neoplasias de Mama Triplo Negativas , Humanos , Redes Reguladoras de Genes/genética , Leucócitos Mononucleares , Transcriptoma/genética
10.
Brief Bioinform ; 24(3)2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-37122068

RESUMO

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) technology attracts extensive attention in the biomedical field. It can be used to measure gene expression and analyze the transcriptome at the single-cell level, enabling the identification of cell types based on unsupervised clustering. Data imputation and dimension reduction are conducted before clustering because scRNA-seq has a high 'dropout' rate, noise and linear inseparability. However, independence of dimension reduction, imputation and clustering cannot fully characterize the pattern of the scRNA-seq data, resulting in poor clustering performance. Herein, we propose a novel and accurate algorithm, SSNMDI, that utilizes a joint learning approach to simultaneously perform imputation, dimensionality reduction and cell clustering in a non-negative matrix factorization (NMF) framework. In addition, we integrate the cell annotation as prior information, then transform the joint learning into a semi-supervised NMF model. Through experiments on 14 datasets, we demonstrate that SSNMDI has a faster convergence speed, better dimensionality reduction performance and a more accurate cell clustering performance than previous methods, providing an accurate and robust strategy for analyzing scRNA-seq data. Biological analysis are also conducted to validate the biological significance of our method, including pseudotime analysis, gene ontology and survival analysis. We believe that we are among the first to introduce imputation, partial label information, dimension reduction and clustering to the single-cell field. AVAILABILITY AND IMPLEMENTATION: The source code for SSNMDI is available at https://github.com/yushanqiu/SSNMDI.


Assuntos
Perfilação da Expressão Gênica , Análise da Expressão Gênica de Célula Única , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Algoritmos , Análise por Conglomerados
11.
Brief Bioinform ; 25(1)2023 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-38171927

RESUMO

Exploring microbial stress responses to drugs is crucial for the advancement of new therapeutic methods. While current artificial intelligence methodologies have expedited our understanding of potential microbial responses to drugs, the models are constrained by the imprecise representation of microbes and drugs. To this end, we combine deep autoencoder and subgraph augmentation technology for the first time to propose a model called JDASA-MRD, which can identify the potential indistinguishable responses of microbes to drugs. In the JDASA-MRD model, we begin by feeding the established similarity matrices of microbe and drug into the deep autoencoder, enabling to extract robust initial features of both microbes and drugs. Subsequently, we employ the MinHash and HyperLogLog algorithms to account intersections and cardinality data between microbe and drug subgraphs, thus deeply extracting the multi-hop neighborhood information of nodes. Finally, by integrating the initial node features with subgraph topological information, we leverage graph neural network technology to predict the microbes' responses to drugs, offering a more effective solution to the 'over-smoothing' challenge. Comparative analyses on multiple public datasets confirm that the JDASA-MRD model's performance surpasses that of current state-of-the-art models. This research aims to offer a more profound insight into the adaptability of microbes to drugs and to furnish pivotal guidance for drug treatment strategies. Our data and code are publicly available at: https://github.com/ZZCrazy00/JDASA-MRD.


Assuntos
Algoritmos , Inteligência Artificial , Redes Neurais de Computação
12.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36567252

RESUMO

Numerous experimental studies have indicated that alteration and dysregulation in mircroRNAs (miRNAs) are associated with serious diseases. Identifying disease-related miRNAs is therefore an essential and challenging task in bioinformatics research. Computational methods are an efficient and economical alternative to conventional biomedical studies and can reveal underlying miRNA-disease associations for subsequent experimental confirmation with reasonable confidence. Despite the success of existing computational approaches, most of them only rely on the known miRNA-disease associations to predict associations without adding other data to increase the prediction accuracy, and they are affected by issues of data sparsity. In this paper, we present MRRN, a model that combines matrix reconstruction with node reliability to predict probable miRNA-disease associations. In MRRN, the most reliable neighbors of miRNA and disease are used to update the original miRNA-disease association matrix, which significantly reduces data sparsity. Unknown miRNA-disease associations are reconstructed by aggregating the most reliable first-order neighbors to increase prediction accuracy by representing the local and global structure of the heterogeneous network. Five-fold cross-validation of MRRN produced an area under the curve (AUC) of 0.9355 and area under the precision-recall curve (AUPR) of 0.2646, values that were greater than those produced by comparable models. Two different types of case studies using three diseases were conducted to demonstrate the accuracy of MRRN, and all top 30 predicted miRNAs were verified.


Assuntos
MicroRNAs , Humanos , MicroRNAs/genética , Predisposição Genética para Doença , Reprodutibilidade dos Testes , Algoritmos , Biologia Computacional/métodos
13.
Brief Bioinform ; 24(3)2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-37088976

RESUMO

Single-cell RNA sequencing (scRNA-seq) is a revolutionary breakthrough that determines the precise gene expressions on individual cells and deciphers cell heterogeneity and subpopulations. However, scRNA-seq data are much noisier than traditional high-throughput RNA-seq data because of technical limitations, leading to many scRNA-seq data studies about dimensionality reduction and visualization remaining at the basic data-stacking stage. In this study, we propose an improved variational autoencoder model (termed DREAM) for dimensionality reduction and a visual analysis of scRNA-seq data. Here, DREAM combines the variational autoencoder and Gaussian mixture model for cell type identification, meanwhile explicitly solving 'dropout' events by introducing the zero-inflated layer to obtain the low-dimensional representation that describes the changes in the original scRNA-seq dataset. Benchmarking comparisons across nine scRNA-seq datasets show that DREAM outperforms four state-of-the-art methods on average. Moreover, we prove that DREAM can accurately capture the expression dynamics of human preimplantation embryonic development. DREAM is implemented in Python, freely available via the GitHub website, https://github.com/Crystal-JJ/DREAM.


Assuntos
Análise de Célula Única , Análise da Expressão Gênica de Célula Única , Humanos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , RNA-Seq , Perfilação da Expressão Gênica/métodos , Análise por Conglomerados
14.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37200156

RESUMO

Multiple sequence alignment is widely used for sequence analysis, such as identifying important sites and phylogenetic analysis. Traditional methods, such as progressive alignment, are time-consuming. To address this issue, we introduce StarTree, a novel method to fast construct a guide tree by combining sequence clustering and hierarchical clustering. Furthermore, we develop a new heuristic similar region detection algorithm using the FM-index and apply the k-banded dynamic program to the profile alignment. We also introduce a win-win alignment algorithm that applies the central star strategy within the clusters to fast the alignment process, then uses the progressive strategy to align the central-aligned profiles, guaranteeing the final alignment's accuracy. We present WMSA 2 based on these improvements and compare the speed and accuracy with other popular methods. The results show that the guide tree made by the StarTree clustering method can lead to better accuracy than that of PartTree while consuming less time and memory than that of UPGMA and mBed methods on datasets with thousands of sequences. During the alignment of simulated data sets, WMSA 2 can consume less time and memory while ranking at the top of Q and TC scores. The WMSA 2 is still better at the time, and memory efficiency on the real datasets and ranks at the top on the average sum of pairs score. For the alignment of 1 million SARS-CoV-2 genomes, the win-win mode of WMSA 2 significantly decreased the consumption time than the former version. The source code and data are available at https://github.com/malabz/WMSA2.


Assuntos
COVID-19 , RNA , Humanos , Alinhamento de Sequência , Filogenia , SARS-CoV-2/genética , Software , Algoritmos , DNA/genética
15.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37861173

RESUMO

NcRNA-encoded small peptides (ncPEPs) have recently emerged as promising targets and biomarkers for cancer immunotherapy. Therefore, identifying cancer-associated ncPEPs is crucial for cancer research. In this work, we propose CoraL, a novel supervised contrastive meta-learning framework for predicting cancer-associated ncPEPs. Specifically, the proposed meta-learning strategy enables our model to learn meta-knowledge from different types of peptides and train a promising predictive model even with few labeled samples. The results show that our model is capable of making high-confidence predictions on unseen cancer biomarkers with only five samples, potentially accelerating the discovery of novel cancer biomarkers for immunotherapy. Moreover, our approach remarkably outperforms existing deep learning models on 15 cancer-associated ncPEPs datasets, demonstrating its effectiveness and robustness. Interestingly, our model exhibits outstanding performance when extended for the identification of short open reading frames derived from ncPEPs, demonstrating the strong prediction ability of CoraL at the transcriptome level. Importantly, our feature interpretation analysis discovers unique sequential patterns as the fingerprint for each cancer-associated ncPEPs, revealing the relationship among certain cancer biomarkers that are validated by relevant literature and motif comparison. Overall, we expect CoraL to be a useful tool to decipher the pathogenesis of cancer and provide valuable information for cancer research. The dataset and source code of our proposed method can be found at https://github.com/Johnsunnn/CoraL.


Assuntos
Antozoários , Neoplasias , Animais , Antozoários/genética , Neoplasias/genética , Biomarcadores Tumorais/genética , Imunoterapia , Peptídeos/genética , RNA não Traduzido
16.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37779250

RESUMO

The microbiota-gut-brain axis denotes a two-way system of interactions between the gut and the brain, comprising three key components: (1) gut microbiota, (2) intermediates and (3) mental ailments. These constituents communicate with one another to induce changes in the host's mood, cognition and demeanor. Knowledge concerning the regulation of the host central nervous system by gut microbiota is fragmented and mostly confined to disorganized or semi-structured unrestricted texts. Such a format hinders the exploration and comprehension of unknown territories or the further advancement of artificial intelligence systems. Hence, we collated crucial information by scrutinizing an extensive body of literature, amalgamated the extant knowledge of the microbiota-gut-brain axis and depicted it in the form of a knowledge graph named MMiKG, which can be visualized on the GraphXR platform and the Neo4j database, correspondingly. By merging various associated resources and deducing prospective connections between gut microbiota and the central nervous system through MMiKG, users can acquire a more comprehensive perception of the pathogenesis of mental disorders and generate novel insights for advancing therapeutic measures. As a free and open-source platform, MMiKG can be accessed at http://yangbiolab.cn:8501/ with no login requirement.


Assuntos
Transtornos Mentais , Microbiota , Humanos , Inteligência Artificial , Reconhecimento Automatizado de Padrão , Estudos Prospectivos , Encéfalo
17.
Bioinformatics ; 40(5)2024 05 02.
Artigo em Inglês | MEDLINE | ID: mdl-38684178

RESUMO

MOTIVATION: Continuous advancements in single-cell RNA sequencing (scRNA-seq) technology have enabled researchers to further explore the study of cell heterogeneity, trajectory inference, identification of rare cell types, and neurology. Accurate scRNA-seq data clustering is crucial in single-cell sequencing data analysis. However, the high dimensionality, sparsity, and presence of "false" zero values in the data can pose challenges to clustering. Furthermore, current unsupervised clustering algorithms have not effectively leveraged prior biological knowledge, making cell clustering even more challenging. RESULTS: This study investigates a semisupervised clustering model called scTPC, which integrates the triplet constraint, pairwise constraint, and cross-entropy constraint based on deep learning. Specifically, the model begins by pretraining a denoising autoencoder based on a zero-inflated negative binomial distribution. Deep clustering is then performed in the learned latent feature space using triplet constraints and pairwise constraints generated from partial labeled cells. Finally, to address imbalanced cell-type datasets, a weighted cross-entropy loss is introduced to optimize the model. A series of experimental results on 10 real scRNA-seq datasets and five simulated datasets demonstrate that scTPC achieves accurate clustering with a well-designed framework. AVAILABILITY AND IMPLEMENTATION: scTPC is a Python-based algorithm, and the code is available from https://github.com/LF-Yang/Code or https://zenodo.org/records/10951780.


Assuntos
Algoritmos , Análise de Célula Única , Análise de Célula Única/métodos , Análise por Conglomerados , Humanos , Análise de Sequência de RNA/métodos , RNA-Seq/métodos , Aprendizado Profundo , Software , Análise da Expressão Gênica de Célula Única
18.
Bioinformatics ; 40(5)2024 05 02.
Artigo em Inglês | MEDLINE | ID: mdl-38710482

RESUMO

MOTIVATION: Despite the extensive manufacturing of antiviral drugs and vaccination, viral infections continue to be a major human ailment. Antiviral peptides (AVPs) have emerged as potential candidates in the pursuit of novel antiviral drugs. These peptides show vigorous antiviral activity against a diverse range of viruses by targeting different phases of the viral life cycle. Therefore, the accurate prediction of AVPs is an essential yet challenging task. Lately, many machine learning-based approaches have developed for this purpose; however, their limited capabilities in terms of feature engineering, accuracy, and generalization make these methods restricted. RESULTS: In the present study, we aim to develop an efficient machine learning-based approach for the identification of AVPs, referred to as DeepAVP-TPPred, to address the aforementioned problems. First, we extract two new transformed feature sets using our designed image-based feature extraction algorithms and integrate them with an evolutionary information-based feature. Next, these feature sets were optimized using a novel feature selection approach called binary tree growth Algorithm. Finally, the optimal feature space from the training dataset was fed to the deep neural network to build the final classification model. The proposed model DeepAVP-TPPred was tested using stringent 5-fold cross-validation and two independent dataset testing methods, which achieved the maximum performance and showed enhanced efficiency over existing predictors in terms of both accuracy and generalization capabilities. AVAILABILITY AND IMPLEMENTATION: https://github.com/MateeullahKhan/DeepAVP-TPPred.


Assuntos
Algoritmos , Antivirais , Aprendizado de Máquina , Antivirais/farmacologia , Antivirais/química , Peptídeos/química , Humanos , Biologia Computacional/métodos , Redes Neurais de Computação
19.
Bioinformatics ; 40(1)2024 01 02.
Artigo em Inglês | MEDLINE | ID: mdl-38180876

RESUMO

MOTIVATION: In recent years, circular RNAs (circRNAs), the particular form of RNA with a closed-loop structure, have attracted widespread attention due to their physiological significance (they can directly bind proteins), leading to the development of numerous protein site identification algorithms. Unfortunately, these studies are supervised and require the vast majority of labeled samples in training to produce superior performance. But the acquisition of sample labels requires a large number of biological experiments and is difficult to obtain. RESULTS: To resolve this matter that a great deal of tags need to be trained in the circRNA-binding site prediction task, a self-supervised learning binding site identification algorithm named CircSI-SSL is proposed in this article. According to the survey, this is unprecedented in the research field. Specifically, CircSI-SSL initially combines multiple feature coding schemes and employs RNA_Transformer for cross-view sequence prediction (self-supervised task) to learn mutual information from the multi-view data, and then fine-tuning with only a few sample labels. Comprehensive experiments on six widely used circRNA datasets indicate that our CircSI-SSL algorithm achieves excellent performance in comparison to previous algorithms, even in the extreme case where the ratio of training data to test data is 1:9. In addition, the transplantation experiment of six linRNA datasets without network modification and hyperparameter adjustment shows that CircSI-SSL has good scalability. In summary, the prediction algorithm based on self-supervised learning proposed in this article is expected to replace previous supervised algorithms and has more extensive application value. AVAILABILITY AND IMPLEMENTATION: The source code and data are available at https://github.com/cc646201081/CircSI-SSL.


Assuntos
RNA Circular , RNA , Sítios de Ligação , Algoritmos , Aprendizado de Máquina Supervisionado
20.
Bioinformatics ; 40(1)2024 01 02.
Artigo em Inglês | MEDLINE | ID: mdl-38200554

RESUMO

MOTIVATION: In bioinformatics, multiple sequence alignment (MSA) is a crucial task. However, conventional methods often struggle with aligning ultralong sequences. To address this issue, researchers have designed MSA methods rooted in a vertical division strategy, which segments sequence data for parallel alignment. A prime example of this approach is FMAlign, which utilizes the FM-index to extract common seeds and segment the sequences accordingly. RESULTS: FMAlign2 leverages the suffix array to identify maximal exact matches, redefining the approach of FMAlign from searching for global chains to partial chains. By using a vertical division strategy, large-scale problem is deconstructed into manageable tasks, enabling parallel execution of subMSA. Furthermore, sequence-profile alignment and refinement are incorporated to concatenate subsets, yielding the final result seamlessly. Compared to FMAlign, FMAlign2 markedly augments the segmentation of sequences and significantly reduces the time while maintaining accuracy, especially on ultralong datasets. Importantly, FMAlign2 enhances existing MSA methods by conferring the capability to handle sequences reaching billions in length within an acceptable time frame. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are available at https://github.com/malabz/FMAlign2 and https://zenodo.org/records/10435770.


Assuntos
Algoritmos , Software , Alinhamento de Sequência , Sequência de Bases , Biologia Computacional
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA