Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Genomics ; 116(1): 110749, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38008265

RESUMO

MOTIVATION: N4-acetylcytidine (ac4C) is a highly conserved RNA modification that plays a crucial role in various biological processes. Accurately identifying ac4C sites is of paramount importance for gaining a deeper understanding of their regulatory mechanisms. Nevertheless, the existing experimental techniques for ac4C site identification are characterized by limitations in terms of cost-effectiveness, while the performance of current computational methods in accurately identifying ac4C sites requires further enhancement. RESULTS: In this paper, we present MetaAc4C, an advanced deep learning model that leverages pre-trained bidirectional encoder representations from transformers (BERT). The model is based on a bi-directional long short-term memory network (BLSTM) architecture, incorporating attention mechanism and residual connection. To address the issue of data imbalance, we adapt generative adversarial networks to generate synthetic feature samples. On the independent test set, MetaAc4C surpasses the current state-of-the-art ac4C prediction model, exhibiting improvements in terms of ACC, MCC, and AUROC by 2.36%, 4.76%, and 3.11%, respectively, on the unbalanced dataset. When evaluated on the balanced dataset, MetaAc4C achieves improvements in ACC, MCC, and AUROC by 2.6%, 5.11%, and 1.01%, respectively. Notably, our approach of utilizing WGAN-GP augmented training RNA samples demonstrates even superior performance compared to the SMOTE oversampling method.


Assuntos
Aprendizado Profundo , Citidina , RNA
2.
Brief Bioinform ; 23(6)2022 11 19.
Artigo em Inglês | MEDLINE | ID: mdl-36155619

RESUMO

Identification of transcription factor binding sites (TFBSs) is essential to understanding of gene regulation. Designing computational models for accurate prediction of TFBSs is crucial because it is not feasible to experimentally assay all transcription factors (TFs) in all sequenced eukaryotic genomes. Although many methods have been proposed for the identification of TFBSs in humans, methods designed for plants are comparatively underdeveloped. Here, we present PlantBind, a method for integrated prediction and interpretation of TFBSs based on DNA sequences and DNA shape profiles. Built on an attention-based multi-label deep learning framework, PlantBind not only simultaneously predicts the potential binding sites of 315 TFs, but also identifies the motifs bound by transcription factors. During the training process, this model revealed a strong similarity among TF family members with respect to target binding sequences. Trans-species prediction performance using four Zea mays TFs demonstrated the suitability of this model for transfer learning. Overall, this study provides an effective solution for identifying plant TFBSs, which will promote greater understanding of transcriptional regulatory mechanisms in plants.


Assuntos
Regulação da Expressão Gênica , Fatores de Transcrição , Humanos , Sítios de Ligação , Ligação Proteica , Fatores de Transcrição/metabolismo , Redes Neurais de Computação
3.
Brief Bioinform ; 23(2)2022 03 10.
Artigo em Inglês | MEDLINE | ID: mdl-35189635

RESUMO

Protein lysine crotonylation (Kcr) is an important type of posttranslational modification that is associated with a wide range of biological processes. The identification of Kcr sites is critical to better understanding their functional mechanisms. However, the existing experimental techniques for detecting Kcr sites are cost-ineffective, to a great need for new computational methods to address this problem. We here describe Adapt-Kcr, an advanced deep learning model that utilizes adaptive embedding and is based on a convolutional neural network together with a bidirectional long short-term memory network and attention architecture. On the independent testing set, Adapt-Kcr outperformed the current state-of-the-art Kcr prediction model, with an improvement of 3.2% in accuracy and 1.9% in the area under the receiver operating characteristic curve. Compared to other Kcr models, Adapt-Kcr additionally had a more robust ability to distinguish between crotonylation and other lysine modifications. Another model (Adapt-ST) was trained to predict phosphorylation sites in SARS-CoV-2, and outperformed the equivalent state-of-the-art phosphorylation site prediction model. These results indicate that self-adaptive embedding features perform better than handcrafted features in capturing discriminative information; when used in attention architecture, this could be an effective way of identifying protein Kcr sites. Together, our Adapt framework (including learning embedding features and attention architecture) has a strong potential for prediction of other protein posttranslational modification sites.


Assuntos
Biologia Computacional , Aprendizado Profundo , Lisina/metabolismo , Processamento de Proteína Pós-Traducional , Software , Algoritmos , Benchmarking , Biologia Computacional/métodos , Biologia Computacional/normas , Bases de Dados Factuais , Redes Neurais de Computação , Fosforilação , Curva ROC , Reprodutibilidade dos Testes , Interface Usuário-Computador
4.
Brief Bioinform ; 23(6)2022 11 19.
Artigo em Inglês | MEDLINE | ID: mdl-36184256

RESUMO

Fentanyl and its analogues are psychoactive substances and the concern of fentanyl abuse has been existed in decades. Because the structure of fentanyl is easy to be modified, criminals may synthesize new fentanyl analogues to avoid supervision. The drug supervision is based on the structure matching to the database and too few kinds of fentanyl analogues are included in the database, so it is necessary to find out more potential fentanyl analogues and expand the sample space of fentanyl analogues. In this study, we introduced two deep generative models (SeqGAN and MolGPT) to generate potential fentanyl analogues, and a total of 11 041 valid molecules were obtained. The results showed that not only can we generate molecules with similar property distribution of original data, but the generated molecules also contain potential fentanyl analogues that are not pretty similar to any of original data. Ten molecules based on the rules of fentanyl analogues were selected for NMR, MS and IR validation. The results indicated that these molecules are all unreported fentanyl analogues. Furthermore, this study is the first to apply the deep learning to the generation of fentanyl analogues, greatly expands the exploring space of fentanyl analogues and provides help for the supervision of fentanyl.


Assuntos
Aprendizado Profundo , Fentanila , Fentanila/química , Analgésicos Opioides/química , Espectroscopia de Ressonância Magnética , Gerenciamento de Dados
5.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36525367

RESUMO

SUMMARY: Non-coding RNAs play important roles in transcriptional processes and participate in the regulation of various biological functions, in particular miRNAs and lncRNAs. Despite their importance for several biological functions, the existing signaling pathway databases do not include information on miRNA and lncRNA. Here, we redesigned a novel pathway database named NcPath by integrating and visualizing a total of 178 308 human experimentally validated miRNA-target interactions (MTIs), 32 282 experimentally verified lncRNA-target interactions (LTIs) and 4837 experimentally validated human ceRNA networks across 222 KEGG pathways (including 27 sub-categories). To expand the application potential of the redesigned NcPath database, we identified 556 798 reliable lncRNA-protein-coding genes (PCG) interaction pairs by integrating co-expression relations, ceRNA relations, co-TF-binding interactions, co-histone-modification interactions, cis-regulation relations and lncPro Tool predictions between lncRNAs and PCG. In addition, to determine the pathways in which miRNA/lncRNA targets are involved, we performed a KEGG enrichment analysis using a hypergeometric test. The NcPath database also provides information on MTIs/LTIs/ceRNA networks, PubMed IDs, gene annotations and the experimental verification method used. In summary, the NcPath database will serve as an important and continually updated platform that provides annotation and visualization of the pathways on which non-coding RNAs (miRNA and lncRNA) are involved, and provide support to multimodal non-coding RNAs enrichment analysis. The NcPath database is freely accessible at http://ncpath.pianlab.cn/. AVAILABILITY AND IMPLEMENTATION: NcPath database is freely available at http://ncpath.pianlab.cn/. The code and manual to use NcPath can be found at https://github.com/Marscolono/NcPath/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
MicroRNAs , RNA Longo não Codificante , Humanos , RNA Longo não Codificante/genética , RNA Longo não Codificante/metabolismo , RNA Mensageiro/metabolismo , Redes Reguladoras de Genes , MicroRNAs/genética , MicroRNAs/metabolismo , Transdução de Sinais
6.
PLoS Comput Biol ; 17(2): e1008767, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33600435

RESUMO

N6-methyladenine (6mA) is an important DNA modification form associated with a wide range of biological processes. Identifying accurately 6mA sites on a genomic scale is crucial for under-standing of 6mA's biological functions. However, the existing experimental techniques for detecting 6mA sites are cost-ineffective, which implies the great need of developing new computational methods for this problem. In this paper, we developed, without requiring any prior knowledge of 6mA and manually crafted sequence features, a deep learning framework named Deep6mA to identify DNA 6mA sites, and its performance is superior to other DNA 6mA prediction tools. Specifically, the 5-fold cross-validation on a benchmark dataset of rice gives the sensitivity and specificity of Deep6mA as 92.96% and 95.06%, respectively, and the overall prediction accuracy is 94%. Importantly, we find that the sequences with 6mA sites share similar patterns across different species. The model trained with rice data predicts well the 6mA sites of other three species: Arabidopsis thaliana, Fragaria vesca and Rosa chinensis with a prediction accuracy over 90%. In addition, we find that (1) 6mA tends to occur at GAGG motifs, which means the sequence near the 6mA site may be conservative; (2) 6mA is enriched in the TATA box of the promoter, which may be the main source of its regulating downstream gene expression.


Assuntos
Adenina/análogos & derivados , Metilação de DNA , DNA/genética , DNA/metabolismo , Aprendizado Profundo , Adenina/metabolismo , Arabidopsis/genética , Arabidopsis/metabolismo , Sequência de Bases , Sítios de Ligação/genética , Biologia Computacional , DNA de Plantas/genética , DNA de Plantas/metabolismo , Bases de Dados de Ácidos Nucleicos , Fragaria/genética , Fragaria/metabolismo , Redes Neurais de Computação , Oryza/genética , Oryza/metabolismo , Rosa/genética , Rosa/metabolismo , Especificidade da Espécie
7.
BMC Bioinformatics ; 22(1): 27, 2021 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-33482718

RESUMO

BACKGROUND: Currently, large-scale gene expression profiling has been successfully applied to the discovery of functional connections among diseases, genetic perturbation, and drug action. To address the cost of an ever-expanding gene expression profile, a new, low-cost, high-throughput reduced representation expression profiling method called L1000 was proposed, with which one million profiles were produced. Although a set of ~ 1000 carefully chosen landmark genes that can capture ~ 80% of information from the whole genome has been identified for use in L1000, the robustness of using these landmark genes to infer target genes is not satisfactory. Therefore, more efficient computational methods are still needed to deep mine the influential genes in the genome. RESULTS: Here, we propose a computational framework based on deep learning to mine a subset of genes that can cover more genomic information. Specifically, an AutoEncoder framework is first constructed to learn the non-linear relationship between genes, and then DeepLIFT is applied to calculate gene importance scores. Using this data-driven approach, we have re-obtained a landmark gene set. The result shows that our landmark genes can predict target genes more accurately and robustly than that of L1000 based on two metrics [mean absolute error (MAE) and Pearson correlation coefficient (PCC)]. This reveals that the landmark genes detected by our method contain more genomic information. CONCLUSIONS: We believe that our proposed framework is very suitable for the analysis of biological big data to reveal the mysteries of life. Furthermore, the landmark genes inferred from this study can be used for the explosive amplification of gene expression profiles to facilitate research into functional connections.


Assuntos
Aprendizado Profundo , Perfilação da Expressão Gênica , Genômica , Genoma , Transcriptoma
8.
Genes (Basel) ; 13(7)2022 06 23.
Artigo em Inglês | MEDLINE | ID: mdl-35885905

RESUMO

Small molecular networks within complex pathways are defined as subpathways. The identification of patient-specific subpathways can reveal the etiology of cancer and guide the development of personalized therapeutic strategies. The dysfunction of subpathways has been associated with the occurrence and development of cancer. Here, we propose a strategy to identify aberrant subpathways at the individual level by calculating the edge score and using the Gene Set Enrichment Analysis (GSEA) method. This provides a novel approach to subpathway analysis. We applied this method to the expression data of a lung adenocarcinoma (LUAD) dataset from The Cancer Genome Atlas (TCGA) database. We validated the effectiveness of this method in identifying LUAD-relevant subpathways and demonstrated its reliability using an independent Gene Expression Omnibus dataset (GEO). Additionally, survival analysis was applied to illustrate the clinical application value of the genes and edges in subpathways that were associated with the prognosis of patients and cancer immunity, which could be potential biomarkers. With these analyses, we show that our method could help uncover subpathways underlying lung adenocarcinoma.


Assuntos
Adenocarcinoma de Pulmão , Neoplasias Pulmonares , Adenocarcinoma de Pulmão/genética , Humanos , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patologia , Prognóstico , Reprodutibilidade dos Testes
9.
Genes (Basel) ; 12(2)2021 01 28.
Artigo em Inglês | MEDLINE | ID: mdl-33525573

RESUMO

In genome-wide association studies, detecting high-order epistasis is important for analyzing the occurrence of complex human diseases and explaining missing heritability. However, there are various challenges in the actual high-order epistasis detection process due to the large amount of data, "small sample size problem", diversity of disease models, etc. This paper proposes a multi-objective genetic algorithm (EpiMOGA) for single nucleotide polymorphism (SNP) epistasis detection. The K2 score based on the Bayesian network criterion and the Gini index of the diversity of the binary classification problem were used to guide the search process of the genetic algorithm. Experiments were performed on 26 simulated datasets of different models and a real Alzheimer's disease dataset. The results indicated that EpiMOGA was obviously superior to other related and competitive methods in both detection efficiency and accuracy, especially for small-sample-size datasets, and the performance of EpiMOGA remained stable across datasets of different disease models. At the same time, a number of SNP loci and 2-order epistasis associated with Alzheimer's disease were identified by the EpiMOGA method, indicating that this method is capable of identifying high-order epistasis from genome-wide data and can be applied in the study of complex diseases.


Assuntos
Epistasia Genética/genética , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Genoma/genética , Algoritmos , Teorema de Bayes , Humanos , Modelos Genéticos , Polimorfismo de Nucleotídeo Único/genética
10.
PeerJ ; 9: e11426, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34055486

RESUMO

Long non-coding RNA (lncRNA)-microRNA (miRNA) interactions are quickly emerging as important mechanisms underlying the functions of non-coding RNAs. Accordingly, predicting lncRNA-miRNA interactions provides an important basis for understanding the mechanisms of action of ncRNAs. However, the accuracy of the established prediction methods is still limited. In this study, we used structural consistency to measure the predictability of interactive links based on a bilayer network by integrating information for known lncRNA-miRNA interactions, an lncRNA similarity network, and an miRNA similarity network. In particular, by using the structural perturbation method, we proposed a framework called SPMLMI to predict potential lncRNA-miRNA interactions based on the bilayer network. We found that the structural consistency of the bilayer network was higher than that of any single network, supporting the utility of bilayer network construction for the prediction of lncRNA-miRNA interactions. Applying SPMLMI to three real datasets, we obtained areas under the curves of 0.9512 ± 0.0034, 0.8767 ± 0.0033, and 0.8653 ± 0.0021 based on 5-fold cross-validation, suggesting good model performance. In addition, the generalizability of SPMLMI was better than that of the previously established methods. Case studies of two lncRNAs (i.e., SNHG14 and MALAT1) further demonstrated the feasibility and effectiveness of the method. Therefore, SPMLMI is a feasible approach to identify novel lncRNA-miRNA interactions underlying complex biological processes.

11.
Genes (Basel) ; 11(11)2020 10 29.
Artigo em Inglês | MEDLINE | ID: mdl-33138076

RESUMO

Identifying perturbed pathways at an individual level is important to discover the causes of cancer and develop individualized custom therapeutic strategies. Though prognostic gene lists have had success in prognosis prediction, using single genes that are related to the relevant system or specific network cannot fully reveal the process of tumorigenesis. We hypothesize that in individual samples, the disruption of transcription homeostasis can influence the occurrence, development, and metastasis of tumors and has implications for patient survival outcomes. Here, we introduced the individual-level pathway score, which can measure the correlation perturbation of the pathways in a single sample well. We applied this method to the expression data of 16 different cancer types from The Cancer Genome Atlas (TCGA) database. Our results indicate that different cancer types as well as their tumor-adjacent tissues can be clearly distinguished by the individual-level pathway score. Additionally, we found that there was strong heterogeneity among different cancer types and the percentage of perturbed pathways as well as the perturbation proportions of tumor samples in each pathway were significantly different. Finally, the prognosis-related pathways of different cancer types were obtained by survival analysis. We demonstrated that the individual-level pathway score (iPS) is capable of classifying cancer types and identifying some key prognosis-related pathways.


Assuntos
Neoplasias/genética , Estudos de Casos e Controles , Bases de Dados de Ácidos Nucleicos , Feminino , Perfilação da Expressão Gênica , Regulação Neoplásica da Expressão Gênica , Humanos , Estimativa de Kaplan-Meier , Masculino , Neoplasias/classificação , Neoplasias/mortalidade , Prognóstico , RNA-Seq
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA