Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 152
Filtrar
1.
Front Genet ; 15: 1444459, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39184348

RESUMEN

The detection of enhancer-promoter interactions (EPIs) is crucial for understanding gene expression regulation, disease mechanisms, and more. In this study, we developed TF-EPI, a deep learning model based on Transformer designed to detect these interactions solely from DNA sequences. The performance of TF-EPI surpassed that of other state-of-the-art methods on multiple benchmark datasets. Importantly, by utilizing the attention mechanism of the Transformer, we identified distinct cell type-specific motifs and sequences in enhancers and promoters, which were validated against databases such as JASPAR and UniBind, highlighting the potential of our method in discovering new biological insights. Moreover, our analysis of the transcription factors (TFs) corresponding to these motifs and short sequence pairs revealed the heterogeneity and commonality of gene regulatory mechanisms and demonstrated the ability to identify TFs relevant to the source information of the cell line. Finally, the introduction of transfer learning can mitigate the challenges posed by cell type-specific gene regulation, yielding enhanced accuracy in cross-cell line EPI detection. Overall, our work unveils important sequence information for the investigation of enhancer-promoter pairs based on the attention mechanism of the Transformer, providing an important milestone in the investigation of cis-regulatory grammar.

2.
J Med Virol ; 96(8): e29851, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-39132689

RESUMEN

Here, we performed single-cell RNA sequencing of S1 and receptor binding domain protein-specific B cells from convalescent COVID-19 patients with different clinical manifestations. This study aimed to evaluate the role and developmental pathway of atypical memory B cells (MBCs) in response to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection. The results revealed a proinflammatory signature across B cell subsets associated with disease severity, as evidenced by the upregulation of genes such as GADD45B, MAP3K8, and NFKBIA in critical and severe individuals. Furthermore, the analysis of atypical MBCs suggested a developmental pathway similar to that of conventional MBCs through germinal centers, as indicated by the expression of several genes involved in germinal center processes, including CXCR4, CXCR5, BCL2, and MYC. Additionally, the upregulation of genes characteristic of the immune response in COVID-19, such as ZFP36 and DUSP1, suggested that the differentiation and activation of atypical MBCs may be influenced by exposure to SARS-CoV-2 and that these genes may contribute to the immune response for COVID-19 recovery. Our study contributes to a better understanding of atypical MBCs in COVID-19 and the role of other B cell subsets across different clinical manifestations.


Asunto(s)
COVID-19 , Células B de Memoria , SARS-CoV-2 , Análisis de la Célula Individual , Humanos , COVID-19/inmunología , COVID-19/virología , COVID-19/genética , SARS-CoV-2/inmunología , SARS-CoV-2/genética , Células B de Memoria/inmunología , Masculino , Adulto , Femenino , Persona de Mediana Edad , Perfilación de la Expresión Génica , Transcriptoma , Centro Germinal/inmunología , Linfocitos B/inmunología , Anciano
3.
Front Genet ; 15: 1407765, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38974382

RESUMEN

Preventing, diagnosing, and treating diseases requires accurate clinical biomarkers, which remains challenging. Recently, advanced computational approaches have accelerated the discovery of promising biomarkers from high-dimensional multimodal data. Although machine-learning methods have greatly contributed to the research fields, handling data sparseness, which is not unusual in research settings, is still an issue as it leads to limited interpretability and performance in the presence of missing information. Here, we propose a novel pipeline integrating joint non-negative matrix factorization (JNMF), identifying key features within sparse high-dimensional heterogeneous data, and a biological pathway analysis, interpreting the functionality of features by detecting activated signaling pathways. By applying our pipeline to large-scale public cancer datasets, we identified sets of genomic features relevant to specific cancer types as common pattern modules (CPMs) of JNMF. We further detected COPS5 as a potential upstream regulator of pathways associated with diffuse large B-cell lymphoma (DLBCL). COPS5 exhibited co-overexpression with MYC, TP53, and BCL2, known DLBCL marker genes, and its high expression was correlated with a lower survival probability of DLBCL patients. Using the CRISPR-Cas9 system, we confirmed the tumor growth effect of COPS5, which suggests it as a novel prognostic biomarker for DLBCL. Our results highlight that integrating multiple high-dimensional data and effectively decomposing them to interpretable dimensions unravels hidden biological importance, which enhances the discovery of clinical biomarkers.

4.
NAR Genom Bioinform ; 6(2): lqae067, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38846348

RESUMEN

Trans-splicing is a post-transcriptional processing event that joins exons from separate RNAs to produce a chimeric RNA. However, the detailed mechanism of trans-splicing remains poorly understood. Here, we characterize trans-spliced genes and provide insights into the mechanism of trans-splicing in the tunicate Ciona. Tunicates are the closest invertebrates to humans, and their genes frequently undergo trans-splicing. Our analysis revealed that, in genes that give rise to both trans-spliced and non-trans-spliced messenger RNAs, trans-splice acceptor sites were preferentially located at the first functional acceptor site, and their paired donor sites were weak in both Ciona and humans. Additionally, we found that Ciona trans-spliced genes had GU- and AU-rich 5' transcribed regions. Our data and findings not only are useful for Ciona research community, but may also aid in a better understanding of the trans-splicing mechanism, potentially advancing the development of gene therapy based on trans-splicing.

5.
NAR Genom Bioinform ; 6(2): lqae050, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38711859

RESUMEN

Delineating the intricate interplay between promoter-proximal and -distal regulators is crucial for understanding the function of transcriptional mediator complexes implicated in the regulation of gene expression. The present study aimed to develop a computational method for accurately modeling the spatial proximal and distal regulatory interactions. Our method combined regression-based models to identify key regulators through gene expression prediction and a graph-embedding approach to detect coregulated genes. This approach enabled a detailed investigation of the gene regulatory mechanisms for germinal center B cells, accompanied by dramatic rearrangements of the genome structure. We found that while the promoter-proximal regulatory elements were the principal regulators of gene expression, the distal regulators fine-tuned transcription. Moreover, our approach unveiled the presence of modular regulators, such as cofactors and proximal/distal transcription factors, which were co-expressed with their target genes. Some of these modules exhibited abnormal expression patterns in lymphoma. These findings suggest that the dysregulation of interactions between transcriptional and architectural factors is associated with chromatin reorganization failure, which may increase the risk of malignancy. Therefore, our computational approach helps decipher the transcriptional cis-regulatory code spatially interacting.

6.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38581422

RESUMEN

Reliable cell type annotations are crucial for investigating cellular heterogeneity in single-cell omics data. Although various computational approaches have been proposed for single-cell RNA sequencing (scRNA-seq) annotation, high-quality cell labels are still lacking in single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) data, because of extreme sparsity and inconsistent chromatin accessibility between datasets. Here, we present a novel automated cell annotation method that transfers cell type information from a well-labeled scRNA-seq reference to an unlabeled scATAC-seq target, via a parallel graph neural network, in a semi-supervised manner. Unlike existing methods that utilize only gene expression or gene activity features, HyGAnno leverages genome-wide accessibility peak features to facilitate the training process. In addition, HyGAnno reconstructs a reference-target cell graph to detect cells with low prediction reliability, according to their specific graph connectivity patterns. HyGAnno was assessed across various datasets, showcasing its strengths in precise cell annotation, generating interpretable cell embeddings, robustness to noisy reference data and adaptability to tumor tissues.


Asunto(s)
Cromatina , Redes Neurales de la Computación , Reproducibilidad de los Resultados
7.
PeerJ ; 12: e17073, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38500529

RESUMEN

Background: Observational studies have demonstrated that a higher resting heart rate (RHR) is associated with an increased risk of dementia. However, it is not clear whether the association is causal. This study aimed to determine the causal effects of higher genetically predicted RHR on the risk of dementia. Methods: We performed a two-sample Mendelian randomization analysis to investigate the causal effect of higher genetically predicted RHR on Alzheimer's disease (AD) using summary statistics from genome-wide association studies. The generalized summary Mendelian randomization (GSMR) analysis was used to analyze the corresponding effects of RHR on following different outcomes: 1) diagnosis of AD (International Genomics of Alzheimer's Project), 2) family history (maternal and paternal) of AD from UK Biobank, 3) combined meta-analysis including these three GWAS results. Further analyses were conducted to determine the possibility of reverse causal association by adjusting for RHR modifying medication. Results: The results of GSMR showed no significant causal effect of higher genetically predicted RHR on the risk of AD (ßGSMR = 0.12, P = 0.30). GSMR applied to the maternal family history of AD (ßGSMR = -0.18, P = 0.13) and to the paternal family history of AD (ßGSMR = -0.14, P = 0.39) showed the same results. Furthermore, the results were robust after adjusting for RHR modifying drugs (ßGSMR = -0.03, P = 0.72). Conclusion: Our study did not find any evidence that supports a causal effect of RHR on dementia. Previous observational associations between RHR and dementia are likely attributed to the correlation between RHR and other cardiovascular diseases.


Asunto(s)
Enfermedad de Alzheimer , Estudio de Asociación del Genoma Completo , Humanos , Enfermedad de Alzheimer/epidemiología , Bancos de Muestras Biológicas , Frecuencia Cardíaca/genética , Análisis de la Aleatorización Mendeliana , Biobanco del Reino Unido , Metaanálisis como Asunto
8.
Nat Genet ; 56(3): 473-482, 2024 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38361031

RESUMEN

Chromatin accessibility is a hallmark of active regulatory regions and is functionally linked to transcriptional networks and cell identity. However, the molecular mechanisms and networks that govern chromatin accessibility have not been thoroughly studied. Here we conducted a genome-wide CRISPR screening combined with an optimized ATAC-see protocol to identify genes that modulate global chromatin accessibility. In addition to known chromatin regulators like CREBBP and EP400, we discovered a number of previously unrecognized proteins that modulate chromatin accessibility, including TFDP1, HNRNPU, EIF3D and THAP11 belonging to diverse biological pathways. ATAC-seq analysis upon their knockouts revealed their distinct and specific effects on chromatin accessibility. Remarkably, we found that TFDP1, a transcription factor, modulates global chromatin accessibility through transcriptional regulation of canonical histones. In addition, our findings highlight the manipulation of chromatin accessibility as an approach to enhance various cell engineering applications, including genome editing and induced pluripotent stem cell reprogramming.


Asunto(s)
Cromatina , Secuenciación de Nucleótidos de Alto Rendimiento , Cromatina/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Histonas/genética , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , Redes Reguladoras de Genes
9.
Nucleic Acids Res ; 52(3): 1107-1119, 2024 Feb 09.
Artículo en Inglés | MEDLINE | ID: mdl-38084904

RESUMEN

In this research, we elucidate the presence of around 11,000 housekeeping cis-regulatory elements (HK-CREs) and describe their main characteristics. Besides the trivial promoters of housekeeping genes, most HK-CREs reside in promoter regions and are involved in a broader role beyond housekeeping gene regulation. HK-CREs are conserved regions rich in unmethylated CpG sites. Their distribution highly correlates with that of protein-coding genes, and they interact with many genes over long distances. We observed reduced activity of a subset of HK-CREs in diverse cancer subtypes due to aberrant methylation, particularly those located in chromosome 19 and associated with zinc finger genes. Further analysis of samples from 17 cancer subtypes showed a significantly increased survival probability of patients with higher expression of these genes, suggesting them as housekeeping tumor suppressor genes. Overall, our work unravels the presence of housekeeping CREs indispensable for the maintenance and stability of cells.


Asunto(s)
Neoplasias , Secuencias Reguladoras de Ácidos Nucleicos , Humanos , Regiones Promotoras Genéticas , Regulación de la Expresión Génica , Neoplasias/genética , Epigénesis Genética
10.
Brief Bioinform ; 24(6)2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37861173

RESUMEN

NcRNA-encoded small peptides (ncPEPs) have recently emerged as promising targets and biomarkers for cancer immunotherapy. Therefore, identifying cancer-associated ncPEPs is crucial for cancer research. In this work, we propose CoraL, a novel supervised contrastive meta-learning framework for predicting cancer-associated ncPEPs. Specifically, the proposed meta-learning strategy enables our model to learn meta-knowledge from different types of peptides and train a promising predictive model even with few labeled samples. The results show that our model is capable of making high-confidence predictions on unseen cancer biomarkers with only five samples, potentially accelerating the discovery of novel cancer biomarkers for immunotherapy. Moreover, our approach remarkably outperforms existing deep learning models on 15 cancer-associated ncPEPs datasets, demonstrating its effectiveness and robustness. Interestingly, our model exhibits outstanding performance when extended for the identification of short open reading frames derived from ncPEPs, demonstrating the strong prediction ability of CoraL at the transcriptome level. Importantly, our feature interpretation analysis discovers unique sequential patterns as the fingerprint for each cancer-associated ncPEPs, revealing the relationship among certain cancer biomarkers that are validated by relevant literature and motif comparison. Overall, we expect CoraL to be a useful tool to decipher the pathogenesis of cancer and provide valuable information for cancer research. The dataset and source code of our proposed method can be found at https://github.com/Johnsunnn/CoraL.


Asunto(s)
Antozoos , Neoplasias , Animales , Antozoos/genética , Neoplasias/genética , Biomarcadores de Tumor/genética , Inmunoterapia , Péptidos/genética , ARN no Traducido
12.
Nat Aging ; 3(8): 1001-1019, 2023 08.
Artículo en Inglés | MEDLINE | ID: mdl-37474791

RESUMEN

Protein misfolding is a major factor of neurodegenerative diseases. Post-mitotic neurons are highly susceptible to protein aggregates that are not diluted by mitosis. Therefore, post-mitotic cells may have a specific protein quality control system. Here, we show that LONRF2 is a bona fide protein quality control ubiquitin ligase induced in post-mitotic senescent cells. Under unperturbed conditions, LONRF2 is predominantly expressed in neurons. LONRF2 binds and ubiquitylates abnormally structured TDP-43 and hnRNP M1 and artificially misfolded proteins. Lonrf2-/- mice exhibit age-dependent TDP-43-mediated motor neuron (MN) degeneration and cerebellar ataxia. Mouse induced pluripotent stem cell-derived MNs lacking LONRF2 showed reduced survival, shortening of neurites and accumulation of pTDP-43 and G3BP1 after long-term culture. The shortening of neurites in MNs from patients with amyotrophic lateral sclerosis is rescued by ectopic expression of LONRF2. Our findings reveal that LONRF2 is a protein quality control ligase whose loss may contribute to MN degeneration and motor deficits.


Asunto(s)
Neuronas Motoras , Ubiquitina , Ratones , Animales , Neuronas Motoras/metabolismo , Ubiquitina/metabolismo , Ligasas/metabolismo , ADN Helicasas/metabolismo , Proteínas de Unión a Poli-ADP-Ribosa/metabolismo , ARN Helicasas/metabolismo , Proteínas con Motivos de Reconocimiento de ARN/metabolismo , Proteínas de Unión al ADN/genética
13.
Nucleic Acids Res ; 51(7): 3017-3029, 2023 04 24.
Artículo en Inglés | MEDLINE | ID: mdl-36796796

RESUMEN

Here, we present DeepBIO, the first-of-its-kind automated and interpretable deep-learning platform for high-throughput biological sequence functional analysis. DeepBIO is a one-stop-shop web service that enables researchers to develop new deep-learning architectures to answer any biological question. Specifically, given any biological sequence data, DeepBIO supports a total of 42 state-of-the-art deep-learning algorithms for model training, comparison, optimization and evaluation in a fully automated pipeline. DeepBIO provides a comprehensive result visualization analysis for predictive models covering several aspects, such as model interpretability, feature analysis and functional sequential region discovery. Additionally, DeepBIO supports nine base-level functional annotation tasks using deep-learning architectures, with comprehensive interpretations and graphical visualizations to validate the reliability of annotated sites. Empowered by high-performance computers, DeepBIO allows ultra-fast prediction with up to million-scale sequence data in a few hours, demonstrating its usability in real application scenarios. Case study results show that DeepBIO provides an accurate, robust and interpretable prediction, demonstrating the power of deep learning in biological sequence functional analysis. Overall, we expect DeepBIO to ensure the reproducibility of deep-learning biological sequence analysis, lessen the programming and hardware burden for biologists and provide meaningful functional insights at both the sequence level and base level from biological sequences alone. DeepBIO is publicly available at https://inner.wei-group.net/DeepBIO.


The development of next-generation sequencing techniques has led to an exponential increase in the amount of biological sequence data accessible. It naturally poses a fundamental challenge­how to build the relationships from such large-scale sequences to their functions. In this work, we present DeepBIO, the first-of-its-kind automated and interpretable deep-learning platform for high-throughput biological sequence functional analysis. It enables researchers to develop new deep-learning architectures to answer any biological question in a fully automated pipeline. We expect DeepBIO to ensure the reproducibility of deep-learning-based biological sequence analysis, lessen the programming and hardware burden for biologists and provide meaningful functional insights at both the sequence level and base level from biological sequences alone.


Asunto(s)
Aprendizaje Profundo , Reproducibilidad de los Resultados , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento
14.
Front Immunol ; 14: 1304778, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38173717

RESUMEN

Macrophages display extreme plasticity, and the mechanisms and applications of polarization and de-/repolarization of macrophages have been extensively investigated. However, the regulation of macrophage hysteresis after de-/repolarization remains unclear. In this study, by using a large-scale computational analysis of macrophage multi-omics data, we report a list of hysteresis genes that maintain their expression patterns after polarization and de-/repolarization. While the polarization in M1 macrophages leads to a higher level of hysteresis in genes associated with cell cycle progression, cell migration, and enhancement of the immune response, we found weak levels of hysteresis after M2 polarization. During the polarization process from M0 to M1 and back to M0, the factors IRFs/STAT, AP-1, and CTCF regulate hysteresis by altering their binding sites to the chromatin. Overall, our results show that a history of polarization can lead to hysteresis in gene expression and chromatin accessibility over a given period. This study contributes to the understanding of de-/repolarization memory in macrophages.


Asunto(s)
Cromatina , Factor de Transcripción AP-1 , Factor de Transcripción AP-1/genética , Factor de Transcripción AP-1/metabolismo , Cromatina/genética , Cromatina/metabolismo , Multiómica , Macrófagos
15.
NAR Genom Bioinform ; 4(4): lqac087, 2022 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-36458020

RESUMEN

Several factors, including tissue origins and culture conditions, affect the gene expression of undifferentiated stem cells. However, understanding the basic identity across different stem cells has not been pursued well despite its importance in stem cell biology. Thus, we aimed to rank the relative importance of multiple factors to gene expression profile among undifferentiated human stem cells by analyzing publicly available RNA-seq datasets. We first conducted batch effect correction to avoid undefined variance in the dataset as possible. Then, we highlighted the relative impact of biological and technical factors among undifferentiated stem cell types: a more influence on tissue origins in induced pluripotent stem cells than in other stem cell types; a stronger impact of culture condition in embryonic stem cells and somatic stem cell types, including mesenchymal stem cells and hematopoietic stem cells. In addition, we found that a characteristic gene module, enriched in histones, exhibits higher expression across different stem cell types that were annotated by specific culture conditions. This tendency was also observed in mouse stem cell RNA-seq data. Our findings would help to obtain general insights into stem cell quality, such as the balance of differentiation potentials that undifferentiated stem cells possess.

16.
Genome Biol ; 23(1): 219, 2022 10 17.
Artículo en Inglés | MEDLINE | ID: mdl-36253864

RESUMEN

In this study, we propose iDNA-ABF, a multi-scale deep biological language learning model that enables the interpretable prediction of DNA methylations based on genomic sequences only. Benchmarking comparisons show that our iDNA-ABF outperforms state-of-the-art methods for different methylation predictions. Importantly, we show the power of deep language learning in capturing both sequential and functional semantics information from background genomes. Moreover, by integrating the interpretable analysis mechanism, we well explain what the model learns, helping us build the mapping from the discovery of important sequential determinants to the in-depth analysis of their biological functions.


Asunto(s)
Metilación de ADN , Lenguaje , Genómica , Modelos Biológicos
17.
Front Bioinform ; 2: 910531, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36304291

RESUMEN

Prediction of subcellular localization of proteins from their amino acid sequences has a long history in bioinformatics and is still actively developing, incorporating the latest advances in machine learning and proteomics. Notably, deep learning-based methods for natural language processing have made great contributions. Here, we review recent advances in the field as well as its related fields, such as subcellular proteomics and the prediction/recognition of subcellular localization from image data.

18.
Nat Commun ; 13(1): 4063, 2022 07 13.
Artículo en Inglés | MEDLINE | ID: mdl-35831322

RESUMEN

Point-mutations of MEK1, a central component of ERK signaling, are present in cancer and RASopathies, but their precise biological effects remain obscure. Here, we report a mutant MEK1 structure that uncovers the mechanisms underlying abnormal activities of cancer- and RASopathy-associated MEK1 mutants. These two classes of MEK1 mutations differentially impact on spatiotemporal dynamics of ERK signaling, cellular transcriptional programs, gene expression profiles, and consequent biological outcomes. By making use of such distinct characteristics of the MEK1 mutants, we identified cancer- and RASopathy-signature genes that may serve as diagnostic markers or therapeutic targets for these diseases. In particular, two AKT-inhibitor molecules, PHLDA1 and 2, are simultaneously upregulated by oncogenic ERK signaling, and mediate cancer-specific ERK-AKT crosstalk. The combined expression of PHLDA1/2 is critical to confer resistance to ERK pathway-targeted therapeutics on cancer cells. Finally, we propose a therapeutic strategy to overcome this drug resistance. Our data provide vital insights into the etiology, diagnosis, and therapeutic strategy of cancers and RASopathies.


Asunto(s)
Neoplasias , Proteínas Proto-Oncogénicas c-akt , Humanos , MAP Quinasa Quinasa 1/genética , Sistema de Señalización de MAP Quinasas/genética , Quinasas de Proteína Quinasa Activadas por Mitógenos/metabolismo , Neoplasias/metabolismo , Inhibidores de Proteínas Quinasas/farmacología , Inhibidores de Proteínas Quinasas/uso terapéutico , Proteínas Proto-Oncogénicas c-akt/genética , Proteínas Proto-Oncogénicas c-akt/metabolismo , Transducción de Señal/genética
19.
Metabolites ; 12(7)2022 Jul 11.
Artículo en Inglés | MEDLINE | ID: mdl-35888758

RESUMEN

Taurine, a sulfur-containing ß-amino acid, is present at high concentrations in mammalian tissues and plays an important role in several essential biological processes. However, the genetic mechanisms involved in these physiological processes associated with taurine remain unclear. In this study, we investigated the regulatory mechanism underlying the taurine-induced transcriptional enhancement of the thioredoxin-interacting protein (TXNIP). The results showed that taurine significantly increased the luciferase activity of the human TXNIP promoter. Further, deletion analysis of the TXNIP promoter showed that taurine induced luciferase activity only in the TXNIP promoter region (+200 to +218). Furthermore, by employing a bioinformatic analysis using the TRANSFAC database, we focused on Tst-1 and Ets-1 as candidates involved in taurine-induced transcription and found that the mutation in the Ets-1 sequence did not enhance transcriptional activity by taurine. Additionally, chromatin immunoprecipitation assays indicated that the binding of Ets-1 to the TXNIP promoter region was enhanced by taurine. Taurine also increased the levels of phosphorylated Ets-1, indicating activation of Ets-1 pathway by taurine. Moreover, an ERK cascade inhibitor significantly suppressed the taurine-induced increase in TXNIP mRNA levels and transcriptional enhancement of TXNIP. These results suggest that taurine enhances TXNIP expression by activating transcription factor Ets-1 via the ERK cascade.

20.
Bioinformatics ; 38(13): 3351-3360, 2022 06 27.
Artículo en Inglés | MEDLINE | ID: mdl-35604077

RESUMEN

SUMMARY: Identifying the protein-peptide binding residues is fundamentally important to understand the mechanisms of protein functions and explore drug discovery. Although several computational methods have been developed, most of them highly rely on third-party tools or complex data preprocessing for feature design, easily resulting in low computational efficacy and suffering from low predictive performance. To address the limitations, we propose PepBCL, a novel BERT (Bidirectional Encoder Representation from Transformers) -based contrastive learning framework to predict the protein-peptide binding residues based on protein sequences only. PepBCL is an end-to-end predictive model that is independent of feature engineering. Specifically, we introduce a well pre-trained protein language model that can automatically extract and learn high-latent representations of protein sequences relevant for protein structures and functions. Further, we design a novel contrastive learning module to optimize the feature representations of binding residues underlying the imbalanced dataset. We demonstrate that our proposed method significantly outperforms the state-of-the-art methods under benchmarking comparison, and achieves more robust performance. Moreover, we found that we further improve the performance via the integration of traditional features and our learnt features. Interestingly, the interpretable analysis of our model highlights the flexibility and adaptability of deep learning-based protein language model to capture both conserved and non-conserved sequential characteristics of peptide-binding residues. Finally, to facilitate the use of our method, we establish an online predictive platform as the implementation of the proposed PepBCL, which is now available at http://server.wei-group.net/PepBCL/. AVAILABILITY AND IMPLEMENTATION: https://github.com/Ruheng-W/PepBCL. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Aprendizaje Profundo , Proteínas/química , Péptidos , Unión Proteica , Secuencia de Aminoácidos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA