Búsqueda | Portal de Búsqueda de la BVS España

1.

Improved structure-related prediction for insufficient homologous proteins using MSA enhancement and pre-trained language model.

Meng, Qiaozhen; Guo, Fei; Tang, Jijun.

Brief Bioinform ; 24(4)2023 07 20.

Artículo en Inglés | MEDLINE | ID: mdl-37321965

RESUMEN

In recent years, protein structure problems have become a hotspot for understanding protein folding and function mechanisms. It has been observed that most of the protein structure works rely on and benefit from co-evolutionary information obtained by multiple sequence alignment (MSA). As an example, AlphaFold2 (AF2) is a typical MSA-based protein structure tool which is famous for its high accuracy. As a consequence, these MSA-based methods are limited by the quality of the MSAs. Especially for orphan proteins that have no homologous sequence, AlphaFold2 performs unsatisfactorily as MSA depth decreases, which may pose a barrier to its widespread application in protein mutation and design problems in which there are no rich homologous sequences and rapid prediction is needed. In this paper, we constructed two standard datasets for orphan and de novo proteins which have insufficient/none homology information, called Orphan62 and Design204, respectively, to fairly evaluate the performance of the various methods in this case. Then, depending on whether or not utilizing scarce MSA information, we summarized two approaches, MSA-enhanced and MSA-free methods, to effectively solve the issue without sufficient MSAs. MSA-enhanced model aims to improve poor MSA quality from the data source by knowledge distillation and generation models. MSA-free model directly learns the relationship between residues on enormous protein sequences from pre-trained models, bypassing the step of extracting the residue pair representation from MSA. Next, we evaluated the performance of four MSA-free methods (trRosettaX-Single, TRFold, ESMFold and ProtT5) and MSA-enhanced (Bagging MSA) method compared with a traditional MSA-based method AlphaFold2, in two protein structure-related prediction tasks, respectively. Comparison analyses show that trRosettaX-Single and ESMFold which belong to MSA-free method can achieve fast prediction ($\sim\! 40$s) and comparable performance compared with AF2 in tertiary structure prediction, especially for short peptides, $\alpha $-helical segments and targets with few homologous sequences. Bagging MSA utilizing MSA enhancement improves the accuracy of our trained base model which is an MSA-based method when poor homology information exists in secondary structure prediction. Our study provides biologists an insight of how to select rapid and appropriate prediction tools for enzyme engineering and peptide drug development. CONTACT: guofei@csu.edu.cn, jj.tang@siat.ac.cn.

Asunto(s)

Algoritmos , Furilfuramida , Alineación de Secuencia , Proteínas/química , Secuencia de Aminoácidos

2.

IK-DDI: a novel framework based on instance position embedding and key external text for DDI extraction.

Dou, Mingliang; Ding, Jiaqi; Chen, Genlang; Duan, Junwen; Guo, Fei; Tang, Jijun.

Brief Bioinform ; 24(3)2023 05 19.

Artículo en Inglés | MEDLINE | ID: mdl-36932655

RESUMEN

Determining drug-drug interactions (DDIs) is an important part of pharmacovigilance and has a vital impact on public health. Compared with drug trials, obtaining DDI information from scientific articles is a faster and lower cost but still a highly credible approach. However, current DDI text extraction methods consider the instances generated from articles to be independent and ignore the potential connections between different instances in the same article or sentence. Effective use of external text data could improve prediction accuracy, but existing methods cannot extract key information from external data accurately and reasonably, resulting in low utilization of external data. In this study, we propose a DDI extraction framework, instance position embedding and key external text for DDI (IK-DDI), which adopts instance position embedding and key external text to extract DDI information. The proposed framework integrates the article-level and sentence-level position information of the instances into the model to strengthen the connections between instances generated from the same article or sentence. Moreover, we introduce a comprehensive similarity-matching method that uses string and word sense similarity to improve the matching accuracy between the target drug and external text. Furthermore, the key sentence search method is used to obtain key information from external data. Therefore, IK-DDI can make full use of the connection between instances and the information contained in external text data to improve the efficiency of DDI extraction. Experimental results show that IK-DDI outperforms existing methods on both macro-averaged and micro-averaged metrics, which suggests our method provides complete framework that can be used to extract relationships between biomedical entities and process external text data.

Asunto(s)

Minería de Datos , Farmacovigilancia , Minería de Datos/métodos , Interacciones Farmacológicas , Benchmarking , Sistemas de Liberación de Medicamentos

3.

MVML-MPI: Multi-View Multi-Label Learning for Metabolic Pathway Inference.

Liu, Xiaoyi; Yang, Hongpeng; Ai, Chengwei; Ding, Yijie; Guo, Fei; Tang, Jijun.

Brief Bioinform ; 24(6)2023 09 22.

Artículo en Inglés | MEDLINE | ID: mdl-37930024

RESUMEN

Development of robust and effective strategies for synthesizing new compounds, drug targeting and constructing GEnome-scale Metabolic models (GEMs) requires a deep understanding of the underlying biological processes. A critical step in achieving this goal is accurately identifying the categories of pathways in which a compound participated. However, current machine learning-based methods often overlook the multifaceted nature of compounds, resulting in inaccurate pathway predictions. Therefore, we present a novel framework on Multi-View Multi-Label Learning for Metabolic Pathway Inference, hereby named MVML-MPI. First, MVML-MPI learns the distinct compound representations in parallel with corresponding compound encoders to fully extract features. Subsequently, we propose an attention-based mechanism that offers a fusion module to complement these multi-view representations. As a result, MVML-MPI accurately represents and effectively captures the complex relationship between compounds and metabolic pathways and distinguishes itself from current machine learning-based methods. In experiments conducted on the Kyoto Encyclopedia of Genes and Genomes pathways dataset, MVML-MPI outperformed state-of-the-art methods, demonstrating the superiority of MVML-MPI and its potential to utilize the field of metabolic pathway design, which can aid in optimizing drug-like compounds and facilitating the development of GEMs. The code and data underlying this article are freely available at https://github.com/guofei-tju/MVML-MPI. Contact: jtang@cse.sc.edu, guofei@csu.edu.com or wuxi_dyj@csj.uestc.edu.cn.

Asunto(s)

Aprendizaje Automático , Redes y Vías Metabólicas

4.

Identification of drug-target interactions via multiple kernel-based triple collaborative matrix factorization.

Ding, Yijie; Tang, Jijun; Guo, Fei; Zou, Quan.

Brief Bioinform ; 23(2)2022 03 10.

Artículo en Inglés | MEDLINE | ID: mdl-35134117

RESUMEN

Targeted drugs have been applied to the treatment of cancer on a large scale, and some patients have certain therapeutic effects. It is a time-consuming task to detect drug-target interactions (DTIs) through biochemical experiments. At present, machine learning (ML) has been widely applied in large-scale drug screening. However, there are few methods for multiple information fusion. We propose a multiple kernel-based triple collaborative matrix factorization (MK-TCMF) method to predict DTIs. The multiple kernel matrices (contain chemical, biological and clinical information) are integrated via multi-kernel learning (MKL) algorithm. And the original adjacency matrix of DTIs could be decomposed into three matrices, including the latent feature matrix of the drug space, latent feature matrix of the target space and the bi-projection matrix (used to join the two feature spaces). To obtain better prediction performance, MKL algorithm can regulate the weight of each kernel matrix according to the prediction error. The weights of drug side-effects and target sequence are the highest. Compared with other computational methods, our model has better performance on four test data sets.

Asunto(s)

Algoritmos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Interacciones Farmacológicas , Humanos , Aprendizaje Automático

5.

A hybrid deep learning framework for gene regulatory network inference from single-cell transcriptomic data.

Zhao, Mengyuan; He, Wenying; Tang, Jijun; Zou, Quan; Guo, Fei.

Brief Bioinform ; 23(2)2022 03 10.

Artículo en Inglés | MEDLINE | ID: mdl-35062026

RESUMEN

Inferring gene regulatory networks (GRNs) based on gene expression profiles is able to provide an insight into a number of cellular phenotypes from the genomic level and reveal the essential laws underlying various life phenomena. Different from the bulk expression data, single-cell transcriptomic data embody cell-to-cell variance and diverse biological information, such as tissue characteristics, transformation of cell types, etc. Inferring GRNs based on such data offers unprecedented advantages for making a profound study of cell phenotypes, revealing gene functions and exploring potential interactions. However, the high sparsity, noise and dropout events of single-cell transcriptomic data pose new challenges for regulation identification. We develop a hybrid deep learning framework for GRN inference from single-cell transcriptomic data, DGRNS, which encodes the raw data and fuses recurrent neural network and convolutional neural network (CNN) to train a model capable of distinguishing related gene pairs from unrelated gene pairs. To overcome the limitations of such datasets, it applies sliding windows to extract valuable features while preserving the direction of regulation. DGRNS is constructed as a deep learning model containing gated recurrent unit network for exploring time-dependent information and CNN for learning spatially related information. Our comprehensive and detailed comparative analysis on the dataset of mouse hematopoietic stem cells illustrates that DGRNS outperforms state-of-the-art methods. The networks inferred by DGRNS are about 16% higher than the area under the receiver operating characteristic curve of other unsupervised methods and 10% higher than the area under the precision recall curve of other supervised methods. Experiments on human datasets show the strong robustness and excellent generalization of DGRNS. By comparing the predictions with standard network, we discover a series of novel interactions which are proved to be true in some specific cell types. Importantly, DGRNS identifies a series of regulatory relationships with high confidence and functional consistency, which have not yet been experimentally confirmed and merit further research.

Asunto(s)

Aprendizaje Profundo , Redes Reguladoras de Genes , Algoritmos , Animales , Ratones , Redes Neurales de la Computación , Transcriptoma

6.

Two-stage-vote ensemble framework based on integration of mutation data and gene interaction network for uncovering driver genes.

Kan, Yingxin; Jiang, Limin; Guo, Yan; Tang, Jijun; Guo, Fei.

Brief Bioinform ; 23(1)2022 01 17.

Artículo en Inglés | MEDLINE | ID: mdl-34791034

RESUMEN

Identifying driver genes, exactly from massive genes with mutations, promotes accurate diagnosis and treatment of cancer. In recent years, a lot of works about uncovering driver genes based on integration of mutation data and gene interaction networks is gaining more attention. However, it is in suspense if it is more effective for prioritizing driver genes when integrating various types of mutation information (frequency and functional impact) and gene networks. Hence, we build a two-stage-vote ensemble framework based on somatic mutations and mutual interactions. Specifically, we first represent and combine various kinds of mutation information, which are propagated through networks by an improved iterative framework. The first vote is conducted on iteration results by voting methods, and the second vote is performed to get ensemble results of the first poll for the final driver gene list. Compared with four excellent previous approaches, our method has better performance in identifying driver genes on $33$ types of cancer from The Cancer Genome Atlas. Meanwhile, we also conduct a comparative analysis about two kinds of mutation information, five gene interaction networks and four voting strategies. Our framework offers a new view for data integration and promotes more latent cancer genes to be admitted.

Asunto(s)

Redes Reguladoras de Genes , Neoplasias , Epistasis Genética , Humanos , Mutación , Neoplasias/genética , Oncogenes

7.

CoMutDB: the landscape of somatic mutation co-occurrence in cancers.

Jiang, Limin; Yu, Hui; Tang, Jijun; Guo, Yan.

Bioinformatics ; 39(1)2023 01 01.

Artículo en Inglés | MEDLINE | ID: mdl-36355452

RESUMEN

MOTIVATION: Somatic mutation co-occurrence has been proven to have a profound effect on tumorigenesis. While some studies have been conducted on co-mutations, a centralized resource dedicated to co-mutations in cancer is still lacking. RESULTS: Using multi-omics data from over 30â000 subjects and 1747 cancer cell lines, we present the Cancer co-mutation database (CoMutDB), the most comprehensive resource devoted to describing cancer co-mutations and their characteristics. AVAILABILITY AND IMPLEMENTATION: The data underlying this article are available in the online database CoMutDB: http://www.innovebioinfo.com/Database/CoMutDB/Home.php.

Asunto(s)

Neoplasias , Humanos , Mutación , Bases de Datos Factuales , Neoplasias/genética , Carcinogénesis , Transformación Celular Neoplásica

8.

SBSA: an online service for somatic binding sequence annotation.

Jiang, Limin; Guo, Fei; Tang, Jijun; Yu, Hui; Ness, Scott; Duan, Mingrui; Mao, Peng; Zhao, Ying-Yong; Guo, Yan.

Nucleic Acids Res ; 50(1): e4, 2022 01 11.

Artículo en Inglés | MEDLINE | ID: mdl-34606615

RESUMEN

Efficient annotation of alterations in binding sequences of molecular regulators can help identify novel candidates for mechanisms study and offer original therapeutic hypotheses. In this work, we developed Somatic Binding Sequence Annotator (SBSA) as a full-capacity online tool to annotate altered binding motifs/sequences, addressing diverse types of genomic variants and molecular regulators. The genomic variants can be somatic mutation, single nucleotide polymorphism, RNA editing, etc. The binding motifs/sequences involve transcription factors (TFs), RNA-binding proteins, miRNA seeds, miRNA-mRNA 3'-UTR binding target, or can be any custom motifs/sequences. Compared to similar tools, SBSA is the first to support miRNA seeds and miRNA-mRNA 3'-UTR binding target, and it unprecedentedly implements a personalized genome approach that accommodates joint adjacent variants. SBSA is empowered to support an indefinite species, including preloaded reference genomes for SARS-Cov-2 and 25 other common organisms. We demonstrated SBSA by annotating multi-omics data from over 30,890 human subjects. Of the millions of somatic binding sequences identified, many are with known severe biological repercussions, such as the somatic mutation in TERT promoter region which causes a gained binding sequence for E26 transformation-specific factor (ETS1). We further validated the function of this TERT mutation using experimental data in cancer cells. Availability:http://innovebioinfo.com/Annotation/SBSA/SBSA.php.

Asunto(s)

COVID-19/virología , Biología Computacional/instrumentación , Genómica/instrumentación , Mutación , Proteómica/instrumentación , SARS-CoV-2 , Regiones no Traducidas 3' , Algoritmos , Secuencias de Aminoácidos , COVID-19/metabolismo , Biología Computacional/métodos , Computadores , Técnicas Genéticas , Genoma Humano , Genómica/métodos , Humanos , Internet , MicroARNs/metabolismo , Fenotipo , Regiones Promotoras Genéticas , Unión Proteica , Proteómica/métodos , Proteína Proto-Oncogénica c-ets-1/genética , Proteína Proto-Oncogénica c-ets-1/metabolismo , Proteínas de Unión al ARN/metabolismo , Telomerasa/metabolismo

9.

Exploring associations of non-coding RNAs in human diseases via three-matrix factorization with hypergraph-regular terms on center kernel alignment.

Wang, Hao; Tang, Jijun; Ding, Yijie; Guo, Fei.

Brief Bioinform ; 22(5)2021 09 02.

Artículo en Inglés | MEDLINE | ID: mdl-33443536

RESUMEN

Relationship of accurate associations between non-coding RNAs and diseases could be of great help in the treatment of human biomedical research. However, the traditional technology is only applied on one type of non-coding RNA or a specific disease, and the experimental method is time-consuming and expensive. More computational tools have been proposed to detect new associations based on known ncRNA and disease information. Due to the ncRNAs (circRNAs, miRNAs and lncRNAs) having a close relationship with the progression of various human diseases, it is critical for developing effective computational predictors for ncRNA-disease association prediction. In this paper, we propose a new computational method of three-matrix factorization with hypergraph regularization terms (HGRTMF) based on central kernel alignment (CKA), for identifying general ncRNA-disease associations. In the process of constructing the similarity matrix, various types of similarity matrices are applicable to circRNAs, miRNAs and lncRNAs. Our method achieves excellent performance on five datasets, involving three types of ncRNAs. In the test, we obtain best area under the curve scores of $0.9832$, $0.9775$, $0.9023$, $0.8809$ and $0.9185$ via 5-fold cross-validation and $0.9832$, $0.9836$, $0.9198$, $0.9459$ and $0.9275$ via leave-one-out cross-validation on five datasets. Furthermore, our novel method (CKA-HGRTMF) is also able to discover new associations between ncRNAs and diseases accurately. Availability: Codes and data are available: https://github.com/hzwh6910/ncRNA2Disease.git. Contact:fguo@tju.edu.cn.

Asunto(s)

Algoritmos , Biología Computacional , Enfermedad/genética , Modelos Genéticos , ARN no Traducido , Humanos , ARN no Traducido/genética , ARN no Traducido/metabolismo

10.

Exploring effectiveness of ab-initio protein-protein docking methods on a novel antibacterial protein complex dataset.

Zhang, Wei; Meng, Qiaozhen; Tang, Jijun; Guo, Fei.

Brief Bioinform ; 22(6)2021 11 05.

Artículo en Inglés | MEDLINE | ID: mdl-33959764

RESUMEN

Diseases caused by bacterial infections become a critical problem in public heath. Antibiotic, the traditional treatment, gradually loses their effectiveness due to the resistance. Meanwhile, antibacterial proteins attract more attention because of broad spectrum and little harm to host cells. Therefore, exploring new effective antibacterial proteins is urgent and necessary. In this paper, we are committed to evaluating the effectiveness of ab-initio docking methods in antibacterial protein-protein docking. For this purpose, we constructed a three-dimensional (3D) structure dataset of antibacterial protein complex, called APCset, which contained $19$ protein complexes whose receptors or ligands are homologous to antibacterial peptides from Antimicrobial Peptide Database. Then we selected five representative ab-initio protein-protein docking tools including ZDOCK3.0.2, FRODOCK3.0, ATTRACT, PatchDock and Rosetta to identify these complexes' structure, whose performance differences were obtained by analyzing from five aspects, including top/best pose, first hit, success rate, average hit count and running time. Finally, according to different requirements, we assessed and recommended relatively efficient protein-protein docking tools. In terms of computational efficiency and performance, ZDOCK was more suitable as preferred computational tool, with average running time of $6.144$ minutes, average Fnat of best pose of $0.953$ and average rank of best pose of $4.158$. Meanwhile, ZDOCK still yielded better performance on Benchmark 5.0, which proved ZDOCK was effective in performing docking on large-scale dataset. Our survey can offer insights into the research on the treatment of bacterial infections by utilizing the appropriate docking methods.

Asunto(s)

Algoritmos , Péptidos Antimicrobianos/química , Biología Computacional , Bases de Datos de Proteínas , Simulación del Acoplamiento Molecular , Programas Informáticos

11.

MMFGRN: a multi-source multi-model fusion method for gene regulatory network reconstruction.

He, Wenying; Tang, Jijun; Zou, Quan; Guo, Fei.

Brief Bioinform ; 22(6)2021 11 05.

Artículo en Inglés | MEDLINE | ID: mdl-33939795

RESUMEN

Lots of biological processes are controlled by gene regulatory networks (GRNs), such as growth and differentiation of cells, occurrence and development of the diseases. Therefore, it is important to persistently concentrate on the research of GRN. The determination of the gene-gene relationships from gene expression data is a complex issue. Since it is difficult to efficiently obtain the regularity behind the gene-gene relationship by only relying on biochemical experimental methods, thus various computational methods have been used to construct GRNs, and some achievements have been made. In this paper, we propose a novel method MMFGRN (for "Multi-source Multi-model Fusion for Gene Regulatory Network reconstruction") to reconstruct the GRN. In order to make full use of the limited datasets and explore the potential regulatory relationships contained in different data types, we construct the MMFGRN model from three perspectives: single time series data model, single steady-data model and time series and steady-data joint model. And, we utilize the weighted fusion strategy to get the final global regulatory link ranking. Finally, MMFGRN model yields the best performance on the DREAM4 InSilico_Size10 data, outperforming other popular inference algorithms, with an overall area under receiver operating characteristic score of 0.909 and area under precision-recall (AUPR) curves score of 0.770 on the 10-gene network. Additionally, as the network scale increases, our method also has certain advantages with an overall AUPR score of 0.335 on the DREAM4 InSilico_Size100 data. These results demonstrate the good robustness of MMFGRN on different scales of networks. At the same time, the integration strategy proposed in this paper provides a new idea for the reconstruction of the biological network model without prior knowledge, which can help researchers to decipher the elusive mechanism of life.

Asunto(s)

Biología Computacional/métodos , Regulación de la Expresión Génica , Redes Reguladoras de Genes , Programas Informáticos , Algoritmos , Reproducibilidad de los Resultados , Flujo de Trabajo

12.

A comprehensive overview and critical evaluation of gene regulatory network inference technologies.

Zhao, Mengyuan; He, Wenying; Tang, Jijun; Zou, Quan; Guo, Fei.

Brief Bioinform ; 22(5)2021 09 02.

Artículo en Inglés | MEDLINE | ID: mdl-33539514

RESUMEN

Gene regulatory network (GRN) is the important mechanism of maintaining life process, controlling biochemical reaction and regulating compound level, which plays an important role in various organisms and systems. Reconstructing GRN can help us to understand the molecular mechanism of organisms and to reveal the essential rules of a large number of biological processes and reactions in organisms. Various outstanding network reconstruction algorithms use specific assumptions that affect prediction accuracy, in order to deal with the uncertainty of processing. In order to study why a certain method is more suitable for specific research problem or experimental data, we conduct research from model-based, information-based and machine learning-based method classifications. There are obviously different types of computational tools that can be generated to distinguish GRNs. Furthermore, we discuss several classical, representative and latest methods in each category to analyze core ideas, general steps, characteristics, etc. We compare the performance of state-of-the-art GRN reconstruction technologies on simulated networks and real networks under different scaling conditions. Through standardized performance metrics and common benchmarks, we quantitatively evaluate the stability of various methods and the sensitivity of the same algorithm applying to different scaling networks. The aim of this study is to explore the most appropriate method for a specific GRN, which helps biologists and medical scientists in discovering potential drug targets and identifying cancer biomarkers.

Asunto(s)

Biología Computacional/métodos , Regulación de la Expresión Génica , Redes Reguladoras de Genes , Aprendizaje Automático , Transcriptoma , Teorema de Bayes , Biomarcadores de Tumor/genética , Bases de Datos Genéticas , Escherichia coli/genética , Modelos Genéticos , Neoplasias/genética , RNA-Seq/métodos

13.

DeepATT: a hybrid category attention neural network for identifying functional effects of DNA sequences.

Li, Jiawei; Pu, Yuqian; Tang, Jijun; Zou, Quan; Guo, Fei.

Brief Bioinform ; 22(3)2021 05 20.

Artículo en Inglés | MEDLINE | ID: mdl-32778871

RESUMEN

Quantifying DNA properties is a challenging task in the broad field of human genomics. Since the vast majority of non-coding DNA is still poorly understood in terms of function, this task is particularly important to have enormous benefit for biology research. Various DNA sequences should have a great variety of representations, and specific functions may focus on corresponding features in the front part of learning model. Currently, however, for multi-class prediction of non-coding DNA regulatory functions, most powerful predictive models do not have appropriate feature extraction and selection approaches for specific functional effects, so that it is difficult to gain a better insight into their internal correlations. Hence, we design a category attention layer and category dense layer in order to select efficient features and distinguish different DNA functions. In this study, we propose a hybrid deep neural network method, called DeepATT, for identifying $919$ regulatory functions on nearly $5$ million DNA sequences. Our model has four built-in neural network constructions: convolution layer captures regulatory motifs, recurrent layer captures a regulatory grammar, category attention layer selects corresponding valid features for different functions and category dense layer classifies predictive labels with selected features of regulatory functions. Importantly, we compare our novel method, DeepATT, with existing outstanding prediction tools, DeepSEA and DanQ. DeepATT performs significantly better than other existing tools for identifying DNA functions, at least increasing $1.6\%$ area under precision recall. Furthermore, we can mine the important correlation among different DNA functions according to the category attention module. Moreover, our novel model can greatly reduce the number of parameters by the mechanism of attention and locally connected, on the basis of ensuring accuracy.

Asunto(s)

ADN/genética , Bases de Datos de Ácidos Nucleicos , Redes Neurales de la Computación , Secuencias Reguladoras de Ácidos Nucleicos , Análisis de Secuencia de ADN

14.

Predicting MHC class I binder: existing approaches and a novel recurrent neural network solution.

Jiang, Limin; Yu, Hui; Li, Jiawei; Tang, Jijun; Guo, Yan; Guo, Fei.

Brief Bioinform ; 22(6)2021 11 05.

Artículo en Inglés | MEDLINE | ID: mdl-34131696

RESUMEN

Major histocompatibility complex (MHC) possesses important research value in the treatment of complex human diseases. A plethora of computational tools has been developed to predict MHC class I binders. Here, we comprehensively reviewed 27 up-to-date MHC I binding prediction tools developed over the last decade, thoroughly evaluating feature representation methods, prediction algorithms and model training strategies on a benchmark dataset from Immune Epitope Database. A common limitation was identified during the review that all existing tools can only handle a fixed peptide sequence length. To overcome this limitation, we developed a bilateral and variable long short-term memory (BVLSTM)-based approach, named BVLSTM-MHC. It is the first variable-length MHC class I binding predictor. In comparison to the 10 mainstream prediction tools on an independent validation dataset, BVLSTM-MHC achieved the best performance in six out of eight evaluated metrics. A web server based on the BVLSTM-MHC model was developed to enable accurate and efficient MHC class I binder prediction in human, mouse, macaque and chimpanzee.

Asunto(s)

Sitios de Unión , Proteínas Portadoras/química , Biología Computacional/métodos , Antígenos de Histocompatibilidad Clase I/química , Redes Neurales de la Computación , Programas Informáticos , Secuencia de Aminoácidos , Proteínas Portadoras/metabolismo , Bases de Datos Factuales , Aprendizaje Profundo , Epítopos/química , Epítopos/inmunología , Epítopos/metabolismo , Antígenos de Histocompatibilidad Clase I/inmunología , Antígenos de Histocompatibilidad Clase I/metabolismo , Aprendizaje Automático , Unión Proteica , Curva ROC , Reproducibilidad de los Resultados , Navegador Web

15.

Inferring gene regulatory network via fusing gene expression image and RNA-seq data.

Li, Xuejian; Ma, Shiqiang; Liu, Jin; Tang, Jijun; Guo, Fei.

Bioinformatics ; 38(6): 1716-1723, 2022 03 04.

Artículo en Inglés | MEDLINE | ID: mdl-34999771

RESUMEN

MOTIVATION: Recently, with the development of high-throughput experimental technology, reconstruction of gene regulatory network (GRN) has ushered in new opportunities and challenges. Some previous methods mainly extract gene expression information based on RNA-seq data, but the associated information is very limited. With the establishment of gene expression image database, it is possible to infer GRN from image data with rich spatial information. RESULTS: First, we propose a new convolutional neural network (called SDINet), which can extract gene expression information from images and identify the interaction between genes. SDINet can obtain the detailed information and high-level semantic information from the images well. And it can achieve satisfying performance on image data (Acc: 0.7196, F1: 0.7374). Second, we apply the idea of our SDINet to build an RNA-model, which also achieves good results on RNA-seq data (Acc: 0.8962, F1: 0.8950). Finally, we combine image data and RNA-seq data, and design a new fusion network to explore the potential relationship between them. Experiments show that our proposed network fusing two modalities can obtain satisfying performance (Acc: 0.9116, F1: 0.9118) than any single data. AVAILABILITY AND IMPLEMENTATION: Data and code are available from https://github.com/guofei-tju/Combine-Gene-Expression-images-and-RNA-seq-data-For-infering-GRN.

Asunto(s)

Redes Reguladoras de Genes , Expresión Génica , RNA-Seq , Análisis de Secuencia de ARN/métodos

16.

Deep neural network based tissue deconvolution of circulating tumor cell RNA.

Yan, Fengyao; Jiang, Limin; Ye, Fei; Ping, Jie; Bowley, Tetiana Y; Ness, Scott A; Li, Chung-I; Marchetti, Dario; Tang, Jijun; Guo, Yan.

J Transl Med ; 21(1): 783, 2023 11 04.

Artículo en Inglés | MEDLINE | ID: mdl-37925448

RESUMEN

Prior research has shown that the deconvolution of cell-free RNA can uncover the tissue origin. The conventional deconvolution approaches rely on constructing a reference tissue-specific gene panel, which cannot capture the inherent variation present in actual data. To address this, we have developed a novel method that utilizes a neural network framework to leverage the entire training dataset. Our approach involved training a model that incorporated 15 distinct tissue types. Through one semi-independent and two complete independent validations, including deconvolution using a semi in silico dataset, deconvolution with a custom normal tissue mixture RNA-seq data, and deconvolution of longitudinal circulating tumor cell RNA-seq (ctcRNA) data from a cancer patient with metastatic tumors, we demonstrate the efficacy and advantages of the deep-learning approach which were exerted by effectively capturing the inherent variability present in the dataset, thus leading to enhanced accuracy. Sensitivity analyses reveal that neural network models are less susceptible to the presence of missing data, making them more suitable for real-world applications. Moreover, by leveraging the concept of organotropism, we applied our approach to trace the migration of circulating tumor cell-derived RNA (ctcRNA) in a cancer patient with metastatic tumors, thereby highlighting the potential clinical significance of early detection of cancer metastasis.

Asunto(s)

Células Neoplásicas Circulantes , ARN , Humanos , Redes Neurales de la Computación , RNA-Seq , Análisis de Secuencia de ARN

17.

Critical evaluation of web-based prediction tools for human protein subcellular localization.

Shen, Yinan; Ding, Yijie; Tang, Jijun; Zou, Quan; Guo, Fei.

Brief Bioinform ; 21(5): 1628-1640, 2020 09 25.

Artículo en Inglés | MEDLINE | ID: mdl-31697319

RESUMEN

Human protein subcellular localization has an important research value in biological processes, also in elucidating protein functions and identifying drug targets. Over the past decade, a number of protein subcellular localization prediction tools have been designed and made freely available online. The purpose of this paper is to summarize the progress of research on the subcellular localization of human proteins in recent years, including commonly used data sets proposed by the predecessors and the performance of all selected prediction tools against the same benchmark data set. We carry out a systematic evaluation of several publicly available subcellular localization prediction methods on various benchmark data sets. Among them, we find that mLASSO-Hum and pLoc-mHum provide a statistically significant improvement in performance, as measured by the value of accuracy, relative to the other methods. Meanwhile, we build a new data set using the latest version of Uniprot database and construct a new GO-based prediction method HumLoc-LBCI in this paper. Then, we test all selected prediction tools on the new data set. Finally, we discuss the possible development directions of human protein subcellular localization. Availability: The codes and data are available from http://www.lbci.cn/syn/.

Asunto(s)

Internet , Proteínas/metabolismo , Fracciones Subcelulares/metabolismo , Benchmarking , Conjuntos de Datos como Asunto , Humanos

18.

MOF composites derived BiFeO₃@Bi₅O₇I n-n heterojunction for enhanced photocatalytic performance.

Zhu, Yu; Li, Chuwen; Hou, Dongmei; Gao, Guicheng; Luo, Weiqi; Duan, Zhengzhou; Zhang, Tang; Xv, Qinyun; Wang, Yujia; Tang, Jijun.

Nanotechnology ; 33(20)2022 Feb 21.

Artículo en Inglés | MEDLINE | ID: mdl-34983034

RESUMEN

BiFeO3is a photocatalyst with excellent performance. However, its applications are limited due to its wide bandgap. In this paper, MIL-101(Fe)@BiOI composite material is synthesized by hydrothermal method and then calcined at high temperature to obtain BiFeO3@Bi5O7I composite material with high degradation capacity. Among them, an n-n heterojunction is formed, which improves the efficiency of charge transfer, and the recombination of light-generated electrons and holes promotes improved photocatalytic efficiency and stability. The result of photocatalytic degradation of tetracycline under visible light irradiation showed, BiFeO3@Bi5O7I (1:2) has the best photodegradation effect, with a degradation rate of 86.4%, which proves its potential as a photocatalyst.

19.

EditPredict: Prediction of RNA editable sites with convolutional neural network.

Wang, Jiandong; Ness, Scott; Brown, Roger; Yu, Hui; Oyebamiji, Olufunmilola; Jiang, Limin; Sheng, Quanhu; Samuels, David C; Zhao, Ying-Yong; Tang, Jijun; Guo, Yan.

Genomics ; 113(6): 3864-3871, 2021 11.

Artículo en Inglés | MEDLINE | ID: mdl-34562567

RESUMEN

RNA editing exerts critical impacts on numerous biological processes. While millions of RNA editings have been identified in humans, much more are expected to be discovered. In this work, we constructed Convolutional Neural Network (CNN) models to predict human RNA editing events in both Alu regions and non-Alu regions. With a validation dataset resulting from CRISPR/Cas9 knockout of the ADAR1 enzyme, the validation accuracies reached 99.5% and 93.6% for Alu and non-Alu regions, respectively. We ported our CNN models in a web service named EditPredict. EditPredict not only works on reference genome sequences but can also take into consideration single nucleotide variants in personal genomes. In addition to the human genome, EditPredict tackles other model organisms including bumblebee, fruitfly, mouse, and squid genomes. EditPredict can be used stand-alone to predict novel RNA editing and it can be used to assist in filtering for candidate RNA editing detected from RNA-Seq data.

Asunto(s)

Redes Neurales de la Computación , Edición de ARN , Animales , Genoma , ARN , RNA-Seq

20.

Predicting subcellular location of protein with evolution information and sequence-based deep learning.

Liao, Zhijun; Pan, Gaofeng; Sun, Chao; Tang, Jijun.

BMC Bioinformatics ; 22(Suppl 10): 515, 2021 Oct 22.

Artículo en Inglés | MEDLINE | ID: mdl-34686152

RESUMEN

BACKGROUND: Protein subcellular localization prediction plays an important role in biology research. Since traditional methods are laborious and time-consuming, many machine learning-based prediction methods have been proposed. However, most of the proposed methods ignore the evolution information of proteins. In order to improve the prediction accuracy, we present a deep learning-based method to predict protein subcellular locations. RESULTS: Our method utilizes not only amino acid compositions sequence but also evolution matrices of proteins. Our method uses a bidirectional long short-term memory network that processes the entire protein sequence and a convolutional neural network that extracts features from protein sequences. The position specific scoring matrix is used as a supplement to protein sequences. Our method was trained and tested on two benchmark datasets. The experiment results show that our method yields accurate results on the two datasets with an average precision of 0.7901, ranking loss of 0.0758 and coverage of 1.2848. CONCLUSION: The experiment results show that our method outperforms five methods currently available. According to those experiments, we can see that our method is an acceptable alternative to predict protein subcellular location.

Asunto(s)

Aprendizaje Profundo , Secuencia de Aminoácidos , Biología Computacional , Bases de Datos de Proteínas , Posición Específica de Matrices de Puntuación , Proteínas/genética

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA