Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 66
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38600667

RESUMO

Human leukocyte antigen (HLA) recognizes foreign threats and triggers immune responses by presenting peptides to T cells. Computationally modeling the binding patterns between peptide and HLA is very important for the development of tumor vaccines. However, it is still a big challenge to accurately predict HLA molecules binding peptides. In this paper, we develop a new model TripHLApan for predicting HLA molecules binding peptides by integrating triple coding matrix, BiGRU + Attention models, and transfer learning strategy. We have found the main interaction site regions between HLA molecules and peptides, as well as the correlation between HLA encoding and binding motifs. Based on the discovery, we make the preprocessing and coding closer to the natural biological process. Besides, due to the input being based on multiple types of features and the attention module focused on the BiGRU hidden layer, TripHLApan has learned more sequence level binding information. The application of transfer learning strategies ensures the accuracy of prediction results under special lengths (peptides in length 8) and model scalability with the data explosion. Compared with the current optimal models, TripHLApan exhibits strong predictive performance in various prediction environments with different positive and negative sample ratios. In addition, we validate the superiority and scalability of TripHLApan's predictive performance using additional latest data sets, ablation experiments and binding reconstitution ability in the samples of a melanoma patient. The results show that TripHLApan is a powerful tool for predicting the binding of HLA-I and HLA-II molecular peptides for the synthesis of tumor vaccines. TripHLApan is publicly available at https://github.com/CSUBioGroup/TripHLApan.git.


Assuntos
Vacinas Anticâncer , Humanos , Ligação Proteica , Peptídeos/química , Antígenos HLA/química , Antígenos de Histocompatibilidade Classe II/química , Antígenos de Histocompatibilidade Classe I/química , Aprendizado de Máquina
2.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36511222

RESUMO

Circular RNAs (circRNAs) are reverse-spliced and covalently closed RNAs. Their interactions with RNA-binding proteins (RBPs) have multiple effects on the progress of many diseases. Some computational methods are proposed to identify RBP binding sites on circRNAs but suffer from insufficient accuracy, robustness and explanation. In this study, we first take the characteristics of both RNA and RBP into consideration. We propose a method for discriminating circRNA-RBP binding sites based on multi-scale characterizing sequence and structure features, called CRMSS. For circRNAs, we use sequence ${k}\hbox{-}{mer}$ embedding and the forming probabilities of local secondary structures as features. For RBPs, we combine sequence and structure frequencies of RNA-binding domain regions to generate features. We capture binding patterns with multi-scale residual blocks. With BiLSTM and attention mechanism, we obtain the contextual information of high-level representation for circRNA-RBP binding. To validate the effectiveness of CRMSS, we compare its predictive performance with other methods on 37 RBPs. Taking the properties of both circRNAs and RBPs into account, CRMSS achieves superior performance over state-of-the-art methods. In the case study, our model provides reliable predictions and correctly identifies experimentally verified circRNA-RBP pairs. The code of CRMSS is freely available at https://github.com/BioinformaticsCSU/CRMSS.


Assuntos
RNA Circular , RNA , RNA Circular/genética , Sítios de Ligação , RNA/metabolismo , Proteínas de Ligação a RNA/metabolismo
3.
Bioinformatics ; 40(Supplement_1): i511-i520, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940121

RESUMO

MOTIVATION: Identifying cancer genes remains a significant challenge in cancer genomics research. Annotated gene sets encode functional associations among multiple genes, and cancer genes have been shown to cluster in hallmark signaling pathways and biological processes. The knowledge of annotated gene sets is critical for discovering cancer genes but remains to be fully exploited. RESULTS: Here, we present the DIsease-Specific Hypergraph neural network (DISHyper), a hypergraph-based computational method that integrates the knowledge from multiple types of annotated gene sets to predict cancer genes. First, our benchmark results demonstrate that DISHyper outperforms the existing state-of-the-art methods and highlight the advantages of employing hypergraphs for representing annotated gene sets. Second, we validate the accuracy of DISHyper-predicted cancer genes using functional validation results and multiple independent functional genomics data. Third, our model predicts 44 novel cancer genes, and subsequent analysis shows their significant associations with multiple types of cancers. Overall, our study provides a new perspective for discovering cancer genes and reveals previously undiscovered cancer genes. AVAILABILITY AND IMPLEMENTATION: DISHyper is freely available for download at https://github.com/genemine/DISHyper.


Assuntos
Neoplasias , Redes Neurais de Computação , Humanos , Neoplasias/genética , Biologia Computacional/métodos , Genômica/métodos , Genes Neoplásicos , Anotação de Sequência Molecular/métodos , Bases de Dados Genéticas
4.
Bioinformatics ; 39(5)2023 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-37084258

RESUMO

MOTIVATION: Hi-C technology has been the most widely used chromosome conformation capture (3C) experiment that measures the frequency of all paired interactions in the entire genome, which is a powerful tool for studying the 3D structure of the genome. The fineness of the constructed genome structure depends on the resolution of Hi-C data. However, due to the fact that high-resolution Hi-C data require deep sequencing and thus high experimental cost, most available Hi-C data are in low-resolution. Hence, it is essential to enhance the quality of Hi-C data by developing the effective computational methods. RESULTS: In this work, we propose a novel method, so-called DFHiC, which generates the high-resolution Hi-C matrix from the low-resolution Hi-C matrix in the framework of the dilated convolutional neural network. The dilated convolution is able to effectively explore the global patterns in the overall Hi-C matrix by taking advantage of the information of the Hi-C matrix in a way of the longer genomic distance. Consequently, DFHiC can improve the resolution of the Hi-C matrix reliably and accurately. More importantly, the super-resolution Hi-C data enhanced by DFHiC is more in line with the real high-resolution Hi-C data than those done by the other existing methods, in terms of both chromatin significant interactions and identifying topologically associating domains. AVAILABILITY AND IMPLEMENTATION: https://github.com/BinWangCSU/DFHiC.


Assuntos
Cromatina , Cromossomos , Cromatina/genética , Genoma , Genômica , Redes Neurais de Computação
5.
Bioinformatics ; 39(9)2023 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-37606993

RESUMO

MOTIVATION: Cancer heterogeneity drastically affects cancer therapeutic outcomes. Predicting drug response in vitro is expected to help formulate personalized therapy regimens. In recent years, several computational models based on machine learning and deep learning have been proposed to predict drug response in vitro. However, most of these methods capture drug features based on a single drug description (e.g. drug structure), without considering the relationships between drugs and biological entities (e.g. target, diseases, and side effects). Moreover, most of these methods collect features separately for drugs and cell lines but fail to consider the pairwise interactions between drugs and cell lines. RESULTS: In this paper, we propose a deep learning framework, named MSDRP for drug response prediction. MSDRP uses an interaction module to capture interactions between drugs and cell lines, and integrates multiple associations/interactions between drugs and biological entities through similarity network fusion algorithms, outperforming some state-of-the-art models in all performance measures for all experiments. The experimental results of de novo test and independent test demonstrate the excellent performance of our model for new drugs. Furthermore, several case studies illustrate the rationality for using feature vectors derived from drug similarity matrices from multisource data to represent drugs and the interpretability of our model. AVAILABILITY AND IMPLEMENTATION: The codes of MSDRP are available at https://github.com/xyzhang-10/MSDRP.


Assuntos
Aprendizado Profundo , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Algoritmos , Linhagem Celular , Aprendizado de Máquina
6.
Bioinformatics ; 39(39 Suppl 1): i368-i376, 2023 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-37387178

RESUMO

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) offers a powerful tool to dissect the complexity of biological tissues through cell sub-population identification in combination with clustering approaches. Feature selection is a critical step for improving the accuracy and interpretability of single-cell clustering. Existing feature selection methods underutilize the discriminatory potential of genes across distinct cell types. We hypothesize that incorporating such information could further boost the performance of single cell clustering. RESULTS: We develop CellBRF, a feature selection method that considers genes' relevance to cell types for single-cell clustering. The key idea is to identify genes that are most important for discriminating cell types through random forests guided by predicted cell labels. Moreover, it proposes a class balancing strategy to mitigate the impact of unbalanced cell type distributions on feature importance evaluation. We benchmark CellBRF on 33 scRNA-seq datasets representing diverse biological scenarios and demonstrate that it substantially outperforms state-of-the-art feature selection methods in terms of clustering accuracy and cell neighborhood consistency. Furthermore, we demonstrate the outstanding performance of our selected features through three case studies on cell differentiation stage identification, non-malignant cell subtype identification, and rare cell identification. CellBRF provides a new and effective tool to boost single-cell clustering accuracy. AVAILABILITY AND IMPLEMENTATION: All source codes of CellBRF are freely available at https://github.com/xuyp-csu/CellBRF.


Assuntos
Benchmarking , Algoritmo Florestas Aleatórias , Diferenciação Celular , Análise por Conglomerados
7.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34213525

RESUMO

Identifying the frequencies of the drug-side effects is a very important issue in pharmacological studies and drug risk-benefit. However, designing clinical trials to determine the frequencies is usually time consuming and expensive, and most existing methods can only predict the drug-side effect existence or associations, not their frequencies. Inspired by the recent progress of graph neural networks in the recommended system, we develop a novel prediction model for drug-side effect frequencies, using a graph attention network to integrate three different types of features, including the similarity information, known drug-side effect frequency information and word embeddings. In comparison, the few available studies focusing on frequency prediction use only the known drug-side effect frequency scores. One novel approach used in this work first decomposes the feature types in drug-side effect graph to extract different view representation vectors based on three different type features, and then recombines these latent view vectors automatically to obtain unified embeddings for prediction. The proposed method demonstrates high effectiveness in 10-fold cross-validation. The computational results show that the proposed method achieves the best performance in the benchmark dataset, outperforming the state-of-the-art matrix decomposition model. In addition, some ablation experiments and visual analyses are also supplied to illustrate the usefulness of our method for the prediction of the drug-side effect frequencies. The codes of MGPred are available at https://github.com/zhc940702/MGPred and https://zenodo.org/record/4449613.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/diagnóstico , Informática Médica/métodos , Software , Algoritmos , Benchmarking , Bases de Dados Factuais , Aprendizado Profundo , Interações Medicamentosas , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/etiologia , Humanos , Reprodutibilidade dos Testes
8.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33834190

RESUMO

Biomolecular recognition between ligand and protein plays an essential role in drug discovery and development. However, it is extremely time and resource consuming to determine the protein-ligand binding affinity by experiments. At present, many computational methods have been proposed to predict binding affinity, most of which usually require protein 3D structures that are not often available. Therefore, new methods that can fully take advantage of sequence-level features are greatly needed to predict protein-ligand binding affinity and accelerate the drug discovery process. We developed a novel deep learning approach, named DeepDTAF, to predict the protein-ligand binding affinity. DeepDTAF was constructed by integrating local and global contextual features. More specifically, the protein-binding pocket, which possesses some special properties for directly binding the ligand, was firstly used as the local input feature for protein-ligand binding affinity prediction. Furthermore, dilated convolution was used to capture multiscale long-range interactions. We compared DeepDTAF with the recent state-of-art methods and analyzed the effectiveness of different parts of our model, the significant accuracy improvement showed that DeepDTAF was a reliable tool for affinity prediction. The resource codes and data are available at https: //github.com/KailiWang1/DeepDTAF.


Assuntos
Aprendizado Profundo , Modelos Moleculares , Proteínas/química , Proteínas/metabolismo , Sequência de Aminoácidos , Sítios de Ligação , Confiabilidade dos Dados , Descoberta de Drogas/métodos , Ligação de Hidrogênio , Ligantes , Ligação Proteica , Conformação Proteica em alfa-Hélice , Reprodutibilidade dos Testes , Software
9.
Brief Bioinform ; 22(4)2021 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-33147616

RESUMO

With the development of high-throughput technology and the accumulation of biomedical data, the prior information of biological entity can be calculated from different aspects. Specifically, drug-drug similarities can be measured from target profiles, drug-drug interaction and side effects. Similarly, different methods and data sources to calculate disease ontology can result in multiple measures of pairwise disease similarities. Therefore, in computational drug repositioning, developing a dynamic method to optimize the fusion process of multiple similarities is a crucial and challenging task. In this study, we propose a multi-similarities bilinear matrix factorization (MSBMF) method to predict promising drug-associated indications for existing and novel drugs. Instead of fusing multiple similarities into a single similarity matrix, we concatenate these similarity matrices of drug and disease, respectively. Applying matrix factorization methods, we decompose the drug-disease association matrix into a drug-feature matrix and a disease-feature matrix. At the same time, using these feature matrices as basis, we extract effective latent features representing the drug and disease similarity matrices to infer missing drug-disease associations. Moreover, these two factored matrices are constrained by non-negative factorization to ensure that the completed drug-disease association matrix is biologically interpretable. In addition, we numerically solve the MSBMF model by an efficient alternating direction method of multipliers algorithm. The computational experiment results show that MSBMF obtains higher prediction accuracy than the state-of-the-art drug repositioning methods in cross-validation experiments. Case studies also demonstrate the effectiveness of our proposed method in practical applications. Availability: The data and code of MSBMF are freely available at https://github.com/BioinformaticsCSU/MSBMF. Corresponding author: Jianxin Wang, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P. R. China. E-mail: jxwang@mail.csu.edu.cn Supplementary Data: Supplementary data are available online at https://academic.oup.com/bib.


Assuntos
Algoritmos , Biologia Computacional , Bases de Dados Factuais , Reposicionamento de Medicamentos , Humanos
10.
Brief Bioinform ; 22(2): 1604-1619, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-32043521

RESUMO

Drug repositioning can drastically decrease the cost and duration taken by traditional drug research and development while avoiding the occurrence of unforeseen adverse events. With the rapid advancement of high-throughput technologies and the explosion of various biological data and medical data, computational drug repositioning methods have been appealing and powerful techniques to systematically identify potential drug-target interactions and drug-disease interactions. In this review, we first summarize the available biomedical data and public databases related to drugs, diseases and targets. Then, we discuss existing drug repositioning approaches and group them based on their underlying computational models consisting of classical machine learning, network propagation, matrix factorization and completion, and deep learning based models. We also comprehensively analyze common standard data sets and evaluation metrics used in drug repositioning, and give a brief comparison of various prediction methods on the gold standard data sets. Finally, we conclude our review with a brief discussion on challenges in computational drug repositioning, which includes the problem of reducing the noise and incompleteness of biomedical data, the ensemble of various computation drug repositioning methods, the importance of designing reliable negative samples selection methods, new techniques dealing with the data sparseness problem, the construction of large-scale and comprehensive benchmark data sets and the analysis and explanation of the underlying mechanisms of predicted interactions.


Assuntos
Simulação por Computador , Reposicionamento de Medicamentos , Algoritmos , Teorema de Bayes , Análise por Conglomerados , Biologia Computacional/métodos , Interpretação Estatística de Dados , Aprendizado Profundo , Reprodutibilidade dos Testes , Máquina de Vetores de Suporte
11.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33822883

RESUMO

The rapid increase of genome data brought by gene sequencing technologies poses a massive challenge to data processing. To solve the problems caused by enormous data and complex computing requirements, researchers have proposed many methods and tools which can be divided into three types: big data storage, efficient algorithm design and parallel computing. The purpose of this review is to investigate popular parallel programming technologies for genome sequence processing. Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is firstly introduced, and then a list of tools or algorithms are summarized in the aspects of principle, hardware platform and computing efficiency. The programming model of each hardware and application provides a reference for researchers to choose high-performance computing tools. Finally, we discuss the limitations and future trends of parallel computing technologies.


Assuntos
Processamento Eletrônico de Dados/métodos , Genoma Humano , Genômica/métodos , Polimorfismo de Nucleotídeo Único , Alinhamento de Sequência/métodos , Algoritmos , Sequência de Bases/genética , Mapeamento Cromossômico/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Armazenamento e Recuperação da Informação , Software , Sequenciamento Completo do Genoma/métodos
12.
Bioinformatics ; 38(17): 4153-4161, 2022 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-35801934

RESUMO

MOTIVATION: Identifying drug-target interactions is a crucial step for drug discovery and design. Traditional biochemical experiments are credible to accurately validate drug-target interactions. However, they are also extremely laborious, time-consuming and expensive. With the collection of more validated biomedical data and the advancement of computing technology, the computational methods based on chemogenomics gradually attract more attention, which guide the experimental verifications. RESULTS: In this study, we propose an end-to-end deep learning-based method named IIFDTI to predict drug-target interactions (DTIs) based on independent features of drug-target pairs and interactive features of their substructures. First, the interactive features of substructures between drugs and targets are extracted by the bidirectional encoder-decoder architecture. The independent features of drugs and targets are extracted by the graph neural networks and convolutional neural networks, respectively. Then, all extracted features are fused and inputted into fully connected dense layers in downstream tasks for predicting DTIs. IIFDTI takes into account the independent features of drugs/targets and simulates the interactive features of the substructures from the biological perspective. Multiple experiments show that IIFDTI outperforms the state-of-the-art methods in terms of the area under the receiver operating characteristics curve (AUC), the area under the precision-recall curve (AUPR), precision, and recall on benchmark datasets. In addition, the mapped visualizations of attention weights indicate that IIFDTI has learned the biological knowledge insights, and two case studies illustrate the capabilities of IIFDTI in practical applications. AVAILABILITY AND IMPLEMENTATION: The data and codes underlying this article are available in Github at https://github.com/czjczj/IIFDTI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Descoberta de Drogas , Redes Neurais de Computação , Interações Medicamentosas , Área Sob a Curva , Descoberta de Drogas/métodos , Curva ROC
13.
Bioinformatics ; 38(7): 1995-2002, 2022 03 28.
Artigo em Inglês | MEDLINE | ID: mdl-35043942

RESUMO

MOTIVATION: The identification of compound-protein interactions (CPIs) is an essential step in the process of drug discovery. The experimental determination of CPIs is known for a large amount of funds and time it consumes. Computational model has therefore become a promising and efficient alternative for predicting novel interactions between compounds and proteins on a large scale. Most supervised machine learning prediction models are approached as a binary classification problem, which aim to predict whether there is an interaction between the compound and the protein or not. However, CPI is not a simple binary on-off relationship, but a continuous value reflects how tightly the compound binds to a particular target protein, also called binding affinity. RESULTS: In this study, we propose an end-to-end neural network model, called BACPI, to predict CPI and binding affinity. We employ graph attention network and convolutional neural network (CNN) to learn the representations of compounds and proteins and develop a bi-directional attention neural network model to integrate the representations. To evaluate the performance of BACPI, we use three CPI datasets and four binding affinity datasets in our experiments. The results show that, when predicting CPIs, BACPI significantly outperforms other available machine learning methods on both balanced and unbalanced datasets. This suggests that the end-to-end neural network model that predicts CPIs directly from low-level representations is more robust than traditional machine learning-based methods. And when predicting binding affinities, BACPI achieves higher performance on large datasets compared to other state-of-the-art deep learning methods. This comparison result suggests that the proposed method with bi-directional attention neural network can capture the important regions of compounds and proteins for binding affinity prediction. AVAILABILITY AND IMPLEMENTATION: Data and source codes are available at https://github.com/CSUBioGroup/BACPI.


Assuntos
Redes Neurais de Computação , Software , Proteínas/química , Aprendizado de Máquina , Descoberta de Drogas/métodos
14.
J Biomed Inform ; 143: 104396, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37211195

RESUMO

Automated ICD coding is a multi-label prediction task aiming at assigning patient diagnoses with the most relevant subsets of disease codes. In the deep learning regime, recent works have suffered from large label set and heavy imbalance distribution. To mitigate the negative effect in such scenarios, we propose a retrieve and rerank framework that introduces the Contrastive Learning (CL) for label retrieval, allowing the model to make more accurate prediction from a simplified label space. Given the appealing discriminative power of CL, we adopt it as the training strategy to replace the standard cross-entropy objective and retrieve a small subset by taking the distance between clinical notes and ICD codes into account. After properly training, the retriever could implicitly capture the code co-occurrence, which makes up for the deficiency of cross-entropy assigning each label independently of the others. Further, we evolve a powerful model via a Transformer variant for refining and reranking the candidate set, which can extract semantically meaningful features from long clinical sequences. Applying our method on well-known models, experiments show that our framework provides more accurate results guaranteed by preselecting a small subset of candidates before fine-level reranking. Relying on the framework, our proposed model achieves 0.590 and 0.990 in terms of Micro-F1 and Micro-AUC on benchmark MIMIC-III.


Assuntos
Registros Eletrônicos de Saúde , Classificação Internacional de Doenças , Humanos
15.
Bioinformatics ; 37(18): 2841-2847, 2021 09 29.
Artigo em Inglês | MEDLINE | ID: mdl-33769479

RESUMO

MOTIVATION: The Anatomical Therapeutic Chemical (ATC) system is an official classification system established by the World Health Organization for medicines. Correctly assigning ATC classes to given compounds is an important research problem in drug discovery, which can not only discover the possible active ingredients of the compounds, but also infer theirs therapeutic, pharmacological and chemical properties. RESULTS: In this article, we develop an end-to-end multi-label classifier called CGATCPred to predict 14 main ATC classes for given compounds. In order to extract rich features of each compound, we use the deep Convolutional Neural Network and shortcut connections to represent and learn the seven association scores between the given compound and others. Moreover, we construct the correlation graph of ATC classes and then apply graph convolutional network on the graph for label embedding abstraction. We use all label embedding to guide the learning process of compound representation. As a result, by using the Jackknife test, CGATCPred obtain reliable Aiming of 81.94%, Coverage of 82.88%, Accuracy 80.81%, Absolute True 76.58% and Absolute False 2.75%, yielding significantly improvements compared to exiting multi-label classifiers. AVAILABILITY AND IMPLEMENTATION: The codes of CGATCPred are available at https://github.com/zhc940702/CGATCPred and https://zenodo.org/record/4552917.


Assuntos
Descoberta de Drogas , Redes Neurais de Computação
16.
Bioinformatics ; 36(Suppl_2): i709-i717, 2020 12 30.
Artigo em Inglês | MEDLINE | ID: mdl-33381840

RESUMO

MOTIVATION: Determining the structures of proteins is a critical step to understand their biological functions. Crystallography-based X-ray diffraction technique is the main method for experimental protein structure determination. However, the underlying crystallization process, which needs multiple time-consuming and costly experimental steps, has a high attrition rate. To overcome this issue, a series of in silico methods have been developed with the primary aim of selecting the protein sequences that are promising to be crystallized. However, the predictive performance of the current methods is modest. RESULTS: We propose a deep learning model, so-called CLPred, which uses a bidirectional recurrent neural network with long short-term memory (BLSTM) to capture the long-range interaction patterns between k-mers amino acids to predict protein crystallizability. Using sequence only information, CLPred outperforms the existing deep-learning predictors and a vast majority of sequence-based diffraction-quality crystals predictors on three independent test sets. The results highlight the effectiveness of BLSTM in capturing non-local, long-range inter-peptide interaction patterns to distinguish proteins that can result in diffraction-quality crystals from those that cannot. CLPred has been steadily improved over the previous window-based neural networks, which is able to predict crystallization propensity with high accuracy. CLPred can also be improved significantly if it incorporates additional features from pre-extracted evolutional, structural and physicochemical characteristics. The correctness of CLPred predictions is further validated by the case studies of Sox transcription factor family member proteins and Zika virus non-structural proteins. AVAILABILITY AND IMPLEMENTATION: https://github.com/xuanwenjing/CLPred.


Assuntos
Infecção por Zika virus , Zika virus , Sequência de Aminoácidos , Biologia Computacional , Cristalização , Humanos , Redes Neurais de Computação , Proteínas/genética
17.
Bioinformatics ; 36(4): 1114-1120, 2020 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-31593229

RESUMO

MOTIVATION: Protein-protein interactions (PPIs) play important roles in many biological processes. Conventional biological experiments for identifying PPI sites are costly and time-consuming. Thus, many computational approaches have been proposed to predict PPI sites. Existing computational methods usually use local contextual features to predict PPI sites. Actually, global features of protein sequences are critical for PPI site prediction. RESULTS: A new end-to-end deep learning framework, named DeepPPISP, through combining local contextual and global sequence features, is proposed for PPI site prediction. For local contextual features, we use a sliding window to capture features of neighbors of a target amino acid as in previous studies. For global sequence features, a text convolutional neural network is applied to extract features from the whole protein sequence. Then the local contextual and global sequence features are combined to predict PPI sites. By integrating local contextual and global sequence features, DeepPPISP achieves the state-of-the-art performance, which is better than the other competing methods. In order to investigate if global sequence features are helpful in our deep learning model, we remove or change some components in DeepPPISP. Detailed analyses show that global sequence features play important roles in DeepPPISP. AVAILABILITY AND IMPLEMENTATION: The DeepPPISP web server is available at http://bioinformatics.csu.edu.cn/PPISP/. The source code can be obtained from https://github.com/CSUBioGroup/DeepPPISP. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Neurais de Computação , Sequência de Aminoácidos , Domínios e Motivos de Interação entre Proteínas , Proteínas , Software
18.
Methods ; 179: 73-80, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32387314

RESUMO

In recent years, accumulating studies have shown that long non-coding RNAs (lncRNAs) not only play an important role in the regulation of various biological processes but also are the foundation for understanding mechanisms of human diseases. Due to the high cost of traditional biological experiments, the number of experimentally verified lncRNA-disease associations is very limited. Thus, many computational approaches have been proposed to discover the underlying associations between lncRNAs and diseases. However, the associations between lncRNAs and diseases are too complicated to model by using only traditional matrix factorization-based methods. In this study, we propose a hybrid computational framework (SDLDA) for the lncRNA-disease association prediction. In our computational framework, we use singular value decomposition and deep learning to extract linear and non-linear features of lncRNAs and diseases, respectively. Then we train SDLDA by combing the linear and non-linear features. Compared to previous computational methods, the combination of linear and non-linear features reinforces each other, which is better than using only either matrix factorization or deep learning. The computational results show that SDLDA has a better performance over existing methods in the leave-one-out cross-validation. Furthermore, the case studies show that 28 out of 30 cancer-related lncRNAs (10 for gastric cancer, 10 for colon cancer and 8 for renal cancer) are verified by mining recent biomedical literature. Code and data can be accessed at https://github.com/CSUBioGroup/SDLDA.


Assuntos
Biologia Computacional/métodos , Aprendizado Profundo , Estudos de Associação Genética/métodos , RNA Longo não Codificante/metabolismo , Mineração de Dados/métodos , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Regulação da Expressão Gênica , Predisposição Genética para Doença , Humanos , Neoplasias/genética , RNA Longo não Codificante/genética
19.
Methods ; 173: 75-82, 2020 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-31301375

RESUMO

The wide applications of automatic disease inference in many medical fields improve the efficiency of medical treatments. Many efforts have been made to predict patients' future health conditions according to their full clinical texts, clinical measurements or medical codes. Symptoms reflect the onset of diseases and can provide credible information for disease diagnosis. In this study, we propose a new disease inference method by extracting symptoms and integrating two symptom representation approaches. To reduce the uncertainty and irregularity of symptom descriptions in Electronic Medical Records (EMR), a comprehensive clinical knowledge database consisting of massive amount of data about diseases, symptoms, and their relationships, we extract symptoms with existing nature language process tool Metamap which is designed for biomedical texts. To take advantages of the complex relationship between symptoms and diseases to enhance the accuracy of disease inference, we present two symptom representation models: term frequency-inverse document frequency (TF-IDF) model for the representation of the relationship between symptoms and diseases and Word2Vec for the expression of the semantic relationship between symptoms. Based on these two symptom representations, we employ the bidirectional Long Short Term Memory networks (BiLSTMs) to model symptom sequences in EMR. Our proposed model shows a significant improvement in term of AUC (0.895) and F1 (0.572) for 50 diseases in MIMIC-III dataset. The results illustrate that the model with the combination of the two symptom representations perform better than the one with only one of them.


Assuntos
Registros Eletrônicos de Saúde , Memória de Curto Prazo/fisiologia , Redes Neurais de Computação , Algoritmos , Humanos , Processamento de Linguagem Natural , Semântica
20.
BMC Bioinformatics ; 21(Suppl 6): 203, 2020 Nov 18.
Artigo em Inglês | MEDLINE | ID: mdl-33203392

RESUMO

BACKGROUND: One of the most essential problems in structural bioinformatics is protein fold recognition. In this paper, we design a novel deep learning architecture, so-called DeepFrag-k, which identifies fold discriminative features at fragment level to improve the accuracy of protein fold recognition. DeepFrag-k is composed of two stages: the first stage employs a multi-modal Deep Belief Network (DBN) to predict the potential structural fragments given a sequence, represented as a fragment vector, and then the second stage uses a deep convolutional neural network (CNN) to classify the fragment vector into the corresponding fold. RESULTS: Our results show that DeepFrag-k yields 92.98% accuracy in predicting the top-100 most popular fragments, which can be used to generate discriminative fragment feature vectors to improve protein fold recognition. CONCLUSIONS: There is a set of fragments that can serve as structural "keywords" distinguishing between major protein folds. The deep learning architecture in DeepFrag-k is able to accurately identify these fragments as structure features to improve protein fold recognition.


Assuntos
Biologia Computacional , Aprendizado Profundo , Dobramento de Proteína , Redes Neurais de Computação , Proteínas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA