Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 99
Filtrar
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38622357

RESUMO

Pseudouridine is an RNA modification that is widely distributed in both prokaryotes and eukaryotes, and plays a critical role in numerous biological activities. Despite its importance, the precise identification of pseudouridine sites through experimental approaches poses significant challenges, requiring substantial time and resources.Therefore, there is a growing need for computational techniques that can reliably and quickly identify pseudouridine sites from vast amounts of RNA sequencing data. In this study, we propose fuzzy kernel evidence Random Forest (FKeERF) to identify pseudouridine sites. This method is called PseU-FKeERF, which demonstrates high accuracy in identifying pseudouridine sites from RNA sequencing data. The PseU-FKeERF model selected four RNA feature coding schemes with relatively good performance for feature combination, and then input them into the newly proposed FKeERF method for category prediction. FKeERF not only uses fuzzy logic to expand the original feature space, but also combines kernel methods that are easy to interpret in general for category prediction. Both cross-validation tests and independent tests on benchmark datasets have shown that PseU-FKeERF has better predictive performance than several state-of-the-art methods. This new method not only improves the accuracy of pseudouridine site identification, but also provides a certain reference for disease control and related drug development in the future.


Assuntos
Pseudouridina , Algoritmo Florestas Aleatórias , Pseudouridina/genética , RNA/genética , Sequência de Bases
2.
Nucleic Acids Res ; 52(D1): D990-D997, 2024 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-37831073

RESUMO

Rare variants contribute significantly to the genetic causes of complex traits, as they can have much larger effects than common variants and account for much of the missing heritability in genome-wide association studies. The emergence of UK Biobank scale datasets and accurate gene-level rare variant-trait association testing methods have dramatically increased the number of rare variant associations that have been detected. However, no systematic collection of these associations has been carried out to date, especially at the gene level. To address the issue, we present the Rare Variant Association Repository (RAVAR), a comprehensive collection of rare variant associations. RAVAR includes 95 047 high-quality rare variant associations (76186 gene-level and 18 861 variant-level associations) for 4429 reported traits which are manually curated from 245 publications. RAVAR is the first resource to collect and curate published rare variant associations in an interactive web interface with integrated visualization, search, and download features. Detailed gene and SNP information are provided for each association, and users can conveniently search for related studies by exploring the EFO tree structure and interactive Manhattan plots. RAVAR could vastly improve the accessibility of rare variant studies. RAVAR is freely available for all users without login requirement at http://www.ravar.bio.


Assuntos
Bases de Dados Genéticas , Variação Genética , Estudo de Associação Genômica Ampla , Estudo de Associação Genômica Ampla/métodos , Herança Multifatorial , Fenótipo
3.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37930024

RESUMO

Development of robust and effective strategies for synthesizing new compounds, drug targeting and constructing GEnome-scale Metabolic models (GEMs) requires a deep understanding of the underlying biological processes. A critical step in achieving this goal is accurately identifying the categories of pathways in which a compound participated. However, current machine learning-based methods often overlook the multifaceted nature of compounds, resulting in inaccurate pathway predictions. Therefore, we present a novel framework on Multi-View Multi-Label Learning for Metabolic Pathway Inference, hereby named MVML-MPI. First, MVML-MPI learns the distinct compound representations in parallel with corresponding compound encoders to fully extract features. Subsequently, we propose an attention-based mechanism that offers a fusion module to complement these multi-view representations. As a result, MVML-MPI accurately represents and effectively captures the complex relationship between compounds and metabolic pathways and distinguishes itself from current machine learning-based methods. In experiments conducted on the Kyoto Encyclopedia of Genes and Genomes pathways dataset, MVML-MPI outperformed state-of-the-art methods, demonstrating the superiority of MVML-MPI and its potential to utilize the field of metabolic pathway design, which can aid in optimizing drug-like compounds and facilitating the development of GEMs. The code and data underlying this article are freely available at https://github.com/guofei-tju/MVML-MPI. Contact:  jtang@cse.sc.edu, guofei@csu.edu.com or wuxi_dyj@csj.uestc.edu.cn.


Assuntos
Aprendizado de Máquina , Redes e Vias Metabólicas
4.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36502371

RESUMO

Deoxyribonucleic acid(DNA) N6-methyladenine plays a vital role in various biological processes, and the accurate identification of its site can provide a more comprehensive understanding of its biological effects. There are several methods for 6mA site prediction. With the continuous development of technology, traditional techniques with the high costs and low efficiencies are gradually being replaced by computer methods. Computer methods that are widely used can be divided into two categories: traditional machine learning and deep learning methods. We first list some existing experimental methods for predicting the 6mA site, then analyze the general process from sequence input to results in computer methods and review existing model architectures. Finally, the results were summarized and compared to facilitate subsequent researchers in choosing the most suitable method for their work.


Assuntos
Metilação de DNA , Aprendizado de Máquina , Projetos de Pesquisa , DNA/genética
5.
PLoS Comput Biol ; 20(6): e1012229, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38924082

RESUMO

De novo drug design is crucial in advancing drug discovery, which aims to generate new drugs with specific pharmacological properties. Recently, deep generative models have achieved inspiring progress in generating drug-like compounds. However, the models prioritize a single target drug generation for pharmacological intervention, neglecting the complicated inherent mechanisms of diseases, and influenced by multiple factors. Consequently, developing novel multi-target drugs that simultaneously target specific targets can enhance anti-tumor efficacy and address issues related to resistance mechanisms. To address this issue and inspired by Generative Pre-trained Transformers (GPT) models, we propose an upgraded GPT model with generative adversarial imitation learning for multi-target molecular generation called MTMol-GPT. The multi-target molecular generator employs a dual discriminator model using the Inverse Reinforcement Learning (IRL) method for a concurrently multi-target molecular generation. Extensive results show that MTMol-GPT generates various valid, novel, and effective multi-target molecules for various complex diseases, demonstrating robustness and generalization capability. In addition, molecular docking and pharmacophore mapping experiments demonstrate the drug-likeness properties and effectiveness of generated molecules potentially improve neuropsychiatric interventions. Furthermore, our model's generalizability is exemplified by a case study focusing on the multi-targeted drug design for breast cancer. As a broadly applicable solution for multiple targets, MTMol-GPT provides new insight into future directions to enhance potential complex disease therapeutics by generating high-quality multi-target molecules in drug discovery.


Assuntos
Biologia Computacional , Descoberta de Drogas , Simulação de Acoplamento Molecular , Humanos , Biologia Computacional/métodos , Descoberta de Drogas/métodos , Desenho de Fármacos , Antineoplásicos/química , Antineoplásicos/farmacologia , Algoritmos , Aprendizado Profundo , Aprendizado de Máquina
6.
Methods ; 223: 75-82, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38286333

RESUMO

The accurate identification of drug-protein interactions (DPIs) is crucial in drug development, especially concerning G protein-coupled receptors (GPCRs), which are vital targets in drug discovery. However, experimental validation of GPCR-drug pairings is costly, prompting the need for accurate predictive methods. To address this, we propose MFD-GDrug, a multimodal deep learning model. Leveraging the ESM pretrained model, we extract protein features and employ a CNN for protein feature representation. For drugs, we integrated multimodal features of drug molecular structures, including three-dimensional features derived from Mol2vec and the topological information of drug graph structures extracted through Graph Convolutional Neural Networks (GCN). By combining structural characterizations and pretrained embeddings, our model effectively captures GPCR-drug interactions. Our tests on leading GPCR-drug interaction datasets show that MFD-GDrug outperforms other methods, demonstrating superior predictive accuracy.


Assuntos
Aprendizado Profundo , Interações Medicamentosas , Desenvolvimento de Medicamentos , Descoberta de Drogas , Redes Neurais de Computação
7.
Brief Bioinform ; 23(2)2022 03 10.
Artigo em Inglês | MEDLINE | ID: mdl-35134117

RESUMO

Targeted drugs have been applied to the treatment of cancer on a large scale, and some patients have certain therapeutic effects. It is a time-consuming task to detect drug-target interactions (DTIs) through biochemical experiments. At present, machine learning (ML) has been widely applied in large-scale drug screening. However, there are few methods for multiple information fusion. We propose a multiple kernel-based triple collaborative matrix factorization (MK-TCMF) method to predict DTIs. The multiple kernel matrices (contain chemical, biological and clinical information) are integrated via multi-kernel learning (MKL) algorithm. And the original adjacency matrix of DTIs could be decomposed into three matrices, including the latent feature matrix of the drug space, latent feature matrix of the target space and the bi-projection matrix (used to join the two feature spaces). To obtain better prediction performance, MKL algorithm can regulate the weight of each kernel matrix according to the prediction error. The weights of drug side-effects and target sequence are the highest. Compared with other computational methods, our model has better performance on four test data sets.


Assuntos
Algoritmos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Interações Medicamentosas , Humanos , Aprendizado de Máquina
8.
Brief Bioinform ; 23(6)2022 11 19.
Artigo em Inglês | MEDLINE | ID: mdl-36259601

RESUMO

In the entire life cycle of drug development, the side effect is one of the major failure factors. Severe side effects of drugs that go undetected until the post-marketing stage leads to around two million patient morbidities every year in the United States. Therefore, there is an urgent need for a method to predict side effects of approved drugs and new drugs. Following this need, we present a new predictor for finding side effects of drugs. Firstly, multiple similarity matrices are constructed based on the association profile feature and drug chemical structure information. Secondly, these similarity matrices are integrated by Centered Kernel Alignment-based Multiple Kernel Learning algorithm. Then, Weighted K nearest known neighbors is utilized to complement the adjacency matrix. Next, we construct Restricted Boltzmann machines (RBM) in drug space and side effect space, respectively, and apply a penalized maximum likelihood approach to train model. At last, the average decision rule was adopted to integrate predictions from RBMs. Comparison results and case studies demonstrate, with four benchmark datasets, that our method can give a more accurate and reliable prediction result.


Assuntos
Algoritmos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Funções Verossimilhança , Análise por Conglomerados
9.
Methods ; 219: 73-81, 2023 11.
Artigo em Inglês | MEDLINE | ID: mdl-37783242

RESUMO

Adverse drug reactions include side effects, allergic reactions, and secondary infections. Severe adverse reactions can cause cancer, deformity, or mutation. The monitoring of drug side effects is an important support for post marketing safety supervision of drugs, and an important basis for revising drug instructions. Its purpose is to timely detect and control drug safety risks. Traditional methods are time-consuming. To accelerate the discovery of side effects, we propose a machine learning based method, called correntropy-loss based matrix factorization with neural tangent kernel (CLMF-NTK), to solve the prediction of drug side effects. Our method and other computational methods are tested on three benchmark datasets, and the results show that our method achieves the best predictive performance.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Neoplasias , Humanos , Aprendizado de Máquina , Neoplasias/genética , Benchmarking , Algoritmos
10.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33443536

RESUMO

Relationship of accurate associations between non-coding RNAs and diseases could be of great help in the treatment of human biomedical research. However, the traditional technology is only applied on one type of non-coding RNA or a specific disease, and the experimental method is time-consuming and expensive. More computational tools have been proposed to detect new associations based on known ncRNA and disease information. Due to the ncRNAs (circRNAs, miRNAs and lncRNAs) having a close relationship with the progression of various human diseases, it is critical for developing effective computational predictors for ncRNA-disease association prediction. In this paper, we propose a new computational method of three-matrix factorization with hypergraph regularization terms (HGRTMF) based on central kernel alignment (CKA), for identifying general ncRNA-disease associations. In the process of constructing the similarity matrix, various types of similarity matrices are applicable to circRNAs, miRNAs and lncRNAs. Our method achieves excellent performance on five datasets, involving three types of ncRNAs. In the test, we obtain best area under the curve scores of $0.9832$, $0.9775$, $0.9023$, $0.8809$ and $0.9185$ via 5-fold cross-validation and $0.9832$, $0.9836$, $0.9198$, $0.9459$ and $0.9275$ via leave-one-out cross-validation on five datasets. Furthermore, our novel method (CKA-HGRTMF) is also able to discover new associations between ncRNAs and diseases accurately. Availability: Codes and data are available: https://github.com/hzwh6910/ncRNA2Disease.git. Contact:fguo@tju.edu.cn.


Assuntos
Algoritmos , Biologia Computacional , Doença/genética , Modelos Genéticos , RNA não Traduzido , Humanos , RNA não Traduzido/genética , RNA não Traduzido/metabolismo
11.
Methods ; 208: 1-8, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-36220606

RESUMO

An enhancer is a short DNA sequence containing many binding sites of transcription factors that plays a crucial role in the gene expression of major eukaryotes. It is difficult to avoid the time consumption and high cost of experimental methods. Therefore, with the continuous development of genomics, it is an urgent task to identify enhancers and their intensities by computational methods. In this paper, we propose a two-layer model called iEnhancer-MRBF, wherein the first layer is used to identify enhancers, and the identified enhancers are divided into strong enhancers and weak enhancers according to their strength in the second layer. In iEnhancer-MRBF, a new classifier multiple Laplacian-regularized radial basis function network (MLR-RBFN) is proposed, and three feature representation methods, namely, kmer, nucleotide binary profiles (NBP) and ac-cumulated nucleotide frequency (ANF), as well as feature selection, are used to process DNA sequences. The experimental results show that the model is significantly better than the previous prediction models, and the test accuracy rates of the first and second layers of independent datasets are 79.75% and 83.50%, respectively.


Assuntos
Elementos Facilitadores Genéticos , Genômica , Genômica/métodos , Nucleotídeos , Fatores de Transcrição/metabolismo , Sequência de Bases
12.
Methods ; 207: 29-37, 2022 11.
Artigo em Inglês | MEDLINE | ID: mdl-36087888

RESUMO

DNA-binding proteins actively participate in life activities such as DNA replication, recombination, gene expression and regulation and play a prominent role in these processes. As DNA-binding proteins continue to be discovered and increase, it is imperative to design an efficient and accurate identification tool. Considering the time-consuming and expensive traditional experimental technology and the insufficient number of samples in the biological computing method based on structural information, we proposed a machine learning algorithm based on sequence information to identify DNA binding proteins, named multi-view Least Squares Support Vector Machine via Hilbert-Schmidt Independence Criterion (multi-view LSSVM via HSIC). This method took 6 feature sets as multi-view input and trains a single view through the LSSVM algorithm. Then, we integrated HSIC into LSSVM as a regular term to reduce the dependence between views and explored the complementary information of multiple views. Subsequently, we trained and coordinated the submodels and finally combined the submodels in the form of weights to obtain the final prediction model. On training set PDB1075, the prediction results of our model were better than those of most existing methods. Independent tests are conducted on the datasets PDB186 and PDB2272. The accuracy of the prediction results was 85.5% and 79.36%, respectively. This result exceeded the current state-of-the-art methods, which showed that the multi-view LSSVM via HSIC can be used as an efficient predictor.


Assuntos
Proteínas de Ligação a DNA , Máquina de Vetores de Suporte , Proteínas de Ligação a DNA/química , Análise dos Mínimos Quadrados , Aprendizado de Máquina , Algoritmos
13.
Int J Mol Sci ; 24(12)2023 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-37373163

RESUMO

High-fat diet (HFD)-induced insulin resistance (IR) in skeletal muscle is often accompanied by mitochondrial dysfunction and oxidative stress. Boosting nicotinamide adenine dinucleotide (NAD) using nicotinamide riboside (NR) can effectively decrease oxidative stress and increase mitochondrial function. However, whether NR can ameliorate IR in skeletal muscle is still inconclusive. We fed male C57BL/6J mice with an HFD (60% fat) ± 400 mg/kg·bw NR for 24 weeks. C2C12 myotube cells were treated with 0.25 mM palmitic acid (PA) ± 0.5 mM NR for 24 h. Indicators for IR and mitochondrial dysfunction were analyzed. NR treatment alleviated IR in HFD-fed mice with regard to improved glucose tolerance and a remarkable decrease in the levels of fasting blood glucose, fasting insulin and HOMA-IR index. NR-treated HFD-fed mice also showed improved metabolic status regarding a significant reduction in body weight and lipid contents in serum and the liver. NR activated AMPK in the skeletal muscle of HFD-fed mice and PA-treated C2C12 myotube cells and upregulated the expression of mitochondria-related transcriptional factors and coactivators, thereby improving mitochondrial function and alleviating oxidative stress. Upon inhibiting AMPK using Compound C, NR lost its ability in enhancing mitochondrial function and protection against IR induced by PA. In summary, improving mitochondrial function through the activation of AMPK pathway in skeletal muscle may play an important role in the amelioration of IR using NR.


Assuntos
Resistência à Insulina , Masculino , Camundongos , Animais , Resistência à Insulina/fisiologia , Proteínas Quinases Ativadas por AMP/metabolismo , Camundongos Endogâmicos C57BL , Mitocôndrias , Músculo Esquelético/metabolismo , Insulina/metabolismo , Ácido Palmítico/farmacologia , Ácido Palmítico/metabolismo , Dieta Hiperlipídica/efeitos adversos
14.
Brief Bioinform ; 21(5): 1628-1640, 2020 09 25.
Artigo em Inglês | MEDLINE | ID: mdl-31697319

RESUMO

Human protein subcellular localization has an important research value in biological processes, also in elucidating protein functions and identifying drug targets. Over the past decade, a number of protein subcellular localization prediction tools have been designed and made freely available online. The purpose of this paper is to summarize the progress of research on the subcellular localization of human proteins in recent years, including commonly used data sets proposed by the predecessors and the performance of all selected prediction tools against the same benchmark data set. We carry out a systematic evaluation of several publicly available subcellular localization prediction methods on various benchmark data sets. Among them, we find that mLASSO-Hum and pLoc-mHum provide a statistically significant improvement in performance, as measured by the value of accuracy, relative to the other methods. Meanwhile, we build a new data set using the latest version of Uniprot database and construct a new GO-based prediction method HumLoc-LBCI in this paper. Then, we test all selected prediction tools on the new data set. Finally, we discuss the possible development directions of human protein subcellular localization. Availability: The codes and data are available from http://www.lbci.cn/syn/.


Assuntos
Internet , Proteínas/metabolismo , Frações Subcelulares/metabolismo , Benchmarking , Conjuntos de Dados como Assunto , Humanos
15.
Int J Mol Sci ; 23(6)2022 Mar 11.
Artigo em Inglês | MEDLINE | ID: mdl-35328461

RESUMO

Dihydrouridine (D) is an abundant post-transcriptional modification present in transfer RNA from eukaryotes, bacteria, and archaea. D has contributed to treatments for cancerous diseases. Therefore, the precise detection of D modification sites can enable further understanding of its functional roles. Traditional experimental techniques to identify D are laborious and time-consuming. In addition, there are few computational tools for such analysis. In this study, we utilized eleven sequence-derived feature extraction methods and implemented five popular machine algorithms to identify an optimal model. During data preprocessing, data were partitioned for training and testing. Oversampling was also adopted to reduce the effect of the imbalance between positive and negative samples. The best-performing model was obtained through a combination of random forest and nucleotide chemical property modeling. The optimized model presented high sensitivity and specificity values of 0.9688 and 0.9706 in independent tests, respectively. Our proposed model surpassed published tools in independent tests. Furthermore, a series of validations across several aspects was conducted in order to demonstrate the robustness and reliability of our model.


Assuntos
Algoritmos , Nucleotídeos , Biologia Computacional/métodos , RNA de Transferência , Reprodutibilidade dos Testes
16.
BMC Bioinformatics ; 22(Suppl 3): 291, 2021 May 31.
Artigo em Inglês | MEDLINE | ID: mdl-34058979

RESUMO

BACKGROUND: DNA-Binding Proteins (DBP) plays a pivotal role in biological system. A mounting number of researchers are studying the mechanism and detection methods. To detect DBP, the tradition experimental method is time-consuming and resource-consuming. In recent years, Machine Learning methods have been used to detect DBP. However, it is difficult to adequately describe the information of proteins in predicting DNA-binding proteins. In this study, we extract six features from protein sequence and use Multiple Kernel Learning-based on Centered Kernel Alignment to integrate these features. The integrated feature is fed into Support Vector Machine to build predictive model and detect new DBP. RESULTS: In our work, date sets of PDB1075 and PDB186 are employed to test our method. From the results, our model obtains better results (accuracy) than other existing methods on PDB1075 ([Formula: see text]) and PDB186 ([Formula: see text]), respectively. CONCLUSION: Multiple kernel learning could fuse the complementary information between different features. Compared with existing methods, our method achieves comparable and best results on benchmark data sets.


Assuntos
Proteínas de Ligação a DNA , Máquina de Vetores de Suporte , Aprendizado de Máquina
17.
BMC Bioinformatics ; 22(Suppl 3): 431, 2021 Sep 08.
Artigo em Inglês | MEDLINE | ID: mdl-34496763

RESUMO

BACKGROUND: RNA secondary structure prediction is an important research content in the field of biological information. Predicting RNA secondary structure with pseudoknots has been proved to be an NP-hard problem. Traditional machine learning methods can not effectively apply protein sequence information with different sequence lengths to the prediction process due to the constraint of the self model when predicting the RNA secondary structure. In addition, there is a large difference between the number of paired bases and the number of unpaired bases in the RNA sequences, which means the problem of positive and negative sample imbalance is easy to make the model fall into a local optimum. To solve the above problems, this paper proposes a variable-length dynamic bidirectional Gated Recurrent Unit(VLDB GRU) model. The model can accept sequences with different lengths through the introduction of flag vector. The model can also make full use of the base information before and after the predicted base and can avoid losing part of the information due to truncation. Introducing a weight vector to predict the RNA training set by dynamically adjusting each base loss function solves the problem of balanced sample imbalance. RESULTS: The algorithm proposed in this paper is compared with the existing algorithms on five representative subsets of the data set RNA STRAND. The experimental results show that the accuracy and Matthews correlation coefficient of the method are improved by 4.7% and 11.4%, respectively. CONCLUSIONS: The flag vector introduced allows the model to effectively use the information before and after the protein sequence; the introduced weight vector solves the problem of unbalanced sample balance. Compared with other algorithms, the LVDB GRU algorithm proposed in this paper has the best detection results.


Assuntos
Redes Neurais de Computação , RNA , Algoritmos , Conformação de Ácido Nucleico , Estrutura Secundária de Proteína , RNA/genética
18.
BMC Genomics ; 22(1): 605, 2021 Aug 09.
Artigo em Inglês | MEDLINE | ID: mdl-34372777

RESUMO

BACKGROUND: Identifying potential associations between genes and diseases via biomedical experiments must be the time-consuming and expensive research works. The computational technologies based on machine learning models have been widely utilized to explore genetic information related to complex diseases. Importantly, the gene-disease association detection can be defined as the link prediction problem in bipartite network. However, many existing methods do not utilize multiple sources of biological information; Additionally, they do not extract higher-order relationships among genes and diseases. RESULTS: In this study, we propose a novel method called Dual Hypergraph Regularized Least Squares (DHRLS) with Centered Kernel Alignment-based Multiple Kernel Learning (CKA-MKL), in order to detect all potential gene-disease associations. First, we construct multiple kernels based on various biological data sources in gene and disease spaces respectively. After that, we use CAK-MKL to obtain the optimal kernels in the two spaces respectively. To specific, hypergraph can be employed to establish higher-order relationships. Finally, our DHRLS model is solved by the Alternating Least squares algorithm (ALSA), for predicting gene-disease associations. CONCLUSION: Comparing with many outstanding prediction tools, DHRLS achieves best performance on gene-disease associations network under two types of cross validation. To verify robustness, our proposed approach has excellent prediction performance on six real-world networks. Our research work can effectively discover potential disease-associated genes and provide guidance for the follow-up verification methods of complex diseases.


Assuntos
Algoritmos , Redes Reguladoras de Genes , Humanos , Análise dos Mínimos Quadrados , Aprendizado de Máquina
19.
BMC Genomics ; 22(1): 56, 2021 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-33451286

RESUMO

BACKGROUND: Biological functions of biomolecules rely on the cellular compartments where they are located in cells. Importantly, RNAs are assigned in specific locations of a cell, enabling the cell to implement diverse biochemical processes in the way of concurrency. However, lots of existing RNA subcellular localization classifiers only solve the problem of single-label classification. It is of great practical significance to expand RNA subcellular localization into multi-label classification problem. RESULTS: In this study, we extract multi-label classification datasets about RNA-associated subcellular localizations on various types of RNAs, and then construct subcellular localization datasets on four RNA categories. In order to study Homo sapiens, we further establish human RNA subcellular localization datasets. Furthermore, we utilize different nucleotide property composition models to extract effective features to adequately represent the important information of nucleotide sequences. In the most critical part, we achieve a major challenge that is to fuse the multivariate information through multiple kernel learning based on Hilbert-Schmidt independence criterion. The optimal combined kernel can be put into an integration support vector machine model for identifying multi-label RNA subcellular localizations. Our method obtained excellent results of 0.703, 0.757, 0.787, and 0.800, respectively on four RNA data sets on average precision. CONCLUSION: To be specific, our novel method performs outstanding rather than other prediction tools on novel benchmark datasets. Moreover, we establish user-friendly web server with the implementation of our method.


Assuntos
Biologia Computacional , Proteínas , Bases de Dados de Proteínas , Humanos , RNA/genética , Máquina de Vetores de Suporte
20.
BMC Bioinformatics ; 20(Suppl 25): 683, 2019 Dec 24.
Artigo em Inglês | MEDLINE | ID: mdl-31874596

RESUMO

BACKGROUND: In ab initio protein-structure predictions, a large set of structural decoys are often generated, with the requirement to select best five or three candidates from the decoys. The clustered central structures with the most number of neighbors are frequently regarded as the near-native protein structures with the lowest free energy; however, limitations in clustering methods and three-dimensional structural-distance assessments make identifying exact order of the best five or three near-native candidate structures difficult. RESULTS: To address this issue, we propose a method that re-ranks the candidate structures via random forest classification using intra- and inter-cluster features from the results of the clustering. Comparative analysis indicated that our method was better able to identify the order of the candidate structures as comparing with current methods SPICKR, Calibur, and Durandal. The results confirmed that the identification of the first model were closer to the native structure in 12 of 43 cases versus four for SPICKER, and the same as the native structure in up to 27 of 43 cases versus 14 for Calibur and up to eight of 43 cases versus two for Durandal. CONCLUSIONS: In this study, we presented an improved method based on random forest classification to transform the problem of re-ranking the candidate structures by an binary classification. Our results indicate that this method is a powerful method for the problem and the effect of this method is better than other methods.


Assuntos
Algoritmos , Proteínas/química , Análise por Conglomerados , Conformação Proteica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA