Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
1.
Genome Med ; 16(1): 56, 2024 Apr 16.
Artigo em Inglês | MEDLINE | ID: mdl-38627848

RESUMO

Despite the abundance of genotype-phenotype association studies, the resulting association outcomes often lack robustness and interpretations. To address these challenges, we introduce PheSeq, a Bayesian deep learning model that enhances and interprets association studies through the integration and perception of phenotype descriptions. By implementing the PheSeq model in three case studies on Alzheimer's disease, breast cancer, and lung cancer, we identify 1024 priority genes for Alzheimer's disease and 818 and 566 genes for breast cancer and lung cancer, respectively. Benefiting from data fusion, these findings represent moderate positive rates, high recall rates, and interpretation in gene-disease association studies.


Assuntos
Doença de Alzheimer , Neoplasias da Mama , Aprendizado Profundo , Neoplasias Pulmonares , Humanos , Feminino , Doença de Alzheimer/genética , Teorema de Bayes , Estudos de Associação Genética , Neoplasias da Mama/genética
2.
Front Plant Sci ; 14: 1175837, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37229121

RESUMO

Introduction: An emerging approach using promoter tiling deletion via genome editing is beginning to become popular in plants. Identifying the precise positions of core motifs within plant gene promoter is of great demand but they are still largely unknown. We previously developed TSPTFBS of 265 Arabidopsis transcription factor binding sites (TFBSs) prediction models, which now cannot meet the above demand of identifying the core motif. Methods: Here, we additionally introduced 104 maize and 20 rice TFBS datasets and utilized DenseNet for model construction on a large-scale dataset of a total of 389 plant TFs. More importantly, we combined three biological interpretability methods including DeepLIFT, in-silico tiling deletion, and in-silico mutagenesis to identify the potential core motifs of any given genomic region. Results: For the results, DenseNet not only has achieved greater predictability than baseline methods such as LS-GKM and MEME for above 389 TFs from Arabidopsis, maize and rice, but also has greater performance on trans-species prediction of a total of 15 TFs from other six plant species. A motif analysis based on TF-MoDISco and global importance analysis (GIA) further provide the biological implication of the core motif identified by three interpretability methods. Finally, we developed a pipeline of TSPTFBS 2.0, which integrates 389 DenseNet-based models of TF binding and the above three interpretability methods. Discussion: TSPTFBS 2.0 was implemented as a user-friendly web-server (http://www.hzau-hulab.com/TSPTFBS/), which can support important references for editing targets of any given plant promoters and it has great potentials to provide reliable editing target of genetic screen experiments in plants.

3.
Sensors (Basel) ; 23(2)2023 Jan 13.
Artigo em Inglês | MEDLINE | ID: mdl-36679714

RESUMO

Many visual SLAM systems are generally solved using natural landmarks or optical flow. However, due to textureless areas, illumination change or motion blur, they often acquire poor camera poses or even fail to track. Additionally, they cannot obtain camera poses with a metric scale in the monocular case. In some cases (such as when calibrating the extrinsic parameters of camera-IMU), we prefer to sacrifice the flexibility of such methods to improve accuracy and robustness by using artificial landmarks. This paper proposes enhancements to the traditional SPM-SLAM, which is a system that aims to build a map of markers and simultaneously localize the camera pose. By placing the markers in the surrounding environment, the system can run stably and obtain accurate camera poses. To improve robustness and accuracy in the case of rotational movements, we improve the initialization, keyframes insertion and relocalization. Additionally, we propose a novel method to estimate marker poses from a set of images to solve the problem of planar-marker pose ambiguity. Compared with the state-of-art, the experiments show that our system achieves better accuracy in most public sequences and is more robust than SPM-SLAM under rotational movements. Finally, the open-source code is publicly available and can be found at GitHub.


Assuntos
Algoritmos , Imageamento Tridimensional , Imageamento Tridimensional/métodos , Software , Movimento , Estimulação Luminosa
4.
Curr Opin Biotechnol ; 79: 102887, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36640453

RESUMO

Genomics and deep learning are a natural match since both are data-driven fields. Regulatory genomics refers to functional noncoding DNA regulating gene expression. In recent years, deep learning applications on regulatory genomics have achieved remarkable advances so-much-so that it has revolutionized the rules of the game of the computational methods in this field. Here, we review two emerging trends: (i) the modeling of very long input sequence (up to 200 kb), which requires self-matched modularization of model architecture; (ii) on the balance of model predictability and model interpretability because the latter is more able to meet biological demands. Finally, we discuss how to employ these two routes to design synthetic regulatory DNA, as a promising strategy for optimizing crop agronomic properties.


Assuntos
Aprendizado Profundo , Genômica
6.
Pancreas ; 50(6): 873-878, 2021 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-34347724

RESUMO

OBJECTIVES: The objective of this study was to develop and validate a model, based on the blood biochemical (BBC) indexes, to predict the recurrence of acute pancreatitis patients. METHODS: We retrospectively enrolled 923 acute pancreatitis patients (586 in the primary cohort and 337 in the validation cohort) from January 2014 to December 2016. Aiming for an extreme imbalance between recurrent acute pancreatitis (RAP) and non-RAP patients (about 1:4), we designed BBC index selection using least absolute shrinkage and selection operator regression, along with an ensemble-learning strategy to obtain a BBC signature. Multivariable logistic regression was used to build the RAP predictive model. RESULTS: The BBC signature, consisting of 35 selected BBC indexes, was significantly higher in patients with RAP (P < 0.001). The area under the curve of the receiver operating characteristic curve of BBC signature model was 0.6534 in the primary cohort and 0.7173 in the validation cohort. The RAP predictive nomogram incorporating the BBC signature, age, hypertension, and diabetes showed better discrimination, with an area under the curve of 0.6538 in the primary cohort and 0.7212 in the validation cohort. CONCLUSIONS: Our study developed a RAP predictive nomogram with good performance, which could be conveniently and efficiently used to optimize individualized prediction of RAP.


Assuntos
Nomogramas , Pâncreas/patologia , Pancreatite/diagnóstico , Medicina de Precisão/métodos , Doença Aguda , Adulto , Idoso , Feminino , Humanos , Modelos Logísticos , Masculino , Pessoa de Meia-Idade , Análise Multivariada , Prognóstico , Recidiva , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
7.
Nucleic Acids Res ; 49(W1): W523-W529, 2021 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-34037796

RESUMO

Characterizing regulatory effects of genomic variants in plants remains a challenge. Although several tools based on deep-learning models and large-scale chromatin-profiling data have been available to predict regulatory elements and variant effects, no dedicated tools or web services have been reported in plants. Here, we present PlantDeepSEA as a deep learning-based web service to predict regulatory effects of genomic variants in multiple tissues of six plant species (including four crops). PlantDeepSEA provides two main functions. One is called Variant Effector, which aims to predict the effects of sequence variants on chromatin accessibility. Another is Sequence Profiler, a utility that performs 'in silico saturated mutagenesis' analysis to discover high-impact sites (e.g., cis-regulatory elements) within a sequence. When validated on independent test sets, the area under receiver operating characteristic curve of deep learning models in PlantDeepSEA ranges from 0.93 to 0.99. We demonstrate the usability of the web service with two examples. PlantDeepSEA could help to prioritize regulatory causal variants and might improve our understanding of their mechanisms of action in different tissues in plants. PlantDeepSEA is available at http://plantdeepsea.ncpgr.cn/.


Assuntos
Variação Genética , Genoma de Planta , Sequências Reguladoras de Ácido Nucleico , Software , Cromatina , Aprendizado Profundo , Genes de Plantas , Genômica , Internet , Oryza/genética , Plantas/genética , Polimorfismo Genético , Locos de Características Quantitativas , Zea mays/genética
8.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-34020535

RESUMO

The multivariate genomic selection (GS) models have not been adequately studied and their potential remains unclear. In this study, we developed a highly efficient bivariate (2D) GS method and demonstrated its significant advantages over the univariate (1D) rival methods using a rice dataset, where four traditional traits (i.e. yield, 1000-grain weight, grain number and tiller number) as well as 1000 metabolomic traits were analyzed. The novelty of the method is the incorporation of the HAT methodology in the 2D BLUP GS model such that the computational efficiency has been dramatically increased by avoiding the conventional cross-validation. The results indicated that (1) the 2D BLUP-HAT GS analysis generally produces higher predictabilities for two traits than those achieved by the analysis of individual traits using 1D GS model, and (2) selected metabolites may be utilized as ancillary traits in the new 2D BLUP-HAT GS method to further boost the predictability of traditional traits, especially for agronomically important traits with low 1D predictabilities.


Assuntos
Modelos Genéticos , Oryza/genética , Locos de Características Quantitativas , Seleção Genética
9.
Bioinformatics ; 37(2): 260-262, 2021 Apr 19.
Artigo em Inglês | MEDLINE | ID: mdl-33416862

RESUMO

MOTIVATION: Both the lack or limitation of experimental data of transcription factor binding sites (TFBS) in plants and the independent evolutions of plant TFs make computational approaches for identifying plant TFBSs lagging behind the relevant human researches. Observing that TFs are highly conserved among plant species, here we first employ the deep convolutional neural network (DeepCNN) to build 265 Arabidopsis TFBS prediction models based on available DAP-seq (DNA affinity purification sequencing) datasets, and then transfer them into homologous TFs in other plants. RESULTS: DeepCNN not only achieves greater successes on Arabidopsis TFBS predictions when compared with gkm-SVM and MEME but also has learned its known motif for most Arabidopsis TFs as well as cooperative TF motifs with protein-protein interaction evidences as its biological interpretability. Under the idea of transfer learning, trans-species prediction performances on ten TFs of other three plants of Oryza sativa, Zea mays and Glycine max demonstrate the feasibility of current strategy. AVAILABILITY AND IMPLEMENTATION: The trained 265 Arabidopsis TFBS prediction models were packaged in a Docker image named TSPTFBS, which is freely available on DockerHub at https://hub.docker.com/r/vanadiummm/tsptfbs. Source code and documentation are available on GitHub at: https://github.com/liulifenyf/TSPTFBS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

10.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-32392580

RESUMO

P53 is the 'guardian of the genome' and is responsible for regulating cell cycle and apoptosis. The genomic p53 binding regions, where activating transcriptional factors and cofactors like p300 simultaneously bind, are called 'p53-dependent enhancers', which play an important role in tumorigenesis. Current experimental assays generally provide a broad peak of each enhancer element, leaving our knowledge about critical enhancer regions (CERs) limited. Under the inspiration of enhancer dissection by CRISPR-Cas9 screen library on genome-wide p53 binding sites, here we introduce a statistical framework called 'Computational CRISPR Strategy' (CCS), to predict whether a given DNA fragment will be a p53-dependent CER by employing 7-mer as feature extractions along with random forest as the regressor. When training on a p53 CRISPR enhancer dataset, CCS not only accurately fitted the top-ranked enriched single guide RNAs (sgRNAs) but also successfully reproduced two known CERs that were validated by experiments. When applying it to an independent testing dataset on a tilling of a 2K-b genomic region of CRISPR-deCDKN1A-Lib, the trained model shows great generalizability by identifying a CER containing five top-ranked sgRNAs. A feature importance analysis further indicates that top-ranked 7-mers are mapped onto informative TF motifs including POU5F1 and SOX5, which are differentially enriched in p53-dependent CERs and are potential factors to make a general p53 binding site to form a p53-dependent CER, providing the interpretability of the trained model. Our results demonstrate that CCS is an alternative way of the CRISPR experiment to screen the genome for mapping p53-dependent CERs.


Assuntos
Elementos Facilitadores Genéticos , Genes p53 , Sistemas CRISPR-Cas , Conjuntos de Dados como Assunto , Humanos , RNA Guia de Cinetoplastídeos/genética
11.
Int J Mol Sci ; 20(7)2019 Apr 05.
Artigo em Inglês | MEDLINE | ID: mdl-30959806

RESUMO

Abstract: Deciphering the code of cis-regulatory element (CRE) is one of the core issues of current biology. As an important category of CRE, enhancers play crucial roles in gene transcriptional regulations in a distant manner. Further, the disruption of an enhancer can cause abnormal transcription and, thus, trigger human diseases, which means that its accurate identification is currently of broad interest. Here, we introduce an innovative concept, i.e., abelian complexity function (ACF), which is a more complex extension of the classic subword complexity function, for a new coding of DNA sequences. After feature selection by an upper bound estimation and integration with DNA composition features, we developed an enhancer prediction model with hybrid abelian complexity features (HACF). Compared with existing methods, HACF shows consistently superior performance on three sources of enhancer datasets. We tested the generalization ability of HACF by scanning human chromosome 22 to validate previously reported super-enhancers. Meanwhile, we identified novel candidate enhancers which have supports from enhancer-related ENCODE ChIP-seq signals. In summary, HACF improves current enhancer prediction and may be beneficial for further prioritization of functional noncoding variants.


Assuntos
Biologia Computacional/métodos , Sequências Reguladoras de Ácido Nucleico/genética , Algoritmos , Sequência de Bases , Cromossomos Humanos Par 22/genética , Doença/genética , Elementos Facilitadores Genéticos , Entropia , Éxons/genética , Humanos , Íntrons/genética , Regiões Promotoras Genéticas/genética
12.
Plant Biotechnol J ; 17(10): 2011-2020, 2019 10.
Artigo em Inglês | MEDLINE | ID: mdl-30950198

RESUMO

Genomic prediction (GP) aims to construct a statistical model for predicting phenotypes using genome-wide markers and is a promising strategy for accelerating molecular plant breeding. However, current progress of phenotype prediction using genomic data alone has reached a bottleneck, and previous studies on transcriptomic and metabolomic predictions ignored genomic information. Here, we designed a novel strategy of GP called multilayered least absolute shrinkage and selection operator (MLLASSO) by integrating multiple omic data into a single model that iteratively learns three layers of genetic features (GFs) supervised by observed transcriptome and metabolome. Significantly, MLLASSO learns higher order information of gene interactions, which enables us to achieve a significant improvement of predictability of yield in rice from 0.1588 (GP alone) to 0.2451 (MLLASSO). In the prediction of the first two layers, some genes were found to be genetically predictable genes (GPGs) as their expressions were accurately predicted with genetic markers. Interestingly, we made three dramatic discoveries for the GPGs: (i) GPGs are good predictors for highly complex traits like yield; (ii) GPGs are mostly eQTL genes (cis or trans); and (iii) trait-related transcriptional factor families are enriched in GPGs. These findings support the notion that learned GFs not only are good predictors for traits but also have specific biological implications regarding regulation of gene expressions. To differentiate the new method from conventional GP models, we called MLLASSO a directed learning strategy supervised by intermediate omic data. This new prediction model appears to be more reliable and more robust than conventional GP models.


Assuntos
Genômica/métodos , Oryza/genética , Aprendizado de Máquina Supervisionado , Marcadores Genéticos , Metaboloma , Modelos Genéticos , Modelos Estatísticos , Fenótipo , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Transcriptoma
13.
Front Genet ; 10: 1305, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31969903

RESUMO

Deciphering the code of cis-regulatory element (CRE) is one of the core issues of today's biology. Enhancers are distal CREs and play significant roles in gene transcriptional regulation. Although identifications of enhancer locations across the whole genome [discriminative enhancer predictions (DEP)] is necessary, it is more important to predict in which specific cell or tissue types, they will be activated and functional [tissue-specific enhancer predictions (TSEP)]. Although existing deep learning models achieved great successes in DEP, they cannot be directly employed in TSEP because a specific cell or tissue type only has a limited number of available enhancer samples for training. Here, we first adopted a reported deep learning architecture and then developed a novel training strategy named "pretraining-retraining strategy" (PRS) for TSEP by decomposing the whole training process into two successive stages: a pretraining stage is designed to train with the whole enhancer data for performing DEP, and a retraining strategy is then designed to train with tissue-specific enhancer samples based on the trained pretraining model for making TSEP. As a result, PRS is found to be valid for DEP with an AUC of 0.922 and a GM (geometric mean) of 0.696, when testing on a larger-scale FANTOM5 enhancer dataset via a five-fold cross-validation. Interestingly, based on the trained pretraining model, a new finding is that only additional twenty epochs are needed to complete the retraining process on testing 23 specific tissues or cell lines. For TSEP tasks, PRS achieved a mean GM of 0.806 which is significantly higher than 0.528 of gkm-SVM, an existing mainstream method for CRE predictions. Notably, PRS is further proven superior to other two state-of-the-art methods: DEEP and BiRen. In summary, PRS has employed useful ideas from the domain of transfer learning and is a reliable method for TSEPs.

14.
Int J Mol Sci ; 18(2)2017 Feb 16.
Artigo em Inglês | MEDLINE | ID: mdl-28212312

RESUMO

DNA methylation plays a significant role in transcriptional regulation by repressing activity. Change of the DNA methylation level is an important factor affecting the expression of target genes and downstream phenotypes. Because current experimental technologies can only assay a small proportion of CpG sites in the human genome, it is urgent to develop reliable computational models for predicting genome-wide DNA methylation. Here, we proposed a novel algorithm that accurately extracted sequence complexity features (seven features) and developed a support-vector-machine-based prediction model with integration of the reported DNA composition features (trinucleotide frequency and GC content, 65 features) by utilizing the methylation profiles of embryonic stem cells in human. The prediction results from 22 human chromosomes with size-varied windows showed that the 600-bp window achieved the best average accuracy of 94.7%. Moreover, comparisons with two existing methods further showed the superiority of our model, and cross-species predictions on mouse data also demonstrated that our model has certain generalization ability. Finally, a statistical test of the experimental data and the predicted data on functional regions annotated by ChromHMM found that six out of 10 regions were consistent, which implies reliable prediction of unassayed CpG sites. Accordingly, we believe that our novel model will be useful and reliable in predicting DNA methylation.


Assuntos
Composição de Bases , Metilação de DNA , Epigenômica , Genoma Humano , Estudo de Associação Genômica Ampla , Modelos Genéticos , Animais , Biologia Computacional/métodos , Ilhas de CpG , Conjuntos de Dados como Assunto , Epigenômica/métodos , Perfilação da Expressão Gênica , Humanos , Curva ROC , Reprodutibilidade dos Testes , Especificidade da Espécie
15.
Front Plant Sci ; 6: 1027, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26640468

RESUMO

In the post-GWAS (Genome-Wide Association Scan) era, the interpretation of GWAS results is crucial to screen for highly relevant phenotype-genotype association pairs. Based on the single genotype-phenotype association test and a pathway enrichment analysis, we propose a Metabolite-pathway-based Phenome-Wide Association Scan (M-PheWAS) to analyze the key metabolite-SNP pairs in rice and determine the regulatory relationship by assessing similarities in the changes of enzymes and downstream products in a pathway. Two SNPs, sf0315305925 and sf0315308337, were selected using this approach, and their molecular function and regulatory relationship with Enzyme EC:5.5.1.6 and with flavonoids, a significant downstream regulatory metabolite product, were demonstrated. Moreover, a total of 105 crucial SNPs were screened using M-PheWAS, which may be important for metabolite associations.

16.
BMC Bioinformatics ; 16: 402, 2015 Dec 03.
Artigo em Inglês | MEDLINE | ID: mdl-26630876

RESUMO

BACKGROUND: Nuclear receptors (NRs) form a large family of ligand-inducible transcription factors that regulate gene expressions involved in numerous physiological phenomena, such as embryogenesis, homeostasis, cell growth and death. These nuclear receptors-related pathways are important targets of marketed drugs. Therefore, the design of a reliable computational model for predicting NRs from amino acid sequence has now been a significant biomedical problem. RESULTS: Conjoint triad feature (CTF) mainly considers neighbor relationships in protein sequences by encoding each protein sequence using the triad (continuous three amino acids) frequency distribution extracted from a 7-letter reduced alphabet. In addition, chaos game representation (CGR) can investigate the patterns hidden in protein sequences and visually reveal previously unknown structure. In this paper, three methods, CTF, CGR, amino acid composition (AAC), are applied to formulate the protein samples. By considering different combinations of three methods, we study seven groups of features, and each group is evaluated by the 10-fold cross-validation test. Meanwhile, a new non-redundant dataset containing 474 NR sequences and 500 non-NR sequences is built based on the latest NucleaRDB database. Comparing the results of numerical experiments, the group of combined features with CTF and AAC gets the best result with the accuracy of 96.30% for identifying NRs from non-NRs. Moreover, if it is classified as a NR, it will be further put into the second level, which will classify a NR into one of the eight main subfamilies. At the second level, the group of combined features with CTF and AAC also gets the best accuracy of 94.73%. Subsequently, the proposed predictor is compared with two existing methods, and the comparisons show that the accuracies of two levels significantly increase to 98.79% (NR-2L: 92.56 %; iNR-PhysChem: 98.18%; the first level) and 93.71% (NR-2L: 88.68%; iNR-PhysChem: 92.45%; the second level) with the introduction of our CTF-based method. Finally, each component of CTF features is analyzed via the statistical significant test, and a simplified model only with the resulting top-50 significant features achieves accuracy of 95.28%. CONCLUSIONS: The experimental results demonstrate that our CTF-based method is an effective way for predicting nuclear receptor proteins. Furthermore, the top-50 significant features obtained from the statistical significant test are considered as the "intrinsic features" in predicting NRs based on the analysis of relative importance.


Assuntos
Algoritmos , Aminoácidos/química , Biologia Computacional/métodos , Receptores Citoplasmáticos e Nucleares/metabolismo , Bases de Dados de Proteínas , Humanos , Receptores Citoplasmáticos e Nucleares/química , Receptores Citoplasmáticos e Nucleares/classificação , Máquina de Vetores de Suporte
17.
Biomed Mater Eng ; 26 Suppl 1: S1829-36, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26405954

RESUMO

G-protein-coupled receptors (GPCRs) are seven membrane-spanning proteins and regulate many important physiological processes, such as vision, neurotransmission, immune response and so on. GPCRs-related pathways are the targets of a large number of marketed drugs. Therefore, the design of a reliable computational model for predicting GPCRs from amino acid sequence has long been a significant biomedical problem. Chaos game representation (CGR) reveals the fractal patterns hidden in protein sequences, and then fractal dimension (FD) is an important feature of these highly irregular geometries with concise mathematical expression. Here, in order to extract important features from GPCR protein sequences, CGR algorithm, fractal dimension and amino acid composition (AAC) are employed to formulate the numerical features of protein samples. Four groups of features are considered, and each group is evaluated by support vector machine (SVM) and 10-fold cross-validation test. To test the performance of the present method, a new non-redundant dataset was built based on latest GPCRDB database. Comparing the results of numerical experiments, the group of combined features with AAC and FD gets the best result, the accuracy is 99.22% and Matthew's correlation coefficient (MCC) is 0.9845 for identifying GPCRs from non-GPCRs. Moreover, if it is classified as a GPCR, it will be further put into the second level, which will classify a GPCR into one of the five main subfamilies. At this level, the group of combined features with AAC and FD also gets best accuracy 85.73%. Finally, the proposed predictor is also compared with existing methods and shows better performances.


Assuntos
Fractais , Reconhecimento Automatizado de Padrão/métodos , Receptores Acoplados a Proteínas G/química , Receptores Acoplados a Proteínas G/metabolismo , Análise de Sequência de Proteína/métodos , Máquina de Vetores de Suporte , Algoritmos , Sequência de Aminoácidos , Mineração de Dados/métodos , Bases de Dados de Proteínas , Dados de Sequência Molecular , Alinhamento de Sequência/métodos , Relação Estrutura-Atividade
18.
J Theor Biol ; 343: 186-92, 2014 Feb 21.
Artigo em Inglês | MEDLINE | ID: mdl-24189096

RESUMO

DNA-binding proteins play a vitally important role in many biological processes. Prediction of DNA-binding proteins from amino acid sequence is a significant but not fairly resolved scientific problem. Chaos game representation (CGR) investigates the patterns hidden in protein sequences, and visually reveals previously unknown structure. Fractal dimensions (FD) are good tools to measure sizes of complex, highly irregular geometric objects. In order to extract the intrinsic correlation with DNA-binding property from protein sequences, CGR algorithm, fractal dimension and amino acid composition are applied to formulate the numerical features of protein samples in this paper. Seven groups of features are extracted, which can be computed directly from the primary sequence, and each group is evaluated by the 10-fold cross-validation test and Jackknife test. Comparing the results of numerical experiments, the group of amino acid composition and fractal dimension (21-dimension vector) gets the best result, the average accuracy is 81.82% and average Matthew's correlation coefficient (MCC) is 0.6017. This resulting predictor is also compared with existing method DNA-Prot and shows better performances.


Assuntos
Biologia Computacional/métodos , Proteínas de Ligação a DNA/metabolismo , Fractais , Máquina de Vetores de Suporte , Proteínas de Ligação a DNA/química , Bases de Dados de Proteínas , Modelos Moleculares , Dinâmica não Linear , Estrutura Terciária de Proteína , Análise de Regressão , Reprodutibilidade dos Testes
19.
Protein Pept Lett ; 19(9): 940-8, 2012 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-22486614

RESUMO

Obtaining soluble proteins in sufficient concentrations is a major obstacle in various experimental studies. How to predict the propensity of targets in large-scale proteomics projects to be soluble is a significant but not fairly resolved scientific problem. Chaos game representation (CGR) can investigate the patterns hiding in protein sequences, and can visually reveal previously unknown structure. Fractal dimensions are good tools to measure sizes of complex, highly irregular geometric objects. In this paper, we convert each protein sequence into a high-dimensional vector by CGR algorithm and fractal dimension, and then predict protein solubility by these fractal features together with Chou's pseudo amino acid composition features and support vector machine (SVM). We extract and study six groups of features computed directly from the primary sequence, and each group is evaluated by the 10-fold cross-validation test. As the results of comparisons, the group of 445-dimensional vector gets the best results, the average accuracy is 0.8741 and average MCC is 0.7358. The resulting predictor is also compared with existing methods and shows significant improvement.


Assuntos
Fractais , Dinâmica não Linear , Proteínas/química , Aminoácidos/química , Modelos Químicos , Solubilidade , Máquina de Vetores de Suporte
20.
J Theor Biol ; 293: 74-81, 2012 Jan 21.
Artigo em Inglês | MEDLINE | ID: mdl-22001320

RESUMO

Knowledge of thermophilic mechanisms about some organisms whose optimum growth temperature (OGT) ranges from 50 to 80 degree plays a major role in helping design stable proteins. How to predict a DNA sequence to be thermophilic is a long but not fairly resolved problem. Chaos game representation (CGR) can investigate the patterns hiding in DNA sequences, and can visually reveal previously unknown structure. Fractal dimensions are good tools to measure sizes of complex, highly irregular geometric objects. In this paper, we convert every DNA sequence into a high dimensional vector by CGR algorithm and fractal dimension, and then predict the DNA sequence thermostability by these fractal features and support vector machine (SVM). We have conducted experiments on three groups: 17-dimensional vector, 65-dimensional vector, and 257-dimensional vector. Each group is evaluated by the 10-fold cross-validation test. For the results, the group of 257-dimensional vector gets the best results: the average accuracy is 0.9456 and average MCC is 0.8878. The results are also compared with the previous work with single CGR features. The comparison shows the high effectiveness of the new hybrid fractal algorithm.


Assuntos
Algoritmos , DNA Bacteriano/genética , Temperatura Alta , Análise de Sequência de DNA/métodos , Sequência de Bases , Bases de Dados de Ácidos Nucleicos , Fractais , Modelos Genéticos , Dinâmica não Linear , Máquina de Vetores de Suporte
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA