Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 22
Filtrar
Mais filtros

Bases de dados
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-36070864

RESUMO

The location of microRNAs (miRNAs) in cells determines their function in regulation activity. Studies have shown that miRNAs are stable in the extracellular environment that mediates cell-to-cell communication and are located in the intracellular region that responds to cellular stress and environmental stimuli. Though in situ detection techniques of miRNAs have made great contributions to the study of the localization and distribution of miRNAs, miRNA subcellular localization and their role are still in progress. Recently, some machine learning-based algorithms have been designed for miRNA subcellular location prediction, but their performance is still far from satisfactory. Here, we present a new data partitioning strategy that categorizes functionally similar locations for the precise and instructive prediction of miRNA subcellular location in Homo sapiens. To characterize the localization signals, we adopted one-hot encoding with post padding to represent the whole miRNA sequences, and proposed a deep bidirectional long short-term memory with the multi-head self-attention algorithm to model. The algorithm showed high selectivity in distinguishing extracellular miRNAs from intracellular miRNAs. Moreover, a series of motif analyses were performed to explore the mechanism of miRNA subcellular localization. To improve the convenience of the model, a user-friendly web server named iLoc-miRNA was established (http://iLoc-miRNA.lin-group.cn/).


Assuntos
Biologia Computacional , MicroRNAs , Algoritmos , Biologia Computacional/métodos , Humanos , Aprendizado de Máquina , MicroRNAs/genética
2.
Brief Bioinform ; 22(1): 526-535, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-31994694

RESUMO

Messenger RNAs (mRNAs) shoulder special responsibilities that transmit genetic code from DNA to discrete locations in the cytoplasm. The locating process of mRNA might provide spatial and temporal regulation of mRNA and protein functions. The situ hybridization and quantitative transcriptomics analysis could provide detail information about mRNA subcellular localization; however, they are time consuming and expensive. It is highly desired to develop computational tools for timely and effectively predicting mRNA subcellular location. In this work, by using binomial distribution and one-way analysis of variance, the optimal nonamer composition was obtained to represent mRNA sequences. Subsequently, a predictor based on support vector machine was developed to identify the mRNA subcellular localization. In 5-fold cross-validation, results showed that the accuracy is 90.12% for Homo sapiens (H. sapiens). The predictor may provide a reference for the study of mRNA localization mechanisms and mRNA translocation strategies. An online web server was established based on our models, which is available at http://lin-group.cn/server/iLoc-mRNA/.


Assuntos
Biologia Computacional/métodos , Transporte de RNA , RNA Mensageiro/metabolismo , Humanos , RNA Mensageiro/química , Análise de Sequência de RNA/métodos , Software
3.
Methods ; 208: 42-47, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-36341922

RESUMO

The adaptor proteins play a crucially important role in regulating lymphocyte activation. Rapid and efficient identification of adaptor proteins is essential for understanding their functions. However, biochemical methods require not only expensive experimental costs, but also long experiment cycles and more personnel. Therefore, a computational method that could accurately identify adaptor proteins is urgently needed. To solve this issue, we developed a classifier that combined the support vector machine (SVM) with the composition of k-Spaced Amino Acid Pairs (CKSAAP) and the amino acid composition (AAC) to identify adaptor proteins. Analysis of variance (ANOVA) was used to select the optimized features which could generate the maximum prediction performance. By examining the proposed model on independent data, we found that the 447 optimized features could achieve an accuracy of 92.39% with an AUC of 0.9766, demonstrating the powerful capabilities of our model. We hope that the proposed model could provide more clues for studying adaptor proteins.


Assuntos
Biologia Computacional , Máquina de Vetores de Suporte , Biologia Computacional/métodos , Aminoácidos/metabolismo , Análise de Variância
4.
Bioinformatics ; 35(9): 1469-1477, 2019 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-30247625

RESUMO

MOTIVATION: Transcription termination is an important regulatory step of gene expression. If there is no terminator in gene, transcription could not stop, which will result in abnormal gene expression. Detecting such terminators can determine the operon structure in bacterial organisms and improve genome annotation. Thus, accurate identification of transcriptional terminators is essential and extremely important in the research of transcription regulations. RESULTS: In this study, we developed a new predictor called 'iTerm-PseKNC' based on support vector machine to identify transcription terminators. The binomial distribution approach was used to pick out the optimal feature subset derived from pseudo k-tuple nucleotide composition (PseKNC). The 5-fold cross-validation test results showed that our proposed method achieved an accuracy of 95%. To further evaluate the generalization ability of 'iTerm-PseKNC', the model was examined on independent datasets which are experimentally confirmed Rho-independent terminators in Escherichia coli and Bacillus subtilis genomes. As a result, all the terminators in E. coli and 87.5% of the terminators in B. subtilis were correctly identified, suggesting that the proposed model could become a powerful tool for bacterial terminator recognition. AVAILABILITY AND IMPLEMENTATION: For the convenience of most of wet-experimental researchers, the web-server for 'iTerm-PseKNC' was established at http://lin-group.cn/server/iTerm-PseKNC/, by which users can easily obtain their desired result without the need to go through the detailed mathematical equations involved.


Assuntos
Transcrição Gênica , Bacillus subtilis , Escherichia coli , Nucleotídeos , Óperon , Software
5.
Bioinformatics ; 34(24): 4196-4204, 2018 12 15.
Artigo em Inglês | MEDLINE | ID: mdl-29931187

RESUMO

Motivation: Long non-coding RNAs (lncRNAs) are a class of RNA molecules with more than 200 nucleotides. They have important functions in cell development and metabolism, such as genetic markers, genome rearrangements, chromatin modifications, cell cycle regulation, transcription and translation. Their functions are generally closely related to their localization in the cell. Therefore, knowledge about their subcellular locations can provide very useful clues or preliminary insight into their biological functions. Although biochemical experiments could determine the localization of lncRNAs in a cell, they are both time-consuming and expensive. Therefore, it is highly desirable to develop bioinformatics tools for fast and effective identification of their subcellular locations. Results: We developed a sequence-based bioinformatics tool called 'iLoc-lncRNA' to predict the subcellular locations of LncRNAs by incorporating the 8-tuple nucleotide features into the general PseKNC (Pseudo K-tuple Nucleotide Composition) via the binomial distribution approach. Rigorous jackknife tests have shown that the overall accuracy achieved by the new predictor on a stringent benchmark dataset is 86.72%, which is over 20% higher than that by the existing state-of-the-art predictor evaluated on the same tests. Availability and implementation: A user-friendly webserver has been established at http://lin-group.cn/server/iLoc-LncRNA, by which users can easily obtain their desired results. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , RNA Longo não Codificante/genética , Software , Nucleotídeos
6.
Int J Biol Macromol ; 265(Pt 1): 130659, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38462114

RESUMO

Understanding the subcellular localization of lncRNAs is crucial for comprehending their regulation activities. The conventional detection of lncRNA subcellular location usually uses in situ detection techniques, which are resource intensive. Some machine learning-based algorithms have been proposed for lncRNA subcellular location prediction in mammals. However, due to the low level of conservation of lncRNA sequence, the performance of cross-species models remains unsatisfactory. In this study, we curated a novel dataset containing subcellular location information of lncRNAs in Homo sapiens. Subsequently, based on the BERT pre-trained language algorithm, we developed a model for lncRNA subcellular location prediction. Our model achieved a micro-average area under the receiver operating characteristic (AUROC) of 0.791 on the training set and an AUROC of 0.700 on the testing nucleus set. Additionally, we conducted cross-species validation and motif discovery to further investigate underlying patterns. In summary, our study provides valuable guidance and computational analysis tools for exploring the mechanisms of lncRNA subcellular localization and the dynamic spatial changes of RNA in abnormal physiological states.


Assuntos
RNA Longo não Codificante , Animais , Humanos , RNA Longo não Codificante/genética , Algoritmos , Aprendizado de Máquina , Biologia Computacional/métodos , Mamíferos/genética
7.
IET Syst Biol ; 2024 Mar 26.
Artigo em Inglês | MEDLINE | ID: mdl-38530028

RESUMO

Pancreatic ductal adenocarcinoma (PDAC) accounts for 95% of all pancreatic cancer cases, posing grave challenges to its diagnosis and treatment. Timely diagnosis is pivotal for improving patient survival, necessitating the discovery of precise biomarkers. An innovative approach was introduced to identify gene markers for precision PDAC detection. The core idea of our method is to discover gene pairs that display consistent opposite relative expression and differential co-expression patterns between PDAC and normal samples. Reversal gene pair analysis and differential partial correlation analysis were performed to determine reversal differential partial correlation (RDC) gene pairs. Using incremental feature selection, the authors refined the selected gene set and constructed a machine-learning model for PDAC recognition. As a result, the approach identified 10 RDC gene pairs. And the model could achieve a remarkable accuracy of 96.1% during cross-validation, surpassing gene expression-based models. The experiment on independent validation data confirmed the model's performance. Enrichment analysis revealed the involvement of these genes in essential biological processes and shed light on their potential roles in PDAC pathogenesis. Overall, the findings highlight the potential of these 10 RDC gene pairs as effective diagnostic markers for early PDAC detection, bringing hope for improving patient prognosis and survival.

8.
Front Microbiol ; 14: 1170785, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37125199

RESUMO

Promotors are those genomic regions on the upstream of genes, which are bound by RNA polymerase for starting gene transcription. Because it is the most critical element of gene expression, the recognition of promoters is crucial to understand the regulation of gene expression. This study aimed to develop a machine learning-based model to predict promotors in Agrobacterium tumefaciens (A. tumefaciens) strain C58. In the model, promotor sequences were encoded by three different kinds of feature descriptors, namely, accumulated nucleotide frequency, k-mer nucleotide composition, and binary encodings. The obtained features were optimized by using correlation and the mRMR-based algorithm. These optimized features were inputted into a random forest (RF) classifier to discriminate promotor sequences from non-promotor sequences in A. tumefaciens strain C58. The examination of 10-fold cross-validation showed that the proposed model could yield an overall accuracy of 0.837. This model will provide help for the study of promoters in A. tumefaciens C58 strain.

9.
Int J Biol Macromol ; 228: 706-714, 2023 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-36584777

RESUMO

CRISPR-Cas, as a tool for gene editing, has received extensive attention in recent years. Anti-CRISPR (Acr) proteins can inactivate the CRISPR-Cas defense system during interference phase, and can be used as a potential tool for the regulation of gene editing. In-depth study of Anti-CRISPR proteins is of great significance for the implementation of gene editing. In this study, we developed a high-accuracy prediction model based on two-step model fusion strategy, called AcrPred, which could produce an AUC of 0.952 with independent dataset validation. To further validate the proposed model, we compared with published tools and correctly identified 9 of 10 new Acr proteins, indicating the strong generalization ability of our model. Finally, for the convenience of related wet-experimental researchers, a user-friendly web-server AcrPred (Anti-CRISPR proteins Prediction) was established at http://lin-group.cn/server/AcrPred, by which users can easily identify potential Anti-CRISPR proteins.


Assuntos
Sistemas CRISPR-Cas , Edição de Genes , Sistemas CRISPR-Cas/genética , Algoritmos , Aprendizado de Máquina , Proteínas Virais/genética
10.
Comput Struct Biotechnol J ; 21: 2253-2261, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37035551

RESUMO

Hormone binding proteins (HBPs) belong to the group of soluble carrier proteins. These proteins selectively and non-covalently interact with hormones and promote growth hormone signaling in human and other animals. The HBPs are useful in many medical and commercial fields. Thus, the identification of HBPs is very important because it can help to discover more details about hormone binding proteins. Meanwhile, the experimental methods are time-consuming and expensive for hormone binding proteins recognition. Computational prediction methods have played significant roles in the correct recognition of hormone binding proteins with the use of sequence information and ML algorithms. In this review, we compared and assessed the implementation of ML-based tools in recognition of HBPs in a unique way. We hope that this study will give enough awareness and knowledge for research on HBPs.

11.
Comput Math Methods Med ; 2022: 7493834, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35069791

RESUMO

Helicobacter pylori (H. pylori) is the most common risk factor for gastric cancer worldwide. The membrane proteins of the H. pylori are involved in bacterial adherence and play a vital role in the field of drug discovery. Thus, an accurate and cost-effective computational model is needed to predict the uncharacterized membrane proteins of H. pylori. In this study, a reliable benchmark dataset consisted of 114 membrane and 219 nonmembrane proteins was constructed based on UniProt. A support vector machine- (SVM-) based model was developed for discriminating H. pylori membrane proteins from nonmembrane proteins by using sequence information. Cross-validation showed that our method achieved good performance with an accuracy of 91.29%. It is anticipated that the proposed model will be useful for the annotation of H. pylori membrane proteins and the development of new anti-H. pylori agents.


Assuntos
Proteínas de Bactérias/genética , Helicobacter pylori/genética , Proteínas de Membrana/genética , Sequência de Aminoácidos , Aminoácidos/análise , Proteínas de Bactérias/química , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Helicobacter pylori/química , Helicobacter pylori/patogenicidade , Interações entre Hospedeiro e Microrganismos , Humanos , Proteínas de Membrana/química , Máquina de Vetores de Suporte
12.
Front Microbiol ; 13: 790063, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35273581

RESUMO

Thermophilic proteins have important application value in biotechnology and industrial processes. The correct identification of thermophilic proteins provides important information for the application of these proteins in engineering. The identification method of thermophilic proteins based on biochemistry is laborious, time-consuming, and high cost. Therefore, there is an urgent need for a fast and accurate method to identify thermophilic proteins. Considering this urgency, we constructed a reliable benchmark dataset containing 1,368 thermophilic and 1,443 non-thermophilic proteins. A multi-layer perceptron (MLP) model based on a multi-feature fusion strategy was proposed to discriminate thermophilic proteins from non-thermophilic proteins. On independent data set, the proposed model could achieve an accuracy of 96.26%, which demonstrates that the model has a good application prospect. In order to use the model conveniently, a user-friendly software package called iThermo was established and can be freely accessed at http://lin-group.cn/server/iThermo/index.html. The high accuracy of the model and the practicability of the developed software package indicate that this study can accelerate the discovery and engineering application of thermally stable proteins.

13.
Front Biosci (Landmark Ed) ; 27(3): 84, 2022 03 05.
Artigo em Inglês | MEDLINE | ID: mdl-35345316

RESUMO

BACKGROUND: Lipocalin belongs to the calcyin family, and its sequence length is generally between 165 and 200 residues. They are mainly stable and multifunctional extracellular proteins. Lipocalin plays an important role in several stress responses and allergic inflammations. Because the accurate identification of lipocalins could provide significant evidences for the study of their function, it is necessary to develop a machine learning-based model to recognize lipocalin. METHODS: In this study, we constructed a prediction model to identify lipocalin. Their sequences were encoded by six types of features, namely amino acid composition (AAC), composition of k-spaced amino acid pairs (CKSAAP), pseudo amino acid composition (PseAAC), Geary correlation (GD), normalized Moreau-Broto autocorrelation (NMBroto) and composition/transition/distribution (CTD). Subsequently, these features were optimized by using feature selection techniques. A classifier based on random forest was trained according to the optimal features. RESULTS: The results of 10-fold cross-validation showed that our computational model would classify lipocalins with accuracy of 95.03% and area under the curve of 0.987. On the independent dataset, our computational model could produce the accuracy of 89.90% which was 4.17% higher than the existing model. CONCLUSIONS: In this work, we developed an advanced computational model to discriminate lipocalin proteins from non-lipocalin proteins. In the proposed model, protein sequences were encoded by six descriptors. Then, feature selection was performed to pick out the best features which could produce the maximum accuracy. On the basis of the best feature subset, the RF-based classifier can obtained the best prediction results.


Assuntos
Inteligência Artificial , Lipocalinas , Aminoácidos , Biologia Computacional , Lipocalinas/química , Aprendizado de Máquina , Proteínas/química
14.
Curr Med Chem ; 29(5): 789-806, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34514982

RESUMO

Protein-ligand interactions are necessary for majority protein functions. Adenosine- 5'-triphosphate (ATP) is one such ligand that plays vital role as a coenzyme in providing energy for cellular activities, catalyzing biological reaction and signaling. Knowing ATP binding residues of proteins is helpful for annotation of protein function and drug design. However, due to the huge amounts of protein sequences influx into databases in the post-genome era, experimentally identifying ATP binding residues is costineffective and time-consuming. To address this problem, computational methods have been developed to predict ATP binding residues. In this review, we briefly summarized the application of machine learning methods in detecting ATP binding residues of proteins. We expect this review will be helpful for further research.


Assuntos
Biologia Computacional , Proteínas , Trifosfato de Adenosina/metabolismo , Sequência de Aminoácidos , Sítios de Ligação , Biologia Computacional/métodos , Bases de Dados de Proteínas , Humanos , Aprendizado de Máquina , Ligação Proteica , Proteínas/metabolismo
15.
Comput Struct Biotechnol J ; 20: 4942-4951, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36147670

RESUMO

Ion binding proteins (IBPs) can selectively and non-covalently interact with ions. IBPs in phages also play an important role in biological processes. Therefore, accurate identification of IBPs is necessary for understanding their biological functions and molecular mechanisms that involve binding to ions. Since molecular biology experimental methods are still labor-intensive and cost-ineffective in identifying IBPs, it is helpful to develop computational methods to identify IBPs quickly and efficiently. In this work, a random forest (RF)-based model was constructed to quickly identify IBPs. Based on the protein sequence information and residues' physicochemical properties, the dipeptide composition combined with the physicochemical correlation between two residues were proposed for the extraction of features. A feature selection technique called analysis of variance (ANOVA) was used to exclude redundant information. By comparing with other classified methods, we demonstrated that our method could identify IBPs accurately. Based on the model, a Python package named IBPred was built with the source code which can be accessed at https://github.com/ShishiYuan/IBPred.

16.
Comput Math Methods Med ; 2021: 6664362, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33505515

RESUMO

Bioluminescent proteins (BLPs) are a class of proteins that widely distributed in many living organisms with various mechanisms of light emission including bioluminescence and chemiluminescence from luminous organisms. Bioluminescence has been commonly used in various analytical research methods of cellular processes, such as gene expression analysis, drug discovery, cellular imaging, and toxicity determination. However, the identification of bioluminescent proteins is challenging as they share poor sequence similarities among them. In this paper, we briefly reviewed the development of the computational identification of BLPs and subsequently proposed a novel predicting framework for identifying BLPs based on eXtreme gradient boosting algorithm (XGBoost) and using sequence-derived features. To train the models, we collected BLP data from bacteria, eukaryote, and archaea. Then, for getting more effective prediction models, we examined the performances of different feature extraction methods and their combinations as well as classification algorithms. Finally, based on the optimal model, a novel predictor named iBLP was constructed to identify BLPs. The robustness of iBLP has been proved by experiments on training and independent datasets. Comparison with other published method further demonstrated that the proposed method is powerful and could provide good performance for BLP identification. The webserver and software package for BLP identification are freely available at http://lin-group.cn/server/iBLP.


Assuntos
Algoritmos , Proteínas Luminescentes , Sequência de Aminoácidos , Fenômenos Químicos , Biologia Computacional , Bases de Dados de Proteínas , Descoberta de Drogas , Luminescência , Proteínas Luminescentes/química , Proteínas Luminescentes/genética , Proteínas Luminescentes/metabolismo , Aprendizado de Máquina , Software
17.
Cancer Manag Res ; 13: 3229-3234, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33880065

RESUMO

PURPOSE: Intensity-modulated radiotherapy (IMRT) can improve the prognosis of patients with esophageal cancer. This study aimed to evaluate clinical factors relevant to the prognosis of patients with esophageal cancer who received intensity-modulated radiotherapy (IMRT) alone. PATIENT AND METHODS: Data of 103 patients with pathologically confirmed esophageal cancer who were admitted to our hospital between October 2011 and November 2017 were retrospectively reviewed. All patients had squamous cell carcinoma. All patients received IMRT. Patients with stage I-IVA tumors were included to represent the real-world clinical practice. We performed univariate and multivariate analyses to identify prognostic factors for overall survival (OS) and progression-free survival (PFS). In univariate analyses, the Kaplan-Meier method was used to estimate OS and PFS for various subgroups. In multivariate analyses, hazard ratios were calculated. RESULTS: Single-factor analysis revealed that T stage (P=0.019), N stage (P =0.047), and lesion length (P =0.000) were associated with the prognosis of esophageal cancer patients who received IMRT. Cox regression analysis revealed that T stage (odds ratio [OR] = 4.68; P < 0.05), N stage (OR = 0.28; P < 0.05), and lesion length (OR = 0.09; P < 0.05) were independent factors relevant to prognosis. CONCLUSION: T stage, N stage, and lesion length influenced the long-term curative effects of IMRT for esophageal cancer and were prognostic factors for patients with esophageal cancer receiving definitive radiotherapy alone. The higher the stage and the longer the tumor, the lower the survival rate.

18.
Transl Lung Cancer Res ; 10(5): 2172-2192, 2021 May.
Artigo em Inglês | MEDLINE | ID: mdl-34164268

RESUMO

BACKGROUND: In recent years, immunotherapy has made great progress, and the regulatory role of epigenetics has been verified. However, the role of 5-methylcytosine (m5C) in the tumor microenvironment (TME) and immunotherapy response remains unclear. METHODS: Based on 11 m5C regulators, we evaluated the m5C modification patterns of 572 lung adenocarcinoma (LUAD) patients. The m5C score was constructed by principal component analysis (PCA) algorithms in order to quantify the m5C modification pattern of individual LUAD patients. RESULTS: Two m5C methylation modification patterns were identified according to 11 m5C regulators. The two patterns had a remarkably distinct TME immune cell infiltration characterization. Next, 226 differentially expressed genes (DEGs) related to the m5C phenotype were screened. Patients were divided into three different gene cluster subtypes based on these genes, which had different TME immune cell infiltration and prognosis characteristics. The m5C score was constructed to quantify the m5C modification pattern of individual LUAD patients. We found that the high m5C score group had a better prognosis. The role of the m5C score in predicting prognosis was also verified in the dataset GSE31210. CONCLUSIONS: Our study revealed that m5C modification played a significant role in TME regulation of LUAD. Investigation of the m5C regulation mode may have some implications for tumor immunotherapy in the future.

19.
Artigo em Inglês | MEDLINE | ID: mdl-32292778

RESUMO

Hepatocellular carcinoma (HCC) is a serious cancer which ranked the fourth in cancer-related death worldwide. Hence, more accurate diagnostic models are urgently needed to aid the early HCC diagnosis under clinical scenarios and thus improve HCC treatment and survival. Several conventional methods have been used for discriminating HCC from cirrhosis tissues in patients without HCC (CwoHCC). However, the recognition successful rates are still far from satisfactory. In this study, we applied a computational approach that based on machine learning method to a set of microarray data generated from 1091 HCC samples and 242 CwoHCC samples. The within-sample relative expression orderings (REOs) method was used to extract numerical descriptors from gene expression profiles datasets. After removing the unrelated features by using maximum redundancy minimum relevance (mRMR) with incremental feature selection, we achieved "11-gene-pair" which could produce outstanding results. We further investigated the discriminate capability of the "11-gene-pair" for HCC recognition on several independent datasets. The wonderful results were obtained, demonstrating that the selected gene pairs can be signature for HCC. The proposed computational model can discriminate HCC and adjacent non-cancerous tissues from CwoHCC even for minimum biopsy specimens and inaccurately sampled specimens, which can be practical and effective for aiding the early HCC diagnosis at individual level.

20.
Mol Ther Nucleic Acids ; 17: 337-346, 2019 Sep 06.
Artigo em Inglês | MEDLINE | ID: mdl-31299595

RESUMO

Promoter is a fundamental DNA element located around the transcription start site (TSS) and could regulate gene transcription. Promoter recognition is of great significance in determining transcription units, studying gene structure, analyzing gene regulation mechanisms, and annotating gene functional information. Many models have already been proposed to predict promoters. However, the performances of these methods still need to be improved. In this work, we combined pseudo k-tuple nucleotide composition (PseKNC) with position-correlation scoring function (PCSF) to formulate promoter sequences of Homo sapiens (H. sapiens), Drosophila melanogaster (D. melanogaster), Caenorhabditis elegans (C. elegans), Bacillus subtilis (B. subtilis), and Escherichia coli (E. coli). Minimum Redundancy Maximum Relevance (mRMR) algorithm and increment feature selection strategy were then adopted to find out optimal feature subsets. Support vector machine (SVM) was used to distinguish between promoters and non-promoters. In the 10-fold cross-validation test, accuracies of 93.3%, 93.9%, 95.7%, 95.2%, and 93.1% were obtained for H. sapiens, D. melanogaster, C. elegans, B. subtilis, and E. coli, with the areas under receiver operating curves (AUCs) of 0.974, 0.975, 0.981, 0.988, and 0.976, respectively. Comparative results demonstrated that our method outperforms existing methods for identifying promoters. An online web server was established that can be freely accessed (http://lin-group.cn/server/iProEP/).

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA