Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 24
Filtrar
1.
Proteomics ; : e2400044, 2024 Jun 02.
Artigo em Francês | MEDLINE | ID: mdl-38824664

RESUMO

RNA-dependent liquid-liquid phase separation (LLPS) proteins play critical roles in cellular processes such as stress granule formation, DNA repair, RNA metabolism, germ cell development, and protein translation regulation. The abnormal behavior of these proteins is associated with various diseases, particularly neurodegenerative disorders like amyotrophic lateral sclerosis and frontotemporal dementia, making their identification crucial. However, conventional biochemistry-based methods for identifying these proteins are time-consuming and costly. Addressing this challenge, our study developed a robust computational model for their identification. We constructed a comprehensive dataset containing 137 RNA-dependent and 606 non-RNA-dependent LLPS protein sequences, which were then encoded using amino acid composition, composition of K-spaced amino acid pairs, Geary autocorrelation, and conjoined triad methods. Through a combination of correlation analysis, mutual information scoring, and incremental feature selection, we identified an optimal feature subset. This subset was used to train a random forest model, which achieved an accuracy of 90% when tested against an independent dataset. This study demonstrates the potential of computational methods as efficient alternatives for the identification of RNA-dependent LLPS proteins. To enhance the accessibility of the model, a user-centric web server has been established and can be accessed via the link: http://rpp.lin-group.cn.

2.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34184738

RESUMO

The rapid spread of SARS-CoV-2 infection around the globe has caused a massive health and socioeconomic crisis. Identification of phosphorylation sites is an important step for understanding the molecular mechanisms of SARS-CoV-2 infection and the changes within the host cells pathways. In this study, we present DeepIPs, a first specific deep-learning architecture to identify phosphorylation sites in host cells infected with SARS-CoV-2. DeepIPs consists of the most popular word embedding method and convolutional neural network-long short-term memory network architecture to make the final prediction. The independent test demonstrates that DeepIPs improves the prediction performance compared with other existing tools for general phosphorylation sites prediction. Based on the proposed model, a web-server called DeepIPs was established and is freely accessible at http://lin-group.cn/server/DeepIPs. The source code of DeepIPs is freely available at the repository https://github.com/linDing-group/DeepIPs.


Assuntos
Tratamento Farmacológico da COVID-19 , Fosforilação/genética , SARS-CoV-2/química , Software , COVID-19/genética , COVID-19/virologia , Biologia Computacional , Aprendizado Profundo , Humanos , Redes Neurais de Computação , SARS-CoV-2/genética , SARS-CoV-2/patogenicidade
3.
Brief Bioinform ; 22(2): 1940-1950, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-32065211

RESUMO

The locations of the initiation of genomic DNA replication are defined as origins of replication sites (ORIs), which regulate the onset of DNA replication and play significant roles in the DNA replication process. The study of ORIs is essential for understanding the cell-division cycle and gene expression regulation. Accurate identification of ORIs will provide important clues for DNA replication research and drug development by developing computational methods. In this paper, the first integrated predictor named iORI-Euk was built to identify ORIs in multiple eukaryotes and multiple cell types. In the predictor, seven eukaryotic (Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana, Pichia pastoris, Schizosaccharomyces pombe and Kluyveromyces lactis) ORI data was collected from public database to construct benchmark datasets. Subsequently, three feature extraction strategies which are k-mer, binary encoding and combination of k-mer and binary were used to formulate DNA sequence samples. We also compared the different classification algorithms' performance. As a result, the best results were obtained by using support vector machine in 5-fold cross-validation test and independent dataset test. Based on the optimal model, an online web server called iORI-Euk (http://lin-group.cn/server/iORI-Euk/) was established for the novel ORI identification.


Assuntos
Origem de Replicação , Algoritmos , Animais , Linhagem Celular , Linhagem Celular Tumoral , Conjuntos de Dados como Assunto , Eucariotos/genética , Humanos , Máquina de Vetores de Suporte
4.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33634313

RESUMO

Three-dimensional (3D) architecture of the chromosomes is of crucial importance for transcription regulation and DNA replication. Various high-throughput chromosome conformation capture-based methods have revealed that CTCF-mediated chromatin loops are a major component of 3D architecture. However, CTCF-mediated chromatin loops are cell type specific, and most chromatin interaction capture techniques are time-consuming and labor-intensive, which restricts their usage on a very large number of cell types. Genomic sequence-based computational models are sophisticated enough to capture important features of chromatin architecture and help to identify chromatin loops. In this work, we develop Deep-loop, a convolutional neural network model, to integrate k-tuple nucleotide frequency component, nucleotide pair spectrum encoding, position conservation, position scoring function and natural vector features for the prediction of chromatin loops. By a series of examination based on cross-validation, Deep-loop shows excellent performance in the identification of the chromatin loops from different cell types. The source code of Deep-loop is freely available at the repository https://github.com/linDing-group/Deep-loop.


Assuntos
Fator de Ligação a CCCTC/genética , Cromatina/metabolismo , Genoma Humano , Redes Neurais de Computação , Fator de Ligação a CCCTC/metabolismo , Cromatina/ultraestrutura , Conjuntos de Dados como Assunto , Regulação da Expressão Gênica , Humanos , Células K562 , Células MCF-7 , Conformação Molecular , Motivos de Nucleotídeos , Software
5.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34410360

RESUMO

The global pandemic of coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2, has led to a dramatic loss of human life worldwide. Despite many efforts, the development of effective drugs and vaccines for this novel virus will take considerable time. Artificial intelligence (AI) and machine learning (ML) offer promising solutions that could accelerate the discovery and optimization of new antivirals. Motivated by this, in this paper, we present an extensive survey on the application of AI and ML for combating COVID-19 based on the rapidly emerging literature. Particularly, we point out the challenges and future directions associated with state-of-the-art solutions to effectively control the COVID-19 pandemic. We hope that this review provides researchers with new insights into the ways AI and ML fight and have fought the COVID-19 outbreak.


Assuntos
Tratamento Farmacológico da COVID-19 , Vacinas contra COVID-19/genética , Descoberta de Drogas , SARS-CoV-2/genética , Inteligência Artificial , COVID-19/genética , COVID-19/virologia , Vacinas contra COVID-19/química , Desenho de Fármacos , Humanos , Aprendizado de Máquina , Pandemias , SARS-CoV-2/química , SARS-CoV-2/patogenicidade
6.
Methods ; 203: 558-563, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-34352373

RESUMO

N4-methylcytosine (4mC) is a type of DNA modification which could regulate several biological progressions such as transcription regulation, replication and gene expressions. Precisely recognizing 4mC sites in genomic sequences can provide specific knowledge about their genetic roles. This study aimed to develop a deep learning-based model to predict 4mC sites in the Escherichia coli. In the model, DNA sequences were encoded by word embedding technique 'word2vec'. The obtained features were inputted into 1-D convolutional neural network (CNN) to discriminate 4mC sites from non-4mC sites in Escherichia coli genome. The examination on independent dataset showed that our model could yield the overall accuracy of 0.861, which was about 4.3% higher than the existing model. To provide convenience to scholars, we provided the data and source code of the model which can be freely download from https://github.com/linDing-groups/Deep-4mCW2V.


Assuntos
DNA , Escherichia coli , DNA/genética , Escherichia coli/genética , Genoma , Genômica , Software
7.
Environ Monit Assess ; 195(9): 1028, 2023 Aug 10.
Artigo em Inglês | MEDLINE | ID: mdl-37558890

RESUMO

This study marks the first-ever assessment of radiological hazards linked to the sands and rocks of Patuartek Sea Beach, situated along one of the world's longest sea beaches in Cox' Bazar of Bangladesh. Through the utilization of an HPGe detector, a comprehensive analysis of the activity concentrations of 226Ra, 232Th, and 40 K was conducted, and their activity ranged from 7 to 23 Bq/kg, 9-58 Bq/kg, and 172-340 Bq/kg, respectively, in soils, and 19-24 Bq/kg, 27-39 Bq/kg, and 340-410 Bq/kg, respectively, in rocks. Some sand samples exhibited elevated levels of 232Th, while the rock samples displayed higher levels of 40 K compared to the global average. The radiological hazard parameters were assessed, and no values surpassed the recommended limits set by several international organizations. Hence, the sands and rocks of Patuartek sea beach pose no significant radiological risk to the residents or tourists. The findings of this study provide crucial insights for the development of a radiological baseline map in the country, which is important due to the commissioning of the country's first nuclear power plant Rooppur Nuclear Power Plant. The data may also stimulate interest in the rare-earth minerals present in the area, which is important for the electronics industry, thorium-based nuclear fuel cycles.


Assuntos
Monitoramento de Radiação , Radioatividade , Rádio (Elemento) , Poluentes Radioativos do Solo , Radioisótopos de Potássio/análise , Dióxido de Silício/análise , Solo , Areia , Bangladesh , Poluentes Radioativos do Solo/análise , Praias , Tório/análise , Rádio (Elemento)/análise
8.
Int J Mol Sci ; 23(17)2022 Sep 04.
Artigo em Inglês | MEDLINE | ID: mdl-36077513

RESUMO

Thermophilic proteins have various practical applications in theoretical research and in industry. In recent years, the demand for thermophilic proteins on an industrial scale has been increasing; therefore, the engineering of thermophilic proteins has become a hot direction in the field of protein engineering. However, the exact mechanism of thermostability of proteins is not yet known, for engineering thermophilic proteins knowing the basis of thermostability is necessary. In order to understand the basis of the thermostability in proteins, we have made a statistical analysis of the sequences, secondary structures, hydrogen bonds, salt bridges, DHA (Donor-Hydrogen-Accepter) angles, and bond lengths of ten pairs of thermophilic proteins and their non-thermophilic orthologous. Our findings suggest that polar amino acids contribute to thermostability in proteins by forming hydrogen bonds and salt bridges which provide resistance against protein denaturation. Short bond length and a wider DHA angle provide greater bond stability in thermophilic proteins. Moreover, the increased frequency of aromatic amino acids in thermophilic proteins contributes to thermal stability by forming more aromatic interactions. Additionally, the coil, helix, and loop in the secondary structure also contribute to thermostability.


Assuntos
Aminoácidos , Proteínas , Aminoácidos/química , Ligação de Hidrogênio , Desnaturação Proteica , Engenharia de Proteínas , Proteínas/química , Temperatura
9.
Int J Mol Sci ; 23(3)2022 Jan 23.
Artigo em Inglês | MEDLINE | ID: mdl-35163174

RESUMO

4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements, for example, DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. The purpose of this study was to establish a robust deep learning model to recognize 4mC sites in Geobacter pickeringii. In the anticipated model, two kinds of feature descriptors, namely, binary and k-mer composition were used to encode the DNA sequences of Geobacter pickeringii. The obtained features from their fusion were optimized by using correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. Then, these optimized features were inserted into 1D convolutional neural network (CNN) to classify 4mC sites from non-4mC sites in Geobacter pickeringii. The performance of the anticipated model on independent data exhibited an accuracy of 0.868, which was 4.2% higher than the existing model.


Assuntos
Biologia Computacional/métodos , Epigênese Genética/genética , Geobacter/genética , Algoritmos , Citosina/metabolismo , DNA/genética , Metilação de DNA/genética , Aprendizado Profundo , Aprendizado de Máquina , Mutação/genética , Redes Neurais de Computação , Software
10.
Front Microbiol ; 14: 1170785, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37125199

RESUMO

Promotors are those genomic regions on the upstream of genes, which are bound by RNA polymerase for starting gene transcription. Because it is the most critical element of gene expression, the recognition of promoters is crucial to understand the regulation of gene expression. This study aimed to develop a machine learning-based model to predict promotors in Agrobacterium tumefaciens (A. tumefaciens) strain C58. In the model, promotor sequences were encoded by three different kinds of feature descriptors, namely, accumulated nucleotide frequency, k-mer nucleotide composition, and binary encodings. The obtained features were optimized by using correlation and the mRMR-based algorithm. These optimized features were inputted into a random forest (RF) classifier to discriminate promotor sequences from non-promotor sequences in A. tumefaciens strain C58. The examination of 10-fold cross-validation showed that the proposed model could yield an overall accuracy of 0.837. This model will provide help for the study of promoters in A. tumefaciens C58 strain.

11.
Front Microbiol ; 14: 1200678, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37250059

RESUMO

Promoters are the basic functional cis-elements to which RNA polymerase binds to initiate the process of gene transcription. Comprehensive understanding gene expression and regulation depends on the precise identification of promoters, as they are the most important component of gene expression. This study aimed to develop a machine learning-based model to predict promoters in Klebsiella aerogenes (K. aerogenes). In the prediction model, the promoter sequences in K. aerogenes genome were encoded by pseudo k-tuple nucleotide composition (PseKNC) and position-correlation scoring function (PCSF). Numerical features were obtained and then optimized using mRMR by combining with support vector machine (SVM) and 5-fold cross-validation (CV). Subsequently, these optimized features were inputted into SVM-based classifier to discriminate promoter sequences from non-promoter sequences in K. aerogenes. Results of 10-fold CV showed that the model could yield the overall accuracy of 96.0% and the area under the ROC curve (AUC) of 0.990. We hope that this model will provide help for the study of promoter and gene regulation in K. aerogenes.

12.
Behav Sci (Basel) ; 13(7)2023 Jul 12.
Artigo em Inglês | MEDLINE | ID: mdl-37504025

RESUMO

The world faced COVID-19, which was a threat to public health and disturbed the educational system and economic stability. Educational institutes were closed for a longer period, and students faced difficulty to complete their syllabus. The government adopted a policy of "suspending classes without stopping learning" to continue education activities. However, student satisfaction with online education is a growing concern. Satisfaction of students is an important indicator of academic quality. Therefore, this study attempts to investigate the influencing factors behind learning satisfaction using information from 335 students from various institutes in Pakistan. This research examined the impact of computer and internet knowledge, instructor and course material, and Learning Management Systems (LMS) on learning satisfaction. The path coefficients were obtained via Partial Least Square-Structural Equation Modeling (PLS-SEM). The LMS is a tool that facilitates the learning process with the provision of all types of educational material. The path coefficient was more in the case of LMS (0.489), which indicates its positive and significant role to attain learning satisfaction. The instructor and course material ordered second (0.261), which shows that the quality of an instructor and course material also plays a positive role to attain learning satisfaction. The computer and internet are essential ingredients of online education, showing a significant and positive path coefficient (0.123), implying that computer and internet knowledge could enhance learning satisfaction. The universities should develop their LMS to implement online education with quality course materials. It is also vital that the instructor should be up to date with modern learning techniques while ensuring internet connectivity, especially in rural areas. The government should provide an internet connection to students at discounted rates.

13.
Comput Biol Med ; 163: 107165, 2023 09.
Artigo em Inglês | MEDLINE | ID: mdl-37315383

RESUMO

MicroRNAs have a significant role in the emergence of various human disorders. Consequently, it is essential to understand the existing interactions between miRNAs and diseases, as this will help scientists better study and comprehend the diseases' biological mechanisms. Findings can be employed as biomarkers or drug targets to advance the detection, diagnosis, and treatment of complex human disorders by foretelling possible disease-related miRNAs. This study proposed a computational model for predicting potential miRNA-disease associations called the Collaborative Filtering Neighborhood-based Classification Model (CFNCM), in light of the shortcomings of conventional and biological experiments, which are expensive and time-consuming. The model generated integrated miRNA and disease similarity matrices using the validated associations and miRNA and disease similarity information and used them as the input features for CFNCM. To produce class labels, we first determined the association scores for brand-new pairs using user-based collaborative filtering. With zero as the threshold, the associations with scores >0 were labelled 1, indicating a potential positive association, otherwise, it is marked as 0. Then, we developed classification models using various machine-learning algorithms. By comparison, we discovered that the support vector machine (SVM) produced the best AUC of 0.96 with 10-fold cross-validation through the GridSearchCV technique for identifying optimal parameter values. In addition, the models were evaluated and verified by analyzing the top 50 breast and lung neoplasms-related miRNAs, of which 46 and 47 associations were verified in two authoritative databases, dbDEMC and miR2Disease.


Assuntos
Doença , MicroRNAs , Máquina de Vetores de Suporte , Características da Vizinhança , MicroRNAs/genética , MicroRNAs/metabolismo , Simulação por Computador , Humanos , Doença/classificação , Algoritmos
14.
Comput Struct Biotechnol J ; 21: 2253-2261, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37035551

RESUMO

Hormone binding proteins (HBPs) belong to the group of soluble carrier proteins. These proteins selectively and non-covalently interact with hormones and promote growth hormone signaling in human and other animals. The HBPs are useful in many medical and commercial fields. Thus, the identification of HBPs is very important because it can help to discover more details about hormone binding proteins. Meanwhile, the experimental methods are time-consuming and expensive for hormone binding proteins recognition. Computational prediction methods have played significant roles in the correct recognition of hormone binding proteins with the use of sequence information and ML algorithms. In this review, we compared and assessed the implementation of ML-based tools in recognition of HBPs in a unique way. We hope that this study will give enough awareness and knowledge for research on HBPs.

15.
Front Med (Lausanne) ; 10: 1291352, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38298505

RESUMO

Snake venom contains many toxic proteins that can destroy the circulatory system or nervous system of prey. Studies have found that these snake venom proteins have the potential to treat cardiovascular and nervous system diseases. Therefore, the study of snake venom protein is conducive to the development of related drugs. The research technologies based on traditional biochemistry can accurately identify these proteins, but the experimental cost is high and the time is long. Artificial intelligence technology provides a new means and strategy for large-scale screening of snake venom proteins from the perspective of computing. In this paper, we developed a sequence-based computational method to recognize snake toxin proteins. Specially, we utilized three different feature descriptors, namely g-gap, natural vector and word 2 vector, to encode snake toxin protein sequences. The analysis of variance (ANOVA), gradient-boost decision tree algorithm (GBDT) combined with incremental feature selection (IFS) were used to optimize the features, and then the optimized features were input into the deep learning model for model training. The results show that our model can achieve a prediction performance with an accuracy of 82.00% in 10-fold cross-validation. The model is further verified on independent data, and the accuracy rate reaches to 81.14%, which demonstrated that our model has excellent prediction performance and robustness.

16.
Int J Biol Macromol ; 227: 1174-1181, 2023 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-36470433

RESUMO

RNA N4-acetylcytidine (ac4C) is the acetylation of cytidine at the nitrogen-4 position, which is a highly conserved RNA modification and involves a variety of biological processes. Hence, accurate identification of genome-wide ac4C sites is vital for understanding regulation mechanism of gene expression. In this work, a novel predictor, named iRNA-ac4C, was established to identify ac4C sites in human mRNA based on three feature extraction methods, including nucleotide composition, nucleotide chemical property, and accumulated nucleotide frequency. Subsequently, minimum-Redundancy-Maximum-Relevance combined with incremental feature selection strategies was utilized to select the optimal feature subset. According to the optimal feature subset, the best ac4C classification model was trained by gradient boosting decision tree with 10-fold cross-validation. The results of independent testing set indicated that our proposed method could produce encouraging generalization capabilities. For the convenience of other researchers, we established a user-friendly web server which is freely available at http://lin-group.cn/server/iRNA-ac4C/. We hope that the tool could provide guide for wet-experimental scholars.


Assuntos
Citidina , RNA , Humanos , RNA Mensageiro/metabolismo , Citidina/genética , Citidina/metabolismo , RNA/química , Nucleotídeos
17.
Front Microbiol ; 13: 790063, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35273581

RESUMO

Thermophilic proteins have important application value in biotechnology and industrial processes. The correct identification of thermophilic proteins provides important information for the application of these proteins in engineering. The identification method of thermophilic proteins based on biochemistry is laborious, time-consuming, and high cost. Therefore, there is an urgent need for a fast and accurate method to identify thermophilic proteins. Considering this urgency, we constructed a reliable benchmark dataset containing 1,368 thermophilic and 1,443 non-thermophilic proteins. A multi-layer perceptron (MLP) model based on a multi-feature fusion strategy was proposed to discriminate thermophilic proteins from non-thermophilic proteins. On independent data set, the proposed model could achieve an accuracy of 96.26%, which demonstrates that the model has a good application prospect. In order to use the model conveniently, a user-friendly software package called iThermo was established and can be freely accessed at http://lin-group.cn/server/iThermo/index.html. The high accuracy of the model and the practicability of the developed software package indicate that this study can accelerate the discovery and engineering application of thermally stable proteins.

18.
Front Biosci (Landmark Ed) ; 27(3): 84, 2022 03 05.
Artigo em Inglês | MEDLINE | ID: mdl-35345316

RESUMO

BACKGROUND: Lipocalin belongs to the calcyin family, and its sequence length is generally between 165 and 200 residues. They are mainly stable and multifunctional extracellular proteins. Lipocalin plays an important role in several stress responses and allergic inflammations. Because the accurate identification of lipocalins could provide significant evidences for the study of their function, it is necessary to develop a machine learning-based model to recognize lipocalin. METHODS: In this study, we constructed a prediction model to identify lipocalin. Their sequences were encoded by six types of features, namely amino acid composition (AAC), composition of k-spaced amino acid pairs (CKSAAP), pseudo amino acid composition (PseAAC), Geary correlation (GD), normalized Moreau-Broto autocorrelation (NMBroto) and composition/transition/distribution (CTD). Subsequently, these features were optimized by using feature selection techniques. A classifier based on random forest was trained according to the optimal features. RESULTS: The results of 10-fold cross-validation showed that our computational model would classify lipocalins with accuracy of 95.03% and area under the curve of 0.987. On the independent dataset, our computational model could produce the accuracy of 89.90% which was 4.17% higher than the existing model. CONCLUSIONS: In this work, we developed an advanced computational model to discriminate lipocalin proteins from non-lipocalin proteins. In the proposed model, protein sequences were encoded by six descriptors. Then, feature selection was performed to pick out the best features which could produce the maximum accuracy. On the basis of the best feature subset, the RF-based classifier can obtained the best prediction results.


Assuntos
Inteligência Artificial , Lipocalinas , Aminoácidos , Biologia Computacional , Lipocalinas/química , Aprendizado de Máquina , Proteínas/química
19.
Comput Math Methods Med ; 2021: 6664362, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33505515

RESUMO

Bioluminescent proteins (BLPs) are a class of proteins that widely distributed in many living organisms with various mechanisms of light emission including bioluminescence and chemiluminescence from luminous organisms. Bioluminescence has been commonly used in various analytical research methods of cellular processes, such as gene expression analysis, drug discovery, cellular imaging, and toxicity determination. However, the identification of bioluminescent proteins is challenging as they share poor sequence similarities among them. In this paper, we briefly reviewed the development of the computational identification of BLPs and subsequently proposed a novel predicting framework for identifying BLPs based on eXtreme gradient boosting algorithm (XGBoost) and using sequence-derived features. To train the models, we collected BLP data from bacteria, eukaryote, and archaea. Then, for getting more effective prediction models, we examined the performances of different feature extraction methods and their combinations as well as classification algorithms. Finally, based on the optimal model, a novel predictor named iBLP was constructed to identify BLPs. The robustness of iBLP has been proved by experiments on training and independent datasets. Comparison with other published method further demonstrated that the proposed method is powerful and could provide good performance for BLP identification. The webserver and software package for BLP identification are freely available at http://lin-group.cn/server/iBLP.


Assuntos
Algoritmos , Proteínas Luminescentes , Sequência de Aminoácidos , Fenômenos Químicos , Biologia Computacional , Bases de Dados de Proteínas , Descoberta de Drogas , Luminescência , Proteínas Luminescentes/química , Proteínas Luminescentes/genética , Proteínas Luminescentes/metabolismo , Aprendizado de Máquina , Software
20.
Math Biosci Eng ; 18(4): 3348-3363, 2021 04 15.
Artigo em Inglês | MEDLINE | ID: mdl-34198389

RESUMO

N4-methylcytosine (4mC) is a kind of DNA modification which could regulate multiple biological processes. Correctly identifying 4mC sites in genomic sequences can provide precise knowledge about their genetic roles. This study aimed to develop an ensemble model to predict 4mC sites in the mouse genome. In the proposed model, DNA sequences were encoded by k-mer, enhanced nucleic acid composition and composition of k-spaced nucleic acid pairs. Subsequently, these features were optimized by using minimum redundancy maximum relevance (mRMR) with incremental feature selection (IFS) and five-fold cross-validation. The obtained optimal features were inputted into random forest classifier for discriminating 4mC from non-4mC sites in mouse. On the independent dataset, our model could yield the overall accuracy of 85.41%, which was approximately 3.8% -6.3% higher than the two existing models, i4mC-Mouse and 4mCpred-EL respectively. The data and source code of the model can be freely download from https://github.com/linDing-groups/model_4mc.


Assuntos
Citosina , DNA , Animais , Biologia Computacional , Genoma , Aprendizado de Máquina , Camundongos , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA