Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
1.
Proteomics ; : e2400044, 2024 Jun 02.
Artículo en Francés | MEDLINE | ID: mdl-38824664

RESUMEN

RNA-dependent liquid-liquid phase separation (LLPS) proteins play critical roles in cellular processes such as stress granule formation, DNA repair, RNA metabolism, germ cell development, and protein translation regulation. The abnormal behavior of these proteins is associated with various diseases, particularly neurodegenerative disorders like amyotrophic lateral sclerosis and frontotemporal dementia, making their identification crucial. However, conventional biochemistry-based methods for identifying these proteins are time-consuming and costly. Addressing this challenge, our study developed a robust computational model for their identification. We constructed a comprehensive dataset containing 137 RNA-dependent and 606 non-RNA-dependent LLPS protein sequences, which were then encoded using amino acid composition, composition of K-spaced amino acid pairs, Geary autocorrelation, and conjoined triad methods. Through a combination of correlation analysis, mutual information scoring, and incremental feature selection, we identified an optimal feature subset. This subset was used to train a random forest model, which achieved an accuracy of 90% when tested against an independent dataset. This study demonstrates the potential of computational methods as efficient alternatives for the identification of RNA-dependent LLPS proteins. To enhance the accessibility of the model, a user-centric web server has been established and can be accessed via the link: http://rpp.lin-group.cn.

2.
Brief Bioinform ; 22(6)2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34184738

RESUMEN

The rapid spread of SARS-CoV-2 infection around the globe has caused a massive health and socioeconomic crisis. Identification of phosphorylation sites is an important step for understanding the molecular mechanisms of SARS-CoV-2 infection and the changes within the host cells pathways. In this study, we present DeepIPs, a first specific deep-learning architecture to identify phosphorylation sites in host cells infected with SARS-CoV-2. DeepIPs consists of the most popular word embedding method and convolutional neural network-long short-term memory network architecture to make the final prediction. The independent test demonstrates that DeepIPs improves the prediction performance compared with other existing tools for general phosphorylation sites prediction. Based on the proposed model, a web-server called DeepIPs was established and is freely accessible at http://lin-group.cn/server/DeepIPs. The source code of DeepIPs is freely available at the repository https://github.com/linDing-group/DeepIPs.


Asunto(s)
Tratamiento Farmacológico de COVID-19 , Fosforilación/genética , SARS-CoV-2/química , Programas Informáticos , COVID-19/genética , COVID-19/virología , Biología Computacional , Aprendizaje Profundo , Humanos , Redes Neurales de la Computación , SARS-CoV-2/genética , SARS-CoV-2/patogenicidad
3.
Brief Bioinform ; 22(5)2021 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-33634313

RESUMEN

Three-dimensional (3D) architecture of the chromosomes is of crucial importance for transcription regulation and DNA replication. Various high-throughput chromosome conformation capture-based methods have revealed that CTCF-mediated chromatin loops are a major component of 3D architecture. However, CTCF-mediated chromatin loops are cell type specific, and most chromatin interaction capture techniques are time-consuming and labor-intensive, which restricts their usage on a very large number of cell types. Genomic sequence-based computational models are sophisticated enough to capture important features of chromatin architecture and help to identify chromatin loops. In this work, we develop Deep-loop, a convolutional neural network model, to integrate k-tuple nucleotide frequency component, nucleotide pair spectrum encoding, position conservation, position scoring function and natural vector features for the prediction of chromatin loops. By a series of examination based on cross-validation, Deep-loop shows excellent performance in the identification of the chromatin loops from different cell types. The source code of Deep-loop is freely available at the repository https://github.com/linDing-group/Deep-loop.


Asunto(s)
Factor de Unión a CCCTC/genética , Cromatina/metabolismo , Genoma Humano , Redes Neurales de la Computación , Factor de Unión a CCCTC/metabolismo , Cromatina/ultraestructura , Conjuntos de Datos como Asunto , Regulación de la Expresión Génica , Humanos , Células K562 , Células MCF-7 , Conformación Molecular , Motivos de Nucleótidos , Programas Informáticos
4.
Brief Bioinform ; 22(2): 1940-1950, 2021 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-32065211

RESUMEN

The locations of the initiation of genomic DNA replication are defined as origins of replication sites (ORIs), which regulate the onset of DNA replication and play significant roles in the DNA replication process. The study of ORIs is essential for understanding the cell-division cycle and gene expression regulation. Accurate identification of ORIs will provide important clues for DNA replication research and drug development by developing computational methods. In this paper, the first integrated predictor named iORI-Euk was built to identify ORIs in multiple eukaryotes and multiple cell types. In the predictor, seven eukaryotic (Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana, Pichia pastoris, Schizosaccharomyces pombe and Kluyveromyces lactis) ORI data was collected from public database to construct benchmark datasets. Subsequently, three feature extraction strategies which are k-mer, binary encoding and combination of k-mer and binary were used to formulate DNA sequence samples. We also compared the different classification algorithms' performance. As a result, the best results were obtained by using support vector machine in 5-fold cross-validation test and independent dataset test. Based on the optimal model, an online web server called iORI-Euk (http://lin-group.cn/server/iORI-Euk/) was established for the novel ORI identification.


Asunto(s)
Origen de Réplica , Algoritmos , Animales , Línea Celular , Línea Celular Tumoral , Conjuntos de Datos como Asunto , Eucariontes/genética , Humanos , Máquina de Vectores de Soporte
5.
Brief Bioinform ; 22(6)2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34410360

RESUMEN

The global pandemic of coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2, has led to a dramatic loss of human life worldwide. Despite many efforts, the development of effective drugs and vaccines for this novel virus will take considerable time. Artificial intelligence (AI) and machine learning (ML) offer promising solutions that could accelerate the discovery and optimization of new antivirals. Motivated by this, in this paper, we present an extensive survey on the application of AI and ML for combating COVID-19 based on the rapidly emerging literature. Particularly, we point out the challenges and future directions associated with state-of-the-art solutions to effectively control the COVID-19 pandemic. We hope that this review provides researchers with new insights into the ways AI and ML fight and have fought the COVID-19 outbreak.


Asunto(s)
Tratamiento Farmacológico de COVID-19 , Vacunas contra la COVID-19/genética , Descubrimiento de Drogas , SARS-CoV-2/genética , Inteligencia Artificial , COVID-19/genética , COVID-19/virología , Vacunas contra la COVID-19/química , Diseño de Fármacos , Humanos , Aprendizaje Automático , Pandemias , SARS-CoV-2/química , SARS-CoV-2/patogenicidad
6.
Methods ; 203: 558-563, 2022 07.
Artículo en Inglés | MEDLINE | ID: mdl-34352373

RESUMEN

N4-methylcytosine (4mC) is a type of DNA modification which could regulate several biological progressions such as transcription regulation, replication and gene expressions. Precisely recognizing 4mC sites in genomic sequences can provide specific knowledge about their genetic roles. This study aimed to develop a deep learning-based model to predict 4mC sites in the Escherichia coli. In the model, DNA sequences were encoded by word embedding technique 'word2vec'. The obtained features were inputted into 1-D convolutional neural network (CNN) to discriminate 4mC sites from non-4mC sites in Escherichia coli genome. The examination on independent dataset showed that our model could yield the overall accuracy of 0.861, which was about 4.3% higher than the existing model. To provide convenience to scholars, we provided the data and source code of the model which can be freely download from https://github.com/linDing-groups/Deep-4mCW2V.


Asunto(s)
ADN , Escherichia coli , ADN/genética , Escherichia coli/genética , Genoma , Genómica , Programas Informáticos
7.
Environ Monit Assess ; 195(9): 1028, 2023 Aug 10.
Artículo en Inglés | MEDLINE | ID: mdl-37558890

RESUMEN

This study marks the first-ever assessment of radiological hazards linked to the sands and rocks of Patuartek Sea Beach, situated along one of the world's longest sea beaches in Cox' Bazar of Bangladesh. Through the utilization of an HPGe detector, a comprehensive analysis of the activity concentrations of 226Ra, 232Th, and 40 K was conducted, and their activity ranged from 7 to 23 Bq/kg, 9-58 Bq/kg, and 172-340 Bq/kg, respectively, in soils, and 19-24 Bq/kg, 27-39 Bq/kg, and 340-410 Bq/kg, respectively, in rocks. Some sand samples exhibited elevated levels of 232Th, while the rock samples displayed higher levels of 40 K compared to the global average. The radiological hazard parameters were assessed, and no values surpassed the recommended limits set by several international organizations. Hence, the sands and rocks of Patuartek sea beach pose no significant radiological risk to the residents or tourists. The findings of this study provide crucial insights for the development of a radiological baseline map in the country, which is important due to the commissioning of the country's first nuclear power plant Rooppur Nuclear Power Plant. The data may also stimulate interest in the rare-earth minerals present in the area, which is important for the electronics industry, thorium-based nuclear fuel cycles.


Asunto(s)
Monitoreo de Radiación , Radiactividad , Radio (Elemento) , Contaminantes Radiactivos del Suelo , Radioisótopos de Potasio/análisis , Dióxido de Silicio/análisis , Suelo , Arena , Bangladesh , Contaminantes Radiactivos del Suelo/análisis , Playas , Torio/análisis , Radio (Elemento)/análisis
8.
Int J Mol Sci ; 23(17)2022 Sep 04.
Artículo en Inglés | MEDLINE | ID: mdl-36077513

RESUMEN

Thermophilic proteins have various practical applications in theoretical research and in industry. In recent years, the demand for thermophilic proteins on an industrial scale has been increasing; therefore, the engineering of thermophilic proteins has become a hot direction in the field of protein engineering. However, the exact mechanism of thermostability of proteins is not yet known, for engineering thermophilic proteins knowing the basis of thermostability is necessary. In order to understand the basis of the thermostability in proteins, we have made a statistical analysis of the sequences, secondary structures, hydrogen bonds, salt bridges, DHA (Donor-Hydrogen-Accepter) angles, and bond lengths of ten pairs of thermophilic proteins and their non-thermophilic orthologous. Our findings suggest that polar amino acids contribute to thermostability in proteins by forming hydrogen bonds and salt bridges which provide resistance against protein denaturation. Short bond length and a wider DHA angle provide greater bond stability in thermophilic proteins. Moreover, the increased frequency of aromatic amino acids in thermophilic proteins contributes to thermal stability by forming more aromatic interactions. Additionally, the coil, helix, and loop in the secondary structure also contribute to thermostability.


Asunto(s)
Aminoácidos , Proteínas , Aminoácidos/química , Enlace de Hidrógeno , Desnaturalización Proteica , Ingeniería de Proteínas , Proteínas/química , Temperatura
9.
Int J Mol Sci ; 23(3)2022 Jan 23.
Artículo en Inglés | MEDLINE | ID: mdl-35163174

RESUMEN

4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements, for example, DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. The purpose of this study was to establish a robust deep learning model to recognize 4mC sites in Geobacter pickeringii. In the anticipated model, two kinds of feature descriptors, namely, binary and k-mer composition were used to encode the DNA sequences of Geobacter pickeringii. The obtained features from their fusion were optimized by using correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. Then, these optimized features were inserted into 1D convolutional neural network (CNN) to classify 4mC sites from non-4mC sites in Geobacter pickeringii. The performance of the anticipated model on independent data exhibited an accuracy of 0.868, which was 4.2% higher than the existing model.


Asunto(s)
Biología Computacional/métodos , Epigénesis Genética/genética , Geobacter/genética , Algoritmos , Citosina/metabolismo , ADN/genética , Metilación de ADN/genética , Aprendizaje Profundo , Aprendizaje Automático , Mutación/genética , Redes Neurales de la Computación , Programas Informáticos
10.
Methods Mol Biol ; 2844: 33-44, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39068330

RESUMEN

Promoters are the genomic regions upstream of genes that RNA polymerase binds in order to initiate gene transcription. Understanding the regulation of gene expression depends on being able to identify promoters, because they are the most important component of gene expression. Agrobacterium tumefaciens (A. tumefaciens) strain C58 was the subject of this study with the goal of creating a machine learning-based model to predict promoters. In this study, nucleotide density (ND), k-mer, and one-hot were used to encode the promoter sequence. Support vector machine (SVM) on fivefold cross-validation with incremental feature selection (IFS) was used to optimize the generated features. These improved characteristics were then used to distinguish promoter sequences by feeding them into the random forest (RF) classifier. Tenfold cross-validation (CV) analysis revealed that the projected model has the ability to produce an accuracy of 84.22%.


Asunto(s)
Agrobacterium tumefaciens , Inteligencia Artificial , Regiones Promotoras Genéticas , Máquina de Vectores de Soporte , Agrobacterium tumefaciens/genética , Biología Computacional/métodos , Algoritmos
11.
Int J Biol Macromol ; 277(Pt 4): 134146, 2024 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-39067723

RESUMEN

Liquid-liquid phase separation (LLPS) regulates many biological processes including RNA metabolism, chromatin rearrangement, and signal transduction. Aberrant LLPS potentially leads to serious diseases. Therefore, the identification of the LLPS proteins is crucial. Traditionally, biochemistry-based methods for identifying LLPS proteins are costly, time-consuming, and laborious. In contrast, artificial intelligence-based approaches are fast and cost-effective and can be a better alternative to biochemistry-based methods. Previous research methods employed word2vec in conjunction with machine learning or deep learning algorithms. Although word2vec captures word semantics and relationships, it might not be effective in capturing features relevant to protein classification, like physicochemical properties, evolutionary relationships, or structural features. Additionally, other studies often focused on a limited set of features for model training, including planar π contact frequency, pi-pi, and ß-pairing propensities. To overcome such shortcomings, this study first constructed a reliable dataset containing 1206 protein sequences, including 603 LLPS and 603 non-LLPS protein sequences. Then a computational model was proposed to efficiently identify the LLPS proteins by perceiving semantic information of protein sequences directly; using an ESM2-36 pre-trained model based on transformer architecture in conjunction with a convolutional neural network. The model could achieve an accuracy of 85.68% and 89.67%, respectively on training data and test data, surpassing the accuracy of previous studies. The performance demonstrates the potential of our computational methods as efficient alternatives for identifying LLPS proteins.


Asunto(s)
Proteínas , Proteínas/química , Proteínas/aislamiento & purificación , Algoritmos , Redes Neurales de la Computación , Aprendizaje Automático , Extracción Líquido-Líquido/métodos , Biología Computacional/métodos , Separación de Fases
12.
Front Microbiol ; 14: 1200678, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37250059

RESUMEN

Promoters are the basic functional cis-elements to which RNA polymerase binds to initiate the process of gene transcription. Comprehensive understanding gene expression and regulation depends on the precise identification of promoters, as they are the most important component of gene expression. This study aimed to develop a machine learning-based model to predict promoters in Klebsiella aerogenes (K. aerogenes). In the prediction model, the promoter sequences in K. aerogenes genome were encoded by pseudo k-tuple nucleotide composition (PseKNC) and position-correlation scoring function (PCSF). Numerical features were obtained and then optimized using mRMR by combining with support vector machine (SVM) and 5-fold cross-validation (CV). Subsequently, these optimized features were inputted into SVM-based classifier to discriminate promoter sequences from non-promoter sequences in K. aerogenes. Results of 10-fold CV showed that the model could yield the overall accuracy of 96.0% and the area under the ROC curve (AUC) of 0.990. We hope that this model will provide help for the study of promoter and gene regulation in K. aerogenes.

13.
Front Microbiol ; 14: 1170785, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37125199

RESUMEN

Promotors are those genomic regions on the upstream of genes, which are bound by RNA polymerase for starting gene transcription. Because it is the most critical element of gene expression, the recognition of promoters is crucial to understand the regulation of gene expression. This study aimed to develop a machine learning-based model to predict promotors in Agrobacterium tumefaciens (A. tumefaciens) strain C58. In the model, promotor sequences were encoded by three different kinds of feature descriptors, namely, accumulated nucleotide frequency, k-mer nucleotide composition, and binary encodings. The obtained features were optimized by using correlation and the mRMR-based algorithm. These optimized features were inputted into a random forest (RF) classifier to discriminate promotor sequences from non-promotor sequences in A. tumefaciens strain C58. The examination of 10-fold cross-validation showed that the proposed model could yield an overall accuracy of 0.837. This model will provide help for the study of promoters in A. tumefaciens C58 strain.

14.
Comput Struct Biotechnol J ; 21: 2253-2261, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37035551

RESUMEN

Hormone binding proteins (HBPs) belong to the group of soluble carrier proteins. These proteins selectively and non-covalently interact with hormones and promote growth hormone signaling in human and other animals. The HBPs are useful in many medical and commercial fields. Thus, the identification of HBPs is very important because it can help to discover more details about hormone binding proteins. Meanwhile, the experimental methods are time-consuming and expensive for hormone binding proteins recognition. Computational prediction methods have played significant roles in the correct recognition of hormone binding proteins with the use of sequence information and ML algorithms. In this review, we compared and assessed the implementation of ML-based tools in recognition of HBPs in a unique way. We hope that this study will give enough awareness and knowledge for research on HBPs.

15.
Behav Sci (Basel) ; 13(7)2023 Jul 12.
Artículo en Inglés | MEDLINE | ID: mdl-37504025

RESUMEN

The world faced COVID-19, which was a threat to public health and disturbed the educational system and economic stability. Educational institutes were closed for a longer period, and students faced difficulty to complete their syllabus. The government adopted a policy of "suspending classes without stopping learning" to continue education activities. However, student satisfaction with online education is a growing concern. Satisfaction of students is an important indicator of academic quality. Therefore, this study attempts to investigate the influencing factors behind learning satisfaction using information from 335 students from various institutes in Pakistan. This research examined the impact of computer and internet knowledge, instructor and course material, and Learning Management Systems (LMS) on learning satisfaction. The path coefficients were obtained via Partial Least Square-Structural Equation Modeling (PLS-SEM). The LMS is a tool that facilitates the learning process with the provision of all types of educational material. The path coefficient was more in the case of LMS (0.489), which indicates its positive and significant role to attain learning satisfaction. The instructor and course material ordered second (0.261), which shows that the quality of an instructor and course material also plays a positive role to attain learning satisfaction. The computer and internet are essential ingredients of online education, showing a significant and positive path coefficient (0.123), implying that computer and internet knowledge could enhance learning satisfaction. The universities should develop their LMS to implement online education with quality course materials. It is also vital that the instructor should be up to date with modern learning techniques while ensuring internet connectivity, especially in rural areas. The government should provide an internet connection to students at discounted rates.

16.
Comput Biol Med ; 163: 107165, 2023 09.
Artículo en Inglés | MEDLINE | ID: mdl-37315383

RESUMEN

MicroRNAs have a significant role in the emergence of various human disorders. Consequently, it is essential to understand the existing interactions between miRNAs and diseases, as this will help scientists better study and comprehend the diseases' biological mechanisms. Findings can be employed as biomarkers or drug targets to advance the detection, diagnosis, and treatment of complex human disorders by foretelling possible disease-related miRNAs. This study proposed a computational model for predicting potential miRNA-disease associations called the Collaborative Filtering Neighborhood-based Classification Model (CFNCM), in light of the shortcomings of conventional and biological experiments, which are expensive and time-consuming. The model generated integrated miRNA and disease similarity matrices using the validated associations and miRNA and disease similarity information and used them as the input features for CFNCM. To produce class labels, we first determined the association scores for brand-new pairs using user-based collaborative filtering. With zero as the threshold, the associations with scores >0 were labelled 1, indicating a potential positive association, otherwise, it is marked as 0. Then, we developed classification models using various machine-learning algorithms. By comparison, we discovered that the support vector machine (SVM) produced the best AUC of 0.96 with 10-fold cross-validation through the GridSearchCV technique for identifying optimal parameter values. In addition, the models were evaluated and verified by analyzing the top 50 breast and lung neoplasms-related miRNAs, of which 46 and 47 associations were verified in two authoritative databases, dbDEMC and miR2Disease.


Asunto(s)
Enfermedad , MicroARNs , Máquina de Vectores de Soporte , Características del Vecindario , MicroARNs/genética , MicroARNs/metabolismo , Simulación por Computador , Humanos , Enfermedad/clasificación , Algoritmos
17.
Front Med (Lausanne) ; 10: 1291352, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38298505

RESUMEN

Snake venom contains many toxic proteins that can destroy the circulatory system or nervous system of prey. Studies have found that these snake venom proteins have the potential to treat cardiovascular and nervous system diseases. Therefore, the study of snake venom protein is conducive to the development of related drugs. The research technologies based on traditional biochemistry can accurately identify these proteins, but the experimental cost is high and the time is long. Artificial intelligence technology provides a new means and strategy for large-scale screening of snake venom proteins from the perspective of computing. In this paper, we developed a sequence-based computational method to recognize snake toxin proteins. Specially, we utilized three different feature descriptors, namely g-gap, natural vector and word 2 vector, to encode snake toxin protein sequences. The analysis of variance (ANOVA), gradient-boost decision tree algorithm (GBDT) combined with incremental feature selection (IFS) were used to optimize the features, and then the optimized features were input into the deep learning model for model training. The results show that our model can achieve a prediction performance with an accuracy of 82.00% in 10-fold cross-validation. The model is further verified on independent data, and the accuracy rate reaches to 81.14%, which demonstrated that our model has excellent prediction performance and robustness.

18.
Int J Biol Macromol ; 227: 1174-1181, 2023 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-36470433

RESUMEN

RNA N4-acetylcytidine (ac4C) is the acetylation of cytidine at the nitrogen-4 position, which is a highly conserved RNA modification and involves a variety of biological processes. Hence, accurate identification of genome-wide ac4C sites is vital for understanding regulation mechanism of gene expression. In this work, a novel predictor, named iRNA-ac4C, was established to identify ac4C sites in human mRNA based on three feature extraction methods, including nucleotide composition, nucleotide chemical property, and accumulated nucleotide frequency. Subsequently, minimum-Redundancy-Maximum-Relevance combined with incremental feature selection strategies was utilized to select the optimal feature subset. According to the optimal feature subset, the best ac4C classification model was trained by gradient boosting decision tree with 10-fold cross-validation. The results of independent testing set indicated that our proposed method could produce encouraging generalization capabilities. For the convenience of other researchers, we established a user-friendly web server which is freely available at http://lin-group.cn/server/iRNA-ac4C/. We hope that the tool could provide guide for wet-experimental scholars.


Asunto(s)
Citidina , ARN , Humanos , ARN Mensajero/metabolismo , Citidina/genética , Citidina/metabolismo , ARN/química , Nucleótidos
19.
Front Biosci (Landmark Ed) ; 27(3): 84, 2022 03 05.
Artículo en Inglés | MEDLINE | ID: mdl-35345316

RESUMEN

BACKGROUND: Lipocalin belongs to the calcyin family, and its sequence length is generally between 165 and 200 residues. They are mainly stable and multifunctional extracellular proteins. Lipocalin plays an important role in several stress responses and allergic inflammations. Because the accurate identification of lipocalins could provide significant evidences for the study of their function, it is necessary to develop a machine learning-based model to recognize lipocalin. METHODS: In this study, we constructed a prediction model to identify lipocalin. Their sequences were encoded by six types of features, namely amino acid composition (AAC), composition of k-spaced amino acid pairs (CKSAAP), pseudo amino acid composition (PseAAC), Geary correlation (GD), normalized Moreau-Broto autocorrelation (NMBroto) and composition/transition/distribution (CTD). Subsequently, these features were optimized by using feature selection techniques. A classifier based on random forest was trained according to the optimal features. RESULTS: The results of 10-fold cross-validation showed that our computational model would classify lipocalins with accuracy of 95.03% and area under the curve of 0.987. On the independent dataset, our computational model could produce the accuracy of 89.90% which was 4.17% higher than the existing model. CONCLUSIONS: In this work, we developed an advanced computational model to discriminate lipocalin proteins from non-lipocalin proteins. In the proposed model, protein sequences were encoded by six descriptors. Then, feature selection was performed to pick out the best features which could produce the maximum accuracy. On the basis of the best feature subset, the RF-based classifier can obtained the best prediction results.


Asunto(s)
Inteligencia Artificial , Lipocalinas , Aminoácidos , Biología Computacional , Lipocalinas/química , Aprendizaje Automático , Proteínas/química
20.
Front Microbiol ; 13: 790063, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35273581

RESUMEN

Thermophilic proteins have important application value in biotechnology and industrial processes. The correct identification of thermophilic proteins provides important information for the application of these proteins in engineering. The identification method of thermophilic proteins based on biochemistry is laborious, time-consuming, and high cost. Therefore, there is an urgent need for a fast and accurate method to identify thermophilic proteins. Considering this urgency, we constructed a reliable benchmark dataset containing 1,368 thermophilic and 1,443 non-thermophilic proteins. A multi-layer perceptron (MLP) model based on a multi-feature fusion strategy was proposed to discriminate thermophilic proteins from non-thermophilic proteins. On independent data set, the proposed model could achieve an accuracy of 96.26%, which demonstrates that the model has a good application prospect. In order to use the model conveniently, a user-friendly software package called iThermo was established and can be freely accessed at http://lin-group.cn/server/iThermo/index.html. The high accuracy of the model and the practicability of the developed software package indicate that this study can accelerate the discovery and engineering application of thermally stable proteins.

SELECCIÓN DE REFERENCIAS
Detalles de la búsqueda