RESUMO
The transport, compartmentation and allocation of sugar are critical for plant growth and development, as well as for stress resistance, but sugar transporter genes have not been comprehensively characterized in soybean. Here, we performed a genome-wide identification and expression analyses of sugar transporter genes in soybean in order to reveal their putative functions. A total of 122 genes encoding sucrose transporters (SUTs) and monosaccharide transporters (MSTs) were identified in soybean. They were classified into 8 subfamilies according to their phylogenetic relationships and their conserved motifs. Comparative genomics analysis indicated that whole genome duplication/segmental duplication and tandem duplication contributed to the expansion of sugar transporter genes in soybean. Expression analysis by retrieving transcriptome datasets suggested that most of these sugar transporter genes were expressed in various tissues, and a number of genes exhibited tissue-specific expression patterns. Several genes including GmSTP21, GmSFP8, and GmPLT5/6/7/8/9 were predominantly expressed in nodules, and GmPLT8 was significantly induced by rhizobia inoculation in root hairs. Transcript profiling and qRT-PCR analyses suggested that half of these sugar transporter genes were significantly induced or repressed under stresses like salt, drought, and cold. In addition, GmSTP22 was found to be localized in the plasma membrane, and its overexpression promoted plant growth and salt tolerance in transgenic Arabidopsis under the supplement with glucose or sucrose. This study provides insights into the evolutionary expansion, expression pattern and functional divergence of sugar transporter gene family, and will enable further understanding of their biological functions in the regulation of growth, yield formation and stress resistance of soybean.
RESUMO
Post-translational modification (PTM) refers to the covalent and enzymatic modification of proteins after protein biosynthesis, which orchestrates a variety of biological processes. Detecting PTM sites in proteome scale is one of the key steps to in-depth understanding their regulation mechanisms. In this study, we presented an integrated method based on eXtreme Gradient Boosting (XGBoost), called iRice-MS, to identify 2-hydroxyisobutyrylation, crotonylation, malonylation, ubiquitination, succinylation and acetylation in rice. For each PTM-specific model, we adopted eight feature encoding schemes, including sequence-based features, physicochemical property-based features and spatial mapping information-based features. The optimal feature set was identified from each encoding, and their respective models were established. Extensive experimental results show that iRice-MS always display excellent performance on 5-fold cross-validation and independent dataset test. In addition, our novel approach provides the superiority to other existing tools in terms of AUC value. Based on the proposed model, a web server named iRice-MS was established and is freely accessible at http://lin-group.cn/server/iRice-MS.
Assuntos
Oryza , Processamento de Proteína Pós-Traducional , Acetilação , Biologia Computacional , Modelos Biológicos , Oryza/metabolismo , Processamento de Proteína Pós-Traducional/fisiologia , Proteoma/metabolismo , UbiquitinaçãoRESUMO
As a newly discovered protein posttranslational modification, histone lysine crotonylation (Kcr) involved in cellular regulation and human diseases. Various proteomics technologies have been developed to detect Kcr sites. However, experimental approaches for identifying Kcr sites are often time-consuming and labor-intensive, which is difficult to widely popularize in large-scale species. Computational approaches are cost-effective and can be used in a high-throughput manner to generate relatively precise identification. In this study, we develop a deep learning-based method termed as Deep-Kcr for Kcr sites prediction by combining sequence-based features, physicochemical property-based features and numerical space-derived information with information gain feature selection. We investigate the performances of convolutional neural network (CNN) and five commonly used classifiers (long short-term memory network, random forest, LogitBoost, naive Bayes and logistic regression) using 10-fold cross-validation and independent set test. Results show that CNN could always display the best performance with high computational efficiency on large dataset. We also compare the Deep-Kcr with other existing tools to demonstrate the excellent predictive power and robustness of our method. Based on the proposed model, a webserver called Deep-Kcr was established and is freely accessible at http://lin-group.cn/server/Deep-Kcr.
Assuntos
Crotonatos/metabolismo , Bases de Dados de Proteínas , Aprendizado Profundo , Processamento de Proteína Pós-Traducional , Análise de Sequência de Proteína , Acilação , Humanos , Lisina/metabolismoRESUMO
MicroRNAs, a group of short non-coding RNA molecules, could regulate gene expression. Many diseases are associated with abnormal expression of miRNAs. Therefore, accurate identification of miRNA precursors is necessary. In the past 10 years, experimental methods, comparative genomics methods, and artificial intelligence methods have been used to identify pre-miRNAs. However, experimental methods and comparative genomics methods have their disadvantages, such as time-consuming. In contrast, machine learning-based method is a better choice. Therefore, the review summarizes the current advances in pre-miRNA recognition based on computational methods, including the construction of benchmark datasets, feature extraction methods, prediction algorithms, and the results of the models. And we also provide valid information about the predictors currently available. Finally, we give the future perspectives on the identification of pre-miRNAs. The review provides scholars with a whole background of pre-miRNA identification by using machine learning methods, which can help researchers have a clear understanding of progress of the research in this field.
RESUMO
5hmC, 6mA, and 4mC are three common DNA modifications and are involved in various of biological processes. Accurate genome-wide identification of these sites is invaluable for better understanding their biological functions. Owing to the labor-intensive and expensive nature of experimental methods, it is urgent to develop computational methods for the genome-wide detection of these sites. Keeping this in mind, the current study was devoted to construct a computational method to identify 5hmC, 6mA, and 4mC. We initially used K-tuple nucleotide component, nucleotide chemical property and nucleotide frequency, and mono-nucleotide binary encoding scheme to formulate samples. Subsequently, random forest was utilized to identify 5hmC, 6mA, and 4mC sites. Cross-validated results showed that the proposed method could produce the excellent generalization ability in the identification of the three modification sites. Based on the proposed model, a web-server called iDNA-MS was established and is freely accessible at http://lin-group.cn/server/iDNA-MS.
RESUMO
The chloroplast is a type of subcellular organelle of green plants and eukaryotic algae, which plays an important role in the photosynthesis process. Since the function of a protein correlates with its location, knowing its subchloroplast localization is helpful for elucidating its functions. However, due to a large number of chloroplast proteins, it is costly and time-consuming to design biological experiments to recognize subchloroplast localizations of these proteins. To address this problem, during the past ten years, twelve computational prediction methods have been developed to predict protein subchloroplast localization. This review summarizes the research progress in this area. We hope the review could provide important guide for further computational study on protein subchloroplast localization.
Assuntos
Proteínas de Cloroplastos/genética , Cloroplastos/genética , Regulação da Expressão Gênica de Plantas , Aprendizado de Máquina , Modelos Estatísticos , Proteoma/genética , Sequência de Aminoácidos , Proteínas de Cloroplastos/classificação , Proteínas de Cloroplastos/metabolismo , Cloroplastos/metabolismo , Biologia Computacional/métodos , Biologia Computacional/estatística & dados numéricos , Conjuntos de Dados como Assunto , Plantas/genética , Plantas/metabolismo , Transporte Proteico , Proteoma/classificação , Proteoma/metabolismoRESUMO
Nuclear receptors (NRs) are a superfamily of ligand-dependent transcription factors that are closely related to cell development, differentiation, reproduction, homeostasis, and metabolism. According to the alignments of the conserved domains, NRs are classified and assigned the following seven subfamilies or eight subfamilies: (1) NR1: thyroid hormone like (thyroid hormone, retinoic acid, RAR-related orphan receptor, peroxisome proliferator activated, vitamin D3- like), (2) NR2: HNF4-like (hepatocyte nuclear factor 4, retinoic acid X, tailless-like, COUP-TFlike, USP), (3) NR3: estrogen-like (estrogen, estrogen-related, glucocorticoid-like), (4) NR4: nerve growth factor IB-like (NGFI-B-like), (5) NR5: fushi tarazu-F1 like (fushi tarazu-F1 like), (6) NR6: germ cell nuclear factor like (germ cell nuclear factor), and (7) NR0: knirps like (knirps, knirpsrelated, embryonic gonad protein, ODR7, trithorax) and DAX like (DAX, SHP), or dividing NR0 into (7) NR7: knirps like and (8) NR8: DAX like. Different NRs families have different structural features and functions. Since the function of a NR is closely correlated with which subfamily it belongs to, it is highly desirable to identify NRs and their subfamilies rapidly and effectively. The knowledge acquired is essential for a proper understanding of normal and abnormal cellular mechanisms. With the advent of the post-genomics era, huge amounts of sequence-known proteins have increased explosively. Conventional methods for accurately classifying the family of NRs are experimental means with high cost and low efficiency. Therefore, it has created a greater need for bioinformatics tools to effectively recognize NRs and their subfamilies for the purpose of understanding their biological function. In this review, we summarized the application of machine learning methods in the prediction of NRs from different aspects. We hope that this review will provide a reference for further research on the classification of NRs and their families.
Assuntos
Aprendizado de Máquina , Receptores Citoplasmáticos e Nucleares/genética , Animais , Biologia Computacional , Humanos , Receptores Citoplasmáticos e Nucleares/metabolismoRESUMO
Mycobacterium tuberculosis (MTB) can cause the terrible tuberculosis (TB), which is reported as one of the most dreadful epidemics. Although many biochemical molecular drugs have been developed to cope with this disease, the drug resistance-especially the multidrug-resistant (MDR) and extensively drug-resistance (XDR)-poses a huge threat to the treatment. However, traditional biochemical experimental method to tackle TB is time-consuming and costly. Benefited by the appearance of the enormous genomic and proteomic sequence data, TB can be treated via sequence-based biological computational approach-bioinformatics. Studies on predicting subcellular localization of mycobacterial protein (MBP) with high precision and efficiency may help figure out the biological function of these proteins and then provide useful insights for protein function annotation as well as drug design. In this review, we reported the progress that has been made in computational prediction of subcellular localization of MBP including the following aspects: 1) Construction of benchmark datasets. 2) Methods of feature extraction. 3) Techniques of feature selection. 4) Application of several published prediction algorithms. 5) The published results. 6) The further study on prediction of subcellular localization of MBP.
Assuntos
Proteínas de Bactérias/genética , Aprendizado de Máquina , Mycobacterium tuberculosis/genética , Proteínas de Bactérias/metabolismo , Biologia Computacional , Mycobacterium tuberculosis/metabolismoRESUMO
Bioluminescent Proteins (BLPs) are widely distributed in many living organisms that act as a key role of light emission in bioluminescence. Bioluminescence serves various functions in finding food and protecting the organisms from predators. With the routine biotechnological application of bioluminescence, it is recognized to be essential for many medical, commercial and other general technological advances. Therefore, the prediction and characterization of BLPs are significant and can help to explore more secrets about bioluminescence and promote the development of application of bioluminescence. Since the experimental methods are money and time-consuming for BLPs identification, bioinformatics tools have played important role in fast and accurate prediction of BLPs by combining their sequences information with machine learning methods. In this review, we summarized and compared the application of machine learning methods in the prediction of BLPs from different aspects. We wish that this review will provide insights and inspirations for researches on BLPs.
Assuntos
Biologia Computacional , Proteínas Luminescentes/química , Aprendizado de MáquinaRESUMO
DNA N6-methyladenine (6mA) is a dominant DNA modification form and involved in many biological functions. The accurate genome-wide identification of 6mA sites may increase understanding of its biological functions. Experimental methods for 6mA detection in eukaryotes genome are laborious and expensive. Therefore, it is necessary to develop computational methods to identify 6mA sites on a genomic scale, especially for plant genomes. Based on this consideration, the study aims to develop a machine learning-based method of predicting 6mA sites in the rice genome. We initially used mono-nucleotide binary encoding to formulate positive and negative samples. Subsequently, the machine learning algorithm named Random Forest was utilized to perform the classification for identifying 6mA sites. Our proposed method could produce an area under the receiver operating characteristic curve of 0.964 with an overall accuracy of 0.917, as indicated by the fivefold cross-validation test. Furthermore, an independent dataset was established to assess the generalization ability of our method. Finally, an area under the receiver operating characteristic curve of 0.981 was obtained, suggesting that the proposed method had good performance of predicting 6mA sites in the rice genome. For the convenience of retrieving 6mA sites, on the basis of the computational method, we built a freely accessible web server named iDNA6mA-Rice at http://lin-group.cn/server/iDNA6mA-Rice.