RESUMEN
The development of data science has been needed in environmental fields such as marine, weather, and soil data. In general, the datasets are large in some cases, but they are often small because they contain observation data that the analyses themselves are limited. In such a case, the data are statistically evaluated by increasing or decreasing the levels of factors using differential analysis, resulting in the essential factors are estimated. However, there is no consistent approach to the means of assessing strong associations as a group between factors. Causal inference method has the possibility to output effective results for small data, and the results are expected to provide important information for understanding the potential highly association between factors, not necessarily the inference with big data. Here, we describe essential checkpoints and settings for the calculation by a direct method for learning a linear non-Gaussian structural equation model (DirectLiNGAM) and validation methods for the calculation results by using DirectLiNGAM with small-scale model data as an additional discussion of DirectLiNGAM portion of the related research article. Thus, this study provides the statistical validation methods for the association networks, treatments, and interventions for structural inference as a group of essential factors.â¢Causal inference with DirectLiNGAMâ¢Validation of correlation coefficient and feature importanceâ¢Validation using causal effect object and propensity scores.
RESUMEN
Reducing antibiotic usage among livestock animals to prevent antimicrobial resistance has become an urgent issue worldwide. This study evaluated the effects of administering chlortetracycline (CTC), a versatile antibacterial agent, on the performance, blood components, fecal microbiota, and organic acid concentrations of calves. Japanese Black calves were fed with milk replacers containing CTC at 10 g/kg (CON group) or 0 g/kg (EXP group). Growth performance was not affected by CTC administration. However, CTC administration altered the correlation between fecal organic acids and bacterial genera. Machine learning (ML) methods such as association analysis, linear discriminant analysis, and energy landscape analysis revealed that CTC administration affected populations of various types of fecal bacteria. Interestingly, the abundance of several methane-producing bacteria at 60 days of age was high in the CON group, and the abundance of Lachnospiraceae, a butyrate-producing bacterium, was high in the EXP group. Furthermore, statistical causal inference based on ML data estimated that CTC treatment affected the entire intestinal environment, potentially suppressing butyrate production, which may be attributed to methanogens in feces. Thus, these observations highlight the multiple harmful impacts of antibiotics on the intestinal health of calves and the potential production of greenhouse gases by calves.
Asunto(s)
Antibacterianos , Clortetraciclina , Animales , Bovinos , Antibacterianos/farmacología , Disbiosis , Clortetraciclina/farmacología , Heces/microbiología , Bacterias , Butiratos , Alimentación Animal/análisis , Dieta/veterinariaRESUMEN
Compost is used worldwide as a soil conditioner for crops, but its functions have still been explored. Here, the omics profiles of carrots were investigated, as a root vegetable plant model, in a field amended with compost fermented with thermophilic Bacillaceae for growth and quality indices. Exposure to compost significantly increased the productivity, antioxidant activity, color, and taste of the carrot root and altered the soil bacterial composition with the levels of characteristic metabolites of the leaf, root, and soil. Based on the data, structural equation modeling (SEM) estimated that amino acids, antioxidant activity, flavonoids and/or carotenoids in plants were optimally linked by exposure to compost. The SEM of the soil estimated that the genus Paenibacillus and nitrogen compounds were optimally involved during exposure. These estimates did not show a contradiction between the whole genomic analysis of compost-derived Paenibacillus isolates and the bioactivity data, inferring the presence of a complex cascade of plant growth-promoting effects and modulation of the nitrogen cycle by the compost itself. These observations have provided information on the qualitative indicators of compost in complex soil-plant interactions and offer a new perspective for chemically independent sustainable agriculture through the efficient use of natural nitrogen.
RESUMEN
Coastal seagrass meadows are essential in blue carbon and aquatic ecosystem services. However, this ecosystem has suffered severe eutrophication and destruction due to the expansion of aquaculture. Therefore, methods for the flourishing of seagrass are still being explored. Here, data from 49 public coastal surveys on the distribution of seagrass and seaweed around the onshore aquaculture facilities are revalidated, and an exceptional area where the seagrass Zostera marina thrives was found near the shore downstream of the onshore aquaculture facility. To evaluate the characteristics of the sediment for growing seagrass, physicochemical properties and bacterial ecological evaluations of the sediment were conducted. Evaluation of chemical properties in seagrass sediments confirmed a significant increase in total carbon and a decrease in zinc content. Association analysis and linear discriminant analysis refined bacterial candidates specified in seagrass overgrown- and nonovergrown-sediment. Energy landscape analysis indicated that the symbiotic bacterial groups of seagrass sediment were strongly affected by the distance close to the seagrass-growing aquaculture facility despite their bacterial population appearing to fluctuate seasonally. The bacterial population there showed an apparent decrease in the pathogen candidates belonging to the order Flavobacteriales. Moreover, structure equation modeling and a linear non-Gaussian acyclic model based on the machine learning data estimated an optimal sediment symbiotic bacterial group candidate for seagrass growth as follows: the Lachnospiraceae and Ruminococcaceae families as gut-inhabitant bacteria, Rhodobacteraceae as photosynthetic bacteria, and Desulfobulbaceae as cable bacteria modulating oxygen or nitrate reduction and oxidation of sulfide. These observations confer a novel perspective on the sediment symbiotic bacterial structures critical for blue carbon and low-pathogenic marine ecosystems in aquaculture.
Asunto(s)
Ecosistema , Zosteraceae , Humanos , Sedimentos Geológicos/análisis , Acuicultura , Carbono/análisis , BacteriasRESUMEN
In the development of polymer materials, it is an important issue to explore the complex relationships between domain structure and physical properties. In the domain structure analysis of polymer materials, 1H-static solid-state NMR (ssNMR) spectra can provide information on mobile, rigid, and intermediate domains. But estimation of domain structure from its analysis is difficult due to the wide overlap of spectra from multiple domains. Therefore, we have developed a materials informatics approach that combines the domain modeling ( http://dmar.riken.jp/matrigica/ ) and the integrated analysis of meta-information (the elements, functional groups, additives, and physical properties) in polymer materials. Firstly, the 1H-static ssNMR data of 120 polymer materials were subjected to a short-time Fourier transform to obtain frequency, intensity, and T2 relaxation time for domains with different mobility. The average T2 relaxation time of each domain is 0.96 ms for Mobile, 0.55 ms for Intermediate (Mobile), 0.32 ms for Intermediate (Rigid), and 0.11 ms for Rigid. Secondly, the estimated domain proportions were integrated with meta-information such as elements, functional group and thermophysical properties and was analyzed using a self-organization map and market basket analysis. This proposed method can contribute to explore structure-property relationships of polymer materials with multiple domains.
Asunto(s)
Imagen por Resonancia Magnética , Polímeros , Informática , Espectroscopía de Resonancia Magnética/métodos , Polímeros/químicaRESUMEN
Effective biological utilization of wood biomass is necessary worldwide. Since several insect larvae can use wood biomass as a nutrient source, studies on their digestive microbial structures are expected to reveal a novel rule underlying wood biomass processing. Here, structural inferences for inhabitant bacteria involved in carbon and nitrogen metabolism for beetle larvae, an insect model, were performed to explore the potential rules. Bacterial analysis of larval feces showed enrichment of the phyla Chroloflexi, Gemmatimonadetes, and Planctomycetes, and the genera Bradyrhizobium, Chonella, Corallococcus, Gemmata, Hyphomicrobium, Lutibacterium, Paenibacillus, and Rhodoplanes, as bacteria potential involved in plant growth promotion, nitrogen cycle modulation, and/or environmental protection. The fecal abundances of these bacteria were not necessarily positively correlated with their abundances in the habitat, indicating that they were selectively enriched in the feces of the larvae. Correlation and association analyses predicted that common fecal bacteria might affect carbon and nitrogen metabolism. Based on these hypotheses, structural equation modeling (SEM) statistically estimated that inhabitant bacterial groups involved in carbon and nitrogen metabolism were composed of the phylum Gemmatimonadetes and Planctomycetes, and the genera Bradyrhizobium, Corallococcus, Gemmata, and Paenibacillus, which were among the fecal-enriched bacteria. Nevertheless, the selected common bacteria, i.e., the phyla Acidobacteria, Armatimonadetes, and Bacteroidetes and the genera Candidatus Solibacter, Devosia, Fimbriimonas, Gemmatimonas Opitutus, Sphingobium, and Methanobacterium, were necessary to obtain good fit indices in the SEM. In addition, the composition of the bacterial groups differed depending upon metabolic targets, carbon and nitrogen, and their stable isotopes, δ13C and δ15N, respectively. Thus, the statistically derived causal structural models highlighted that the larval fecal-enriched bacteria and common symbiotic bacteria might selectively play a role in wood biomass carbon and nitrogen metabolism. This information could confer a new perspective that helps us use wood biomass more efficiently and might stimulate innovation in environmental industries in the future.
Asunto(s)
Carbono , Escarabajos , Acidobacteria/metabolismo , Animales , Bacterias/metabolismo , Carbono/metabolismo , Escarabajos/metabolismo , Larva/metabolismo , Nitrógeno/metabolismo , Madera/metabolismoRESUMEN
The protein isoelectric point (pI) can be calculated from an amino acid sequence using computational analysis in a good agreement with experimental data. Availability of whole-genome sequences empowers comparative studies of proteome-wide pI distributions. It was found that the whole-proteome distributions of protein pI values are multimodal in different species. It was further hypothesized that the observed multimodality is associated with subcellular localization-specific differences in local pI distributions. Here, we overview the multimodality of proteome-wide pI distributions in different organisms focusing on the relationships between protein pI and subcellular localization. We also discuss the probable factors responsible for variation of the intracellular localization-specific pI profiles.
RESUMEN
Materials informatics is an emerging field that allows us to predict the properties of materials and has been applied in various research and development fields, such as materials science. In particular, solubility factors such as the Hansen and Hildebrand solubility parameters (HSPs and SP, respectively) and Log P are important values for understanding the physical properties of various substances. In this study, we succeeded at establishing a solubility prediction tool using a unique machine learning method called the in-phase deep neural network (ip-DNN), which starts exclusively from the analytical input data (e.g., NMR information, refractive index, and density) to predict solubility by predicting intermediate elements, such as molecular components and molecular descriptors, in the multiple-step method. For improving the level of accuracy of the prediction, intermediate regression models were employed when performing in-phase machine learning. In addition, we developed a website dedicated to the established solubility prediction method, which is freely available at "http://dmar.riken.jp/matsolca/".
RESUMEN
Nuclear magnetic resonance (NMR) spectroscopy is commonly used to characterize molecular complexity because it produces informative atomic-resolution data on the chemical structure and molecular mobility of samples non-invasively by means of various acquisition parameters and pulse programs. However, analyzing the accumulated NMR data of mixtures is challenging due to noise and signal overlap. Therefore, data-cleansing steps, such as quality checking, noise reduction, and signal deconvolution, are important processes before spectrum analysis. Here, we have developed an NMR measurement informatics tool for data cleansing that combines short-time Fourier transform (STFT; a time-frequency analytical method) and probabilistic sparse matrix factorization (PSMF) for signal deconvolution and noise factor analysis. Our tool can be applied to the original free induction decay (FID) signals of a one-dimensional NMR spectrum. We show that the signal deconvolution method reduces the noise of FID signals, increasing the signal-to-noise ratio (SNR) about tenfold, and its application to diffusion-edited spectra allows signals of macromolecules and unsuppressed small molecules to be separated by the length of the T2* relaxation time. Noise factor analysis of NMR datasets identified correlations between SNR and acquisition parameters, identifying major experimental factors that can lower SNR.
Asunto(s)
Espectroscopía de Resonancia Magnética/métodos , Espectroscopía de Resonancia Magnética/normas , Algoritmos , Análisis Factorial , Modelos Teóricos , Relación Señal-RuidoRESUMEN
InterSpin (http://dmar.riken.jp/interspin/) comprises integrated, supportive, and freely accessible preprocessing webtools and a database to advance signal assignment in low- and high-field NMR analyses of molecular complexities ranging from small molecules to macromolecules for food, material, and environmental applications. To support handling of the broad spectra obtained from solid-state NMR or low-field benchtop NMR, we have developed and evaluated two preprocessing tools: sensitivity improvement with spectral integration, which enhances the signal-to-noise ratio by spectral integration, and peaks separation, which separates overlapping peaks by several algorithms, such as non-negative sparse coding. In addition, the InterSpin Laboratory Information Management System (SpinLIMS) database stores numerous standard spectra ranging from small molecules to macromolecules in solid and solution states (dissolved in polar/nonpolar solvents), and can be searched under various conditions using the following molecular assignment tools. SpinMacro supports easy assignment of macromolecules in natural mixtures via solid-state 13C peaks and dimethyl sulfoxide-dissolved 1H-13C correlation peaks. InterAnalysis improves the accuracy of molecular assignment by integrated analysis of 1H-13C correlation peaks and 1H-J correlation peaks of small molecules dissolved in D2O or deuterated methanol, which supports easy narrowing down of metabolite candidates. Finally, by enabling database interoperability, SpinLIMS's client software will ultimately support scientific discovery by facilitating sharing and reusing of NMR data.
RESUMEN
BACKGROUND: Whole-proteome distributions of protein isoelectric point (pI) values in different organisms are bi- or trimodal with some variations. It was suggested that the observed multimodality of the proteome-wide pI distributions is associated with subcellular localization-specific differences in the local pI distributions. However, the factors responsible for variation of the intracellular localization-specific pI profiles have not been investigated in detail. RESULTS: In this work, we explored proteome-wide pI distributions of 32,138 human proteins predicted to reside in 10 subcellular compartments, as well as the pI distributions of experimentally observed lysosomal and Golgi proteins. The distributions were found to differ significantly, although all of them adhered to the major recurrent bimodal pattern. Grossly, acid-biased and alkaline-biased patterns with various minor statistical features were observed at different subcellular locations. Bioinformatics analysis revealed the existence of strong statistically significant correlations between protein pI and subcellular localization. Most markedly, protein pI was found to correlate positively with nuclear and mitochondrial locations and negatively with cytoskeletal, cytoplasmic, lysosomal and peroxisomal environment. Further analysis demonstrated that subcellular compartment-specific pI distributions are greatly influenced by local pH and organelle membrane charge. Multiple nonlinear regression analysis identified a polynomial function of the two variables that best fitted the mean pI values of the localization-specific pI distributions. A high coefficient of determination calculated for this regression (R2 = 0.98) suggests that local pH and organelle membrane charge are the major factors responsible for variation of the intracellular localization-specific pI profiles. CONCLUSIONS: Our study demonstrates that strong correlations exist between protein pI and subcellular localization. The specific pI distributions at different subcellular locations are defined by local environment. Predominantly, it is the local pH and membrane charge that shape the organelle-specific protein pI patterns. These findings expand our understanding of spatial organization of the human proteome.
Asunto(s)
Membrana Celular/metabolismo , Proteoma/metabolismo , Aparato de Golgi/metabolismo , Humanos , Concentración de Iones de Hidrógeno , Punto Isoeléctrico , Lisosomas/metabolismo , Análisis de Regresión , Fracciones Subcelulares/metabolismoRESUMEN
Information about transcription start sites (TSSs) provides baseline data for the analysis of promoter architecture. In this paper we used paired- and single-end deep sequencing to analyze Arabidopsis TSS tags from several libraries prepared from roots, shoots, flowers and etiolated seedlings. The clustering of approximately 33 million mapped TSS tags led to the identification of 324 461 promoters that covered 79.7% (21 672/27 206) of protein-coding genes in the Arabidopsis genome. In addition we identified intragenic, antisense and orphan promoters that were not associated with any gene models. Of these, intragenic promoters exhibited unique characteristics regarding dinucleotide sequences at TSSs and core promoter element composition, suggesting that these promoters use different mechanisms of transcriptional initiation. An analysis of base composition with regard to promoter position revealed a low GC content throughout the promoter region and several local strand biases that were evident for TATA-type promoters, but not for Coreless-type promoters. Most observed strand biases coincided with strand biases of single nucleotide polymorphism rate. Our analysis also revealed that transcription of a gene is supported by an average of 2.7 genic promoters, among which one specific promoter, designated as a top promoter, substantially determines the expression level of the gene.
Asunto(s)
Arabidopsis/genética , Regiones Promotoras Genéticas/genética , Sitio de Iniciación de la Transcripción/fisiología , Proteínas de Arabidopsis/genética , Regulación de la Expresión Génica de las Plantas/genética , Regulación de la Expresión Génica de las Plantas/fisiologíaRESUMEN
Algae are smaller organisms than land plants and offer clear advantages in research over terrestrial species in terms of rapid production, short generation time and varied commercial applications. Thus, studies investigating the practical development of effective algal production are important and will improve our understanding of both aquatic and terrestrial plants. In this study we estimated multiple physicochemical and secondary structural properties of protein sequences, the predicted presence of post-translational modification (PTM) sites, and subcellular localization using a total of 510,123 protein sequences from the proteomes of 31 algal and three plant species. Algal species were broadly selected from green and red algae, glaucophytes, oomycetes, diatoms and other microalgal groups. The results were deposited in the Algal Protein Annotation Suite database (Alga-PrAS; http://alga-pras.riken.jp/), which can be freely accessed online.
Asunto(s)
Proteínas Algáceas/metabolismo , Bases de Datos de Proteínas , Microalgas/metabolismo , Proteoma/metabolismo , Proteínas Algáceas/clasificación , Chlorophyta/clasificación , Chlorophyta/metabolismo , Análisis por Conglomerados , Biología Computacional/métodos , Cyanophora/metabolismo , Diatomeas/clasificación , Diatomeas/metabolismo , Internet , Microalgas/clasificación , Oomicetos/clasificación , Oomicetos/metabolismo , Proteínas de Plantas/clasificación , Proteínas de Plantas/metabolismo , Plantas/clasificación , Plantas/metabolismo , Rhodophyta/clasificación , Rhodophyta/metabolismoRESUMEN
Cassava anthracnose disease (CAD), caused by the fungus Colletotrichum gloeosporioides f. sp. Manihotis, is a serious disease of cassava (Manihot esculenta) worldwide. In this study, we established a cassava oligonucleotide-DNA microarray representing 59,079 probes corresponding to approximately 30,000 genes based on original expressed sequence tags and RNA-seq information from cassava, and applied it to investigate the molecular mechanisms of resistance to fungal infection using two cassava cultivars, Huay Bong 60 (HB60, resistant to CAD) and Hanatee (HN, sensitive to CAD). Based on quantitative real-time reverse transcription PCR and expression profiling by the microarray, we showed that the expressions of various plant defense-related genes, such as pathogenesis-related (PR) genes, cell wall-related genes, detoxification enzyme, genes related to the response to bacterium, mitogen-activated protein kinase (MAPK), genes related to salicylic acid, jasmonic acid and ethylene pathways were higher in HB60 compared with HN. Our results indicated that the induction of PR genes in HB60 by fungal infection and the higher expressions of defense response-related genes in HB60 compared with HN are likely responsible for the fungal resistance in HB60. We also showed that the use of our cassava oligo microarray could improve our understanding of cassava molecular mechanisms related to environmental responses and development, and advance the molecular breeding of useful cassava plants.
Asunto(s)
Colletotrichum/fisiología , Perfilación de la Expresión Génica/métodos , Regulación de la Expresión Génica de las Plantas , Manihot/genética , Manihot/microbiología , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Enfermedades de las Plantas/genética , Enfermedades de las Plantas/microbiología , Ciclopentanos/metabolismo , Etilenos/metabolismo , Ontología de Genes , Genes de Plantas , Oxilipinas/metabolismo , Reacción en Cadena en Tiempo Real de la Polimerasa , Reproducibilidad de los Resultados , Ácido Salicílico/metabolismo , Transducción de Señal/genética , Regulación hacia Arriba/genéticaRESUMEN
Cell-free protein synthesis is used to produce proteins with various structural traits. Recent bioinformatics analyses indicate that more than half of eukaryotic proteins possess long intrinsically disordered regions. However, no systematic study concerning the connection between intrinsic disorder and expression success of cell-free protein synthesis has been presented until now. To address this issue, we examined correlations of the experimentally observed cell-free protein expression yields with the contents of intrinsic disorder bioinformatically predicted in the expressed sequences. This analysis revealed strong relationships between intrinsic disorder and protein amenability to heterologous cell-free expression. On the one hand, elevated disorder content was associated with the increased ratio of soluble expression. On the other hand, overall propensity for detectable protein expression decreased with disorder content. We further demonstrated that these tendencies are rooted in some distinct features of intrinsically disordered regions, such as low hydrophobicity, elevated surface accessibility and high abundance of sequence motifs for proteolytic degradation, including sites of ubiquitination and PEST sequences. Our findings suggest that identification of intrinsically disordered regions in the expressed amino acid sequences can be of practical use for predicting expression success and optimizing cell-free protein synthesis.
Asunto(s)
Proteínas/metabolismo , Secuencia de Aminoácidos , Sistema Libre de Células , Interacciones Hidrofóbicas e Hidrofílicas , Estructura Terciaria de Proteína , Proteínas/química , Proteínas/genética , Proteínas Recombinantes/biosíntesis , Proteínas Recombinantes/genética , UbiquitinaciónRESUMEN
Recent proteome analyses have reported that intrinsically disordered regions (IDRs) of proteins play important roles in biological processes. In higher plants whose genomes have been sequenced, the correlation between IDRs and post-translational modifications (PTMs) has been reported. The genomes of various eukaryotic algae as common ancestors of plants have also been sequenced. However, no analysis of the relationship to protein properties such as structure and PTMs in algae has been reported. Here, we describe correlations between IDR content and the number of PTM sites for phosphorylation, glycosylation, and ubiquitination, and between IDR content and regions rich in proline, glutamic acid, serine, and threonine (PEST) and transmembrane helices in the sequences of 20 algae proteomes. Phosphorylation, O-glycosylation, ubiquitination, and PEST preferentially occurred in disordered regions. In contrast, transmembrane helices were favored in ordered regions. N-glycosylation tended to occur in ordered regions in most of the studied algae; however, it correlated positively with disordered protein content in diatoms. Additionally, we observed that disordered protein content and the number of PTM sites were significantly increased in the species-specific protein clusters compared to common protein clusters among the algae. Moreover, there were specific relationships between IDRs and PTMs among the algae from different groups.
Asunto(s)
Proteínas Algáceas/metabolismo , Biología Computacional/métodos , Proteínas Intrínsecamente Desordenadas/metabolismo , Procesamiento Proteico-Postraduccional , Proteínas Algáceas/química , Chlorophyta/metabolismo , Simulación por Computador , Diatomeas/metabolismo , Proteínas Intrínsecamente Desordenadas/química , Oomicetos/metabolismo , Conformación Proteica , Rhodophyta/metabolismo , Especificidad de la EspecieRESUMEN
Aminoacyl-tRNA synthetases (ARSs) play an essential role in the protein synthesis by catalyzing an attachment of their cognate amino acids to tRNAs. Unlike their prokaryotic counterparts, ARSs in higher eukaryotes form a multiaminoacyl-tRNA synthetase complex (MARS), consisting of the subset of ARS polypeptides and three auxiliary proteins. The intriguing feature of MARS complex is the presence of only nine out of twenty ARSs, specific for Arg, Asp, Gln, Glu, Ile, Leu, Lys, Met, and Pro, regardless of the organism, cell, or tissue types. Although existence of MARSs complex in higher eukaryotes has been already known for more than four decades, its functional significance remains elusive. We found that seven of the nine corresponding amino acids (Arg, Gln, Glu, Ile, Leu, Lys, and Met) together with Ala form a predictor of the protein α-helicity. Remarkably, all amino acids (besides Ala) in the predictor have the highest possible number of side-chain rotamers. Therefore, compositional bias of a typical α-helix can contribute to the helix's stability by increasing the entropy of the folded state. It also appears that position-specific α-helical propensity, specifically periodic alternation of charged and hydrophobic residues in the helices, may well be provided by the structural organization of the complex. Considering characteristics of MARS complex from the perspective of the α-helicity, we hypothesize that specific composition and structure of the complex represents a functional mechanism for coordination of translation with the fast and correct folding of amphiphilic α-helices.
Asunto(s)
Aminoacil-ARNt Sintetasas/química , Aminoacil-ARNt Sintetasas/metabolismo , Secuencia de Aminoácidos , Aminoácidos/química , Fragmentos de Péptidos/química , Pliegue de Proteína , Modificación Traduccional de las Proteínas , Estructura Secundaria de ProteínaRESUMEN
Arabidopsis thaliana is an important model species for studies of plant gene functions. Research on Arabidopsis has resulted in the generation of high-quality genome sequences, annotations and related post-genomic studies. The amount of annotation, such as gene-coding regions and structures, is steadily growing in the field of plant research. In contrast to the genomics resource of animals and microorganisms, there are still some difficulties with characterization of some gene functions in plant genomics studies. The acquisition of information on protein structure can help elucidate the corresponding gene function because proteins encoded in the genome possess highly specific structures and functions. In this study, we calculated multiple physicochemical and secondary structural parameters of protein sequences, including length, hydrophobicity, the amount of secondary structure, the number of intrinsically disordered regions (IDRs) and the predicted presence of transmembrane helices and signal peptides, using a total of 208,333 protein sequences from the genomes of six representative plant species, Arabidopsis thaliana, Glycine max (soybean), Populus trichocarpa (poplar), Oryza sativa (rice), Physcomitrella patens (moss) and Cyanidioschyzon merolae (alga). Using the PASS tool and the Rosetta Stone method, we annotated the presence of novel functional regions in 1,732 protein sequences that included unannotated sequences from the Arabidopsis and rice proteomes. These results were organized into the Plant Protein Annotation Suite database (Plant-PrAS), which can be freely accessed online at http://plant-pras.riken.jp/.
Asunto(s)
Bases de Datos de Proteínas , Almacenamiento y Recuperación de la Información , Proteínas de Plantas/química , Plantas/metabolismo , Proteoma , Arabidopsis/genética , Arabidopsis/metabolismo , Bryopsida/genética , Bryopsida/metabolismo , Mapeo Cromosómico , Internet , Anotación de Secuencia Molecular , Sistemas de Lectura Abierta , Oryza/genética , Oryza/metabolismo , Proteínas de Plantas/genética , Proteínas de Plantas/metabolismo , Plantas/genética , Populus/genética , Populus/metabolismo , Rhodophyta/genética , Rhodophyta/metabolismoRESUMEN
MOTIVATION: Protein structural research in plants lags behind that in animal and bacterial species. This lag concerns both the structural analysis of individual proteins and the proteome-wide characterization of structure-related properties. Until now, no systematic study concerning the relationships between protein disorder and multiple post-translational modifications (PTMs) in plants has been presented. RESULTS: In this work, we calculated the global degree of intrinsic disorder in the complete proteomes of eight typical monocotyledonous and dicotyledonous plant species. We further predicted multiple sites for phosphorylation, glycosylation, acetylation and methylation and examined the correlations of protein disorder with the presence of the predicted PTM sites. It was found that phosphorylation, acetylation and O-glycosylation displayed a clear preference for occurrence in disordered regions of plant proteins. In contrast, methylation tended to avoid disordered sequence, whereas N-glycosylation did not show a universal structural preference in monocotyledonous and dicotyledonous plants. In addition, the analysis performed revealed significant differences between the integral characteristics of monocot and dicot proteomes. They included elevated disorder degree, increased rate of O-glycosylation and R-methylation, decreased rate of N-glycosylation, K-acetylation and K-methylation in monocotyledonous plant species, as compared with dicotyledonous species. Altogether, our study provides the most compelling evidence so far for the connection between protein disorder and multiple PTMs in plants. CONTACT: tokmak@phoenix.kobe-u.ac.jp or tetsuya.sakurai@riken.jp Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Proteínas de Plantas/química , Plantas/química , Procesamiento Proteico-Postraduccional , Acetilación , Glicosilación , Metilación , Fosforilación , Proteoma/químicaRESUMEN
Cell-free protein synthesis offers substantial advantages over cell-based expression, allowing direct access to the protein synthetic reaction and meticulous control over the reaction conditions. Recently, we identified a number of statistically significant correlations between calculated and predicted properties of amino acid sequences and their amenability to heterologous cell-free expression. These correlations can be of practical use for predicting expression success and optimizing cell-free protein synthesis. In this chapter, we describe our approach and demonstrate how computational and predictive bioinformatics can be used to analyze and optimize cell-free protein expression.