ABSTRACT
BACKGROUND: Although the development of sequencing technologies has provided a large number of protein sequences, the analysis of functions that each one plays is still difficult due to the efforts of laboratorial methods, making necessary the usage of computational methods to decrease this gap. As the main source of information available about proteins is their sequences, approaches that can use this information, such as classification based on the patterns of the amino acids and the inference based on sequence similarity using alignment tools, are able to predict a large collection of proteins. The methods available in the literature that use this type of feature can achieve good results, however, they present restrictions of protein length as input to their models. In this work, we present a new method, called TEMPROT, based on the fine-tuning and extraction of embeddings from an available architecture pre-trained on protein sequences. We also describe TEMPROT+, an ensemble between TEMPROT and BLASTp, a local alignment tool that analyzes sequence similarity, which improves the results of our former approach. RESULTS: The evaluation of our proposed classifiers with the literature approaches has been conducted on our dataset, which was derived from CAFA3 challenge database. Both TEMPROT and TEMPROT+ achieved competitive results on [Formula: see text], [Formula: see text], AuPRC and IAuPRC metrics on Biological Process (BP), Cellular Component (CC) and Molecular Function (MF) ontologies compared to state-of-the-art models, with the main results equal to 0.581, 0.692 and 0.662 of [Formula: see text] on BP, CC and MF, respectively. CONCLUSIONS: The comparison with the literature showed that our model presented competitive results compared the state-of-the-art approaches considering the amino acid sequence pattern recognition and homology analysis. Our model also presented improvements related to the input size that the model can use to train compared to the literature methods.
Subject(s)
Amino Acids , Proteins , Proteins/chemistry , Molecular Sequence Annotation , Amino Acid Sequence , AminesABSTRACT
BACKGROUND: Driver mutations are the genetic components responsible for tumor initiation and progression. These variants, which may be inherited, influence cancer risk and therefore underlie many familial cancers. The present study examines the potential association between SNPs in driver genes SF3B1 (rs4685), TBX3 (rs12366395, rs8853, and rs1061651) and MAP3K1 (rs72758040) and BC in BRCA1/2-negative Chilean families. METHODS: The SNPs were genotyped in 486 BC cases and 1258 controls by TaqMan Assay. RESULTS: Our data do not support an association between rs4685:C > T, rs8853:T > C, or rs1061651:T > C and BC risk. However, the rs12366395-G allele (A/G + G/G) was associated with risk in families with a strong history of BC (OR = 1.2 [95% CI 1.0-1.6] p = 0.02 and OR = 1.5 [95% CI 1.0-2.2] p = 0.02, respectively). Moreover, rs72758040-C was associated with increased risk in cases with a moderate-to-strong family history of BC (OR = 1.3 [95% CI 1.0-1.7] p = 0.02 and OR = 1.3 [95% CI 1.0-1.8] p = 0.03 respectively). Finally, risk was significantly higher in homozygous C/C cases from families with a moderate-to-strong BC history (OR = 1.8 [95% CI 1.0-3.1] p = 0.03 and OR = 1.9 [95% CI 1.1-3.4] p = 0.01, respectively). We also evaluated the combined impact of rs12366395-G and rs72758040-C. Familial BC risk increased in a dose-dependent manner with risk allele count, reflecting an additive effect (p-trend = 0.0002). CONCLUSIONS: Our study suggests that germline variants in driver genes TBX3 (rs12366395) and MAP3K1 (rs72758040) may influence BC risk in BRCA1/2-negative Chilean families. Moreover, the presence of rs12366395-G and rs72758040-C could increase BC risk in a Chilean population.
Subject(s)
Breast Neoplasms , Breast Neoplasms/genetics , Breast Neoplasms/pathology , Chile/epidemiology , Female , Genetic Predisposition to Disease/genetics , Genomics , HumansABSTRACT
We present a detailed heuristic method to quantify the degree of local energetic frustration manifested by protein molecules. Current applications are realized in computational experiments where a protein structure is visualized highlighting the energetic conflicts or the concordance of the local interactions in that structure. Minimally frustrated linkages highlight the stable folding core of the molecule. Sites of high local frustration, in contrast, often indicate functionally relevant regions such as binding, active, or allosteric sites.
Subject(s)
Protein Conformation , Models, Molecular , Protein Folding , Proteins , ThermodynamicsABSTRACT
BACKGROUND: Driver mutations are the genetic components responsible for tumor initiation and progression. These variants, which may be inherited, influence cancer risk and therefore underlie many familial cancers. The present study examines the potential association between SNPs in driver genes SF3B1 (rs4685), TBX3 (rs12366395, rs8853, and rs1061651) and MAP3K1 (rs72758040) and BC in BRCA1/2-negative Chilean families. METHODS: The SNPs were genotyped in 486 BC cases and 1258 controls by TaqMan Assay. RESULTS: Our data do not support an association between rs4685:C > T, rs8853:T > C, or rs1061651:T > C and BC risk. However, the rs12366395-G allele (A/G + G/G) was associated with risk in families with a strong history of BC (OR = 1.2 [95% CI 1.0-1.6] p = 0.02 and OR = 1.5 [95% CI 1.0-2.2] p = 0.02, respectively). Moreover, rs72758040-C was associated with increased risk in cases with a moderate-to-strong family history of BC (OR = 1.3 [95% CI 1.0-1.7] p = 0.02 and OR = 1.3 [95% CI 1.0-1.8] p = 0.03 respectively). Finally, risk was significantly higher in homozygous C/C cases from families with a moderate-to-strong BC history (OR = 1.8 [95% CI 1.0-3.1] p = 0.03 and OR = 1.9 [95% CI 1.1-3.4] p = 0.01, respectively). We also evaluated the combined impact of rs12366395-G and rs72758040-C. Familial BC risk increased in a dose-dependent manner with risk allele count, reflecting an additive effect (p-trend = 0.0002). CONCLUSIONS: Our study suggests that germline variants in driver genes TBX3 (rs12366395) and MAP3K1 (rs72758040) may influence BC risk in BRCA1/2-negative Chilean families. Moreover, the presence of rs12366395-G and rs72758040-C could increase BC risk in a Chilean population.
Subject(s)
Humans , Female , Breast Neoplasms/genetics , Breast Neoplasms/pathology , Chile/epidemiology , Genetic Predisposition to Disease/genetics , GenomicsABSTRACT
Cell penetrating peptides (CPPs) are molecules capable of passing through biological membranes. This capacity has been used to deliver impermeable molecules into cells, such as drugs and DNA probes, among others. However, the internalization of these peptides lacks specificity: CPPs internalize indistinctly on different cell types. Two major approaches have been described to address this problem: (i) targeting, in which a receptor-recognizing sequence is added to a CPP, and (ii) activation, where a non-active form of the CPP is activated once it interacts with cell target components. These strategies result in multifunctional peptides (i.e., penetrate and target recognition) that increase the CPP's length, the cost of synthesis and the likelihood to be degraded or become antigenic. In this work we describe the use of machine-learning methods to design short selective CPP; the reduction in size is accomplished by embedding two or more activities within a single CPP domain, hence we referred to these as moonlighting CPPs. We provide experimental evidence that these designed moonlighting peptides penetrate selectively in targeted cells and discuss areas of opportunity to improve in the design of these peptides.
ABSTRACT
We present a new phylogenetic approach, selection on amino acids and codons (SelAC), whose substitution rates are based on a nested model linking protein expression to population genetics. Unlike simpler codon models that assume a single substitution matrix for all sites, our model more realistically represents the evolution of protein-coding DNA under the assumption of consistent, stabilizing selection using a cost-benefit approach. This cost-benefit approach allows us to generate a set of 20 optimal amino acid-specific matrix families using just a handful of parameters and naturally links the strength of stabilizing selection to protein synthesis levels, which we can estimate. Using a yeast data set of 100 orthologs for 6 taxa, we find SelAC fits the data much better than popular models by 104-105 Akike information criterion units adjusted for small sample bias. Our results also indicated that nested, mechanistic models better predict observed data patterns highlighting the improvement in biological realism in amino acid sequence evolution that our model provides. Additional parameters estimated by SelAC indicate that a large amount of nonphylogenetic, but biologically meaningful, information can be inferred from existing data. For example, SelAC prediction of gene-specific protein synthesis rates correlates well with both empirical (r=0.33-0.48) and other theoretical predictions (r=0.45-0.64) for multiple yeast species. SelAC also provides estimates of the optimal amino acid at each site. Finally, because SelAC is a nested approach based on clearly stated biological assumptions, future modifications, such as including shifts in the optimal amino acid sequence within or across lineages, are possible.
Subject(s)
Amino Acid Substitution , Genetic Techniques , Models, Genetic , Phylogeny , Selection, Genetic , Genetics, Population/methodsABSTRACT
Abstract Root-knot nematodes are a group of endoparasites species that induce the formation of giant cells in the hosts, by which they guarantee their feeding and development. Meloidogyne species infect over 2000 plant species, and are highly destructive, causing damage to many crops around the world. M. enterolobii is considered the most aggressive species in tropical regions, such as Africa and South America. Phytonematodes are able to penetrate and migrate within plant tissues, establishing a sophisticated interaction with their hosts through parasitism factors, which include a series of cell wall degradation enzymes and plant cell modification. Among the parasitism factors documented in the M. enterolobii species, cellulose binding protein (CBP), a nematode excretion protein that appears to be associated with the breakdown of cellulose present in the plant cell wall. In silico analysis can be of great importance for the identification, structural and functional characterization of genomic sequences, besides making possible the prediction of structures and functions of proteins. The present work characterized 12 sequences of the CBP protein of nematodes of the genus Meloidogyne present in genomic databases. The results showed that all CBP sequences had signal peptide and that, after their removal, they had an isoelectric point that characterized them as unstable in an acid medium. The values of the average hydrophilicity demonstrated the hydrophilic character of the analyzed sequences. Phylogenetic analyzes were also consistent with the taxonomic classification of the nematode species of this study. Five motifs were identified, which are present in all sequences analyzed. These results may provide theoretical grounds for future studies of plant resistance to nematode infection.
Subject(s)
Parasitic Diseases , Computer Simulation , Cell Wall , Computational Biology/methods , NematodaABSTRACT
Protein function is a concept that can have different interpretations in different biological contexts, and the number and diversity of novel proteins identified by large-scale "omics" technologies poses increasingly new challenges. In this review we explore current strategies used to predict protein function focused on high-throughput sequence analysis, as for example, inference based on sequence similarity, sequence composition, structure, and protein-protein interaction. Various prediction strategies are discussed together with illustrative workflows highlighting the use of some benchmark tools and knowledge bases in the field.
Subject(s)
Computational Biology/methods , Proteins/chemistry , Software , Algorithms , Databases, Protein , Phylogeny , Proteins/classification , Sequence Alignment , Sequence Analysis, ProteinABSTRACT
Two-component systems (TCS) are protein machineries that enable cells to respond to input signals. Histidine kinases (HK) are the sensory component, transferring information toward downstream response regulators (RR). HKs transfer phosphoryl groups to their specific RRs, but also dephosphorylate them, overall ensuring proper signaling. The mechanisms by which HKs discriminate between such disparate directions, are yet unknown. We now disclose crystal structures of the HK:RR complex DesK:DesR from Bacillus subtilis, comprising snapshots of the phosphotransfer and the dephosphorylation reactions. The HK dictates the reactional outcome through conformational rearrangements that include the reactive histidine. The phosphotransfer center is asymmetric, poised for dissociative nucleophilic substitution. The structural bases of HK phosphatase/phosphotransferase control are uncovered, and the unexpected discovery of a dissociative reactional center, sheds light on the evolution of TCS phosphotransfer reversibility. Our findings should be applicable to a broad range of signaling systems and instrumental in synthetic TCS rewiring.
Subject(s)
Bacillus subtilis/enzymology , Histidine Kinase/chemistry , Histidine Kinase/metabolism , Signal Transduction , Transcription Factors/chemistry , Transcription Factors/metabolism , Crystallography, X-Ray , Models, Molecular , Phosphorylation , Protein Conformation , Protein Processing, Post-TranslationalABSTRACT
BACKGROUND: Hierarchical Multi-Label Classification is a classification task where the classes to be predicted are hierarchically organized. Each instance can be assigned to classes belonging to more than one path in the hierarchy. This scenario is typically found in protein function prediction, considering that each protein may perform many functions, which can be further specialized into sub-functions. We present a new hierarchical multi-label classification method based on multiple neural networks for the task of protein function prediction. A set of neural networks are incrementally training, each being responsible for the prediction of the classes belonging to a given level. RESULTS: The method proposed here is an extension of our previous work. Here we use the neural network output of a level to complement the feature vectors used as input to train the neural network in the next level. We experimentally compare this novel method with several other reduction strategies, showing that it obtains the best predictive performance. Empirical results also show that the proposed method achieves better or comparable predictive performance when compared with state-of-the-art methods for hierarchical multi-label classification in the context of protein function prediction. CONCLUSIONS: The experiments showed that using the output in one level as input to the next level contributed to better classification results. We believe the method was able to learn the relationships between the protein functions during training, and this information was useful for classification. We also identified in which functional classes our method performed better.
Subject(s)
Neural Networks, Computer , Proteins/physiology , Proteins/classification , Proteins/metabolismABSTRACT
Structural differences between conformers sustain protein biological function. Here, we studied in a large dataset of 745 intrinsically disordered proteins, how ordered-disordered transitions modulate structural differences between conformers as derived from crystallographic data. We found that almost 50% of the proteins studied show no transitions and have low conformational diversity while the rest show transitions and a higher conformational diversity. In this last subset, 60% of the proteins become more ordered after ligand binding, while 40% more disordered. As protein conformational diversity is inherently connected with protein function our analysis suggests differences in structure-function relationships related to order-disorder transitions.
Subject(s)
Databases, Protein , Intrinsically Disordered Proteins/chemistry , Intrinsically Disordered Proteins/genetics , Protein ConformationABSTRACT
The NADH oxidase family of enzymes catalyzes the oxidation of NADH by reducing molecular O2 to H2O2, H2O or both. In the protozoan parasite Giardia lamblia, the NADH oxidase enzyme (GlNOX) produces H2O as end product without production of H2O2. GlNOX has been implicated in the parasite metabolism, the intracellular redox regulation and the resistance to drugs currently used against giardiasis; therefore, it is an interesting protein from diverse perspectives. In this work, the GlNOX gene was amplified from genomic G. lamblia DNA and expressed in Escherichia coli as a His-Tagged protein; then, the enzyme was purified by immobilized metal affinity chromatography, characterized, and its properties compared with those of the endogenous enzyme previously isolated from trophozoites (Brown et al. in Eur J Biochem 241(1):155-161, 1996). In comparison with the trophozoite-extracted enzyme, which was scarce and unstable, the recombinant heterologous expression system and one-step purification method produce a stable protein preparation with high yield and purity. The recombinant enzyme mostly resembles the endogenous protein; where differences were found, these were attributable to methodological discrepancies or artifacts. This homogenous, pure and functional protein preparation can be used for detailed structural or functional studies of GlNOX, which will provide a deeper understanding of the biology and pathogeny of G. lamblia.
Subject(s)
Giardia lamblia/enzymology , Multienzyme Complexes/isolation & purification , Multienzyme Complexes/metabolism , NADH, NADPH Oxidoreductases/isolation & purification , NADH, NADPH Oxidoreductases/metabolism , Protozoan Proteins/isolation & purification , Protozoan Proteins/metabolism , Recombinant Proteins/isolation & purification , Recombinant Proteins/metabolism , Amino Acid Sequence , Cloning, Molecular , Escherichia coli/genetics , Giardia lamblia/genetics , Kinetics , Molecular Sequence Data , Multienzyme Complexes/chemistry , Multienzyme Complexes/genetics , NADH, NADPH Oxidoreductases/chemistry , NADH, NADPH Oxidoreductases/genetics , Oxidation-Reduction , Protozoan Proteins/chemistry , Protozoan Proteins/genetics , Recombinant Proteins/chemistry , Recombinant Proteins/genetics , Sequence AlignmentABSTRACT
Background: Rhipicephalus microplus is a monogenetic, hematophagous ectoparasite that has a large economic impact due to associated losses in the cattle industry. Glycogen synthase kinase 3 is a highly conserved and ubiquitously expressed protein in several species. It has been identified as GSK-3 isoform in the cattle tick, and is involved in the modulation of glycogen synthase activity, as a regulator of glycogen synthesis with a role in energy metabolism of R. microplus. It is a fundamental kinase for embryo development, and is directly linked with R. microplus reproductive process. Thus, the aim of this study was to investigate the role of GSK-3 in R. microplus physiology by inhibiting its activity.Materials, Methods & Results: In vitro and in vivo assays were carried out to test the effects of immunological and chemical GSK-3 inhibition. Synthetic peptides were designed by an in silico analysis of the protein antigenic index. Rabbit antibodies were raised against the designed synthetic peptides based on R. microplus GSK-3 sequence (anti-SRm0218 and anti-SRm86100). To access if the inhibition of GSK-3 would result in a decrease in fertility and hatching, the purified IgG from rabbit sera was used to feed partially engorged tick females. Results show that the antibodies were not capable of affect egg production. The same antibodies were used in an in vitro...
Subject(s)
/analysis , /antagonists & inhibitors , Rhipicephalus/physiology , In Vitro TechniquesABSTRACT
Background: Rhipicephalus microplus is a monogenetic, hematophagous ectoparasite that has a large economic impact due to associated losses in the cattle industry. Glycogen synthase kinase 3 is a highly conserved and ubiquitously expressed protein in several species. It has been identified as GSK-3 isoform in the cattle tick, and is involved in the modulation of glycogen synthase activity, as a regulator of glycogen synthesis with a role in energy metabolism of R. microplus. It is a fundamental kinase for embryo development, and is directly linked with R. microplus reproductive process. Thus, the aim of this study was to investigate the role of GSK-3 in R. microplus physiology by inhibiting its activity.Materials, Methods & Results: In vitro and in vivo assays were carried out to test the effects of immunological and chemical GSK-3 inhibition. Synthetic peptides were designed by an in silico analysis of the protein antigenic index. Rabbit antibodies were raised against the designed synthetic peptides based on R. microplus GSK-3 sequence (anti-SRm0218 and anti-SRm86100). To access if the inhibition of GSK-3 would result in a decrease in fertility and hatching, the purified IgG from rabbit sera was used to feed partially engorged tick females. Results show that the antibodies were not capable of affect egg production. The same antibodies were used in an in vitro...(AU)
Subject(s)
Glycogen Synthase Kinase 3 beta/analysis , Rhipicephalus/physiology , Glycogen Synthase Kinase 3 beta/antagonists & inhibitors , In Vitro TechniquesABSTRACT
Predicting enzyme class from protein structure parameters is a challenging problem in protein analysis. We developed a method to predict enzyme class that combines the strengths of statistical and data-mining methods. This method has a strong mathematical foundation and is simple to implement, achieving an accuracy of 45%. A comparison with the methods found in the literature designed to predict enzyme class showed that our method outperforms the existing methods.