Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 48
Filter
1.
Comput Struct Biotechnol J ; 20: 1702-1715, 2022.
Article in English | MEDLINE | ID: mdl-35495120

ABSTRACT

SPARC facilitates the generation of plausible hypotheses regarding underlying biochemical mechanisms by structurally characterizing protein sequence constraints. Such constraints appear as residues co-conserved in functionally related subgroups, as subtle pairwise correlations (i.e., direct couplings), and as correlations among these sequence features or with structural features. SPARC performs three types of analyses. First, based on pairwise sequence correlations, it estimates the biological relevance of alternative conformations and of homomeric contacts, as illustrated here for death domains. Second, it estimates the statistical significance of the correspondence between directly coupled residue pairs and interactions at heterodimeric interfaces. Third, given molecular dynamics simulated structures, it characterizes interactions among constrained residues or between such residues and ligands that: (a) are stably maintained during the simulation; (b) undergo correlated formation and/or disruption of interactions with other constrained residues; or (c) switch between alternative interactions. We illustrate this for two homohexameric complexes: the bacterial enhancer binding protein (bEBP) NtrC1, which activates transcription by remodeling RNA polymerase (RNAP) containing σ54, and for DnaB helicase, which opens DNA at the bacterial replication fork. Based on the NtrC1 analysis, we hypothesize possible mechanisms for inhibiting ATP hydrolysis until ADP is released from an adjacent subunit and for coupling ATP hydrolysis to restructuring of σ54 binding loops. Based on the DnaB analysis, we hypothesize that DnaB 'grabs' ssDNA by flipping every fourth base and inserting it into cavities between subunits and that flipping of a DnaB-specific glutamine residue triggers ATP hydrolysis.

2.
Int J Mol Sci ; 23(6)2022 Mar 11.
Article in English | MEDLINE | ID: mdl-35328445

ABSTRACT

Semaphorin 4A (Sema4A) exerts a stabilizing effect on human Treg cells in PBMC and CD4+ T cell cultures by engaging Plexin B1. Sema4A deficient mice display enhanced allergic airway inflammation accompanied by fewer Treg cells, while Sema4D deficient mice displayed reduced inflammation and increased Treg cell numbers even though both Sema4 subfamily members engage Plexin B1. The main objectives of this study were: 1. To compare the in vitro effects of Sema4A and Sema4D proteins on human Treg cells; and 2. To identify function-determining residues in Sema4A critical for binding to Plexin B1 based on Sema4D homology modeling. We report here that Sema4A and Sema4D display opposite effects on human Treg cells in in vitro PBMC cultures; Sema4D inhibited the CD4+CD25+Foxp3+ cell numbers and CD25/Foxp3 expression. Sema4A and Sema4D competitively bind to Plexin B1 in vitro and hence may be doing so in vivo as well. Bayesian Partitioning with Pattern Selection (BPPS) partitioned 4505 Sema domains from diverse organisms into subgroups based on distinguishing sequence patterns that are likely responsible for functional differences. BPPS groups Sema3 and Sema4 into one family and further separates Sema4A and Sema4D into distinct subfamilies. Residues distinctive of the Sema3,4 family and of Sema4A (and by homology of Sema4D) tend to cluster around the Plexin B1 binding site. This suggests that the residues both common to and distinctive of Sema4A and Sema4D may mediate binding to Plexin B1, with subfamily residues mediating functional specificity. We mutated the Sema4A-specific residues M198 and F223 to alanine; notably, F223 in Sema4A corresponds to alanine in Sema4D. Mutant proteins were assayed for Plexin B1-binding and Treg stimulation activities. The F223A mutant was unable to stimulate Treg stability in in vitro PBMC cultures despite binding Plexin B1 with an affinity similar to the WT protein. This research is a first step in generating potent mutant Sema4A molecules with stimulatory function for Treg cells with a view to designing immunotherapeutics for asthma.


Subject(s)
Leukocytes, Mononuclear , Semaphorins/metabolism , Alanine , Animals , Bayes Theorem , Forkhead Transcription Factors/genetics , Humans , Inflammation , Leukocytes, Mononuclear/metabolism , Mice , Nerve Tissue Proteins/metabolism
4.
Sci Rep ; 11(1): 17663, 2021 09 03.
Article in English | MEDLINE | ID: mdl-34480063

ABSTRACT

De novo transcriptome assembly from billions of RNA-seq reads is very challenging due to alternative splicing and various levels of expression, which often leads to incorrect, mis-assembled transcripts. BayesDenovo addresses this problem by using both a read-guided strategy to accurately reconstruct splicing graphs from the RNA-seq data and a Bayesian strategy to estimate, from these graphs, the probability of transcript expression without penalizing poorly expressed transcripts. Simulation and cell line benchmark studies demonstrate that BayesDenovo is very effective in reducing false positives and achieves much higher accuracy than other assemblers, especially for alternatively spliced genes and for highly or poorly expressed transcripts. Moreover, BayesDenovo is more robust on multiple replicates by assembling a larger portion of common transcripts. When applied to breast cancer data, BayesDenovo identifies phenotype-specific transcripts associated with breast cancer recurrence.


Subject(s)
Gene Expression Profiling/methods , High-Throughput Nucleotide Sequencing , Transcriptome , Bayes Theorem , Computer Simulation , Humans , Sequence Analysis, RNA
5.
PLoS Comput Biol ; 17(7): e1009203, 2021 07.
Article in English | MEDLINE | ID: mdl-34292930

ABSTRACT

Transcription factors (TFs) often function as a module including both master factors and mediators binding at cis-regulatory regions to modulate nearby gene transcription. ChIP-seq profiling of multiple TFs makes it feasible to infer functional TF modules. However, when inferring TF modules based on co-localization of ChIP-seq peaks, often many weak binding events are missed, especially for mediators, resulting in incomplete identification of modules. To address this problem, we develop a ChIP-seq data-driven Gibbs Sampler to infer Modules (ChIP-GSM) using a Bayesian framework that integrates ChIP-seq profiles of multiple TFs. ChIP-GSM samples read counts of module TFs iteratively to estimate the binding potential of a module to each region and, across all regions, estimates the module abundance. Using inferred module-region probabilistic bindings as feature units, ChIP-GSM then employs logistic regression to predict active regulatory elements. Validation of ChIP-GSM predicted regulatory regions on multiple independent datasets sharing the same context confirms the advantage of using TF modules for predicting regulatory activity. In a case study of K562 cells, we demonstrate that the ChIP-GSM inferred modules form as groups, activate gene expression at different time points, and mediate diverse functional cellular processes. Hence, ChIP-GSM infers biologically meaningful TF modules and improves the prediction accuracy of regulatory region activities.


Subject(s)
Chromatin Immunoprecipitation Sequencing/methods , Gene Regulatory Networks , Regulatory Sequences, Nucleic Acid/genetics , Transcription Factors/genetics , Transcription Factors/metabolism , Bayes Theorem , Binding Sites/genetics , Chromatin/genetics , Chromatin/metabolism , Chromatin Immunoprecipitation Sequencing/statistics & numerical data , Computational Biology , Enhancer Elements, Genetic , Epigenesis, Genetic , Gene Expression Regulation , Humans , K562 Cells , MCF-7 Cells , Models, Statistical , Promoter Regions, Genetic
6.
Bioinformatics ; 37(20): 3456-3463, 2021 Oct 25.
Article in English | MEDLINE | ID: mdl-33983436

ABSTRACT

MOTIVATION: Detecting subtle biologically relevant patterns in protein sequences often requires the construction of a large and accurate multiple sequence alignment (MSA). Methods for constructing MSAs are usually evaluated using benchmark alignments, which, however, typically contain very few sequences and are therefore inappropriate when dealing with large numbers of proteins. RESULTS: eCOMPASS addresses this problem using a statistical measure of relative alignment quality based on direct coupling analysis (DCA): to maintain protein structural integrity over evolutionary time, substitutions at one residue position typically result in compensating substitutions at other positions. eCOMPASS computes the statistical significance of the congruence between high scoring directly coupled pairs and 3D contacts in corresponding structures, which depends upon properly aligned homologous residues. We illustrate eCOMPASS using both simulated and real MSAs. AVAILABILITY AND IMPLEMENTATION: The eCOMPASS executable, C++ open source code and input data sets are available at https://www.igs.umaryland.edu/labs/neuwald/software/compass. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

7.
BMC Bioinformatics ; 22(1): 193, 2021 Apr 15.
Article in English | MEDLINE | ID: mdl-33858322

ABSTRACT

BACKGROUND: ChIP-seq combines chromatin immunoprecipitation assays with sequencing and identifies genome-wide binding sites for DNA binding proteins. While many binding sites have strong ChIP-seq 'peak' observations and are well captured, there are still regions bound by proteins weakly, with a relatively low ChIP-seq signal enrichment. These weak binding sites, especially those at promoters and enhancers, are functionally important because they also regulate nearby gene expression. Yet, it remains a challenge to accurately identify weak binding sites in ChIP-seq data due to the ambiguity in differentiating these weak binding sites from the amplified background DNAs. RESULTS: ChIP-BIT2 ( http://sourceforge.net/projects/chipbitc/ ) is a software package for ChIP-seq peak detection. ChIP-BIT2 employs a mixture model integrating protein and control ChIP-seq data and predicts strong or weak protein binding sites at promoters, enhancers, or other genomic locations. For binding sites at gene promoters, ChIP-BIT2 simultaneously predicts their target genes. ChIP-BIT2 has been validated on benchmark regions and tested using large-scale ENCODE ChIP-seq data, demonstrating its high accuracy and wide applicability. CONCLUSION: ChIP-BIT2 is an efficient ChIP-seq peak caller. It provides a better lens to examine weak binding sites and can refine or extend the existing binding site collection, providing additional regulatory regions for decoding the mechanism of gene expression regulation.


Subject(s)
High-Throughput Nucleotide Sequencing , Software , Bayes Theorem , Binding Sites , Chromatin Immunoprecipitation , Oligonucleotide Array Sequence Analysis , Sequence Analysis, DNA
8.
Sci Rep ; 11(1): 385, 2021 01 11.
Article in English | MEDLINE | ID: mdl-33432018

ABSTRACT

Exploring complex modularization of intracellular signal transduction pathways is critical to understanding aberrant cellular responses during disease development and drug treatment. IMPALA (Inferred Modularization of PAthway LAndscapes) integrates information from high throughput gene expression experiments and genome-scale knowledge databases to identify aberrant pathway modules, thereby providing a powerful sampling strategy to reconstruct and explore pathway landscapes. Here IMPALA identifies pathway modules associated with breast cancer recurrence and Tamoxifen resistance. Focusing on estrogen-receptor (ER) signaling, IMPALA identifies alternative pathways from gene expression data of Tamoxifen treated ER positive breast cancer patient samples. These pathways were often interconnected through cytoplasmic genes such as IRS1/2, JAK1, YWHAZ, CSNK2A1, MAPK1 and HSP90AA1 and significantly enriched with ErbB, MAPK, and JAK-STAT signaling components. Characterization of the pathway landscape revealed key modules associated with ER signaling and with cell cycle and apoptosis signaling. We validated IMPALA-identified pathway modules using data from four different breast cancer cell lines including sensitive and resistant models to Tamoxifen. Results showed that a majority of genes in cell cycle/apoptosis modules that were up-regulated in breast cancer patients with short survivals (< 5 years) were also over-expressed in drug resistant cell lines, whereas the transcription factors JUN, FOS, and STAT3 were down-regulated in both patient and drug resistant cell lines. Hence, IMPALA identified pathways were associated with Tamoxifen resistance and an increased risk of breast cancer recurrence. The IMPALA package is available at https://dlrl.ece.vt.edu/software/ .


Subject(s)
Breast Neoplasms/pathology , Computational Biology , Neoplasm Recurrence, Local/genetics , Algorithms , Breast Neoplasms/drug therapy , Breast Neoplasms/genetics , Breast Neoplasms/metabolism , Drug Resistance, Neoplasm/genetics , Female , Gene Expression Regulation, Neoplastic , Gene Regulatory Networks/physiology , Genes, BRCA1 , Humans , Neoplasm Metastasis , Neoplasm Recurrence, Local/metabolism , Receptor, ErbB-2/genetics , Receptor, ErbB-2/metabolism , Receptors, Estrogen/genetics , Receptors, Estrogen/metabolism , Signal Transduction/genetics , Tamoxifen/pharmacology , Tamoxifen/therapeutic use
9.
Bioinformatics ; 37(5): 650-658, 2021 05 05.
Article in English | MEDLINE | ID: mdl-33016988

ABSTRACT

MOTIVATION: High-throughput RNA sequencing has revolutionized the scope and depth of transcriptome analysis. Accurate reconstruction of a phenotype-specific transcriptome is challenging due to the noise and variability of RNA-seq data. This requires computational identification of transcripts from multiple samples of the same phenotype, given the underlying consensus transcript structure. RESULTS: We present a Bayesian method, integrated assembly of phenotype-specific transcripts (IntAPT), that identifies phenotype-specific isoforms from multiple RNA-seq profiles. IntAPT features a novel two-layer Bayesian model to capture the presence of isoforms at the group layer and to quantify the abundance of isoforms at the sample layer. A spike-and-slab prior is used to model the isoform expression and to enforce the sparsity of expressed isoforms. Dependencies between the existence of isoforms and their expression are modeled explicitly to facilitate parameter estimation. Model parameters are estimated iteratively using Gibbs sampling to infer the joint posterior distribution, from which the presence and abundance of isoforms can reliably be determined. Studies using both simulations and real datasets show that IntAPT consistently outperforms existing methods for the IntAPT. Experimental results demonstrate that, despite sequencing errors, IntAPT exhibits a robust performance among multiple samples, resulting in notably improved identification of expressed isoforms of low abundance. AVAILABILITY AND IMPLEMENTATION: The IntAPT package is available at http://github.com/henryxushi/IntAPT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Gene Expression Profiling , Transcriptome , Bayes Theorem , Phenotype , RNA-Seq , Sequence Analysis, RNA , Software
10.
Sci Rep ; 10(1): 16962, 2020 Oct 07.
Article in English | MEDLINE | ID: mdl-33028952

ABSTRACT

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

11.
Database (Oxford) ; 20202020 01 01.
Article in English | MEDLINE | ID: mdl-32500917

ABSTRACT

For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease-endonuclease-phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/.


Subject(s)
Databases, Protein , Proteins , Sequence Alignment/methods , Machine Learning , Proteins/chemistry , Proteins/genetics , Sequence Analysis, Protein , Software
12.
Sci Rep ; 10(1): 7960, 2020 05 14.
Article in English | MEDLINE | ID: mdl-32409786

ABSTRACT

Genome-wide transcription factor (TF) binding signal analyses reveal co-localization of TF binding sites based on inferred cis-regulatory modules (CRMs). CRMs play a key role in understanding the cooperation of multiple TFs under specific conditions. However, the functions of CRMs and their effects on nearby gene transcription are highly dynamic and context-specific and therefore are challenging to characterize. BICORN (Bayesian Inference of COoperative Regulatory Network) builds a hierarchical Bayesian model and infers context-specific CRMs based on TF-gene binding events and gene expression data for a particular cell type. BICORN automatically searches for a list of candidate CRMs based on the input TF bindings at regulatory regions associated with genes of interest. Applying Gibbs sampling, BICORN iteratively estimates model parameters of CRMs, TF activities, and corresponding regulation on gene transcription, which it models as a sparse network of functional CRMs regulating target genes. The BICORN package is implemented in R (version 3.4 or later) and is publicly available on the CRAN server at https://cran.r-project.org/web/packages/BICORN/index.html.


Subject(s)
Computational Biology/methods , Gene Regulatory Networks , Regulatory Sequences, Nucleic Acid/genetics , Bayes Theorem , Cell Line , Humans , Software
13.
Immunogenetics ; 72(3): 181-203, 2020 04.
Article in English | MEDLINE | ID: mdl-32002590

ABSTRACT

Toll-interleukin-1R resistance (TIR) domains are ubiquitously present in all forms of cellular life. They are most commonly found in signaling proteins, as units responsible for signal-dependent formation of protein complexes that enable amplification and spatial propagation of the signal. A less common function of TIR domains is their ability to catalyze nicotinamide adenine dinucleotide degradation. This survey analyzes 26,414 TIR domains, automatically classified based on group-specific sequence patterns presumably determining biological function, using a statistical approach termed Bayesian partitioning with pattern selection (BPPS). We examine these groups and patterns in the light of available structures and biochemical analyses. Proteins within each of thirteen eukaryotic groups (10 metazoans and 3 plants) typically appear to perform similar functions, whereas proteins within each prokaryotic group typically exhibit diverse domain architectures, suggesting divergent functions. Groups are often uniquely characterized by structural fold variations associated with group-specific sequence patterns and by herein identified sequence motifs defining TIR domain functional divergence. For example, BPPS identifies, in helices C and D of TIRAP and MyD88 orthologs, conserved surface-exposed residues apparently responsible for specificity of TIR domain interactions. In addition, BPPS clarifies the functional significance of the previously described Box 2 and Box 3 motifs, each of which is a part of a larger, group-specific block of conserved, intramolecularly interacting residues.


Subject(s)
Adaptor Proteins, Signal Transducing/genetics , Protein Domains/genetics , Protein Domains/physiology , Adaptor Proteins, Signal Transducing/metabolism , Amino Acid Sequence , Animals , Bayes Theorem , Databases, Genetic , Drosophila Proteins/genetics , Drosophila Proteins/metabolism , Humans , Interleukins , Models, Molecular , Myeloid Differentiation Factor 88/genetics , Myeloid Differentiation Factor 88/metabolism , Protein Structure, Secondary , Receptors, Interleukin-1/genetics , Receptors, Interleukin-1/metabolism , Signal Transduction/genetics , Signal Transduction/physiology , Toll-Like Receptors/genetics , Toll-Like Receptors/metabolism
14.
Sci Rep ; 10(1): 1691, 2020 02 03.
Article in English | MEDLINE | ID: mdl-32015389

ABSTRACT

Protein functional constraints are manifest as superfamily and functional-subgroup conserved residues, and as pairwise correlations. Deep Analysis of Residue Constraints (DARC) aids the visualization of these constraints, characterizes how they correlate with each other and with structure, and estimates statistical significance. This can identify determinants of protein functional specificity, as we illustrate for bacterial DNA clamp loader ATPases. These load ring-shaped sliding clamps onto DNA to keep polymerase attached during replication and contain one δ, three γ, and one δ' AAA+ subunits semi-circularly arranged in the order δ-γ1-γ2-γ3-δ'. Only γ is active, though both γ and δ' functionally influence an adjacent γ subunit. DARC identifies, as functionally-congruent features linking allosterically the ATP, DNA, and clamp binding sites: residues distinctive of γ and of γ/δ' that mutually interact in trans, centered on the catalytic base; several γ/δ'-residues and six γ/δ'-covariant residue pairs within the DNA binding N-termini of helices α2 and α3; and γ/δ'-residues associated with the α2 C-terminus and the clamp-binding loop. Most notable is a trans-acting γ/δ' hydroxyl group that 99% of other AAA+ proteins lack. Mutation of this hydroxyl to a methyl group impedes clamp binding and opening, DNA binding, and ATP hydrolysis-implying a remarkably clamp-loader-specific function.


Subject(s)
DNA-Binding Proteins/metabolism , Protein Subunits/metabolism , Adenosine Triphosphatases/metabolism , Adenosine Triphosphate/metabolism , Binding Sites/physiology , DNA Polymerase III/metabolism , DNA, Bacterial/metabolism , Escherichia coli/metabolism , Hydrolysis , Protein Structure, Secondary , Sensitivity and Specificity
15.
Elife ; 72018 01 16.
Article in English | MEDLINE | ID: mdl-29336305

ABSTRACT

Residues responsible for allostery, cooperativity, and other subtle but functionally important interactions remain difficult to detect. To aid such detection, we employ statistical inference based on the assumption that residues distinguishing a protein subgroup from evolutionarily divergent subgroups often constitute an interacting functional network. We identify such networks with the aid of two measures of statistical significance. One measure aids identification of divergent subgroups based on distinguishing residue patterns. For each subgroup, a second measure identifies structural interactions involving pattern residues. Such interactions are derived either from atomic coordinates or from Direct Coupling Analysis scores, used as surrogates for structural distances. Applying this approach to N-acetyltransferases, P-loop GTPases, RNA helicases, synaptojanin-superfamily phosphatases and nucleases, and thymine/uracil DNA glycosylases yielded results congruent with biochemical understanding of these proteins, and also revealed striking sequence-structural features overlooked by other methods. These and similar analyses can aid the design of drugs targeting allosteric sites.


Subject(s)
Computational Biology/methods , Enzymes/chemistry , Enzymes/metabolism , Protein Conformation
16.
J Comput Biol ; 25(2): 121-129, 2018 02.
Article in English | MEDLINE | ID: mdl-28771374

ABSTRACT

We study a simple abstract problem motivated by a variety of applications in protein sequence analysis. Consider a string of 0s and 1s of length L, and containing D 1s. If we believe that some or all of the 1s may be clustered near the start of the sequence, which subset is the most significantly so clustered, and how significant is this clustering? We approach this question using the minimum description length principle and illustrate its application by analyzing residues that distinguish translational initiation and elongation factor guanosine triphosphatases (GTPases) from other P-loop GTPases. Within a structure of yeast elongation factor 1[Formula: see text], these residues form a significant cluster centered on a region implicated in guanine nucleotide exchange. Various biomedical questions may be cast as the abstract problem considered here.


Subject(s)
Computational Biology/methods , GTP Phosphohydrolase-Linked Elongation Factors/chemistry , Saccharomyces cerevisiae Proteins/chemistry , Sequence Analysis, Protein/methods , Cluster Analysis
17.
PLoS Comput Biol ; 14(12): e1006237, 2018 12.
Article in English | MEDLINE | ID: mdl-30596639

ABSTRACT

Protein Direct Coupling Analysis (DCA), which predicts residue-residue contacts based on covarying positions within a multiple sequence alignment, has been remarkably effective. This suggests that there is more to learn from sequence correlations than is generally assumed, and calls for deeper investigations into DCA and perhaps into other types of correlations. Here we describe an approach that enables such investigations by measuring, as an estimated p-value, the statistical significance of the association between residue-residue covariance and structural interactions, either internal or homodimeric. Its application to thirty protein superfamilies confirms that direct coupling (DC) scores correlate with 3D pairwise contacts with very high significance. This method also permits quantitative assessment of the relative performance of alternative DCA methods, and of the degree to which they detect direct versus indirect couplings. We illustrate its use to assess, for a given protein, the biological relevance of alternative conformational states, to investigate the possible mechanistic implications of differences between these states, and to characterize subtle aspects of direct couplings. Our analysis indicates that direct pairwise correlations may be largely distinct from correlated patterns associated with functional specialization, and that the joint analysis of both types of correlations can yield greater power. Data, programs, and source code are freely available at http://evaldca.igs.umaryland.edu.


Subject(s)
Binding Sites/physiology , Proteins/chemistry , Sequence Analysis, Protein/methods , Algorithms , Models, Molecular , Protein Conformation , Protein Interaction Domains and Motifs/physiology , Protein Structural Elements , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data , Sequence Analysis, Protein/statistics & numerical data
18.
PLoS Comput Biol ; 12(12): e1005294, 2016 12.
Article in English | MEDLINE | ID: mdl-28002465

ABSTRACT

Over evolutionary time, members of a superfamily of homologous proteins sharing a common structural core diverge into subgroups filling various functional niches. At the sequence level, such divergence appears as correlations that arise from residue patterns distinct to each subgroup. Such a superfamily may be viewed as a population of sequences corresponding to a complex, high-dimensional probability distribution. Here we model this distribution as hierarchical interrelated hidden Markov models (hiHMMs), which describe these sequence correlations implicitly. By characterizing such correlations one may hope to obtain information regarding functionally-relevant properties that have thus far evaded detection. To do so, we infer a hiHMM distribution from sequence data using Bayes' theorem and Markov chain Monte Carlo (MCMC) sampling, which is widely recognized as the most effective approach for characterizing a complex, high dimensional distribution. Other routines then map correlated residue patterns to available structures with a view to hypothesis generation. When applied to N-acetyltransferases, this reveals sequence and structural features indicative of functionally important, yet generally unknown biochemical properties. Even for sets of proteins for which nothing is known beyond unannotated sequences and structures, this can lead to helpful insights. We describe, for example, a putative coenzyme-A-induced-fit substrate binding mechanism mediated by arginine residue switching between salt bridge and π-π stacking interactions. A suite of programs implementing this approach is available (psed.igs.umaryland.edu).


Subject(s)
Acetyltransferases/chemistry , Models, Molecular , Sequence Analysis, Protein/methods , Acetyltransferases/genetics , Acetyltransferases/metabolism , Amino Acid Sequence , Animals , Caenorhabditis elegans Proteins/chemistry , Caenorhabditis elegans Proteins/genetics , Caenorhabditis elegans Proteins/metabolism , Computational Biology , Humans , Markov Chains , Monte Carlo Method , Sequence Alignment/methods
19.
Curr Opin Struct Biol ; 38: 1-8, 2016 06.
Article in English | MEDLINE | ID: mdl-27179293

ABSTRACT

The availability of vast amounts of protein sequence data facilitates detection of subtle statistical correlations due to imposed structural and functional constraints. Recent breakthroughs using Direct Coupling Analysis (DCA) and related approaches have tapped into correlations believed to be due to compensatory mutations. This has yielded some remarkable results, including substantially improved prediction of protein intra- and inter-domain 3D contacts, of membrane and globular protein structures, of substrate binding sites, and of protein conformational heterogeneity. A complementary approach is Bayesian Partitioning with Pattern Selection (BPPS), which partitions related proteins into hierarchically-arranged subgroups based on correlated residue patterns. These correlated patterns are presumably due to structural and functional constraints associated with evolutionary divergence rather than to compensatory mutations. Hence joint application of DCA- and BPPS-based approaches should help sort out the structural and functional constraints contributing to sequence correlations.


Subject(s)
Computational Biology/methods , Proteins/chemistry , Proteins/metabolism , Sequence Alignment , Amino Acid Sequence , Interatrial Block , Models, Molecular
20.
PLoS Comput Biol ; 12(5): e1004936, 2016 05.
Article in English | MEDLINE | ID: mdl-27192614

ABSTRACT

We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a "top-down" strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO's superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/.


Subject(s)
Proteins/chemistry , Sequence Alignment/statistics & numerical data , Algorithms , Bayes Theorem , Computational Biology , Databases, Protein , Markov Chains , Monte Carlo Method , Sequence Alignment/standards , Software
SELECTION OF CITATIONS
SEARCH DETAIL
...