Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 19 de 19
Filter
1.
Cell ; 150(5): 1068-81, 2012 Aug 31.
Article in English | MEDLINE | ID: mdl-22939629

ABSTRACT

Cellular processes often depend on stable physical associations between proteins. Despite recent progress, knowledge of the composition of human protein complexes remains limited. To close this gap, we applied an integrative global proteomic profiling approach, based on chromatographic separation of cultured human cell extracts into more than one thousand biochemical fractions that were subsequently analyzed by quantitative tandem mass spectrometry, to systematically identify a network of 13,993 high-confidence physical interactions among 3,006 stably associated soluble human proteins. Most of the 622 putative protein complexes we report are linked to core biological processes and encompass both candidate disease genes and unannotated proteins to inform on mechanism. Strikingly, whereas larger multiprotein assemblies tend to be more extensively annotated and evolutionarily conserved, human protein complexes with five or fewer subunits are far more likely to be functionally unannotated or restricted to vertebrates, suggesting more recent functional innovations.


Subject(s)
Multiprotein Complexes/analysis , Protein Interaction Maps , Proteins/chemistry , Proteomics/methods , Humans , Tandem Mass Spectrometry
2.
Biochim Biophys Acta Proteins Proteom ; 1865(1): 43-54, 2017 Jan.
Article in English | MEDLINE | ID: mdl-27718363

ABSTRACT

Therapeutic protein kinase inhibitors are designed on the basis of kinase structures. Here, we define intrinsically disordered regions (IDRs) in structurally hybrid kinases. We reveal that 65% of kinases have an IDR adjacent to their kinase domain (KD). These IDRs are evolutionarily more conserved than IDRs distant to KDs. Strikingly, 36 kinases have adjacent IDRs extending into their KDs, defining a unique structural and functional subset of the kinome. Functional network analysis of this subset of the kinome uncovered FAK1 as topologically the most connected hub kinase. We identify that KD-flanking IDR of FAK1 is more conserved and undergoes more post-translational modifications than other IDRs. It preferentially interacts with proteins regulating scaffolding and kinase activity, which contribute to cytoskeletal remodeling. In summary, spatially and evolutionarily conserved IDRs in kinases may influence their functions, which can be exploited for targeted therapies in diseases including those that involve aberrant cytoskeletal remodeling.


Subject(s)
Cytoskeleton/metabolism , Focal Adhesion Kinase 1/chemistry , Cytoskeleton/enzymology , Focal Adhesion Kinase 1/metabolism , Intrinsically Disordered Proteins/chemistry , Intrinsically Disordered Proteins/metabolism , Protein Conformation , Protein Processing, Post-Translational
3.
PLoS Genet ; 9(2): e1003280, 2013.
Article in English | MEDLINE | ID: mdl-23468640

ABSTRACT

Expansions of trinucleotide CAG/CTG repeats in somatic tissues are thought to contribute to ongoing disease progression through an affected individual's life with Huntington's disease or myotonic dystrophy. Broad ranges of repeat instability arise between individuals with expanded repeats, suggesting the existence of modifiers of repeat instability. Mice with expanded CAG/CTG repeats show variable levels of instability depending upon mouse strain. However, to date the genetic modifiers underlying these differences have not been identified. We show that in liver and striatum the R6/1 Huntington's disease (HD) (CAG)∼100 transgene, when present in a congenic C57BL/6J (B6) background, incurred expansion-biased repeat mutations, whereas the repeat was stable in a congenic BALB/cByJ (CBy) background. Reciprocal congenic mice revealed the Msh3 gene as the determinant for the differences in repeat instability. Expansion bias was observed in congenic mice homozygous for the B6 Msh3 gene on a CBy background, while the CAG tract was stabilized in congenics homozygous for the CBy Msh3 gene on a B6 background. The CAG stabilization was as dramatic as genetic deficiency of Msh2. The B6 and CBy Msh3 genes had identical promoters but differed in coding regions and showed strikingly different protein levels. B6 MSH3 variant protein is highly expressed and associated with CAG expansions, while the CBy MSH3 variant protein is expressed at barely detectable levels, associating with CAG stability. The DHFR protein, which is divergently transcribed from a promoter shared by the Msh3 gene, did not show varied levels between mouse strains. Thus, naturally occurring MSH3 protein polymorphisms are modifiers of CAG repeat instability, likely through variable MSH3 protein stability. Since evidence supports that somatic CAG instability is a modifier and predictor of disease, our data are consistent with the hypothesis that variable levels of CAG instability associated with polymorphisms of DNA repair genes may have prognostic implications for various repeat-associated diseases.


Subject(s)
Huntington Disease/genetics , Proteins/genetics , Trinucleotide Repeat Expansion/genetics , Trinucleotide Repeats/genetics , Animals , Corpus Striatum/metabolism , Disease Models, Animal , Genomic Instability , Humans , Mice , MutS Homolog 3 Protein , Myotonic Dystrophy/genetics , Myotonic Dystrophy/metabolism , Neostriatum/metabolism , Nerve Tissue Proteins/genetics , Nerve Tissue Proteins/metabolism , Polymorphism, Genetic , Protein Stability
4.
Mol Biol Evol ; 30(2): 332-46, 2013 Feb.
Article in English | MEDLINE | ID: mdl-22977115

ABSTRACT

Protein interaction networks play central roles in biological systems, from simple metabolic pathways through complex programs permitting the development of organisms. Multicellularity could only have arisen from a careful orchestration of cellular and molecular roles and responsibilities, all properly controlled and regulated. Disease reflects a breakdown of this organismal homeostasis. To better understand the evolution of interactions whose dysfunction may be contributing factors to disease, we derived the human protein coevolution network using our MatrixMatchMaker algorithm and using the Orthologous MAtrix project (OMA) database as a source for protein orthologs from 103 eukaryotic genomes. We annotated the coevolution network using protein-protein interaction data, many functional data sources, and we explored the evolutionary rates and dates of emergence of the proteins in our data set. Strikingly, clustering based only on the topology of the coevolution network partitions it into two subnetworks, one generally representing ancient eukaryotic functions and the other functions more recently acquired during animal evolution. That latter subnetwork is enriched for proteins with roles in cell-cell communication, the control of cell division, and related multicellular functions. Further annotation using data from genetic disease databases and cancer genome sequences strongly implicates these proteins in both ciliopathies and cancer. The enrichment for such disease markers in the animal network suggests a functional link between these coevolving proteins. Genetic validation corroborates the recruitment of ancient cilia in the evolution of multicellularity.


Subject(s)
Biological Evolution , Cell Communication/physiology , Proteins/genetics , Proteins/metabolism , Animals , Ciliary Motility Disorders/genetics , Ciliary Motility Disorders/metabolism , Cluster Analysis , Databases, Protein , Female , Gene Expression , Humans , Male , Mutation , Neoplasms/genetics , Neoplasms/metabolism , Protein Binding , Protein Interaction Mapping , Protein Interaction Maps
5.
Genome Res ; 19(10): 1861-71, 2009 Oct.
Article in English | MEDLINE | ID: mdl-19696150

ABSTRACT

Coevolution maintains interactions between phenotypic traits through the process of reciprocal natural selection. Detecting molecular coevolution can expose functional interactions between molecules in the cell, generating insights into biological processes, pathways, and the networks of interactions important for cellular function. Prediction of interaction partners from different protein families exploits the property that interacting proteins can follow similar patterns and relative rates of evolution. Current methods for detecting coevolution based on the similarity of phylogenetic trees or evolutionary distance matrices have, however, been limited by requiring coevolution over the entire evolutionary history considered and are inaccurate in the presence of paralogous copies. We present a novel method for determining coevolving protein partners by finding the largest common submatrix in a given pair of distance matrices, with the size of the largest common submatrix measuring the strength of coevolution. This approach permits us to consider matrices of different size and scale, to find lineage-specific coevolution, and to predict multiple interaction partners. We used MatrixMatchMaker to predict protein-protein interactions in the human genome. We show that proteins that are known to interact physically are more strongly coevolving than proteins that simply belong to the same biochemical pathway. The human coevolution network is highly connected, suggesting many more protein-protein interactions than are currently known from high-throughput and other experimental evidence. These most strongly coevolving proteins suggest interactions that have been maintained over long periods of evolutionary time, and that are thus likely to be of fundamental importance to cellular function.


Subject(s)
Evolution, Molecular , Gene Regulatory Networks/genetics , Proteins/genetics , Calibration , Computational Biology/methods , Databases, Protein , Forecasting , Genetic Variation , Humans , Metabolic Networks and Pathways/genetics , Phylogeny , Protein Binding/genetics , Protein Interaction Domains and Motifs/genetics , Proteins/metabolism , Sensitivity and Specificity , Sequence Analysis, Protein/methods , Sequence Analysis, Protein/standards , Software/standards
6.
Proteins ; 78(3): 548-58, 2010 Feb 15.
Article in English | MEDLINE | ID: mdl-19768681

ABSTRACT

Correlated mutation analysis (CMA) is an effective approach for predicting functional and structural residue interactions from multiple sequence alignments (MSAs) of proteins. As nearby residues may also play a role in a given functional interaction, we were interested in seeing whether covarying sites were clustered, and whether this could be used to enhance the predictive power of CMA. A large-scale search for coevolving regions within protein domains revealed that if two sites in a MSA covary, then neighboring sites in the alignment also typically covary, resulting in clusters of covarying residues. The program PatchD(http://www.uhnres.utoronto.ca/labs/tillier/) was developed to measure the covariation between disconnected sequence clusters to reveal patch covariation. Patches that exhibit strong covariation identify multiple residues that are generally nearby in the protein structure, suggesting that the detection of covarying patches can be used in conjunction with traditional CMA approaches to reveal functional interaction partners.


Subject(s)
DNA Mutational Analysis/methods , Models, Genetic , Proteins/chemistry , Proteins/genetics , Amino Acid Sequence , Binding Sites , Cluster Analysis , Conserved Sequence , Genetic Variation , Models, Molecular , Phylogeny , Proteins/metabolism , Sequence Alignment
7.
Biochem Cell Biol ; 88(2): 185-94, 2010 Apr.
Article in English | MEDLINE | ID: mdl-20453921

ABSTRACT

GroEL is a chaperone thought of as essential for bacterial life. However, some species of Mollicutes are missing GroEL. We use phylogenetic analysis to show that the presence of GroEL is polyphyletic among the Mollicutes, and that there is evidence for lateral gene transfer of GroEL to Mycoplasma penetrans from the Proteobacteria. Furthermore, we propose that the presence of GroEL in Mycoplasma may be required for invasion of host tissue, suggesting that GroEL may act as an adhesin-invasin.


Subject(s)
Chaperonin 60/genetics , Chaperonin 60/metabolism , Tenericutes/genetics , Tenericutes/metabolism , Chaperonin 60/chemistry , Phylogeny , Tenericutes/chemistry
8.
Bioinformatics ; 23(10): 1195-202, 2007 May 15.
Article in English | MEDLINE | ID: mdl-17392329

ABSTRACT

MOTIVATION: With hundreds of completely sequenced microbial genomes available, and advancements in DNA microarray technology, the detection of genes in microbial communities consisting of hundreds of thousands of sequences may be possible. The existing strategies developed for DNA probe design, geared toward identifying specific sequences, are not suitable due to the lack of coverage, flexibility and efficiency necessary for applications in metagenomics. METHODS: ProDesign is a tool developed for the selection of oligonucleotide probes to detect members of gene families present in environmental samples. Gene family-specific probe sequences are generated based on specific and shared words, which are found with the spaced seed hashing algorithm. To detect more sequences, those sharing some common words are re-clustered into new families, then probes specific for the new families are generated. RESULTS: The program is very flexible in that it can be used for designing probes for detecting many genes families simultaneously and specifically in one or more genomes. Neither the length nor the melting temperature of the probes needs to be predefined. We have found that ProDesign provides more flexibility, coverage and speed than other software programs used in the selection of probes for genomic and gene family arrays. AVAILABILITY: ProDesign is licensed free of charge to academic users. ProDesign and Supplementary Material can be obtained by contacting the authors. A web server for ProDesign is available at http://www.uhnresearch.ca/labs/tillier/ProDesign/ProDesign.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Computational Biology/methods , Multigene Family , Oligonucleotide Probes/genetics , Bacteria/genetics , Genome, Bacterial , Microarray Analysis , Oligonucleotide Array Sequence Analysis , Software
9.
Biomol Eng ; 24(3): 321-6, 2007 Sep.
Article in English | MEDLINE | ID: mdl-17502167

ABSTRACT

RNA sequences can form structures which are conserved throughout evolution and the question of aligning two RNA secondary structures has been extensively studied. Most of the previous alignment algorithms require the input of gap opening and gap extension penalty parameters. The choice of appropriate parameter values is controversial as there is little biological information to guide their assignment. In this paper, we present an algorithm which circumvents this problem. Instead of finding an optimal alignment with predefined gap opening penalty, the algorithm finds the optimal alignment with exact number of aligned blocks.


Subject(s)
Algorithms , RNA/chemistry , RNA/genetics , Sequence Alignment/methods , Sequence Analysis, RNA/methods , Base Sequence , Molecular Sequence Data , Nucleic Acid Conformation , Sequence Homology, Nucleic Acid
10.
Data Brief ; 10: 315-324, 2017 Feb.
Article in English | MEDLINE | ID: mdl-28004021

ABSTRACT

We present data on the evolution of intrinsically disordered regions (IDRs) taking into account the entire human protein kinome. The evolutionary data of the IDRs with respect to the kinase domains (KDs) and kinases as a whole protein (WP) are reported. Further, we have reported its post translational modifications of FAK1 IDRs and their contribution to the cytoskeletal remodeling. We also report the data to build a protein-protein interaction (PPI) network of primary and secondary FAK1-interacting hybrid proteins. Detailed analysis of the data and its effect on FAK1-related functions have been described in "Structural pliability adjacent to the kinase domain highlights contribution of FAK1 IDRs to cytoskeletal remodeling" (Kathiriya et. al., 2016) [1].

11.
BMC Bioinformatics ; 7: 471, 2006 Oct 24.
Article in English | MEDLINE | ID: mdl-17062146

ABSTRACT

BACKGROUND: There have been many algorithms and software programs implemented for the inference of multiple sequence alignments of protein and DNA sequences. The "true" alignment is usually unknown due to the incomplete knowledge of the evolutionary history of the sequences, making it difficult to gauge the relative accuracy of the programs. RESULTS: We tested nine of the most often used protein alignment programs and compared their results using sequences generated with the simulation software Simprot which creates known alignments under realistic and controlled evolutionary scenarios. We have simulated more than 30,000 alignment sets using various evolutionary histories in order to define strengths and weaknesses of each program tested. We found that alignment accuracy is extremely dependent on the number of insertions and deletions in the sequences, and that indel size has a weaker effect. We also considered benchmark alignments from the latest version of BAliBASE and the results relative to BAliBASE- and Simprot-generated data sets were consistent in most cases. CONCLUSION: Our results indicate that employing Simprot's simulated sequences allows the creation of a more flexible and broader range of alignment classes than the usual methods for alignment accuracy assessment. Simprot also allows for a quick and efficient analysis of a wider range of possible evolutionary histories that might not be present in currently available alignment sets. Among the nine programs tested, the iterative approach available in Mafft (L-INS-i) and ProbCons were consistently the most accurate, with Mafft being the faster of the two.


Subject(s)
Amino Acid Sequence , Proteins/chemistry , Sequence Alignment/methods , Software , Computational Biology , Computer Simulation , Databases, Protein , Gene Deletion , Mutation , Protein Conformation , Proteins/genetics , Sequence Alignment/standards
12.
Proteins ; 63(4): 822-31, 2006 Jun 01.
Article in English | MEDLINE | ID: mdl-16634043

ABSTRACT

Approaches for the determination of interacting partners from different protein families (such as ligands and their receptors) have made use of the property that interacting proteins follow similar patterns and relative rates of evolution. Interacting protein partners can then be predicted from the similarity of their phylogenetic trees or evolutionary distances matrices. We present a novel method called Codep, for the determination of interacting protein partners by maximizing co-evolutionary signals. The order of sequences in the multiple sequence alignments from two protein families is determined in such a manner as to maximize the similarity of substitution patterns at amino acid sites in the two alignments and, thus, phylogenetic congruency. This is achieved by maximizing the total number of interdependencies of amino acids sites between the alignments. Once ordered, the corresponding sequences in the two alignments indicate the predicted interacting partners. We demonstrate the efficacy of this approach with computer simulations and in analyses of several protein families. A program implementing our method, Codep, is freely available to academic users from our website: http://www.uhnresearch.ca/labs/tillier/.


Subject(s)
Evolution, Molecular , Proteins/genetics , Proteins/metabolism , Computer Simulation , Phylogeny , Protein Binding , Proteins/chemistry , Software
13.
BMC Bioinformatics ; 6: 236, 2005 Sep 27.
Article in English | MEDLINE | ID: mdl-16188037

ABSTRACT

BACKGROUND: General protein evolution models help determine the baseline expectations for the evolution of sequences, and they have been extensively useful in sequence analysis and for the computer simulation of artificial sequence data sets. RESULTS: We have developed a new method of simulating protein sequence evolution, including insertion and deletion (indel) events in addition to amino-acid substitutions. The simulation generates both the simulated sequence family and a true sequence alignment that captures the evolutionary relationships between amino acids from different sequences. Our statistical model for indel evolution is based on the empirical indel distribution determined by Qian and Goldstein. We have parameterized this distribution so that it applies to sequences diverged by varying evolutionary times and generalized it to provide flexibility in simulation conditions. Our method uses a Monte-Carlo simulation strategy, and has been implemented in a C++ program named Simprot. CONCLUSION: Simprot will be useful for testing methods of analysis of protein sequence families particularly alignment methods, phylogenetic tree building, detection of recombination and horizontal gene transfer, and homology detection, where knowing the true course of sequence evolution is essential.


Subject(s)
Computer Simulation , Evolution, Molecular , Models, Genetic , Sequence Analysis, Protein/methods , Software , Amino Acid Substitution , Models, Statistical , Monte Carlo Method , Phylogeny , Selection, Genetic , Software Design
14.
J Comput Biol ; 10(6): 997-1010, 2003.
Article in English | MEDLINE | ID: mdl-14980022

ABSTRACT

Substitution matrices have been useful for sequence alignment and protein sequence comparisons. The BLOSUM series of matrices, which had been derived from a database of alignments of protein blocks, improved the accuracy of alignments previously obtained from the PAM-type matrices estimated from only closely related sequences. Although BLOSUM matrices are scoring matrices now widely used for protein sequence alignments, they do not describe an evolutionary model. BLOSUM matrices do not permit the estimation of the actual number of amino acid substitutions between sequences by correcting for multiple hits. The method presented here uses the Blocks database of protein alignments, along with the additivity of evolutionary distances, to approximate the amino acid substitution probabilities as a function of actual evolutionary distance. The PMB (Probability Matrix from Blocks) defines a new evolutionary model for protein evolution that can be used for evolutionary analyses of protein sequences. Our model is directly derived from, and thus compatible with, the BLOSUM matrices. The model has the additional advantage of being easily implemented.


Subject(s)
Amino Acid Substitution , Evolution, Molecular , Models, Genetic , Probability , Proteins/chemistry , Computational Biology , Databases, Protein
15.
Methods Mol Biol ; 781: 237-56, 2011.
Article in English | MEDLINE | ID: mdl-21877284

ABSTRACT

Bioinformatic methods to predict protein-protein interactions (PPI) via coevolutionary analysis have -positioned themselves to compete alongside established in vitro methods, despite a lack of understanding for the underlying molecular mechanisms of the coevolutionary process. Investigating the alignment of coevolutionary predictions of PPI with experimental data can focus the effective scope of prediction and lead to better accuracies. A new rate-based coevolutionary method, MMM, preferentially finds obligate interacting proteins that form complexes, conforming to results from studies based on coimmunoprecipitation coupled with mass spectrometry. Using gold-standard databases as a benchmark for accuracy, MMM surpasses methods based on abundance ratios, suggesting that correlated evolutionary rates may yet be better than coexpression at predicting interacting proteins. At the level of protein domains, -coevolution is difficult to detect, even with MMM, except when considering small-scale experimental data involving proteins with multiple domains. Overall, these findings confirm that coevolutionary -methods can be confidently used in predicting PPI, either independently or as drivers of coimmunoprecipitation experiments.


Subject(s)
Biological Evolution , Computational Biology , Protein Interaction Mapping/methods , Proteins/chemistry , Proteins/metabolism , Algorithms , Immunoprecipitation , Phylogeny , Protein Binding
16.
Microb Biotechnol ; 3(6): 677-90, 2010 Nov.
Article in English | MEDLINE | ID: mdl-21255363

ABSTRACT

One hundred and seventy-one genes encoding potential esterases from 11 bacterial genomes were cloned and overexpressed in Escherichia coli; 74 of the clones produced soluble proteins. All 74 soluble proteins were purified and screened for esterase activity; 36 proteins showed carboxyl esterase activity on short-chain esters, 17 demonstrated arylesterase activity, while 38 proteins did not exhibit any activity towards the test substrates. Esterases from Rhodopseudomonas palustris (RpEST-1, RpEST-2 and RpEST-3), Pseudomonas putida (PpEST-1, PpEST-2 and PpEST-3), Pseudomonas aeruginosa (PaEST-1) and Streptomyces avermitilis (SavEST-1) were selected for detailed biochemical characterization. All of the enzymes showed optimal activity at neutral or alkaline pH, and the half-life of each enzyme at 50°C ranged from < 5 min to over 5 h. PpEST-3, RpEST-1 and RpEST-2 demonstrated the highest specific activity with pNP-esters; these enzymes were also among the most stable at 50°C and in the presence of detergents, polar and non-polar organic solvents, and imidazolium ionic liquids. Accordingly, these enzymes are particularly interesting targets for subsequent application trials. Finally, biochemical and bioinformatic analyses were compared to reveal sequence features that could be correlated to enzymes with arylesterase activity, facilitating subsequent searches for new esterases in microbial genome sequences.


Subject(s)
Bacteria/enzymology , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Carboxylic Ester Hydrolases/genetics , Carboxylic Ester Hydrolases/metabolism , Genome, Bacterial , Bacterial Proteins/chemistry , Bacterial Proteins/isolation & purification , Carboxylic Ester Hydrolases/chemistry , Carboxylic Ester Hydrolases/isolation & purification , Computational Biology , Enzyme Stability , Hydrogen-Ion Concentration , Substrate Specificity , Temperature
17.
Evol Bioinform Online ; 2: 77-90, 2007 Jan 14.
Article in English | MEDLINE | ID: mdl-19455203

ABSTRACT

In comparative genomic studies, syntenic groups of homologous sequence in the same order have been used as supplementary information that can be used in helping to determine the orthology of the compared sequences. The assumption is that orthologous gene copies are more likely to share the same genome positions and share the same gene neighbors. In this study we have defined positional homologs as those that also have homologous neighboring genes and we investigated the usefulness of this distinction for bacterial comparative genomics. We considered the identification of positionaly homologous gene pairs in bacterial genomes using protein and DNA sequence level alignments and found that the positional homologs had on average relatively lower rates of substitution at the DNA level (synonymous substitutions) than duplicate homologs in different genomic locations, regardless of the level of protein sequence divergence (measured with non-synonymous substitution rate). Since gene order conservation can indicate accuracy of orthology assignments, we also considered the effect of imposing certain alignment quality requirements on the sensitivity and specificity of identification of protein pairs by BLAST and FASTA when neighboring information is not available and in comparisons where gene order is not conserved. We found that the addition of a stringency filter based on the second best hits was an efficient way to remove dubious ortholog identifications in BLAST and FASTA analyses. Gene order conservation and DNA sequence homology are useful to consider in comparative genomic studies as they may indicate different orthology assignments than protein sequence homology alone.

18.
Bioinformatics ; 19(6): 750-5, 2003 Apr 12.
Article in English | MEDLINE | ID: mdl-12691987

ABSTRACT

MOTIVATION: Multiple sequence alignments of homologous proteins are useful for inferring their phylogenetic history and to reveal functionally important regions in the proteins. Functional constraints may lead to co-variation of two or more amino acids in the sequence, such that a substitution at one site is accompanied by compensatory substitutions at another site. It is not sufficient to find the statistical correlations between sites in the alignment because these may be the result of several undetermined causes. In particular, phylogenetic clustering will lead to many strong correlations. RESULTS: A procedure is developed to detect statistical correlations stemming from functional interaction by removing the strong phylogenetic signal that leads to the correlations of each site with many others in the sequence. Our method relies upon the accuracy of the alignment but it does not require any assumptions about the phylogeny or the substitution process. The effectiveness of the method was verified using computer simulations and then applied to predict functional interactions between amino acids in the Pfam database of alignments.


Subject(s)
Algorithms , Models, Molecular , Phylogeny , Proteins/chemistry , Proteins/classification , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Amino Acid Sequence , Models, Statistical , Molecular Sequence Data , Protein Conformation , Protein Structure, Secondary , Quality Control , Sequence Homology, Amino Acid
19.
Mol Biol Evol ; 21(3): 419-27, 2004 Mar.
Article in English | MEDLINE | ID: mdl-14660689

ABSTRACT

Empirical models of substitution are often used in protein sequence analysis because the large alphabet of amino acids requires that many parameters be estimated in all but the simplest parametric models. When information about structure is used in the analysis of substitutions in structured RNA, a similar situation occurs. The number of parameters necessary to adequately describe the substitution process increases in order to model the substitution of paired bases. We have developed a method to obtain substitution rate matrices empirically from RNA alignments that include structural information in the form of base pairs. Our data consisted of alignments from the European Ribosomal RNA Database of Bacterial and Eukaryotic Small Subunit and Large Subunit Ribosomal RNA ( Wuyts et al. 2001. Nucleic Acids Res. 29:175-177; Wuyts et al. 2002. Nucleic Acids Res. 30:183-185). Using secondary structural information, we converted each sequence in the alignments into a sequence over a 20-symbol code: one symbol for each of the four individual bases, and one symbol for each of the 16 ordered pairs. Substitutions in the coded sequences are defined in the natural way, as observed changes between two sequences at any particular site. For given ranges (windows) of sequence divergence, we obtained substitution frequency matrices for the coded sequences. Using a technique originally developed for modeling amino acid substitutions ( Veerassamy, Smith, and Tillier. 2003. J. Comput. Biol. 10:997-1010), we were able to estimate the actual evolutionary distance for each window. The actual evolutionary distances were used to derive instantaneous rate matrices, and from these we selected a universal rate matrix. The universal rate matrices were incorporated into the Phylip Software package ( Felsenstein 2002. http://evolution.genetics.washington.edu/phylip.html), and we analyzed the ribosomal RNA alignments using both distance and maximum likelihood methods. The empirical substitution models performed well on simulated data, and produced reasonable evolutionary trees for 16S ribosomal RNA sequences from sequenced Bacterial genomes. Empirical models have the advantage of being easily implemented, and the fact that the code consists of 20 symbols makes the models easily incorporated into existing programs for protein sequence analysis. In addition, the models are useful for simulating the evolution of RNA sequence and structure simultaneously.


Subject(s)
Amino Acid Substitution , Models, Genetic , RNA, Ribosomal/genetics , Sequence Alignment/methods , Animals , Computer Simulation , Databases, Nucleic Acid , Evolution, Molecular , Likelihood Functions , Phylogeny
SELECTION OF CITATIONS
SEARCH DETAIL