RESUMEN
We identified tryptic peptides in yeast cell lysates that map to translation initiation sites downstream of the annotated start sites using the peptide-spectrum matching algorithms OMSSA and Mascot. To increase the accuracy of peptide-spectrum matching, both algorithms were run using several standardized parameter sets, and Mascot was run utilizing a, b, and y ions from collision-induced dissociation. A large fraction (22%) of the detected N-terminal peptides mapped to translation initiation downstream of the annotated initiation sites. Expression of several truncated proteins from downstream initiation in the same reading frame as the full-length protein (frame 1) was verified by western analysis. To facilitate analysis of the larger proteome of Drosophila, we created a streamlined sequence library from which all duplicated trypsin fragments had been removed. OMSSA assessment using this "stripped" library revealed 171 peptides that map to downstream translation initiation sites, 76% of which are in the same reading frame as the full-length annotated proteins, although some are in different reading frames creating new protein sequences not in the annotated proteome. Sequences surrounding implicated downstream AUG start codons are associated with nucleotide preferences with a pronounced three-base periodicity N1^G2^A3.
Asunto(s)
Bases de Datos de Proteínas/normas , Proteínas de Drosophila/análisis , Proteínas Fúngicas/análisis , Péptidos/análisis , Proteómica/métodos , Espectrometría de Masas en Tándem/normas , Algoritmos , Secuencia de Aminoácidos , Animales , Codón Iniciador , Anotación de Secuencia Molecular , Proteómica/normas , Sistemas de Lectura , Estándares de ReferenciaRESUMEN
Peptide mass spectrometry relies crucially on algorithms that match peptides to spectra. We describe a method to evaluate the accuracy of these algorithms based on the masses of parent proteins before trypsin endoprotease digestion. Measurement of conformance to parent proteins provides a score for comparison of the performances of different algorithms as well as alternative parameter settings for a given algorithm. Tracking of conformance scores for spectrum matches to proteins with progressively lower expression levels revealed that conformance scores are not uniform within data sets but are significantly lower for less abundant proteins. Similarly peptides with lower algorithm peptide-spectrum match scores have lower conformance. Although peptide mass spectrometry data is typically filtered through decoy analysis to ensure a low false discovery rate, this analysis confirms that the filtered data should not be considered as having a uniform confidence. The analysis suggests that use of different algorithms and multiple standardized parameter settings of these algorithms can increase significantly the numbers of peptides identified. This data set can be used as a resource for future algorithm assessment.
Asunto(s)
Algoritmos , Mapeo Peptídico/métodos , Proteómica/métodos , Espectrometría de Masas en Tándem/métodos , Bases de Datos de Proteínas , Humanos , Fragmentos de Péptidos/análisis , Fragmentos de Péptidos/química , Proteínas/análisis , Proteínas/química , TripsinaRESUMEN
In the absence of chaperone molecules, RNA folding is believed to depend on the distribution of kinetic traps in the energy landscape of all secondary structures. Kinetic traps in the Nussinov energy model are precisely those secondary structures that are saturated, meaning that no base pair can be added without introducing either a pseudoknot or base triple. In this paper, we compute the asymptotic expected number of hairpins in saturated structures. For instance, if every hairpin is required to contain at least θ=3 unpaired bases and the probability that any two positions can base-pair is p=3/8, then the asymptotic number of saturated structures is 1.34685[Symbol: see text]n (-3/2)[Symbol: see text]1.62178 (n) , and the asymptotic expected number of hairpins follows a normal distribution with mean [Formula: see text]. Similar results are given for values θ=1,3, and p=1,1/2,3/8; for instance, when θ=1 and p=1, the asymptotic expected number of hairpins in saturated secondary structures is 0.123194[Symbol: see text]n, a value greater than the asymptotic expected number 0.105573[Symbol: see text]n of hairpins over all secondary structures. Since RNA binding targets are often found in hairpin regions, it follows that saturated structures present potentially more binding targets than nonsaturated structures, on average. Next, we describe a novel algorithm to compute the hairpin profile of a given RNA sequence: given RNA sequence a 1, ,a n , for each integer k, we compute that secondary structure S k having minimum energy in the Nussinov energy model, taken over all secondary structures having k hairpins. We expect that an extension of our algorithm to the Turner energy model may provide more accurate structure prediction for particular RNAs, such as tRNAs and purine riboswitches, known to have a particular number of hairpins. Mathematica(™) computations, C and Python source code, and additional supplementary information are available at the website http://bioinformatics.bc.edu/clotelab/RNAhairpinProfile/ .
Asunto(s)
Conformación de Ácido Nucleico , ARN/química , ARN/genética , Algoritmos , Biología Computacional , Secuencias Invertidas Repetidas , Conceptos Matemáticos , Modelos MolecularesRESUMEN
Comprehensive knowledge of proteome complexity is crucial to understanding cell function. Amino termini of yeast proteins were identified through peptide mass spectrometry on glutaraldehyde-treated cell lysates as well as a parallel assessment of publicly deposited spectra. An unexpectedly large fraction of detected amino-terminal peptides (35%) mapped to translation initiation at AUG codons downstream of the annotated start codon. Many of the implicated genes have suboptimal sequence contexts for translation initiation near their annotated AUG, and their ribosome profiles show elevated tag densities consistent with translation initiation at downstream AUGs as well as their annotated AUGs. These data suggest that a significant fraction of the yeast proteome derives from initiation at downstream AUGs, increasing significantly the repertoire of encoded proteins and their potential functions and cellular localizations.
Asunto(s)
Codón Iniciador/metabolismo , Proteínas Fúngicas/metabolismo , Mapeo Peptídico/métodos , Proteoma/análisis , Saccharomycetales/metabolismo , Acetilación , Algoritmos , Codón Iniciador/genética , Bases de Datos de Proteínas , Proteínas Fúngicas/genética , Genes Fúngicos , Glutaral/metabolismo , Anotación de Secuencia Molecular , Sistemas de Lectura Abierta , Iniciación de la Cadena Peptídica Traduccional , Proteolisis , Proteoma/metabolismo , Proteómica/métodos , Ribosomas/metabolismo , Saccharomycetales/genética , Análisis de Secuencia de Proteína , Espectrometría de Masas en TándemRESUMEN
Let S denote the set of (possibly noncanonical) base pairs {i, j } of an RNA tertiary structure; i.e. {i, j} ∈ S if there is a hydrogen bond between the ith and jth nucleotide. The page number of S, denoted π(S), is the minimum number k such that Scan be decomposed into a disjoint union of k secondary structures. Here, we show that computing the page number is NP-complete; we describe an exact computation of page number, using constraint programming, and determine the page number of a collection of RNA tertiary structures, for which the topological genus is known. We describe an approximation algorithm from which it follows that ω(S) ≤ π(S) ≤ ω(S) ã»log n,where the clique number of S, ω(S), denotes the maximum number of base pairs that pairwise cross each other.
Asunto(s)
Emparejamiento Base , Modelos Químicos , Conformación de Ácido Nucleico , ARN/química , Enlace de Hidrógeno , Modelos Genéticos , Modelos Moleculares , TermodinámicaRESUMEN
The central questions of bacterial ecology and evolution require a method to consistently demarcate, from the vast and diverse set of bacterial cells within a natural community, the groups playing ecologically distinct roles (ecotypes). Because of a lack of theory-based guidelines, current methods in bacterial systematics fail to divide the bacterial domain of life into meaningful units of ecology and evolution. We introduce a sequence-based approach ("ecotype simulation") to model the evolutionary dynamics of bacterial populations and to identify ecotypes within a natural community, focusing here on two Bacillus clades surveyed from the "Evolution Canyons" of Israel. This approach has identified multiple ecotypes within traditional species, with each predicted to be an ecologically distinct lineage; many such ecotypes were confirmed to be ecologically distinct, with specialization to different canyon slopes with different solar exposures. Ecotype simulation provides a long-needed natural foundation for microbial ecology and systematics.
Asunto(s)
Bacillus/clasificación , Ecología , Algoritmos , Simulación por Computador , Contaminación Ambiental , Datos de Secuencia Molecular , FilogeniaRESUMEN
Microbial ecologists and systematists are challenged to discover the early ecological changes that drive the splitting of one bacterial population into two ecologically distinct populations. We have aimed to identify newly divergent lineages ("ecotypes") bearing the dynamic properties attributed to species, with the rationale that discovering their ecological differences would reveal the ecological dimensions of speciation. To this end, we have sampled bacteria from the Bacillus subtilis-Bacillus licheniformis clade from sites differing in solar exposure and soil texture within a Death Valley canyon. Within this clade, we hypothesized ecotype demarcations based on DNA sequence diversity, through analysis of the clade's evolutionary history by Ecotype Simulation (ES) and AdaptML. Ecotypes so demarcated were found to be significantly different in their associations with solar exposure and soil texture, suggesting that these and covarying environmental parameters are among the dimensions of ecological divergence for newly divergent Bacillus ecotypes. Fatty acid composition appeared to contribute to ecotype differences in temperature adaptation, since those ecotypes with more warm-adapting fatty acids were isolated more frequently from sites with greater solar exposure. The recognized species and subspecies of the B. subtilis-B. licheniformis clade were found to be nearly identical to the ecotypes demarcated by ES, with a few exceptions where a recognized taxon is split at most into three putative ecotypes. Nevertheless, the taxa recognized do not appear to encompass the full ecological diversity of the B. subtilis-B. licheniformis clade: ES and AdaptML identified several newly discovered clades as ecotypes that are distinct from any recognized taxon.
Asunto(s)
Bacillus/clasificación , Bacillus/genética , Biodiversidad , Ecosistema , Microbiología Ambiental , Bacillus/química , Bacillus/aislamiento & purificación , Análisis por Conglomerados , ADN Bacteriano/química , ADN Bacteriano/genética , Ácidos Grasos/análisis , Especiación Genética , Genotipo , Datos de Secuencia Molecular , Filogenia , Análisis de Secuencia de ADN , Homología de Secuencia , Estados UnidosRESUMEN
It is a classical result of Stein and Waterman that the asymptotic number of RNA secondary structures is 1.104366 . n(-3/2) . 2.618034(n). In this paper, we study combinatorial asymptotics for two special subclasses of RNA secondary structures - canonical and saturated structures. Canonical secondary structures are defined to have no lonely (isolated) base pairs. This class of secondary structures was introduced by Bompfünewerer et al., who noted that the run time of Vienna RNA Package is substantially reduced when restricting computations to canonical structures. Here we provide an explanation for the speed-up, by proving that the asymptotic number of canonical RNA secondary structures is 2.1614 . n(-3/2) . 1.96798(n) and that the expected number of base pairs in a canonical secondary structure is 0.31724 . n. The asymptotic number of canonical secondary structures was obtained much earlier by Hofacker, Schuster and Stadler using a different method. Saturated secondary structures have the property that no base pairs can be added without violating the definition of secondary structure (i.e. introducing a pseudoknot or base triple). Here we show that the asymptotic number of saturated structures is 1.07427 . n(-3/2) . 2.35467(n), the asymptotic expected number of base pairs is 0.337361 . n, and the asymptotic number of saturated stem-loop structures is 0.323954 . 1.69562(n), in contrast to the number 2(n - 2) of (arbitrary) stem-loop structures as classically computed by Stein and Waterman. Finally, we apply the work of Drmota to show that the density of states for [all resp. canonical resp. saturated] secondary structures is asymptotically Gaussian. We introduce a stochastic greedy method to sample random saturated structures, called quasi-random saturated structures, and show that the expected number of base pairs is 0.340633 . n.
Asunto(s)
Biología Computacional/métodos , Conformación de Ácido Nucleico , ARN/química , Secuencia de Bases , Simulación por Computador , Methanococcaceae/química , Methanococcaceae/genética , Modelos Moleculares , Modelos Estadísticos , Datos de Secuencia Molecular , ARN de Archaea/química , ARN de Archaea/genética , ARN Ribosómico 5S/química , ARN Ribosómico 5S/genética , Programas Informáticos , Procesos EstocásticosRESUMEN
Identification of closely related, ecologically distinct populations of bacteria would benefit microbiologists working in many fields including systematics, epidemiology and biotechnology. Several laboratories have recently developed algorithms aimed at demarcating such 'ecotypes'. We examine the ability of four of these algorithms to correctly identify ecotypes from sequence data. We tested the algorithms on synthetic sequences, with known history and habitat associations, generated under the stable ecotype model and on data from Bacillus strains isolated from Death Valley where previous work has confirmed the existence of multiple ecotypes. We found that one of the algorithms (ecotype simulation) performs significantly better than the others (AdaptML, GMYC, BAPS) in both instances. Unfortunately, it was also shown to be the least efficient of the four. While ecotype simulation is the most accurate, it is by a large margin the slowest of the algorithms tested. Attempts at improving its efficiency are underway.
Asunto(s)
Algoritmos , Bacillus/clasificación , Biología Computacional/métodos , Ecotipo , Análisis de Secuencia de ADN/métodos , Bacillus/genética , Genes Bacterianos , Modelos Estadísticos , Programas Informáticos , Especificidad de la EspecieRESUMEN
BACKGROUND: RNA folding depends on the distribution of kinetic traps in the landscape of all secondary structures. Kinetic traps in the Nussinov energy model are precisely those secondary structures that are saturated, meaning that no base pair can be added without introducing either a pseudoknot or base triple. In previous work, we investigated asymptotic combinatorics of both random saturated structures and of quasi-random saturated structures, where the latter are constructed by a natural stochastic process. RESULTS: We prove that for quasi-random saturated structures with the uniform distribution, the asymptotic expected number of external loops is O(logn) and the asymptotic expected maximum stem length is O(logn), while under the Zipf distribution, the asymptotic expected number of external loops is O(log2n) and the asymptotic expected maximum stem length is O(logn/log logn). CONCLUSIONS: Quasi-random saturated structures are generated by a stochastic greedy method, which is simple to implement. Structural features of random saturated structures appear to resemble those of quasi-random saturated structures, and the latter appear to constitute a class for which both the generation of sampled structures as well as a combinatorial investigation of structural features may be simpler to undertake.
RESUMEN
Microbiologists are challenged to explain the origins of enormous numbers of bacterial species worldwide. Contributing to this extreme diversity may be a simpler process of speciation in bacteria than in animals and plants, requiring neither sexual nor geographical isolation between nascent species. Here, we propose and test a novel hypothesis for the extreme diversity of bacterial species-that splitting of one population into multiple ecologically distinct populations (cladogenesis) may be as frequent as adaptive improvements within a single population's lineage (anagenesis). We employed a set of experimental microcosms to address the relative rates of adaptive cladogenesis and anagenesis among the descendants of a Bacillus subtilis clone, in the absence of competing species. Analysis of the evolutionary trajectories of genetic markers indicated that in at least 7 of 10 replicate microcosm communities, the original population founded one or more new, ecologically distinct populations (ecotypes) before a single anagenetic event occurred within the original population. We were able to support this inference by identifying putative ecotypes formed in these communities through differences in genetic marker association, colony morphology and microhabitat association; we then confirmed the ecological distinctness of these putative ecotypes in competition experiments. Adaptive mutations leading to new ecotypes appeared to be about as common as those improving fitness within an existing ecotype. These results suggest near parity of anagenesis and cladogenesis rates in natural populations that are depauperate of bacterial diversity.
Asunto(s)
Bacillus subtilis/clasificación , Bacillus subtilis/genética , Especiación Genética , Adaptación Fisiológica , Bacillus subtilis/fisiología , Evolución Biológica , Ecotipo , Genética de Población , GeografíaRESUMEN
We present results of computer experiments that indicate that several RNAs for which the native state (minimum free energy secondary structure) is functionally important (type III hammerhead ribozymes, signal recognition particle RNAs, U2 small nucleolar spliceosomal RNAs, certain riboswitches, etc.) all have lower folding energy than random RNAs of the same length and dinucleotide frequency. Additionally, we find that whole mRNA as well as 5'-UTR, 3'-UTR, and cds regions of mRNA have folding energies comparable to that of random RNA, although there may be a statistically insignificant trace signal in 3'-UTR and cds regions. Various authors have used nucleotide (approximate) pattern matching and the computation of minimum free energy as filters to detect potential RNAs in ESTs and genomes. We introduce a new concept of the asymptotic Z-score and describe a fast, whole-genome scanning algorithm to compute asymptotic minimum free energy Z-scores of moving-window contents. Asymptotic Z-score computations offer another filter, to be used along with nucleotide pattern matching and minimum free energy computations, to detect potential functional RNAs in ESTs and genomic regions.
Asunto(s)
Conformación de Ácido Nucleico , Nucleótidos/análisis , ARN/química , ARN/genética , Regiones no Traducidas 3'/química , Regiones no Traducidas 3'/genética , Regiones no Traducidas 3'/metabolismo , Regiones no Traducidas 5'/química , Regiones no Traducidas 5'/genética , Regiones no Traducidas 5'/metabolismo , Algoritmos , Composición de Base , Secuencia de Bases , Biología Computacional , Simulación por Computador , Etiquetas de Secuencia Expresada , Cadenas de Markov , Nucleótidos/química , Nucleótidos/genética , Nucleótidos/metabolismo , ARN/metabolismo , TermodinámicaRESUMEN
It is known (Reidys et al., 1997b. Bull. Math. Biol. 59(2), 339-397) that for any two secondary structures S,S' there exists an RNA sequence compatible with both, and that this result does not extend to more than two secondary structures. Indeed, a simple formula for the number of RNA sequences compatible with secondary structures S,S' plays a role in the algorithms of Flamm et al. (2001. RNA 7, 254-265) and of Abfalter et al. (2003. Proceedings of the German Conference on Bioinformatics, ) to design an RNA switch. Here we show that a natural extension of this problem is NP-complete. Unless P=NP, there is no polynomial time algorithm, which when given secondary structures S1,...,S(k), for k4, determines the least number of positions, such that after removal of all base pairs incident to these positions there exists an RNA nucleotide sequence compatible with the given secondary structures. We also consider a restricted version of this problem with a "fixed maximum" number of possible stars and show that it has a simple polynomial time solution.