RESUMO
BACKGROUND: For decades, codon usage has been used as a measure of adaptation for translational efficiency and translation accuracy of a gene's coding sequence. These patterns of codon usage reflect both the selective and mutational environment in which the coding sequences evolved. Over this same period, gene transfer between lineages has become widely recognized as an important biological phenomenon. Nevertheless, most studies of codon usage implicitly assume that all genes within a genome evolved under the same selective and mutational environment, an assumption violated when introgression occurs. In order to better understand the effects of introgression on codon usage patterns and vice versa, we examine the patterns of codon usage in Lachancea kluyveri, a yeast which has experienced a large introgression. We quantify the effects of mutation bias and selection for translation efficiency on the codon usage pattern of the endogenous and introgressed exogenous genes using a Bayesian mixture model, ROC SEMPPR, which is built on mechanistic assumptions about protein synthesis and grounded in population genetics. RESULTS: We find substantial differences in codon usage between the endogenous and exogenous genes, and show that these differences can be largely attributed to differences in mutation bias favoring A/T ending codons in the endogenous genes while favoring C/G ending codons in the exogenous genes. Recognizing the two different signatures of mutation bias and selection improves our ability to predict protein synthesis rate by 42% and allowed us to accurately assess the decaying signal of endogenous codon mutation and preferences. In addition, using our estimates of mutation bias and selection, we identify Eremothecium gossypii as the closest relative to the exogenous genes, providing an alternative hypothesis about the origin of the exogenous genes, estimate that the introgression occurred â¼6×108 generation ago, and estimate its historic and current selection against mismatched codon usage. CONCLUSIONS: Our work illustrates how mechanistic, population genetic models like ROC SEMPPR can separate the effects of mutation and selection on codon usage and provide quantitative estimates from sequence data.
Assuntos
Uso do Códon , Genética Populacional , Modelos Genéticos , Saccharomycetales/genética , Seleção Genética , Teorema de Bayes , MutaçãoRESUMO
We present a new phylogenetic approach, selection on amino acids and codons (SelAC), whose substitution rates are based on a nested model linking protein expression to population genetics. Unlike simpler codon models that assume a single substitution matrix for all sites, our model more realistically represents the evolution of protein-coding DNA under the assumption of consistent, stabilizing selection using a cost-benefit approach. This cost-benefit approach allows us to generate a set of 20 optimal amino acid-specific matrix families using just a handful of parameters and naturally links the strength of stabilizing selection to protein synthesis levels, which we can estimate. Using a yeast data set of 100 orthologs for 6 taxa, we find SelAC fits the data much better than popular models by 104-105 Akike information criterion units adjusted for small sample bias. Our results also indicated that nested, mechanistic models better predict observed data patterns highlighting the improvement in biological realism in amino acid sequence evolution that our model provides. Additional parameters estimated by SelAC indicate that a large amount of nonphylogenetic, but biologically meaningful, information can be inferred from existing data. For example, SelAC prediction of gene-specific protein synthesis rates correlates well with both empirical (r=0.33-0.48) and other theoretical predictions (r=0.45-0.64) for multiple yeast species. SelAC also provides estimates of the optimal amino acid at each site. Finally, because SelAC is a nested approach based on clearly stated biological assumptions, future modifications, such as including shifts in the optimal amino acid sequence within or across lineages, are possible.
Assuntos
Substituição de Aminoácidos , Técnicas Genéticas , Modelos Genéticos , Filogenia , Seleção Genética , Genética Populacional/métodosRESUMO
Summary: AnaCoDa is an R package for estimating biologically relevant parameters of mixture models, such as selection against translation inefficiency, non-sense errors and ribosome pausing time, from genomic and high throughput datasets. AnaCoDa provides an adaptive Bayesian MCMC algorithm, fully implemented in C++ for high performance with an ergonomic R interface to improve usability. AnaCoDa employs a generic object-oriented design to allow users to extend the framework and implement their own models. Current models implemented in AnaCoDa can accurately estimate biologically relevant parameters given either protein coding sequences or ribosome foot-printing data. Optionally, AnaCoDa can utilize additional data sources, such as gene expression measurements, to aid model fitting and parameter estimation. By utilizing a hierarchical object structure, some parameters can vary between sets of genes while others can be shared. Genes may be assigned to clusters or membership may be estimated by AnaCoDa. This flexibility allows users to estimate the same model parameter under different biological conditions and categorize genes into different sets based on shared model properties embedded within the data. AnaCoDa also allows users to generate simulated data which can be used to aid model development and model analysis as well as evaluate model adequacy. Finally, AnaCoDa contains a set of visualization routines and the ability to revisit or re-initiate previous model fitting, providing researchers with a well rounded easy to use framework to analyze genome scale data. Availability and implementation: AnaCoDa is freely available under the Mozilla Public License 2.0 on CRAN (https://cran.r-project.org/web/packages/AnaCoDa/).
Assuntos
Códon , Genômica/métodos , Modelos Genéticos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Teorema de BayesRESUMO
BACKGROUND: Tag-based techniques, such as SAGE, are commonly used to sample the mRNA pool of an organism's transcriptome. Incomplete digestion during the tag formation process may allow for multiple tags to be generated from a given mRNA transcript. The probability of forming a tag varies with its relative location. As a result, the observed tag counts represent a biased sample of the actual transcript pool. In SAGE this bias can be avoided by ignoring all but the 3' most tag but will discard a large fraction of the observed data. Taking this bias into account should allow more of the available data to be used leading to increased statistical power. RESULTS: Three new hierarchical models, which directly embed a model for the variation in tag formation probability, are proposed and their associated Bayesian inference algorithms are developed. These models may be applied to libraries at both the tag and aggregate level. Simulation experiments and analysis of real data are used to contrast the accuracy of the various methods. The consequences of tag formation bias are discussed in the context of testing differential expression. A description is given as to how these algorithms can be applied in that context. CONCLUSIONS: Several Bayesian inference algorithms that account for tag formation effects are compared with the DPB algorithm providing clear evidence of superior performance. The accuracy of inferences when using a particular non-informative prior is found to depend on the expression level of a given gene. The multivariate nature of the approach easily allows both univariate and joint tests of differential expression. Calculations demonstrate the potential for false positive and negative findings due to variation in tag formation probabilities across samples when testing for differential expression.
Assuntos
Teorema de Bayes , Viés , Perfilação da Expressão Gênica , RNA Mensageiro/genéticaRESUMO
Background Identifying social determinants of myocardial infarction (MI) hospitalizations is crucial for reducing/eliminating health disparities. Therefore, our objectives were to identify sociodemographic determinants of MI hospitalization risks and to assess if the impacts of these determinants vary by geographic location in Florida. Methods and Results This is a retrospective ecologic study at the county level. We obtained data for principal and secondary MI hospitalizations for Florida residents for the 2005-2014 period and calculated age- and sex-adjusted MI hospitalization risks. We used a multivariable negative binomial model to identify sociodemographic determinants of MI hospitalization risks and a geographically weighted negative binomial model to assess if the strength of associations vary by location. There were 645 935 MI hospitalizations (median age, 72 years; 58.1%, men; 73.9%, white). Age- and sex-adjusted risks ranged from 18.49 to 69.48 cases/10 000 persons, and they were significantly higher in counties with low education levels (risk ratio [RR]=1.033, P<0.0001) and high divorce rate (RR, 0.995; P=0.018). However, they were significantly lower in counties with high proportions of rural (RR, 0.996; P<0.0001), black (RR, 1.026; P=0.032), and uninsured populations (RR, 0.983; P=0.040). Associations of MI hospitalization risks with education level and uninsured rate varied geographically (P for non-stationarity test=0.001 and 0.043, respectively), with strongest associations in southern Florida (RR for Assuntos
Hospitalização
, Infarto do Miocárdio/epidemiologia
, Determinantes Sociais da Saúde
, Fatores Socioeconômicos
, Adolescente
, Adulto
, Negro ou Afro-Americano
, Idoso
, Criança
, Pré-Escolar
, Divórcio
, Escolaridade
, Feminino
, Florida/epidemiologia
, Humanos
, Lactente
, Recém-Nascido
, Masculino
, Pessoas sem Cobertura de Seguro de Saúde
, Pessoa de Meia-Idade
, Infarto do Miocárdio/diagnóstico
, Infarto do Miocárdio/terapia
, Fatores Raciais
, Estudos Retrospectivos
, Medição de Risco
, Fatores de Risco
, População Rural
, Fatores de Tempo
, Adulto Jovem
RESUMO
BACKGROUND: Serial Analysis of Gene Expression (SAGE) is a high-throughput method for inferring mRNA expression levels from the experimentally generated sequence based tags. Standard analyses of SAGE data, however, ignore the fact that the probability of generating an observable tag varies across genes and between experiments. As a consequence, these analyses result in biased estimators and posterior probability intervals for gene expression levels in the transcriptome. RESULTS: Using the yeast Saccharomyces cerevisiae as an example, we introduce a new Bayesian method of data analysis which is based on a model of SAGE tag formation. Our approach incorporates the variation in the probability of tag formation into the interpretation of SAGE data and allows us to derive exact joint and approximate marginal posterior distributions for the mRNA frequency of genes detectable using SAGE. Our analysis of these distributions indicates that the frequency of a gene in the tag pool is influenced by its mRNA frequency, the cleavage efficiency of the anchoring enzyme (AE), and the number of informative and uninformative AE cleavage sites within its mRNA. CONCLUSION: With a mechanistic, model based approach for SAGE data analysis, we find that inter-genic variation in SAGE tag formation is large. However, this variation can be estimated and, importantly, accounted for using the methods we develop here. As a result, SAGE based estimates of mRNA frequencies can be adjusted to remove the bias introduced by the SAGE tag formation process.
Assuntos
Etiquetas de Sequências Expressas , Perfilação da Expressão Gênica/métodos , Modelos Genéticos , Análise de Sequência de DNA/métodos , Fatores de Transcrição/genética , Teorema de Bayes , Simulação por Computador , Interpretação Estatística de Dados , Bases de Dados Genéticas , Reprodutibilidade dos Testes , Sensibilidade e EspecificidadeRESUMO
Extracting biologically meaningful information from the continuing flood of genomic data is a major challenge in the life sciences. Codon usage bias (CUB) is a general feature of most genomes and is thought to reflect the effects of both natural selection for efficient translation and mutation bias. Here we present a mechanistically interpretable, Bayesian model (ribosome overhead costs Stochastic Evolutionary Model of Protein Production Rate [ROC SEMPPR]) to extract meaningful information from patterns of CUB within a genome. ROC SEMPPR is grounded in population genetics and allows us to separate the contributions of mutational biases and natural selection against translational inefficiency on a gene-by-gene and codon-by-codon basis. Until now, the primary disadvantage of similar approaches was the need for genome scale measurements of gene expression. Here, we demonstrate that it is possible to both extract accurate estimates of codon-specific mutation biases and translational efficiencies while simultaneously generating accurate estimates of gene expression, rather than requiring such information. We demonstrate the utility of ROC SEMPPR using the Saccharomyces cerevisiae S288c genome. When we compare our model fits with previous approaches we observe an exceptionally high agreement between estimates of both codon-specific parameters and gene expression levels ([Formula: see text] in all cases). We also observe strong agreement between our parameter estimates and those derived from alternative data sets. For example, our estimates of mutation bias and those from mutational accumulation experiments are highly correlated ([Formula: see text]). Our estimates of codon-specific translational inefficiencies and tRNA copy number-based estimates of ribosome pausing time ([Formula: see text]), and mRNA and ribosome profiling footprint-based estimates of gene expression ([Formula: see text]) are also highly correlated, thus supporting the hypothesis that selection against translational inefficiency is an important force driving the evolution of CUB. Surprisingly, we find that for particular amino acids, codon usage in highly expressed genes can still be largely driven by mutation bias and that failing to take mutation bias into account can lead to the misidentification of an amino acid's "optimal" codon. In conclusion, our method demonstrates that an enormous amount of biologically important information is encoded within genome scale patterns of codon usage, accessing this information does not require gene expression measurements, but instead carefully formulated biologically interpretable models.
Assuntos
Códon , Evolução Molecular , Genômica/métodos , Mutação , Biossíntese de Proteínas , Seleção Genética , Expressão Gênica , Modelos Genéticos , Saccharomyces cerevisiae/genéticaRESUMO
OBJECTIVE: Previous studies of pavement management factors that relate to the occurrence of traffic-related crashes are rare. Traditional research has mostly employed summary statistics of bidirectional pavement quality measurements in extended longitudinal road segments over a long time period, which may cause a loss of important information and result in biased parameter estimates. The research presented in this article focuses on crash risk of roadways with overall fair to good pavement quality. Real-time and location-specific data were employed to estimate the effects of pavement management factors on the occurrence of crashes. METHODS: This research is based on the crash data and corresponding pavement quality data for the Tennessee state route highways from 2004 to 2009. The potential temporal and spatial correlations among observations caused by unobserved factors were considered. Overall 6 models were built accounting for no correlation, temporal correlation only, and both the temporal and spatial correlations. These models included Poisson, negative binomial (NB), one random effect Poisson and negative binomial (OREP, ORENB), and two random effect Poisson and negative binomial (TREP, TRENB) models. The Bayesian method was employed to construct these models. The inference is based on the posterior distribution from the Markov chain Monte Carlo (MCMC) simulation. These models were compared using the deviance information criterion. RESULTS: Analysis of the posterior distribution of parameter coefficients indicates that the pavement management factors indexed by Present Serviceability Index (PSI) and Pavement Distress Index (PDI) had significant impacts on the occurrence of crashes, whereas the variable rutting depth was not significant. Among other factors, lane width, median width, type of terrain, and posted speed limit were significant in affecting crash frequency. CONCLUSIONS: The findings of this study indicate that a reduction in pavement roughness would reduce the likelihood of traffic-related crashes. Hence, maintaining a low level of pavement roughness is strongly suggested. In addition, the results suggested that the temporal correlation among observations was significant and that the ORENB model outperformed all other models.
Assuntos
Acidentes de Trânsito/prevenção & controle , Planejamento Ambiental/estatística & dados numéricos , Segurança/estatística & dados numéricos , Teorema de Bayes , Fricção , Humanos , Modelos Estatísticos , Medição de RiscoRESUMO
OBJECTIVE: The severity of traffic-related injuries has been studied by many researchers in recent decades. However, the evaluation of many factors is still in dispute and, until this point, few studies have taken into account pavement management factors as points of interest. The objective of this article is to evaluate the combined influences of pavement management factors and traditional traffic engineering factors on the injury severity of 2-vehicle crashes. METHODS: This study examines 2-vehicle rear-end, sideswipe, and angle collisions that occurred on Tennessee state routes from 2004 to 2008. Both the traditional ordered probit (OP) model and Bayesian ordered probit (BOP) model with weak informative prior were fitted for each collision type. The performances of these models were evaluated based on the parameter estimates and deviances. RESULTS: The results indicated that pavement management factors played identical roles in all 3 collision types. Pavement serviceability produces significant positive effects on the severity of injuries. The pavement distress index (PDI), rutting depth (RD), and rutting depth difference between right and left wheels (RD_df) were not significant in any of these 3 collision types. The effects of traffic engineering factors varied across collision types, except that a few were consistently significant in all 3 collision types, such as annual average daily traffic (AADT), rural-urban location, speed limit, peaking hour, and light condition. CONCLUSIONS: The findings of this study indicated that improved pavement quality does not necessarily lessen the severity of injuries when a 2-vehicle crash occurs. The effects of traffic engineering factors are not universal but vary by the type of crash. The study also found that the BOP model with a weak informative prior can be used as an alternative but was not superior to the traditional OP model in terms of overall performance.
Assuntos
Acidentes de Trânsito/estatística & dados numéricos , Modelos Estatísticos , Índices de Gravidade do Trauma , Ferimentos e Lesões/etiologia , Adolescente , Adulto , Idoso , Teorema de Bayes , Planejamento Ambiental/estatística & dados numéricos , Ergonomia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Fatores de Risco , Tennessee , Meios de Transporte , Adulto JovemRESUMO
The severity of traffic-related injuries has been studied by many researchers in recent decades. However, previous research has seldom accounted for the effects of curbed outside shoulders on traffic-related injury severity. This study applies the zero-inflated ordered probit (ZIOP) model to evaluate the influences of curbed outside shoulders, speed limit change, as well as other traditional factors on the injury severity of single-vehicle crashes. Crash data from 2003 to 2007 in the Illinois Highway Safety Database were employed in this study. The ZIOP model assumes that injury severity comes from two distinct sources: injury propensity and injury severity when this crash falls into the injury prone category. The modeling results show that on one hand, single-vehicle crashes that occurring on roadways with curbed outside shoulders are more likely to be injury prone. On the other hand, the existence of a curb decreases the likelihood of severe injury if the crash was in the injury prone category. As a result, the marginal effect analysis implies that the presence of curbs is associated with a higher likelihood of no injury and minor injury involved crashes, but a lower likelihood of incapacitating injury and fatality involved crashes. In addition, in the presence of curbed outside shoulders, the change of speed limit adds no significant impact to the injury severity of single-vehicle crashes. Moreover, the modeling results also highlight some interesting effects caused by vehicle type, light and weather conditions, and drivers' characteristics, as well as crash type and location. Through a comprehensive evaluation of the modeling results, the authors find that the ZIOP model performs well relative to the traditional ordered probit (OP) model, and can serve as an alternative in future studies of crash injury severity.
Assuntos
Acidentes de Trânsito/estatística & dados numéricos , Planejamento Ambiental , Modelos Estatísticos , Probabilidade , Ferimentos e Lesões/epidemiologia , Acidentes de Trânsito/mortalidade , Condução de Veículo/estatística & dados numéricos , Humanos , Illinois/epidemiologia , Escala de Gravidade do Ferimento , Ferimentos e Lesões/patologiaRESUMO
Codon usage bias (CUB) has been documented across a wide range of taxa and is the subject of numerous studies. While most explanations of CUB invoke some type of natural selection, most measures of CUB adaptation are heuristically defined. In contrast, we present a novel and mechanistic method for defining and contextualizing CUB adaptation to reduce the cost of nonsense errors during protein translation. Using a model of protein translation, we develop a general approach for measuring the protein production cost in the face of nonsense errors of a given allele as well as the mean and variance of these costs across its coding synonyms. We then use these results to define the nonsense error adaptation index (NAI) of the allele or a contiguous subset thereof. Conceptually, the NAI value of an allele is a relative measure of its elevation on a specific and well-defined adaptive landscape. To illustrate its utility, we calculate NAI values for the entire coding sequence and across a set of nonoverlapping windows for each gene in the Saccharomyces cerevisiae S288c genome. Our results provide clear evidence of adaptation to reduce the cost of nonsense errors and increasing adaptation with codon position and expression. The magnitude and nature of this adaptation are also largely consistent with simulation results in which nonsense errors are the only selective force driving CUB evolution. Because NAI is derived from mechanistic models, it is both easier to interpret and more amenable to future refinement than other commonly used measures of codon bias. Further, our approach can also be used as a starting point for developing other mechanistically derived measures of adaptation such as for translational accuracy.
Assuntos
Adaptação Fisiológica , Códon sem Sentido , Códon/genética , Códon/metabolismo , Biossíntese de Proteínas/genética , Alelos , Genoma Fúngico/genética , Modelos Genéticos , Saccharomyces cerevisiae/genéticaRESUMO
We introduce the Skill Plot, a method that it is directly relevant to a decision maker who must use a diagnostic test. In contrast to ROC curves, the skill curve allows easy graphical inspection of the optimal cutoff or decision rule for a diagnostic test. The skill curve and test also determine whether diagnoses based on this cutoff improve upon a naive forecast (of always present or of always absent). The skill measure makes it easy to directly compare the predictive utility of two different classifiers in an analogy to the area under the curve statistic related to ROC analysis. Finally, this article shows that the skill-based cutoff inferred from the plot is equivalent to the cutoff indicated by optimizing the posterior odds in accordance with Bayesian decision theory. A method for constructing a confidence interval for this optimal point is presented and briefly discussed.