RESUMEN
Sampling for prevalence estimation of infection is subject to bias by both oversampling of symptomatic individuals and error-prone tests. This results in naïve estimators of prevalence (ie, proportion of observed infected individuals in the sample) that can be very far from the true proportion of infected. In this work, we present a method of prevalence estimation that reduces both the effect of bias due to testing errors and oversampling of symptomatic individuals, eliminating it altogether in some scenarios. Moreover, this procedure considers stratified errors in which tests have different error rate profiles for symptomatic and asymptomatic individuals. This results in easily implementable algorithms, for which code is provided, that produce better prevalence estimates than other methods (in terms of reducing and/or removing bias), as demonstrated by formal results, simulations, and on COVID-19 data from the Israeli Ministry of Health.
RESUMEN
The variance effective population size ([Formula: see text]) is frequently used to quantify the expected rate at which a population's allele frequencies change over time. The purpose of this paper is to find expressions for the global [Formula: see text] of a spatially structured population that are of interest for conservation of species. Since [Formula: see text] depends on allele frequency change, we start by dividing the cause of allele frequency change into genetic drift within subpopulations (I) and a second component mainly due to migration between subpopulations (II). We investigate in detail how these two components depend on the way in which subpopulations are weighted as well as their dependence on parameters of the model such a migration rates, and local effective and census sizes. It is shown that under certain conditions the impact of II is eliminated, and [Formula: see text] of the metapopulation is maximized, when subpopulations are weighted proportionally to their long term reproductive contributions. This maximal [Formula: see text] is the sought for global effective size, since it approximates the gene diversity effective size [Formula: see text], a quantifier of the rate of loss of genetic diversity that is relevant for conservation of species and populations. We also propose two novel versions of [Formula: see text], one of which (the backward version of [Formula: see text]) is most stable, exists for most populations, and is closer to [Formula: see text] than the classical notion of [Formula: see text]. Expressions for the optimal length of the time interval for measuring genetic change are developed, that make it possible to estimate any version of [Formula: see text] with maximal accuracy.
Asunto(s)
Flujo Genético , Animales , Frecuencia de los Genes , Densidad de Población , TiempoRESUMEN
The sympatric existence of genetically distinguishable populations of the same species remains a puzzle in ecology. Coexisting salmonid fish populations are known from over 100 freshwater lakes. Most studies of sympatric populations have used limited numbers of genetic markers making it unclear if genetic divergence involves certain parts of the genome. We returned to the first reported case of salmonid sympatry, initially detected through contrasting homozygosity at a single allozyme locus (coding for lactate dehydrogenase A) in brown trout in the small Lakes Bunnersjöarna, Sweden. First, we verified the existence of the two coexisting demes using a 96-SNP fluidigm array. We then applied whole-genome resequencing of pooled DNA to explore genome-wide diversity within and between these demes; nucleotide diversity was higher in deme I than in deme II. Strong genetic divergence is observed with genome-wide FST ≈ 0.2. Compared with data from populations of similar small lakes, this divergence is of similar magnitude as that between reproductively isolated populations. Individual whole-genome resequencing of two individuals per deme suggests higher inbreeding in deme II versus deme I, indicating different degree of isolation. We located two gene-copies for LDH-A and found divergence between demes in a regulatory section of one of these genes. However, we did not find a perfect fit between the sequence data and previous allozyme results, and this will require further research. Our data demonstrates genome-wide divergence governed mostly by genetic drift but also by diversifying selection in coexisting populations. This type of hidden biodiversity needs consideration in conservation management.
Asunto(s)
Aislamiento Reproductivo , Simpatría , Animales , Variación Genética , Genética de Población , Humanos , Isoenzimas , Trucha/genéticaRESUMEN
A general framework is introduced to estimate how much external information has been infused into a search algorithm, the so-called active information. This is rephrased as a test of fine-tuning, where tuning corresponds to the amount of pre-specified knowledge that the algorithm makes use of in order to reach a certain target. A function f quantifies specificity for each possible outcome x of a search, so that the target of the algorithm is a set of highly specified states, whereas fine-tuning occurs if it is much more likely for the algorithm to reach the target as intended than by chance. The distribution of a random outcome X of the algorithm involves a parameter θ that quantifies how much background information has been infused. A simple choice of this parameter is to use θf in order to exponentially tilt the distribution of the outcome of the search algorithm under the null distribution of no tuning, so that an exponential family of distributions is obtained. Such algorithms are obtained by iterating a Metropolis-Hastings type of Markov chain, which makes it possible to compute their active information under the equilibrium and non-equilibrium of the Markov chain, with or without stopping when the targeted set of fine-tuned states has been reached. Other choices of tuning parameters θ are discussed as well. Nonparametric and parametric estimators of active information and tests of fine-tuning are developed when repeated and independent outcomes of the algorithm are available. The theory is illustrated with examples from cosmology, student learning, reinforcement learning, a Moran type model of population genetics, and evolutionary programming.
RESUMEN
Philosophers frequently define knowledge as justified, true belief. We built a mathematical framework that makes it possible to define learning (increasing number of true beliefs) and knowledge of an agent in precise ways, by phrasing belief in terms of epistemic probabilities, defined from Bayes' rule. The degree of true belief is quantified by means of active information I+: a comparison between the degree of belief of the agent and a completely ignorant person. Learning has occurred when either the agent's strength of belief in a true proposition has increased in comparison with the ignorant person (I+>0), or the strength of belief in a false proposition has decreased (I+<0). Knowledge additionally requires that learning occurs for the right reason, and in this context we introduce a framework of parallel worlds that correspond to parameters of a statistical model. This makes it possible to interpret learning as a hypothesis test for such a model, whereas knowledge acquisition additionally requires estimation of a true world parameter. Our framework of learning and knowledge acquisition is a hybrid between frequentism and Bayesianism. It can be generalized to a sequential setting, where information and data are updated over time. The theory is illustrated using examples of coin tossing, historical and future events, replication of studies, and causal inference. It can also be used to pinpoint shortcomings of machine learning, where typically learning rather than knowledge acquisition is in focus.
RESUMEN
In this paper we consider the time evolution of a population of size N with overlapping generations, in the vicinity of m genes. We assume that this population is subject to point mutations, genetic drift, and selection. More specifically, we analyze the statistical distribution of the waiting time Tm until the expression of these genes have changed for all individuals, when transcription factors recognize and attach to short DNA-sequences (binding sites) within regulatory sequences in the neighborhoods of the m genes. The evolutionary dynamics is described by a multitype Moran process, where each individual is assigned a m×L regulatory array that consists of regulatory sequences with L nucleotides for all m genes. We study how the waiting time distribution depends on the number of genes, the mutation rate, the length of the binding sites, the length of the regulatory sequences, and the way in which the targeted binding sites are coordinated for different genes in terms of selection coefficients. These selection coefficients depend on how many binding sites have appeared so far, and possibly on their order of appearance. We also allow for back mutations, whereby some acquired binding sites may be lost over time. It is further assumed that the mutation rate is small enough to warrant a fixed state population, so that all individuals have the same regulatory array, at any given time point, until the next successful mutation arrives in some individual and spreads to the rest of the population. We further incorporate stochastic tunneling, whereby successful mutations get mutated before their fixation. A crucial part of our approach is to divide the huge state space of regulatory arrays into a small number of components, assuming that the array component varies as a Markov process over time. This implies that Tm is the time until this Markov process hits an absorbing state, with a phase-type distribution. A number of interesting results can be derived from our general setup, for instance that the expected waiting time increases exponentially with m, for a selectively neutral model, when back-mutations are possible.
Asunto(s)
Flujo Genético , Modelos Genéticos , Sitios de Unión/genética , Evolución Molecular , Humanos , Mutación , Selección Genética , Factores de Tiempo , Factores de Transcripción/genéticaRESUMEN
In this paper we develop a general framework for how the genetic composition of a structured population with strong migration between its subunits, evolves over time. The dynamics is described in terms of a vector-valued Markov process of allele, genotype or haplotype frequencies that varies on two time scales. The more rapid changes are random fluctuations in terms of a multivariate autoregressive process, around a quasi equilibrium fix point, whereas the fix point itself varies more slowly according to a diffusion process, along a lower-dimensional subspace. Under mild regularity conditions, the fluctuations have a magnitude inversely proportional to the square root of the population size N, and hence can be used to estimate a broad class of genetically effective population sizes Ne, with genetic data from one time point only. In this way we are able to unify a number of existing notions of effective size, as well as proposing new ones, for instance one that quantifies the extent to which genotype frequencies fluctuate around Hardy-Weinberg equilibrium.
Asunto(s)
Alelos , Frecuencia de los Genes , Genotipo , Haplotipos , Modelos Genéticos , Densidad de PoblaciónRESUMEN
BACKGROUND: HLA-DRB1*15:01, absence of HLA-A*02:01, and smoking interact to increase multiple sclerosis (MS) risk. OBJECTIVE: To analyze whether MS-associated human leukocyte antigen (HLA) alleles, apart from DRB1*15:01 and absence of A*02:01, interact with smoking in MS development, and to explore whether the established HLA-smoking interaction is affected by the DQA1*01:01 allele, which confers a protective effect only in the presence of DRB1*15:01. METHODS: In two Swedish population-based case-control studies (5838 cases, 5412 controls), subjects with different genotypes and smoking habits were compared regarding MS risk, by calculating odds ratios with 95% confidence intervals employing logistic regression. Interaction on the additive scale between different genotypes and smoking was evaluated. RESULTS: The DRB1*08:01 allele interacted with smoking to increase MS risk. The interaction between DRB1*15:01 and both the absence of A*02:01 and smoking was confined to DQA1*01:01 negative subjects, whereas no interactions occurred among DQA1*01:01 positive subjects. CONCLUSION: Multifaceted interactions take place between different class II alleles and smoking in MS development. The influence of DRB1*15:01 and its interaction with the absence of A*02:01 and smoking is dependent on DQA1*01:01 status which may be due to differences in the responding T-cell repertoires.
Asunto(s)
Antígenos HLA , Esclerosis Múltiple , Alelos , Frecuencia de los Genes , Predisposición Genética a la Enfermedad , Cadenas beta de HLA-DQ/genética , Cadenas HLA-DRB1/genética , Haplotipos , Humanos , Esclerosis Múltiple/genética , Fumar/efectos adversosRESUMEN
Fine-tuning has received much attention in physics, and it states that the fundamental constants of physics are finely tuned to precise values for a rich chemistry and life permittance. It has not yet been applied in a broad manner to molecular biology. However, in this paper we argue that biological systems present fine-tuning at different levels, e.g. functional proteins, complex biochemical machines in living cells, and cellular networks. This paper describes molecular fine-tuning, how it can be used in biology, and how it challenges conventional Darwinian thinking. We also discuss the statistical methods underpinning fine-tuning and present a framework for such analysis.
RESUMEN
Potential long-term consequences of hypnotics remain controversial. We used the prospective Swedish National March Cohort, a study based on 41,695 participants with a mean follow-up duration of 18.9 years. Logistic regression models and Cox proportional hazards models with attained age as timescale were used to assess associations of hypnotic use with short- and long-term mortality. The proportion of subjects who initiated or discontinued hypnotic use during follow-up was substantial. All groups of hypnotics were associated with increased mortality within 2 years after a first prescription, with an overall OR of 2.38 (95% CI, 2.13-2.66). The association was more pronounced among subjects younger than 60 years (OR, 6.16; 95% CI, 3.98-9.52). There was no association between hypnotic use and long-term mortality. The association between hypnotic use and increased mortality was thus restricted to a relatively short period after treatment initiation, and may be explained in terms of confounding by indication.
Asunto(s)
Hipnóticos y Sedantes/efectos adversos , Trastornos del Inicio y del Mantenimiento del Sueño/mortalidad , Anciano , Femenino , Humanos , Masculino , Persona de Mediana Edad , Mortalidad , Estudios ProspectivosRESUMEN
Estimation of effective population size (Ne ) from genetic marker data is a major focus for biodiversity conservation because it is essential to know at what rates inbreeding is increasing and additive genetic variation is lost. But are these the rates assessed when applying commonly used Ne estimation techniques? Here we use recently developed analytical tools and demonstrate that in the case of substructured populations the answer is no. This is because the following: Genetic change can be quantified in several ways reflecting different types of Ne such as inbreeding (NeI ), variance (NeV ), additive genetic variance (NeAV ), linkage disequilibrium equilibrium (NeLD ), eigenvalue (NeE ) and coalescence (NeCo ) effective size. They are all the same for an isolated population of constant size, but the realized values of these effective sizes can differ dramatically in populations under migration. Commonly applied Ne -estimators target NeV or NeLD of individual subpopulations. While such estimates are safe proxies for the rates of inbreeding and loss of additive genetic variation under isolation, we show that they are poor indicators of these rates in populations affected by migration. In fact, both the local and global inbreeding (NeI ) and additive genetic variance (NeAV ) effective sizes are consistently underestimated in a subdivided population. This is serious because these are the effective sizes that are relevant to the widely accepted 50/500 rule for short and long term genetic conservation. The bias can be infinitely large and is due to inappropriate parameters being estimated when applying theory for isolated populations to subdivided ones.
Asunto(s)
Marcadores Genéticos/genética , Variación Genética/genética , Genética de Población , Densidad de Población , Animales , Flujo Génico , Endogamia , Desequilibrio de Ligamiento , Modelos Genéticos , Dinámica Poblacional/estadística & datos numéricosRESUMEN
Interactions between environment and genetics may contribute to multiple sclerosis (MS) development. We investigated whether the previously observed interaction between smoking and HLA genotype in the Swedish population could be replicated, refined and extended to include other populations. We used six independent case-control studies from five different countries (Sweden, Denmark, Norway, Serbia, United States). A pooled analysis was performed for replication of previous observations (7190 cases, 8876 controls). Refined detailed analyses were carried out by combining the genetically similar populations from the Nordic studies (6265 cases, 8401 controls). In both the pooled analyses and in the combined Nordic material, interactions were observed between HLA-DRB*15 and absence of HLA-A*02 and between smoking and each of the genetic risk factors. Two way interactions were observed between each combination of the three variables, invariant over categories of the third. Further, there was also a three way interaction between the risk factors. The difference in MS risk between the extremes was considerable; smokers carrying HLA-DRB1*15 and lacking HLA-A*02 had a 13-fold increased risk compared with never smokers without these genetic risk factors (OR 12.7, 95% CI 10.8-14.9). The risk of MS associated with HLA genotypes is strongly influenced by smoking status and vice versa. Since the function of HLA molecules is to present peptide antigens to T cells, the demonstrated interactions strongly suggest that smoking alters MS risk through actions on adaptive immunity.
Asunto(s)
Predisposición Genética a la Enfermedad , Antígenos HLA-A/genética , Cadenas HLA-DRB1/genética , Esclerosis Múltiple/epidemiología , Fumar/efectos adversos , Estudios de Casos y Controles , Femenino , Interacción Gen-Ambiente , Genotipo , Humanos , Masculino , Persona de Mediana Edad , Esclerosis Múltiple/genética , Esclerosis Múltiple/inmunología , Factores de Riesgo , Fumar/inmunología , Suecia/epidemiologíaRESUMEN
An exact Markov chain is developed for a Moran model of random mating for monoecious diploid individuals with a given probability of self-fertilization. The model captures the dynamics of genetic variation at a biallelic locus. We compare the model with the corresponding diploid Wright-Fisher (WF) model. We also develop a novel diffusion approximation of both models, where the genotype frequency distribution dynamics is described by two partial differential equations, on different time scales. The first equation captures the more slowly varying allele frequencies, and it is the same for the Moran and WF models. The other equation captures departures of the fraction of heterozygous genotypes from a large population equilibrium curve that equals Hardy-Weinberg proportions in the absence of selfing. It is the distribution of a continuous time Ornstein-Uhlenbeck process for the Moran model and a discrete time autoregressive process for the WF model. One application of our results is to capture dynamics of the degree of non-random mating of both models, in terms of the fixation index fIS. Although fIS has a stable fixed point that only depends on the degree of selfing, the normally distributed oscillations around this fixed point are stochastically larger for the Moran than for the WF model.
Asunto(s)
Diploidia , Modelos Genéticos , Reproducción/genética , Simulación por Computador , Análisis Numérico Asistido por Computador , Densidad de PoblaciónRESUMEN
The variance effective population size for age structured populations is generally hard to estimate and the temporal method often gives biased estimates. Here, we give an explicit expression for a correction factor which, combined with estimates from the temporal method, yield approximately unbiased estimates. The calculation of the correction factor requires knowledge of the age specific offspring distribution and survival probabilities as well as possible correlation between survival and reproductive success. In order to relax these requirements, we show that only first order moments of these distributions need to be known if the time between samples is large, or individuals from all age classes which reproduce are sampled. A very explicit approximate expression for the asymptotic coefficient of standard deviation of the estimator is derived, and it can be used to construct confidence intervals and optimal ways of weighting information from different markers. The asymptotic coefficient of standard deviation can also be used to design studies and we show that in order to maximize the precision for a given sample size, individuals from older age classes should be sampled since their expected variance of allele frequency change is higher and easier to estimate. However, for populations with fluctuating age class sizes, the accuracy of the method is reduced when samples are taken from older age classes with high demographic variation. We also present a method for simultaneous estimation of the variance effective and census population size.
Asunto(s)
Demografía/métodos , Genética de Población/métodos , Modelos Teóricos , Densidad de Población , Distribución por Edad , Censos , Simulación por Computador , Intervalos de Confianza , Frecuencia de los Genes , Flujo Genético , Humanos , Tablas de Vida , Modelos LogísticosRESUMEN
Motivated by problems in conservation biology we study genetic dynamics in structured populations of diploid organisms (monoecious or dioecious). Our analysis provides an analytical framework that unifies substantial parts of previous work in terms of exact identity by descent (IBD) and identity by state (IBS) recursions. We provide exact conditions under which two structured haploid and diploid populations are equivalent, and some sufficient conditions under which a dioecious diploid population can be treated as a monoecious diploid one. The IBD recursions are used for computing local and metapopulation inbreeding and coancestry effective population sizes and for predictions of several types of fixation indices over different time horizons.
Asunto(s)
Evolución Biológica , Diploidia , Genética de Población , Endogamia , Animales , Femenino , Masculino , Modelos Genéticos , Densidad de Población , Dinámica PoblacionalRESUMEN
A general theory is developed for the eigenvalue effective size (N(e)E) of structured populations in which a gene with two alleles segregates in discrete time. Generalizing results of Ewens (Theor Popul Biol 21:373-378, 1982), we characterize N(e)E in terms of the largest non-unit eigenvalue of the transition matrix of a Markov chain of allele frequencies. We use Perron-Frobenius Theorem to prove that the same eigenvalue appears in a linear recursion of predicted gene diversities between all pairs of subpopulations. Coalescence theory is employed in order to characterize this recursion, so that explicit novel expressions for N(e)E can be derived. We then study N(e)E asymptotically, when either the inverse size and/or the overall migration rate between subpopulations tend to zero. It is demonstrated that several previously known results can be deduced as special cases. In particular when the coalescence effective size N(e)C exists, it is an asymptotic version of N(e)E in the limit of large populations.
Asunto(s)
Genética de Población , Modelos Genéticos , Densidad de Población , Animales , Frecuencia de los Genes , Variación Genética , Humanos , Pérdida de Heterocigocidad , Cadenas de Markov , Conceptos Matemáticos , Modelos Biológicos , Dinámica Poblacional/estadística & datos numéricosRESUMEN
In this paper, we develop a method for computing the variance effective size N eV, the fixation index F ST and the coefficient of gene differentiation G ST of a structured population under equilibrium conditions. The subpopulation sizes are constant in time, with migration and reproduction schemes that can be chosen with great flexibility. Our quasi equilibrium approach is conditional on non-fixation of alleles. This is of relevance when migration rates are of a larger order of magnitude than the mutation rates, so that new mutations can be ignored before equilibrium balance between genetic drift and migration is obtained. The vector valued time series of subpopulation allele frequencies is divided into two parts; one corresponding to genetic drift of the whole population and one corresponding to differences in allele frequencies among subpopulations. We give conditions under which the first two moments of the latter, after a simple standardization, are well approximated by quantities that can be explicitly calculated. This enables us to compute approximations of the quasi equilibrium values of N eV, F ST and G ST. Our findings are illustrated for several reproduction and migration scenarios, including the island model, stepping stone models and a model where one subpopulation acts as a demographic reservoir. We also make detailed comparisons with a backward approach based on coalescence probabilities.
Asunto(s)
Frecuencia de los Genes/genética , Flujo Genético , Variación Genética/genética , Modelos Genéticos , Mutación/genética , Algoritmos , Animales , Humanos , Análisis Numérico Asistido por Computador , Densidad de PoblaciónRESUMEN
A large hindrance to analyzing information in genetic or protein sequence data has been a lack of a mathematical framework for doing so. In this paper, we present a multinomial probability space X as a general foundation for multicategory discrete data, where categories refer to variants/alleles of biosequences. The external information that is infused in order to generate a sample of such data is quantified as a distance on X between the prior distribution of data and the empirical distribution of the sample. A number of distances on X are treated. All of them have an information theoretic interpretation, reflecting the information that the sampling mechanism provides about which variants that have a selective advantage and therefore appear more frequently compared to prior expectations. This includes distances on X based on mutual information, conditional mutual information, active information, and functional information. The functional information distance is singled out as particularly useful. It is simple and has intuitive interpretations in terms of 1) a rejection sampling mechanism, where functional entities are retained, whereas non-functional categories are censored, and 2) evolutionary waiting times. The functional information is also a quasi-metric on X, with information being measured in an asymmetric, mountainous landscape. This quasi-metric property is also retained for a robustified version of the functional information distance that allows for mutations in the sampling mechanism. The functional information quasi-metric has been applied with success on bioinformatics data sets, for proteins and sequence alignment of protein families.
Asunto(s)
Biología Computacional , Biología Computacional/métodos , Familia de Multigenes , Modelos Genéticos , Humanos , Algoritmos , Evolución MolecularRESUMEN
There are two primary measures of the amount of genetic variation in a population at a locus: heterozygosity and the number of alleles. Effective population size (N e) provides both an expectation of the amount of heterozygosity in a population at drift-mutation equilibrium and the rate of loss of heterozygosity because of genetic drift. In contrast, the number of alleles in a population at drift-mutation equilibrium is a function of both N e and census size (N C). In addition, populations with the same N e can lose allelic variation at very different rates. Allelic variation is generally much more sensitive to bottlenecks than heterozygosity. Expressions used to adjust for the effects of violations of the ideal population on N e do not provide good predictions of the loss of allelic variation. These effects are much greater for loci with many alleles, which are often important for adaptation. We show that there is a linear relationship between the reduction of N C and the corresponding reduction of the expected number of alleles at drift-mutation equilibrium. This makes it possible to predict the expected effect of a bottleneck on allelic variation. Heterozygosity provides good estimates of the rate of adaptive change in the short-term, but allelic variation provides important information about long-term adaptive change. The guideline of long-term N e being greater than 500 is often used as a primary genetic metric for evaluating conservation status. We recommend that this guideline be expanded to take into account allelic variation as well as heterozygosity.
RESUMEN
The variance effective population size (NeV) is a key concept in population biology, because it quantifies the microevolutionary process of random genetic drift, and understanding the characteristics of NeV is thus of central importance. Current formulas for NeV for populations with overlapping generations weight age classes according to their reproductive values (i.e. reflecting the contribution of genes from separate age classes to the population growth) to obtain a correct measure of genetic drift when computing the variance of the allele frequency change over time. In this paper, we examine the effect of applying different weights to the age classes using a novel analytical approach for exploring NeV. We consider a haploid organism with overlapping generations and populations of increasing, declining, or constant expected size and stochastic variation with respect to the number of individuals in the separate age classes. We define NeV, as a function of how the age classes are weighted, and of the span between the two points in time, when measuring allele frequency change. With this model, time profiles for NeV can be calculated for populations with various life histories and with fluctuations in life history composition, using different weighting schemes. We examine analytically and by simulations when NeV, using a weighting scheme with respect to reproductive contribution of separate age classes, accurately reflect the variance of the allele frequency change due to genetic drift over time. We show that the discrepancy of NeV, calculated with reproductive values as weights, compared to when individuals are weighted equally, tends to a constant when the time span between the two measurements increases. This constant is zero only for a population with a constant expected population size. Our results confirm that the effect of ignoring overlapping generations, when empirically assessing NeV from allele frequency shifts, gets smaller as the time interval between samples increases. Our model has empirical applications including assessment of (i) time intervals necessary to permit ignoring the effect of overlapping generations for NeV estimation by means of the temporal method, and (ii) effects of life table manipulation on NeV over varying time periods.