Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 29
Filter
Add more filters










Publication year range
1.
Theor Popul Biol ; 148: 28-39, 2022 12.
Article in English | MEDLINE | ID: mdl-36208800

ABSTRACT

The concept of individual admixture (IA) assumes that the genome of individuals is composed of alleles inherited from K ancestral populations. Each copy of each allele has the same chance qk to originate from population k, and together with the allele frequencies p in all populations at all M markers, comprises the admixture model. Here, we assume a supervised scheme, i.e. allele frequencies p are given through a reference database of size N, and q is estimated via maximum likelihood for a single sample. We study laws of large numbers and central limit theorems describing effects of finiteness of both, M and N, on the estimate of q. We recall results for the effect of finite M, and provide a central limit theorem for the effect of finite N, introduce a new way to express the uncertainty in estimates in standard barplots, give simulation results, and discuss applications in forensic genetics.


Subject(s)
Genetics, Population , Computer Simulation , Gene Frequency , Likelihood Functions , Uncertainty
2.
PLoS Comput Biol ; 18(8): e1010407, 2022 08.
Article in English | MEDLINE | ID: mdl-35921376

ABSTRACT

Estimating the mutation rate, or equivalently effective population size, is a common task in population genetics. If recombination is low or high, optimal linear estimation methods are known and well understood. For intermediate recombination rates, the calculation of optimal estimators is more challenging. As an alternative to model-based estimation, neural networks and other machine learning tools could help to develop good estimators in these involved scenarios. However, if no benchmark is available it is difficult to assess how well suited these tools are for different applications in population genetics. Here we investigate feedforward neural networks for the estimation of the mutation rate based on the site frequency spectrum and compare their performance with model-based estimators. For this we use the model-based estimators introduced by Fu, Futschik et al., and Watterson that minimize the variance or mean squared error for no and free recombination. We find that neural networks reproduce these estimators if provided with the appropriate features and training sets. Remarkably, using the model-based estimators to adjust the weights of the training data, only one hidden layer is necessary to obtain a single estimator that performs almost as well as model-based estimators for low and high recombination rates, and at the same time provides a superior estimation method for intermediate recombination rates. We apply the method to simulated data based on the human chromosome 2 recombination map, highlighting its robustness in a realistic setting where local recombination rates vary and/or are unknown.


Subject(s)
Genetics, Population , Mutation Rate , Computer Simulation , Humans , Neural Networks, Computer , Recombination, Genetic/genetics
3.
Forensic Sci Int Genet ; 56: 102593, 2022 01.
Article in English | MEDLINE | ID: mdl-34735936

ABSTRACT

The inference of biogeographic ancestry (BGA) has become a focus of forensic genetics. Misinference of BGA can have profound unwanted consequences for investigations and society. We show that recent admixture can lead to misclassification and erroneous inference of ancestry proportions, using state of the art analysis tools with (i) simulations, (ii) 1000 genomes project data, and (iii) two individuals analyzed using the ForenSeq DNA Signature Prep Kit. Subsequently, we extend existing tools for estimation of individual ancestry (IA) by allowing for different IA in both parents, leading to estimates of parental individual ancestry (PIA), and a statistical test for recent admixture. Estimation of PIA outperforms IA in most scenarios of recent admixture. Furthermore, additional information about parental ancestry can be acquired with PIA that may guide casework.


Subject(s)
Genetics, Population , Polymorphism, Single Nucleotide , Genotype , Humans
6.
Bioinformatics ; 37(18): 3061-3063, 2021 09 29.
Article in English | MEDLINE | ID: mdl-33738486

ABSTRACT

MOTIVATION: When performing genome-wide association studies conventionally the additive genetic model is used to explore whether a single nucleotide polymorphism (SNP) is associated with a quantitative trait. But for variants, which do not follow an intermediate mode of inheritance (MOI), the recessive or the dominant genetic model can have more power to detect associations and furthermore the MOI is important for downstream analyses and clinical interpretation. When multiple MOIs are modelled the question arises, which describes the true underlying MOI best. RESULTS: We developed an R-package allowing for the first time to determine study specific critical values when one of the three models is more informative than the other ones for a quantitative trait locus. The package allows for user-friendly simulations to determine these critical values with predefined minor allele frequencies and study sizes. For application scenarios with extensive multiple testing we integrated an interpolation functionality to determine critical values already based on a moderate number of random draws. AVAILABILITY AND IMPLEMENTATION: The R-package pgainsim is freely available for download on Github at https://github.com/genepi-freiburg/pgainsim. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome-Wide Association Study , Quantitative Trait Loci , Phenotype , Inheritance Patterns , Polymorphism, Single Nucleotide , Software
7.
Forensic Sci Int Genet ; 46: 102259, 2020 05.
Article in English | MEDLINE | ID: mdl-32105949

ABSTRACT

Inference of the Biogeographical Ancestry (BGA) of a person or trace relies on three ingredients: (1) a reference database of DNA samples including BGA information; (2) a statistical clustering method; (3) a set of loci which segregate dependent on geographical location, i.e. a set of so-called Ancestry Informative Markers (AIMs). We used the theory of feature selection from statistical learning in order to obtain AIMsets for BGA inference. Using simulations, we show that this learning procedure works in various cases, and outperforms ad hoc methods, based on statistics like FST or informativeness for the choice of AIMs. Applying our method to data from the 1000 genomes project (excluding Admixed Americans) we identified an AIMset of 12 SNPs, which gives a vanishing misclassification error on a continental scale, as do other published AIMsets. In fact, cross validation shows that there exists a multitude of sets with comparable performance to the optimal AIMset. On a sub-continental scale, we find a set of 55 SNPs for distinguishing the five European populations. The misclassification error is reduced by a factor of two relative to published AIMsets, but is still 30% and therefore too large in order to be useful in forensic applications.


Subject(s)
Databases, Genetic , Genetic Markers , Polymorphism, Single Nucleotide , Racial Groups/genetics , Forensic Genetics , Humans , Models, Genetic , Models, Statistical
8.
Theor Popul Biol ; 131: 2-11, 2020 02.
Article in English | MEDLINE | ID: mdl-31759974

ABSTRACT

For a panmictic population of constant size evolving under neutrality, Kingman's coalescent describes the genealogy of a population sample in equilibrium. However, for genealogical trees under selection, not even expectations for most basic quantities like height and length of the resulting random tree are known. Here, we give an analytic expression for the distribution of the total tree length of a sample of size n under low levels of selection in a two-alleles model. We can prove that trees are shorter than under neutrality under genic selection and if the beneficial mutant has dominance h<1∕2, but longer for h>1∕2. The difference from neutrality is O(α2) for genic selection with selection intensity α and O(α) for other modes of dominance.


Subject(s)
Alleles , Genetics, Population , Models, Genetic , Selection, Genetic , Pedigree
9.
G3 (Bethesda) ; 10(1): 211-223, 2020 01 07.
Article in English | MEDLINE | ID: mdl-31699776

ABSTRACT

With up to millions of nearly neutral polymorphisms now being routinely sampled in population-genomic surveys, it is possible to estimate the site-frequency spectrum of such sites with high precision. Each frequency class reflects a mixture of potentially unique demographic histories, which can be revealed using theory for the probability distributions of the starting and ending points of branch segments over all possible coalescence trees. Such distributions are completely independent of past population history, which only influences the segment lengths, providing the basis for estimating average population sizes separating tree-wide coalescence events. The history of population-size change experienced by a sample of polymorphisms can then be dissected in a model-flexible fashion, and extension of this theory allows estimation of the mean and full distribution of long-term effective population sizes and ages of alleles of specific frequencies. Here, we outline the basic theory underlying the conceptual approach, develop and test an efficient statistical procedure for parameter estimation, and apply this to multiple population-genomic datasets for the microcrustacean Daphnia pulex.


Subject(s)
Biomass , Models, Genetic , Polymorphism, Single Nucleotide , Animals , Daphnia/genetics , Daphnia/growth & development
10.
Bioinformatics ; 35(11): 1813-1819, 2019 06 01.
Article in English | MEDLINE | ID: mdl-30395202

ABSTRACT

MOTIVATION: Unique sequence regions are associated with genetic function in vertebrate genomes. However, measuring uniqueness, or absence of long repeats, along a genome is conceptually and computationally difficult. Here we use a variant of the Lempel-Ziv complexity, the match complexity, Cm, and augment it by deriving its null distribution for random sequences. We then apply Cm to the human and mouse genomes to investigate the relationship between sequence complexity and function. RESULTS: We implemented Cm in the program macle and show through simulation that the newly derived null distribution of Cm is accurate. This allows us to delineate high-complexity regions in the human and mouse genomes. Using our program macle2go, we find that these regions are twofold enriched for genes. Moreover, the genes contained in these regions are more than 10-fold enriched for developmental functions. AVAILABILITY AND IMPLEMENTATION: Source code for macle and macle2go is available from www.github.com/evolbioinf/macle and www.github.com/evolbioinf/macle2go, respectively; Cm browser tracks from guanine.evolbio.mgp.de/complexity. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome , Genomics , Animals , Genes, Developmental , Humans , Mammals , Mice , Software
11.
J Math Biol ; 77(4): 1153-1191, 2018 10.
Article in English | MEDLINE | ID: mdl-29797051

ABSTRACT

Gene expression is influenced by extrinsic noise (involving a fluctuating environment of cellular processes) and intrinsic noise (referring to fluctuations within a cell under constant environment). We study the standard model of gene expression including an (in-)active gene, mRNA and protein. Gene expression is regulated in the sense that the protein feeds back and either represses (negative feedback) or enhances (positive feedback) its production at the stage of transcription. While it is well-known that negative (positive) feedback reduces (increases) intrinsic noise, we give a precise result on the resulting fluctuations in protein numbers. The technique we use is an extension of the Langevin approximation and is an application of a central limit theorem under stochastic averaging for Markov jump processes (Kang et al. in Ann Appl Probab 24:721-759, 2014). We find that (under our scaling and in equilibrium), negative feedback leads to a reduction in the Fano factor of at most 2, while the noise under positive feedback is potentially unbounded. The fit with simulations is very good and improves on known approximations.


Subject(s)
Gene Expression Regulation , Models, Genetic , Biochemical Phenomena , Computer Simulation , Feedback, Physiological , Homeostasis/genetics , Markov Chains , Mathematical Concepts , Monte Carlo Method , Protein Biosynthesis , RNA, Messenger/genetics , Stochastic Processes , Transcription, Genetic
13.
J Lipid Res ; 57(5): 882-93, 2016 05.
Article in English | MEDLINE | ID: mdl-27015744

ABSTRACT

Lipoproteins play a key role in the development of CVD, but the dynamics of lipoprotein metabolism are difficult to address experimentally. This article describes a novel two-step combined in vitro and in silico approach that enables the estimation of key reactions in lipoprotein metabolism using just one blood sample. Lipoproteins were isolated by ultracentrifugation from fasting plasma stored at 4°C. Plasma incubated at 37°C is no longer in a steady state, and changes in composition may be determined. From these changes, we estimated rates for reactions like LCAT (56.3 µM/h), ß-LCAT (15.62 µM/h), and cholesteryl ester (CE) transfer protein-mediated flux of CE from HDL to IDL/VLDL (21.5 µM/h) based on data from 15 healthy individuals. In a second step, we estimated LDL's HL activity (3.19 pools/day) and, for the very first time, selective CE efflux from LDL (8.39 µM/h) by relying on the previously derived reaction rates. The estimated metabolic rates were then confirmed in an independent group (n = 10). Although measurement uncertainties do not permit us to estimate parameters in individuals, the novel approach we describe here offers the unique possibility to investigate lipoprotein dynamics in various diseases like atherosclerosis or diabetes.


Subject(s)
Lipoproteins, LDL/blood , Adult , Algorithms , Cholesterol Ester Transfer Proteins/physiology , Computer Simulation , Esterification , Female , Humans , Hydrolysis , Male , Middle Aged , Models, Biological , Phosphatidylcholine-Sterol O-Acyltransferase/physiology , Triglycerides/physiology , Young Adult
14.
J R Soc Interface ; 12(104): 20141106, 2015 Mar 06.
Article in English | MEDLINE | ID: mdl-25652460

ABSTRACT

Spatial heterogeneity in cells can be modelled using distinct compartments connected by molecular movement between them. In addition to movement, changes in the amount of molecules are due to biochemical reactions within compartments, often such that some molecular types fluctuate on a slower timescale than others. It is natural to ask the following questions: how sensitive is the dynamics of molecular types to their own spatial distribution, and how sensitive are they to the distribution of others? What conditions lead to effective homogeneity in biochemical dynamics despite heterogeneity in molecular distribution? What kind of spatial distribution is optimal from the point of view of some downstream product? Within a spatially heterogeneous multiscale model, we consider two notions of dynamical homogeneity (full homogeneity and homogeneity for the fast subsystem), and consider their implications under different timescales for the motility of molecules between compartments. We derive rigorous results for their dynamics and long-term behaviour, and illustrate them with examples of a shared pathway, Michaelis-Menten enzymatic kinetics and autoregulating feedbacks. Using stochastic averaging of fast fluctuations to their quasi-steady-state distribution, we obtain simple analytic results that significantly reduce the complexity and expedite simulation of stochastic compartment models of chemical reactions.


Subject(s)
Biophysics/methods , Algorithms , Computer Simulation , Kinetics , Models, Biological , Models, Chemical , Models, Statistical , RNA, Messenger/metabolism , Signal Transduction , Stochastic Processes
15.
J Biotechnol ; 198: 3-14, 2015 Mar 20.
Article in English | MEDLINE | ID: mdl-25661839

ABSTRACT

Phenotypic heterogeneity, defined as the unequal behavior of individuals in an isogenic population, is prevalent in microorganisms. It has a significant impact both on industrial bioprocesses and microbial ecology. We introduce a new versatile reporter system designed for simultaneous monitoring of the activities of three different promoters, where each promoter is fused to a dedicated fluorescent reporter gene (cerulean, mCherry, and mVenus). The compact 3.1 kb triple reporter cassette can either be carried on a replicating plasmid or integrated into the genome avoiding artifacts associated with variation in copy number of plasmid-borne reporter constructs. This construct was applied to monitor promoter activities related to quorum sensing (sinI promoter) and biosynthesis of the exopolysaccharide galactoglucan (wgeA promoter) at single cell level in colonies of the symbiotic nitrogen-fixing alpha-proteobacterium Sinorhizobium meliloti growing in a microfluidics system. The T5-promoter served as a constitutive and homogeneously active control promoter indicating cell viability. wgeA promoter activity was heterogeneous over the whole period of colony development, whereas sinI promoter activity passed through a phase of heterogeneity before becoming homogeneous at late stages. Although quorum sensing-dependent regulation is a major factor activating galactoglucan production, activities of both promoters did not correlate at single cell level. We developed a novel mathematical strategy for classification of the gene expression status in cell populations based on the increase in fluorescence over time in each individual. With respect to galactoglucan biosynthesis, cells in the population were classified into non-contributors, weak contributors, and strong contributors.


Subject(s)
Promoter Regions, Genetic/genetics , Sinorhizobium meliloti/genetics , Bacterial Proteins/genetics , Galactans/genetics , Gene Expression Regulation, Bacterial/genetics , Genes, Reporter/genetics , Glucans/genetics , Green Fluorescent Proteins/genetics , Polysaccharides, Bacterial/genetics , Quorum Sensing/genetics
16.
J Theor Biol ; 364: 355-63, 2015 Jan 07.
Article in English | MEDLINE | ID: mdl-25285895

ABSTRACT

The expression of genes usually follows a two-step procedure. First, a gene (encoded in the genome) is transcribed resulting in a strand of (messenger) RNA. Afterwards, the RNA is translated into protein. We extend the classical stochastic jump model by adding delays (with arbitrary distributions) to transcription and translation. Already in the classical model, production of RNA and protein comes in bursts by activation and deactivation of the gene, resulting in a large variance of the number of RNA and proteins in equilibrium. We derive precise formulas for this second-order structure with the model including delay in equilibrium.


Subject(s)
Gene Expression Profiling , Gene Expression Regulation , Stochastic Processes , Animals , Bacteria , Binding Sites , Computer Simulation , Markov Chains , Models, Genetic , Oscillometry , Poisson Distribution , Protein Biosynthesis , Proteins/chemistry , RNA/chemistry , Transcription, Genetic
17.
Bioinformatics ; 31(8): 1169-75, 2015 Apr 15.
Article in English | MEDLINE | ID: mdl-25504847

ABSTRACT

MOTIVATION: A standard approach to classifying sets of genomes is to calculate their pairwise distances. This is difficult for large samples. We have therefore developed an algorithm for rapidly computing the evolutionary distances between closely related genomes. RESULTS: Our distance measure is based on ungapped local alignments that we anchor through pairs of maximal unique matches of a minimum length. These exact matches can be looked up efficiently using enhanced suffix arrays and our implementation requires approximately only 1 s and 45 MB RAM/Mbase analysed. The pairing of matches distinguishes non-homologous from homologous regions leading to accurate distance estimation. We show this by analysing simulated data and genome samples ranging from 29 Escherichia coli/Shigella genomes to 3085 genomes of Streptococcus pneumoniae. AVAILABILITY AND IMPLEMENTATION: We have implemented the computation of anchor distances in the multithreaded UNIX command-line program andi for ANchor DIstances. C sources and documentation are posted at http://github.com/evolbioinf/andi/ CONTACT: haubold@evolbio.mpg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Biological Evolution , Genome , Genomics/methods , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Software , Animals , Databases, Genetic , Humans , Phylogeny
18.
Genetics ; 198(1): 269-81, 2014 Sep.
Article in English | MEDLINE | ID: mdl-24948778

ABSTRACT

Although the analysis of linkage disequilibrium (LD) plays a central role in many areas of population genetics, the sampling variance of LD is known to be very large with high sensitivity to numbers of nucleotide sites and individuals sampled. Here we show that a genome-wide analysis of the distribution of heterozygous sites within a single diploid genome can yield highly informative patterns of LD as a function of physical distance. The proposed statistic, the correlation of zygosity, is closely related to the conventional population-level measure of LD, but is agnostic with respect to allele frequencies and hence likely less prone to outlier artifacts. Application of the method to several vertebrate species leads to the conclusion that >80% of recombination events are typically resolved by gene-conversion-like processes unaccompanied by crossovers, with the average lengths of conversion patches being on the order of one to several kilobases in length. Thus, contrary to common assumptions, the recombination rate between sites does not scale linearly with distance, often even up to distances of 100 kb. In addition, the amount of LD between sites separated by <200 bp is uniformly much greater than can be explained by the conventional neutral model, possibly because of the nonindependent origin of mutations within this spatial scale. These results raise questions about the application of conventional population-genetic interpretations to LD on short spatial scales and also about the use of spatial patterns of LD to infer demographic histories.


Subject(s)
Genome, Human , Linkage Disequilibrium , Models, Genetic , Animals , Gene Conversion , Gene Frequency , Heterozygote , Humans
19.
PLoS One ; 8(12): e81738, 2013.
Article in English | MEDLINE | ID: mdl-24339959

ABSTRACT

In the area of evolutionary theory, a key question is which portions of the genome of a species are targets of natural selection. Genetic hitchhiking is a theoretical concept that has helped to identify various such targets in natural populations. In the presence of recombination, a severe reduction in sequence diversity is expected around a strongly beneficial allele. The site frequency spectrum is an important tool in genome scans for selection and is composed of the numbers S(1),...,S(n-1), where S(k) is the number of single nucleotide polymorphisms (SNPs) present in k from n individuals. Previous work has shown that both the number of low- and high-frequency variants are elevated relative to neutral evolution when a strongly beneficial allele fixes. Here, we follow a recent investigation of genetic hitchhiking using a marked Yule process to obtain an analytical prediction of the site frequency spectrum in a panmictic population at the time of fixation of a highly beneficial mutation. We combine standard results from the neutral case with the effects of a selective sweep. As simulations show, the resulting formula produces predictions that are more accurate than previous approaches for the whole frequency spectrum. In particular, the formula correctly predicts the elevation of low- and high-frequency variants and is significantly more accurate than previously derived formulas for intermediate frequency variants.


Subject(s)
Evolution, Molecular , Models, Genetic , Selection, Genetic
20.
Theor Popul Biol ; 90: 1-11, 2013 Dec.
Article in English | MEDLINE | ID: mdl-24051161

ABSTRACT

Beneficial mutations can co-occur when population structure slows down adaptation. Here, we consider the process of adaptation in asexual populations distributed over several locations ("islands"). New beneficial mutations arise at constant rate ub, and each mutation has the same selective advantage s>0. We assume that populations evolve within islands according to the successional mutations regime of Desai and Fisher (2007), that is, the time to local fixation of a mutation is short compared to the expected waiting time until the next mutation occurs. To study the rate of adaptation, we introduce an approximate model, the successional mutations (SM) model, which can be simulated efficiently and yields accurate results for a wide range of parameters. In the SM model, mutations fix instantly within islands, and migrants can take over the destination island if they are fitter than the residents. For the special case of a population distributed equally across two islands with population size N, we approximate the model further for small and large migration rates in comparison to the mutation rate. These approximations lead to explicit formulas for the rate of adaptation which fit the original model for a large range of parameter values. For the d island case we provide some heuristics on how to extend the explicit formulas and check these with computer simulations. We conclude that the SM model is a good approximation of the adaptation process in a structured population, at least if mutation or migration is limited.


Subject(s)
Adaptation, Physiological , Population Dynamics , Models, Theoretical , Mutation
SELECTION OF CITATIONS
SEARCH DETAIL
...