Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 52
Filtrar
1.
PLoS Genet ; 20(2): e1010836, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38330138

RESUMEN

Genome-wide genealogies of multiple species carry detailed information about demographic and selection processes on individual branches of the phylogeny. Here, we introduce TRAILS, a hidden Markov model that accurately infers time-resolved population genetics parameters, such as ancestral effective population sizes and speciation times, for ancestral branches using a multi-species alignment of three species and an outgroup. TRAILS leverages the information contained in incomplete lineage sorting fragments by modelling genealogies along the genome as rooted three-leaved trees, each with a topology and two coalescent events happening in discretized time intervals within the phylogeny. Posterior decoding of the hidden Markov model can be used to infer the ancestral recombination graph for the alignment and details on demographic changes within a branch. Since TRAILS performs posterior decoding at the base-pair level, genome-wide scans based on the posterior probabilities can be devised to detect deviations from neutrality. Using TRAILS on a human-chimp-gorilla-orangutan alignment, we recover speciation parameters and extract information about the topology and coalescent times at high resolution.


Asunto(s)
Especiación Genética , Hominidae , Animales , Humanos , Hominidae/genética , Pan troglodytes/genética , Filogenia , Genética de Población , Modelos Genéticos
2.
Stat Appl Genet Mol Biol ; 23(1)2024 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-38753402

RESUMEN

Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. Our novel estimation procedure is based on the expectation-maximization (EM) algorithm and regression in the log-linear quasi-Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework in a large simulation study where we compare to state of the art methods, and show results for three data sets of somatic mutation counts from patients with cancer in the breast, Liver and urinary tract.


Asunto(s)
Algoritmos , Mutación , Neoplasias , Humanos , Neoplasias/genética , Modelos Genéticos , Simulación por Computador , Modelos Estadísticos
3.
Theor Popul Biol ; 156: 1-4, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38184209

RESUMEN

Consider the problem of estimating the branch lengths in a symmetric 2-state substitution model with a known topology and a general, clock-like or star-shaped tree with three leaves. We show that the maximum likelihood estimates are analytically tractable and can be obtained from pairwise sequence comparisons. Furthermore, we demonstrate that this property does not generalize to larger state spaces, more complex models or larger trees. Our arguments are based on an enumeration of the free parameters of the model and the dimension of the minimal sufficient data vector. Our interest in this problem arose from discussions with our former colleague Freddy Bugge Christiansen.


Asunto(s)
Evolución Molecular , Modelos Genéticos , Funciones de Verosimilitud , Filogenia
4.
Theor Popul Biol ; 157: 14-32, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38460602

RESUMEN

A phase-type distribution is the time to absorption in a continuous- or discrete-time Markov chain. Phase-type distributions can be used as a general framework to calculate key properties of the standard coalescent model and many of its extensions. Here, the 'phases' in the phase-type distribution correspond to states in the ancestral process. For example, the time to the most recent common ancestor and the total branch length are phase-type distributed. Furthermore, the site frequency spectrum follows a multivariate discrete phase-type distribution and the joint distribution of total branch lengths in the two-locus coalescent-with-recombination model is multivariate phase-type distributed. In general, phase-type distributions provide a powerful mathematical framework for coalescent theory because they are analytically tractable using matrix manipulations. The purpose of this review is to explain the phase-type theory and demonstrate how the theory can be applied to derive basic properties of coalescent models. These properties can then be used to obtain insight into the ancestral process, or they can be applied for statistical inference. In particular, we show the relation between classical first-step analysis of coalescent models and phase-type calculations. We also show how reward transformations in phase-type theory lead to easy calculation of covariances and correlation coefficients between e.g. tree height, tree length, external branch length, and internal branch length. Furthermore, we discuss how these quantities can be used for statistical inference based on estimating equations. Providing an alternative to previous work based on the Laplace transform, we derive likelihoods for small-size coalescent trees based on phase-type theory. Overall, our main aim is to demonstrate that phase-type distributions provide a convenient general set of tools to understand aspects of coalescent models that are otherwise difficult to derive. Throughout the review, we emphasize the versatility of the phase-type framework, which is also illustrated by our accompanying R-code. All our analyses and figures can be reproduced from code available on GitHub.


Asunto(s)
Genética de Población , Cadenas de Markov , Modelos Genéticos , Humanos
5.
BMC Bioinformatics ; 24(1): 187, 2023 May 08.
Artículo en Inglés | MEDLINE | ID: mdl-37158829

RESUMEN

BACKGROUND: The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, and the rank is chosen by comparing the fit of several models with the same underlying distribution and different values for the rank using classical model selection procedures. However, the counts are often overdispersed, and thus the Negative Binomial distribution is more appropriate. RESULTS: We propose a Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients and derive the corresponding update rules for parameter estimation. We also introduce a novel model selection procedure inspired by cross-validation to determine the number of signatures. Using simulations, we study the influence of the distributional assumption on our method together with other classical model selection procedures. We also present a simulation study with a method comparison where we show that state-of-the-art methods are highly overestimating the number of signatures when overdispersion is present. We apply our proposed analysis on a wide range of simulated data and on two real data sets from breast and prostate cancer patients. On the real data we describe a residual analysis to investigate and validate the model choice. CONCLUSIONS: With our results on simulated and real data we show that our model selection procedure is more robust at determining the correct number of signatures under model misspecification. We also show that our model selection procedure is more accurate than the available methods in the literature for finding the true number of signatures. Lastly, the residual analysis clearly emphasizes the overdispersion in the mutational count data. The code for our model selection procedure and Negative Binomial NMF is available in the R package SigMoS and can be found at https://github.com/MartaPelizzola/SigMoS .


Asunto(s)
Algoritmos , Mama , Masculino , Humanos , Mutación , Distribución Binomial , Simulación por Computador
6.
J Math Biol ; 83(6-7): 63, 2021 11 16.
Artículo en Inglés | MEDLINE | ID: mdl-34783900

RESUMEN

Linear functions of the site frequency spectrum (SFS) play a major role for understanding and investigating genetic diversity. Estimators of the mutation rate (e.g. based on the total number of segregating sites or average of the pairwise differences) and tests for neutrality (e.g. Tajima's D) are perhaps the most well-known examples. The distribution of linear functions of the SFS is important for constructing confidence intervals for the estimators, and to determine significance thresholds for neutrality tests. These distributions are often approximated using simulation procedures. In this paper we use multivariate phase-type theory to specify, characterize and calculate the distribution of linear functions of the site frequency spectrum. In particular, we show that many of the classical estimators of the mutation rate are distributed according to a discrete phase-type distribution. Neutrality tests, however, are generally not discrete phase-type distributed. For neutrality tests we derive the probability generating function using continuous multivariate phase-type theory, and numerically invert the function to obtain the distribution. A main result is an analytically tractable formula for the probability generating function of the SFS. Software implementation of the phase-type methodology is available in the R package PhaseTypeR, and R code for the reproduction of our results is available as an accompanying vignette.


Asunto(s)
Modelos Genéticos , Tasa de Mutación , Genética de Población , Funciones de Verosimilitud , Mutación
7.
PLoS Genet ; 14(9): e1007641, 2018 09.
Artículo en Inglés | MEDLINE | ID: mdl-30226838

RESUMEN

Human populations outside of Africa have experienced at least two bouts of introgression from archaic humans, from Neanderthals and Denisovans. In Papuans there is prior evidence of both these introgressions. Here we present a new approach to detect segments of individual genomes of archaic origin without using an archaic reference genome. The approach is based on a hidden Markov model that identifies genomic regions with a high density of single nucleotide variants (SNVs) not seen in unadmixed populations. We show using simulations that this provides a powerful approach to identifying segments of archaic introgression with a low rate of false detection, given data from a suitable outgroup population is available, without the archaic introgression but containing a majority of the variation that arose since initial separation from the archaic lineage. Furthermore our approach is able to infer admixture proportions and the times both of admixture and of initial divergence between the human and archaic populations. We apply the model to detect archaic introgression in 89 Papuans and show how the identified segments can be assigned to likely Neanderthal or Denisovan origin. We report more Denisovan admixture than previous studies and find a shift in size distribution of fragments of Neanderthal and Denisovan origin that is compatible with a difference in admixture time. Furthermore, we identify small amounts of Denisova ancestry in South East Asians and South Asians.


Asunto(s)
Genoma Humano/genética , Hominidae/genética , Hibridación Genética/genética , Hombre de Neandertal/genética , Animales , Pueblo Asiatico/genética , Población Negra/genética , Fósiles , Humanos , Nativos de Hawái y Otras Islas del Pacífico/genética , Filogenia , Población Blanca/genética
8.
Bioinformatics ; 35(2): 189-199, 2019 01 15.
Artículo en Inglés | MEDLINE | ID: mdl-29945188

RESUMEN

Motivation: Understanding the mutational processes that act during cancer development is a key topic of cancer biology. Nevertheless, much remains to be learned, as a complex interplay of processes with dependencies on a range of genomic features creates highly heterogeneous cancer genomes. Accurate driver detection relies on unbiased models of the mutation rate that also capture rate variation from uncharacterized sources. Results: Here, we analyse patterns of observed-to-expected mutation counts across 505 whole cancer genomes, and find that genomic features missing from our mutation-rate model likely operate on a megabase length scale. We extend our site-specific model of the mutation rate to include the additional variance from these sources, which leads to robust significance evaluation of candidate cancer drivers. We thus present ncdDetect v.2, with greatly improved cancer driver detection specificity. Finally, we show that ranking candidates by their posterior mean value of their effect sizes offers an equivalent and more computationally efficient alternative to ranking by their P-values. Availability and implementation: ncdDetect v.2 is implemented as an R-package and is freely available at http://github.com/TobiasMadsen/ncdDetect2. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Modelos Genéticos , Tasa de Mutación , Neoplasias/genética , Biología Computacional , Genómica , Humanos , Programas Informáticos
9.
Theor Popul Biol ; 127: 16-32, 2019 06.
Artículo en Inglés | MEDLINE | ID: mdl-30822431

RESUMEN

Probability modelling for DNA sequence evolution is well established and provides a rich framework for understanding genetic variation between samples of individuals from one or more populations. We show that both classical and more recent models for coalescence (with or without recombination) can be described in terms of the so-called phase-type theory, where complicated and tedious calculations are circumvented by the use of matrix manipulations. The application of phase-type theory in population genetics consists of describing the biological system as a Markov model by appropriately setting up a state space and calculating the corresponding intensity and reward matrices. Formulae of interest are then expressed in terms of these aforementioned matrices. We illustrate this procedure by a number of examples: (a) Calculating the mean, (co)variance and even higher order moments of the site frequency spectrum in multiple merger coalescent models, (b) Analysing a sample of DNA sequences from the Atlantic Cod using the Beta-coalescent, and (c) Determining the correlation of the number of segregating sites for multiple samples in the two-locus ancestral recombination graph. We believe that phase-type theory has great potential as a tool for analysing probability models in population genetics. The compact matrix notation is useful for clarification of current models, and in particular their formal manipulation and calculations, but also for further development or extensions.


Asunto(s)
Genética de Población , Modelos Genéticos , Algoritmos , Humanos , Cadenas de Markov , Densidad de Población , Recombinación Genética
10.
Stat Appl Genet Mol Biol ; 17(3)2018 06 11.
Artículo en Inglés | MEDLINE | ID: mdl-29886455

RESUMEN

Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n - 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n - i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.


Asunto(s)
Frecuencia de los Genes , Modelos Genéticos , Densidad de Población , Pueblo Asiatico/genética , Población Negra/genética , Genética de Población , Genoma Humano , Genética Humana/métodos , Genética Humana/estadística & datos numéricos , Proyecto Genoma Humano , Humanos , Programas Informáticos , Población Blanca/genética
11.
Nature ; 499(7459): 471-5, 2013 Jul 25.
Artículo en Inglés | MEDLINE | ID: mdl-23823723

RESUMEN

Most great ape genetic variation remains uncharacterized; however, its study is critical for understanding population history, recombination, selection and susceptibility to disease. Here we sequence to high coverage a total of 79 wild- and captive-born individuals representing all six great ape species and seven subspecies and report 88.8 million single nucleotide polymorphisms. Our analysis provides support for genetically distinct populations within each species, signals of gene flow, and the split of common chimpanzees into two distinct groups: Nigeria-Cameroon/western and central/eastern populations. We find extensive inbreeding in almost all wild populations, with eastern gorillas being the most extreme. Inferred effective population sizes have varied radically over time in different lineages and this appears to have a profound effect on the genetic diversity at, or close to, genes in almost all species. We discover and assign 1,982 loss-of-function variants throughout the human and great ape lineages, determining that the rate of gene loss has not been different in the human branch compared to other internal branches in the great ape phylogeny. This comprehensive catalogue of great ape genome diversity provides a framework for understanding evolution and a resource for more effective management of wild and captive great ape populations.


Asunto(s)
Variación Genética , Hominidae/genética , África , Animales , Animales Salvajes/genética , Animales de Zoológico/genética , Asia Sudoriental , Evolución Molecular , Flujo Génico/genética , Genética de Población , Genoma/genética , Gorilla gorilla/clasificación , Gorilla gorilla/genética , Hominidae/clasificación , Humanos , Endogamia , Pan paniscus/clasificación , Pan paniscus/genética , Pan troglodytes/clasificación , Pan troglodytes/genética , Filogenia , Polimorfismo de Nucleótido Simple/genética , Densidad de Población
12.
J Math Biol ; 78(6): 1727-1769, 2019 05.
Artículo en Inglés | MEDLINE | ID: mdl-30734077

RESUMEN

In population genetics, the Dirichlet (also called the Balding-Nichols) model has for 20 years been considered the key model to approximate the distribution of allele fractions within populations in a multi-allelic setting. It has often been noted that the Dirichlet assumption is approximate because positive correlations among alleles cannot be accommodated under the Dirichlet model. However, the validity of the Dirichlet distribution has never been systematically investigated in a general framework. This paper attempts to address this problem by providing a general overview of how allele fraction data under the most common multi-allelic mutational structures should be modeled. The Dirichlet and alternative models are investigated by simulating allele fractions from a diffusion approximation of the multi-allelic Wright-Fisher process with mutation, and applying a moment-based analysis method. The study shows that the optimal modeling strategy for the distribution of allele fractions depends on the specific mutation process. The Dirichlet model is only an exceptionally good approximation for the pure drift, Jukes-Cantor and parent-independent mutation processes with small mutation rates. Alternative models are required and proposed for the other mutation processes, such as a Beta-Dirichlet model for the infinite alleles mutation process, and a Hierarchical Beta model for the Kimura, Hasegawa-Kishino-Yano and Tamura-Nei processes. Finally, a novel Hierarchical Beta approximation is developed, a Pyramidal Hierarchical Beta model, for the generalized time-reversible and single-step mutation processes.


Asunto(s)
Alelos , Análisis de Datos , Genética de Población/métodos , Modelos Genéticos , Simulación por Computador , Conjuntos de Datos como Asunto , Humanos , Tasa de Mutación
13.
BMC Bioinformatics ; 19(1): 147, 2018 04 19.
Artículo en Inglés | MEDLINE | ID: mdl-29673314

RESUMEN

BACKGROUND: Detailed modelling of the neutral mutational process in cancer cells is crucial for identifying driver mutations and understanding the mutational mechanisms that act during cancer development. The neutral mutational process is very complex: whole-genome analyses have revealed that the mutation rate differs between cancer types, between patients and along the genome depending on the genetic and epigenetic context. Therefore, methods that predict the number of different types of mutations in regions or specific genomic elements must consider local genomic explanatory variables. A major drawback of most methods is the need to average the explanatory variables across the entire region or genomic element. This procedure is particularly problematic if the explanatory variable varies dramatically in the element under consideration. RESULTS: To take into account the fine scale of the explanatory variables, we model the probabilities of different types of mutations for each position in the genome by multinomial logistic regression. We analyse 505 cancer genomes from 14 different cancer types and compare the performance in predicting mutation rate for both regional based models and site-specific models. We show that for 1000 randomly selected genomic positions, the site-specific model predicts the mutation rate much better than regional based models. We use a forward selection procedure to identify the most important explanatory variables. The procedure identifies site-specific conservation (phyloP), replication timing, and expression level as the best predictors for the mutation rate. Finally, our model confirms and quantifies certain well-known mutational signatures. CONCLUSION: We find that our site-specific multinomial regression model outperforms the regional based models. The possibility of including genomic variables on different scales and patient specific variables makes it a versatile framework for studying different mutational mechanisms. Our model can serve as the neutral null model for the mutational process; regions that deviate from the null model are candidates for elements that drive cancer development.


Asunto(s)
Genoma Humano , Modelos Genéticos , Tasa de Mutación , Mutación/genética , Neoplasias/genética , Bases de Datos Genéticas , Epigenómica , Humanos , Polimorfismo de Nucleótido Simple/genética , Análisis de Regresión
14.
Syst Biol ; 66(1): e30-e46, 2017 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-28173553

RESUMEN

The Wright­Fisher model provides an elegant mathematical framework for understanding allele frequency data. In particular, the model can be used to infer the demographic history of species and identify loci under selection. A crucial quantity for inference under the Wright­Fisher model is the distribution of allele frequencies (DAF). Despite the apparent simplicity of the model, the calculation of the DAF is challenging. We review and discuss strategies for approximating the DAF, and how these are used in methods that perform inference from allele frequency data. Various evolutionary forces can be incorporated in the Wright­Fisher model, and we consider these in turn. We begin our review with the basic bi-allelic Wright­Fisher model where random genetic drift is the only evolutionary force. We then consider mutation, migration, and selection. In particular, we compare diffusion-based and moment-based methods in terms of accuracy, computational efficiency, and analytical tractability. We conclude with a brief overview of the multi-allelic process with a general mutation model.


Asunto(s)
Frecuencia de los Genes/genética , Modelos Genéticos , Evolución Molecular , Flujo Genético , Mutación
15.
Nature ; 486(7404): 527-31, 2012 Jun 28.
Artículo en Inglés | MEDLINE | ID: mdl-22722832

RESUMEN

Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.


Asunto(s)
Evolución Molecular , Variación Genética/genética , Genoma Humano/genética , Genoma/genética , Pan paniscus/genética , Pan troglodytes/genética , Animales , Elementos Transponibles de ADN/genética , Duplicación de Gen/genética , Genotipo , Humanos , Datos de Secuencia Molecular , Fenotipo , Filogenia , Especificidad de la Especie
16.
Nature ; 483(7388): 169-75, 2012 Mar 07.
Artículo en Inglés | MEDLINE | ID: mdl-22398555

RESUMEN

Gorillas are humans' closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago. In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.


Asunto(s)
Evolución Molecular , Especiación Genética , Genoma/genética , Gorilla gorilla/genética , Animales , Femenino , Regulación de la Expresión Génica , Variación Genética/genética , Genómica , Humanos , Macaca mulatta/genética , Datos de Secuencia Molecular , Pan troglodytes/genética , Filogenia , Pongo/genética , Proteínas/genética , Alineación de Secuencia , Especificidad de la Especie , Transcripción Genética
17.
Proc Natl Acad Sci U S A ; 112(20): 6413-8, 2015 May 19.
Artículo en Inglés | MEDLINE | ID: mdl-25941379

RESUMEN

The unique inheritance pattern of the X chromosome exposes it to natural selection in a way that is different from that of the autosomes, potentially resulting in accelerated evolution. We perform a comparative analysis of X chromosome polymorphism in 10 great ape species, including humans. In most species, we identify striking megabase-wide regions, where nucleotide diversity is less than 20% of the chromosomal average. Such regions are found exclusively on the X chromosome. The regions overlap partially among species, suggesting that the underlying targets are partly shared among species. The regions have higher proportions of singleton SNPs, higher levels of population differentiation, and a higher nonsynonymous-to-synonymous substitution ratio than the rest of the X chromosome. We show that the extent to which diversity is reduced is incompatible with direct selection or the action of background selection and soft selective sweeps alone, and therefore, we suggest that very strong selective sweeps have independently targeted these specific regions in several species. The only genomic feature that we can identify as strongly associated with loss of diversity is the location of testis-expressed ampliconic genes, which also have reduced diversity around them. We hypothesize that these genes may be responsible for selective sweeps in the form of meiotic drive caused by an intragenomic conflict in male meiosis.


Asunto(s)
Variación Genética , Hominidae/genética , Polimorfismo Genético , Selección Genética/genética , Cromosoma X/genética , Animales , Biología Computacional , Bases de Datos Genéticas , Genética de Población , Modelos Genéticos , Especificidad de la Especie
18.
BMC Bioinformatics ; 18(1): 199, 2017 Mar 31.
Artículo en Inglés | MEDLINE | ID: mdl-28359297

RESUMEN

BACKGROUND: Factor graphs provide a flexible and general framework for specifying probability distributions. They can capture a range of popular and recent models for analysis of both genomics data as well as data from other scientific fields. Owing to the ever larger data sets encountered in genomics and the multiple-testing issues accompanying them, accurate significance evaluation is of great importance. We here address the problem of evaluating statistical significance of observations from factor graph models. RESULTS: Two novel numerical approximations for evaluation of statistical significance are presented. First a method using importance sampling. Second a saddlepoint approximation based method. We develop algorithms to efficiently compute the approximations and compare them to naive sampling and the normal approximation. The individual merits of the methods are analysed both from a theoretical viewpoint and with simulations. A guideline for choosing between the normal approximation, saddle-point approximation and importance sampling is also provided. Finally, the applicability of the methods is demonstrated with examples from cancer genomics, motif-analysis and phylogenetics. CONCLUSIONS: The applicability of saddlepoint approximation and importance sampling is demonstrated on known models in the factor graph framework. Using the two methods we can substantially improve computational cost without compromising accuracy. This contribution allows analyses of large datasets in the general factor graph framework.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Modelos Teóricos , Secuencia de Aminoácidos , Factor de Unión a CCCTC , Genómica , Humanos , Células MCF-7 , Neoplasias/diagnóstico , Neoplasias/genética , Filogenia , Probabilidad , Dominios y Motivos de Interacción de Proteínas , Proteínas Represoras , Alineación de Secuencia
19.
Theor Popul Biol ; 114: 88-94, 2017 04.
Artículo en Inglés | MEDLINE | ID: mdl-28041892

RESUMEN

Recently, Burden and Tang (2016) provided an analytical expression for the stationary distribution of the multivariate neutral Wright-Fisher model with low mutation rates. In this paper we present a simple, alternative derivation that illustrates the approximation. Our proof is based on the discrete multivariate boundary mutation model which has three key ingredients. First, the decoupled Moran model is used to describe genetic drift. Second, low mutation rates are assumed by limiting mutations to monomorphic states. Third, the mutation rate matrix is separated into a time-reversible part and a flux part, as suggested by Burden and Tang (2016). An application of our result to data from several great apes reveals that the assumption of stationarity may be inadequate or that other evolutionary forces like selection or biased gene conversion are acting. Furthermore we find that the model with a reversible mutation rate matrix provides a reasonably good fit to the data compared to the one with a non-reversible mutation rate matrix.


Asunto(s)
Evolución Biológica , Frecuencia de los Genes , Flujo Genético , Tasa de Mutación , Genética de Población , Modelos Genéticos , Mutación , Selección Genética
20.
Theor Popul Biol ; 108: 36-50, 2016 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-26612605

RESUMEN

We consider the diffusion approximation of the multivariate Wright-Fisher process with mutation. Analytically tractable formulas for the first-and second-order moments of the allele frequency distribution are derived, and the moments are subsequently used to better understand key population genetics parameters and modeling frameworks. In particular we investigate the behavior of the expected homozygosity (the probability that two randomly sampled genes are identical) in the transient and stationary phases, and how appropriate the Dirichlet distribution is for modeling the allele frequency distribution at different evolutionary time scales. We find that the Dirichlet distribution is adequate for the pure drift model (no mutations allowed), but the distribution is not sufficiently flexible for more general mutation models. We suggest a new hierarchical Beta distribution for the allele frequencies in the Wright-Fisher process with a mutation model on the nucleotide level that distinguishes between transitions and transversions.


Asunto(s)
Genética de Población , Modelos Genéticos , Mutación , Frecuencia de los Genes , Humanos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA