Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 52
Filtrar
1.
PLoS Genet ; 20(2): e1010836, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38330138

RESUMO

Genome-wide genealogies of multiple species carry detailed information about demographic and selection processes on individual branches of the phylogeny. Here, we introduce TRAILS, a hidden Markov model that accurately infers time-resolved population genetics parameters, such as ancestral effective population sizes and speciation times, for ancestral branches using a multi-species alignment of three species and an outgroup. TRAILS leverages the information contained in incomplete lineage sorting fragments by modelling genealogies along the genome as rooted three-leaved trees, each with a topology and two coalescent events happening in discretized time intervals within the phylogeny. Posterior decoding of the hidden Markov model can be used to infer the ancestral recombination graph for the alignment and details on demographic changes within a branch. Since TRAILS performs posterior decoding at the base-pair level, genome-wide scans based on the posterior probabilities can be devised to detect deviations from neutrality. Using TRAILS on a human-chimp-gorilla-orangutan alignment, we recover speciation parameters and extract information about the topology and coalescent times at high resolution.


Assuntos
Especiação Genética , Hominidae , Animais , Humanos , Hominidae/genética , Pan troglodytes/genética , Filogenia , Genética Populacional , Modelos Genéticos
2.
Stat Appl Genet Mol Biol ; 23(1)2024 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-38753402

RESUMO

Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. Our novel estimation procedure is based on the expectation-maximization (EM) algorithm and regression in the log-linear quasi-Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework in a large simulation study where we compare to state of the art methods, and show results for three data sets of somatic mutation counts from patients with cancer in the breast, Liver and urinary tract.


Assuntos
Algoritmos , Mutação , Neoplasias , Humanos , Neoplasias/genética , Modelos Genéticos , Simulação por Computador , Modelos Estatísticos
3.
Theor Popul Biol ; 156: 1-4, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38184209

RESUMO

Consider the problem of estimating the branch lengths in a symmetric 2-state substitution model with a known topology and a general, clock-like or star-shaped tree with three leaves. We show that the maximum likelihood estimates are analytically tractable and can be obtained from pairwise sequence comparisons. Furthermore, we demonstrate that this property does not generalize to larger state spaces, more complex models or larger trees. Our arguments are based on an enumeration of the free parameters of the model and the dimension of the minimal sufficient data vector. Our interest in this problem arose from discussions with our former colleague Freddy Bugge Christiansen.


Assuntos
Evolução Molecular , Modelos Genéticos , Funções Verossimilhança , Filogenia
4.
Theor Popul Biol ; 157: 14-32, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38460602

RESUMO

A phase-type distribution is the time to absorption in a continuous- or discrete-time Markov chain. Phase-type distributions can be used as a general framework to calculate key properties of the standard coalescent model and many of its extensions. Here, the 'phases' in the phase-type distribution correspond to states in the ancestral process. For example, the time to the most recent common ancestor and the total branch length are phase-type distributed. Furthermore, the site frequency spectrum follows a multivariate discrete phase-type distribution and the joint distribution of total branch lengths in the two-locus coalescent-with-recombination model is multivariate phase-type distributed. In general, phase-type distributions provide a powerful mathematical framework for coalescent theory because they are analytically tractable using matrix manipulations. The purpose of this review is to explain the phase-type theory and demonstrate how the theory can be applied to derive basic properties of coalescent models. These properties can then be used to obtain insight into the ancestral process, or they can be applied for statistical inference. In particular, we show the relation between classical first-step analysis of coalescent models and phase-type calculations. We also show how reward transformations in phase-type theory lead to easy calculation of covariances and correlation coefficients between e.g. tree height, tree length, external branch length, and internal branch length. Furthermore, we discuss how these quantities can be used for statistical inference based on estimating equations. Providing an alternative to previous work based on the Laplace transform, we derive likelihoods for small-size coalescent trees based on phase-type theory. Overall, our main aim is to demonstrate that phase-type distributions provide a convenient general set of tools to understand aspects of coalescent models that are otherwise difficult to derive. Throughout the review, we emphasize the versatility of the phase-type framework, which is also illustrated by our accompanying R-code. All our analyses and figures can be reproduced from code available on GitHub.


Assuntos
Genética Populacional , Cadeias de Markov , Modelos Genéticos , Humanos
5.
BMC Bioinformatics ; 24(1): 187, 2023 May 08.
Artigo em Inglês | MEDLINE | ID: mdl-37158829

RESUMO

BACKGROUND: The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, and the rank is chosen by comparing the fit of several models with the same underlying distribution and different values for the rank using classical model selection procedures. However, the counts are often overdispersed, and thus the Negative Binomial distribution is more appropriate. RESULTS: We propose a Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients and derive the corresponding update rules for parameter estimation. We also introduce a novel model selection procedure inspired by cross-validation to determine the number of signatures. Using simulations, we study the influence of the distributional assumption on our method together with other classical model selection procedures. We also present a simulation study with a method comparison where we show that state-of-the-art methods are highly overestimating the number of signatures when overdispersion is present. We apply our proposed analysis on a wide range of simulated data and on two real data sets from breast and prostate cancer patients. On the real data we describe a residual analysis to investigate and validate the model choice. CONCLUSIONS: With our results on simulated and real data we show that our model selection procedure is more robust at determining the correct number of signatures under model misspecification. We also show that our model selection procedure is more accurate than the available methods in the literature for finding the true number of signatures. Lastly, the residual analysis clearly emphasizes the overdispersion in the mutational count data. The code for our model selection procedure and Negative Binomial NMF is available in the R package SigMoS and can be found at https://github.com/MartaPelizzola/SigMoS .


Assuntos
Algoritmos , Mama , Masculino , Humanos , Mutação , Distribuição Binomial , Simulação por Computador
6.
J Math Biol ; 83(6-7): 63, 2021 11 16.
Artigo em Inglês | MEDLINE | ID: mdl-34783900

RESUMO

Linear functions of the site frequency spectrum (SFS) play a major role for understanding and investigating genetic diversity. Estimators of the mutation rate (e.g. based on the total number of segregating sites or average of the pairwise differences) and tests for neutrality (e.g. Tajima's D) are perhaps the most well-known examples. The distribution of linear functions of the SFS is important for constructing confidence intervals for the estimators, and to determine significance thresholds for neutrality tests. These distributions are often approximated using simulation procedures. In this paper we use multivariate phase-type theory to specify, characterize and calculate the distribution of linear functions of the site frequency spectrum. In particular, we show that many of the classical estimators of the mutation rate are distributed according to a discrete phase-type distribution. Neutrality tests, however, are generally not discrete phase-type distributed. For neutrality tests we derive the probability generating function using continuous multivariate phase-type theory, and numerically invert the function to obtain the distribution. A main result is an analytically tractable formula for the probability generating function of the SFS. Software implementation of the phase-type methodology is available in the R package PhaseTypeR, and R code for the reproduction of our results is available as an accompanying vignette.


Assuntos
Modelos Genéticos , Taxa de Mutação , Genética Populacional , Funções Verossimilhança , Mutação
7.
PLoS Genet ; 14(9): e1007641, 2018 09.
Artigo em Inglês | MEDLINE | ID: mdl-30226838

RESUMO

Human populations outside of Africa have experienced at least two bouts of introgression from archaic humans, from Neanderthals and Denisovans. In Papuans there is prior evidence of both these introgressions. Here we present a new approach to detect segments of individual genomes of archaic origin without using an archaic reference genome. The approach is based on a hidden Markov model that identifies genomic regions with a high density of single nucleotide variants (SNVs) not seen in unadmixed populations. We show using simulations that this provides a powerful approach to identifying segments of archaic introgression with a low rate of false detection, given data from a suitable outgroup population is available, without the archaic introgression but containing a majority of the variation that arose since initial separation from the archaic lineage. Furthermore our approach is able to infer admixture proportions and the times both of admixture and of initial divergence between the human and archaic populations. We apply the model to detect archaic introgression in 89 Papuans and show how the identified segments can be assigned to likely Neanderthal or Denisovan origin. We report more Denisovan admixture than previous studies and find a shift in size distribution of fragments of Neanderthal and Denisovan origin that is compatible with a difference in admixture time. Furthermore, we identify small amounts of Denisova ancestry in South East Asians and South Asians.


Assuntos
Genoma Humano/genética , Hominidae/genética , Hibridização Genética/genética , Homem de Neandertal/genética , Animais , Povo Asiático/genética , População Negra/genética , Fósseis , Humanos , Havaiano Nativo ou Outro Ilhéu do Pacífico/genética , Filogenia , População Branca/genética
8.
Bioinformatics ; 35(2): 189-199, 2019 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-29945188

RESUMO

Motivation: Understanding the mutational processes that act during cancer development is a key topic of cancer biology. Nevertheless, much remains to be learned, as a complex interplay of processes with dependencies on a range of genomic features creates highly heterogeneous cancer genomes. Accurate driver detection relies on unbiased models of the mutation rate that also capture rate variation from uncharacterized sources. Results: Here, we analyse patterns of observed-to-expected mutation counts across 505 whole cancer genomes, and find that genomic features missing from our mutation-rate model likely operate on a megabase length scale. We extend our site-specific model of the mutation rate to include the additional variance from these sources, which leads to robust significance evaluation of candidate cancer drivers. We thus present ncdDetect v.2, with greatly improved cancer driver detection specificity. Finally, we show that ranking candidates by their posterior mean value of their effect sizes offers an equivalent and more computationally efficient alternative to ranking by their P-values. Availability and implementation: ncdDetect v.2 is implemented as an R-package and is freely available at http://github.com/TobiasMadsen/ncdDetect2. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Modelos Genéticos , Taxa de Mutação , Neoplasias/genética , Biologia Computacional , Genômica , Humanos , Software
9.
Theor Popul Biol ; 127: 16-32, 2019 06.
Artigo em Inglês | MEDLINE | ID: mdl-30822431

RESUMO

Probability modelling for DNA sequence evolution is well established and provides a rich framework for understanding genetic variation between samples of individuals from one or more populations. We show that both classical and more recent models for coalescence (with or without recombination) can be described in terms of the so-called phase-type theory, where complicated and tedious calculations are circumvented by the use of matrix manipulations. The application of phase-type theory in population genetics consists of describing the biological system as a Markov model by appropriately setting up a state space and calculating the corresponding intensity and reward matrices. Formulae of interest are then expressed in terms of these aforementioned matrices. We illustrate this procedure by a number of examples: (a) Calculating the mean, (co)variance and even higher order moments of the site frequency spectrum in multiple merger coalescent models, (b) Analysing a sample of DNA sequences from the Atlantic Cod using the Beta-coalescent, and (c) Determining the correlation of the number of segregating sites for multiple samples in the two-locus ancestral recombination graph. We believe that phase-type theory has great potential as a tool for analysing probability models in population genetics. The compact matrix notation is useful for clarification of current models, and in particular their formal manipulation and calculations, but also for further development or extensions.


Assuntos
Genética Populacional , Modelos Genéticos , Algoritmos , Humanos , Cadeias de Markov , Densidade Demográfica , Recombinação Genética
10.
Stat Appl Genet Mol Biol ; 17(3)2018 06 11.
Artigo em Inglês | MEDLINE | ID: mdl-29886455

RESUMO

Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n - 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n - i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.


Assuntos
Frequência do Gene , Modelos Genéticos , Densidade Demográfica , Povo Asiático/genética , População Negra/genética , Genética Populacional , Genoma Humano , Genética Humana/métodos , Genética Humana/estatística & dados numéricos , Projeto Genoma Humano , Humanos , Software , População Branca/genética
11.
Nature ; 499(7459): 471-5, 2013 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-23823723

RESUMO

Most great ape genetic variation remains uncharacterized; however, its study is critical for understanding population history, recombination, selection and susceptibility to disease. Here we sequence to high coverage a total of 79 wild- and captive-born individuals representing all six great ape species and seven subspecies and report 88.8 million single nucleotide polymorphisms. Our analysis provides support for genetically distinct populations within each species, signals of gene flow, and the split of common chimpanzees into two distinct groups: Nigeria-Cameroon/western and central/eastern populations. We find extensive inbreeding in almost all wild populations, with eastern gorillas being the most extreme. Inferred effective population sizes have varied radically over time in different lineages and this appears to have a profound effect on the genetic diversity at, or close to, genes in almost all species. We discover and assign 1,982 loss-of-function variants throughout the human and great ape lineages, determining that the rate of gene loss has not been different in the human branch compared to other internal branches in the great ape phylogeny. This comprehensive catalogue of great ape genome diversity provides a framework for understanding evolution and a resource for more effective management of wild and captive great ape populations.


Assuntos
Variação Genética , Hominidae/genética , África , Animais , Animais Selvagens/genética , Animais de Zoológico/genética , Sudeste Asiático , Evolução Molecular , Fluxo Gênico/genética , Genética Populacional , Genoma/genética , Gorilla gorilla/classificação , Gorilla gorilla/genética , Hominidae/classificação , Humanos , Endogamia , Pan paniscus/classificação , Pan paniscus/genética , Pan troglodytes/classificação , Pan troglodytes/genética , Filogenia , Polimorfismo de Nucleotídeo Único/genética , Densidade Demográfica
12.
J Math Biol ; 78(6): 1727-1769, 2019 05.
Artigo em Inglês | MEDLINE | ID: mdl-30734077

RESUMO

In population genetics, the Dirichlet (also called the Balding-Nichols) model has for 20 years been considered the key model to approximate the distribution of allele fractions within populations in a multi-allelic setting. It has often been noted that the Dirichlet assumption is approximate because positive correlations among alleles cannot be accommodated under the Dirichlet model. However, the validity of the Dirichlet distribution has never been systematically investigated in a general framework. This paper attempts to address this problem by providing a general overview of how allele fraction data under the most common multi-allelic mutational structures should be modeled. The Dirichlet and alternative models are investigated by simulating allele fractions from a diffusion approximation of the multi-allelic Wright-Fisher process with mutation, and applying a moment-based analysis method. The study shows that the optimal modeling strategy for the distribution of allele fractions depends on the specific mutation process. The Dirichlet model is only an exceptionally good approximation for the pure drift, Jukes-Cantor and parent-independent mutation processes with small mutation rates. Alternative models are required and proposed for the other mutation processes, such as a Beta-Dirichlet model for the infinite alleles mutation process, and a Hierarchical Beta model for the Kimura, Hasegawa-Kishino-Yano and Tamura-Nei processes. Finally, a novel Hierarchical Beta approximation is developed, a Pyramidal Hierarchical Beta model, for the generalized time-reversible and single-step mutation processes.


Assuntos
Alelos , Análise de Dados , Genética Populacional/métodos , Modelos Genéticos , Simulação por Computador , Conjuntos de Dados como Assunto , Humanos , Taxa de Mutação
13.
BMC Bioinformatics ; 19(1): 147, 2018 04 19.
Artigo em Inglês | MEDLINE | ID: mdl-29673314

RESUMO

BACKGROUND: Detailed modelling of the neutral mutational process in cancer cells is crucial for identifying driver mutations and understanding the mutational mechanisms that act during cancer development. The neutral mutational process is very complex: whole-genome analyses have revealed that the mutation rate differs between cancer types, between patients and along the genome depending on the genetic and epigenetic context. Therefore, methods that predict the number of different types of mutations in regions or specific genomic elements must consider local genomic explanatory variables. A major drawback of most methods is the need to average the explanatory variables across the entire region or genomic element. This procedure is particularly problematic if the explanatory variable varies dramatically in the element under consideration. RESULTS: To take into account the fine scale of the explanatory variables, we model the probabilities of different types of mutations for each position in the genome by multinomial logistic regression. We analyse 505 cancer genomes from 14 different cancer types and compare the performance in predicting mutation rate for both regional based models and site-specific models. We show that for 1000 randomly selected genomic positions, the site-specific model predicts the mutation rate much better than regional based models. We use a forward selection procedure to identify the most important explanatory variables. The procedure identifies site-specific conservation (phyloP), replication timing, and expression level as the best predictors for the mutation rate. Finally, our model confirms and quantifies certain well-known mutational signatures. CONCLUSION: We find that our site-specific multinomial regression model outperforms the regional based models. The possibility of including genomic variables on different scales and patient specific variables makes it a versatile framework for studying different mutational mechanisms. Our model can serve as the neutral null model for the mutational process; regions that deviate from the null model are candidates for elements that drive cancer development.


Assuntos
Genoma Humano , Modelos Genéticos , Taxa de Mutação , Mutação/genética , Neoplasias/genética , Bases de Dados Genéticas , Epigenômica , Humanos , Polimorfismo de Nucleotídeo Único/genética , Análise de Regressão
14.
Syst Biol ; 66(1): e30-e46, 2017 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-28173553

RESUMO

The Wright­Fisher model provides an elegant mathematical framework for understanding allele frequency data. In particular, the model can be used to infer the demographic history of species and identify loci under selection. A crucial quantity for inference under the Wright­Fisher model is the distribution of allele frequencies (DAF). Despite the apparent simplicity of the model, the calculation of the DAF is challenging. We review and discuss strategies for approximating the DAF, and how these are used in methods that perform inference from allele frequency data. Various evolutionary forces can be incorporated in the Wright­Fisher model, and we consider these in turn. We begin our review with the basic bi-allelic Wright­Fisher model where random genetic drift is the only evolutionary force. We then consider mutation, migration, and selection. In particular, we compare diffusion-based and moment-based methods in terms of accuracy, computational efficiency, and analytical tractability. We conclude with a brief overview of the multi-allelic process with a general mutation model.


Assuntos
Frequência do Gene/genética , Modelos Genéticos , Evolução Molecular , Deriva Genética , Mutação
15.
Nature ; 486(7404): 527-31, 2012 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-22722832

RESUMO

Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.


Assuntos
Evolução Molecular , Variação Genética/genética , Genoma Humano/genética , Genoma/genética , Pan paniscus/genética , Pan troglodytes/genética , Animais , Elementos de DNA Transponíveis/genética , Duplicação Gênica/genética , Genótipo , Humanos , Dados de Sequência Molecular , Fenótipo , Filogenia , Especificidade da Espécie
16.
Nature ; 483(7388): 169-75, 2012 Mar 07.
Artigo em Inglês | MEDLINE | ID: mdl-22398555

RESUMO

Gorillas are humans' closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago. In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.


Assuntos
Evolução Molecular , Especiação Genética , Genoma/genética , Gorilla gorilla/genética , Animais , Feminino , Regulação da Expressão Gênica , Variação Genética/genética , Genômica , Humanos , Macaca mulatta/genética , Dados de Sequência Molecular , Pan troglodytes/genética , Filogenia , Pongo/genética , Proteínas/genética , Alinhamento de Sequência , Especificidade da Espécie , Transcrição Gênica
17.
Proc Natl Acad Sci U S A ; 112(20): 6413-8, 2015 May 19.
Artigo em Inglês | MEDLINE | ID: mdl-25941379

RESUMO

The unique inheritance pattern of the X chromosome exposes it to natural selection in a way that is different from that of the autosomes, potentially resulting in accelerated evolution. We perform a comparative analysis of X chromosome polymorphism in 10 great ape species, including humans. In most species, we identify striking megabase-wide regions, where nucleotide diversity is less than 20% of the chromosomal average. Such regions are found exclusively on the X chromosome. The regions overlap partially among species, suggesting that the underlying targets are partly shared among species. The regions have higher proportions of singleton SNPs, higher levels of population differentiation, and a higher nonsynonymous-to-synonymous substitution ratio than the rest of the X chromosome. We show that the extent to which diversity is reduced is incompatible with direct selection or the action of background selection and soft selective sweeps alone, and therefore, we suggest that very strong selective sweeps have independently targeted these specific regions in several species. The only genomic feature that we can identify as strongly associated with loss of diversity is the location of testis-expressed ampliconic genes, which also have reduced diversity around them. We hypothesize that these genes may be responsible for selective sweeps in the form of meiotic drive caused by an intragenomic conflict in male meiosis.


Assuntos
Variação Genética , Hominidae/genética , Polimorfismo Genético , Seleção Genética/genética , Cromossomo X/genética , Animais , Biologia Computacional , Bases de Dados Genéticas , Genética Populacional , Modelos Genéticos , Especificidade da Espécie
18.
BMC Bioinformatics ; 18(1): 199, 2017 Mar 31.
Artigo em Inglês | MEDLINE | ID: mdl-28359297

RESUMO

BACKGROUND: Factor graphs provide a flexible and general framework for specifying probability distributions. They can capture a range of popular and recent models for analysis of both genomics data as well as data from other scientific fields. Owing to the ever larger data sets encountered in genomics and the multiple-testing issues accompanying them, accurate significance evaluation is of great importance. We here address the problem of evaluating statistical significance of observations from factor graph models. RESULTS: Two novel numerical approximations for evaluation of statistical significance are presented. First a method using importance sampling. Second a saddlepoint approximation based method. We develop algorithms to efficiently compute the approximations and compare them to naive sampling and the normal approximation. The individual merits of the methods are analysed both from a theoretical viewpoint and with simulations. A guideline for choosing between the normal approximation, saddle-point approximation and importance sampling is also provided. Finally, the applicability of the methods is demonstrated with examples from cancer genomics, motif-analysis and phylogenetics. CONCLUSIONS: The applicability of saddlepoint approximation and importance sampling is demonstrated on known models in the factor graph framework. Using the two methods we can substantially improve computational cost without compromising accuracy. This contribution allows analyses of large datasets in the general factor graph framework.


Assuntos
Algoritmos , Biologia Computacional/métodos , Modelos Teóricos , Sequência de Aminoácidos , Fator de Ligação a CCCTC , Genômica , Humanos , Células MCF-7 , Neoplasias/diagnóstico , Neoplasias/genética , Filogenia , Probabilidade , Domínios e Motivos de Interação entre Proteínas , Proteínas Repressoras , Alinhamento de Sequência
19.
Theor Popul Biol ; 114: 88-94, 2017 04.
Artigo em Inglês | MEDLINE | ID: mdl-28041892

RESUMO

Recently, Burden and Tang (2016) provided an analytical expression for the stationary distribution of the multivariate neutral Wright-Fisher model with low mutation rates. In this paper we present a simple, alternative derivation that illustrates the approximation. Our proof is based on the discrete multivariate boundary mutation model which has three key ingredients. First, the decoupled Moran model is used to describe genetic drift. Second, low mutation rates are assumed by limiting mutations to monomorphic states. Third, the mutation rate matrix is separated into a time-reversible part and a flux part, as suggested by Burden and Tang (2016). An application of our result to data from several great apes reveals that the assumption of stationarity may be inadequate or that other evolutionary forces like selection or biased gene conversion are acting. Furthermore we find that the model with a reversible mutation rate matrix provides a reasonably good fit to the data compared to the one with a non-reversible mutation rate matrix.


Assuntos
Evolução Biológica , Frequência do Gene , Deriva Genética , Taxa de Mutação , Genética Populacional , Modelos Genéticos , Mutação , Seleção Genética
20.
Theor Popul Biol ; 108: 36-50, 2016 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-26612605

RESUMO

We consider the diffusion approximation of the multivariate Wright-Fisher process with mutation. Analytically tractable formulas for the first-and second-order moments of the allele frequency distribution are derived, and the moments are subsequently used to better understand key population genetics parameters and modeling frameworks. In particular we investigate the behavior of the expected homozygosity (the probability that two randomly sampled genes are identical) in the transient and stationary phases, and how appropriate the Dirichlet distribution is for modeling the allele frequency distribution at different evolutionary time scales. We find that the Dirichlet distribution is adequate for the pure drift model (no mutations allowed), but the distribution is not sufficiently flexible for more general mutation models. We suggest a new hierarchical Beta distribution for the allele frequencies in the Wright-Fisher process with a mutation model on the nucleotide level that distinguishes between transitions and transversions.


Assuntos
Genética Populacional , Modelos Genéticos , Mutação , Frequência do Gene , Humanos
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa