Búsqueda | Portal Regional de la BVS

1.

Flexible model-based non-negative matrix factorization with application to mutational signatures.

Laursen, Ragnhild; Maretty, Lasse; Hobolth, Asger.

Stat Appl Genet Mol Biol ; 23(1)2024 Jan 01.

Artículo en Inglés | MEDLINE | ID: mdl-38753402

RESUMEN

Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. Our novel estimation procedure is based on the expectation-maximization (EM) algorithm and regression in the log-linear quasi-Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework in a large simulation study where we compare to state of the art methods, and show results for three data sets of somatic mutation counts from patients with cancer in the breast, Liver and urinary tract.

Asunto(s)

Algoritmos , Mutación , Neoplasias , Humanos , Neoplasias/genética , Modelos Genéticos , Simulación por Computador , Modelos Estadísticos

2.

Phase-type distributions in mathematical population genetics: An emerging framework.

Hobolth, Asger; Rivas-González, Iker; Bladt, Mogens; Futschik, Andreas.

Theor Popul Biol ; 157: 14-32, 2024 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-38460602

RESUMEN

A phase-type distribution is the time to absorption in a continuous- or discrete-time Markov chain. Phase-type distributions can be used as a general framework to calculate key properties of the standard coalescent model and many of its extensions. Here, the 'phases' in the phase-type distribution correspond to states in the ancestral process. For example, the time to the most recent common ancestor and the total branch length are phase-type distributed. Furthermore, the site frequency spectrum follows a multivariate discrete phase-type distribution and the joint distribution of total branch lengths in the two-locus coalescent-with-recombination model is multivariate phase-type distributed. In general, phase-type distributions provide a powerful mathematical framework for coalescent theory because they are analytically tractable using matrix manipulations. The purpose of this review is to explain the phase-type theory and demonstrate how the theory can be applied to derive basic properties of coalescent models. These properties can then be used to obtain insight into the ancestral process, or they can be applied for statistical inference. In particular, we show the relation between classical first-step analysis of coalescent models and phase-type calculations. We also show how reward transformations in phase-type theory lead to easy calculation of covariances and correlation coefficients between e.g. tree height, tree length, external branch length, and internal branch length. Furthermore, we discuss how these quantities can be used for statistical inference based on estimating equations. Providing an alternative to previous work based on the Laplace transform, we derive likelihoods for small-size coalescent trees based on phase-type theory. Overall, our main aim is to demonstrate that phase-type distributions provide a convenient general set of tools to understand aspects of coalescent models that are otherwise difficult to derive. Throughout the review, we emphasize the versatility of the phase-type framework, which is also illustrated by our accompanying R-code. All our analyses and figures can be reproduced from code available on GitHub.

Asunto(s)

Genética de Población , Cadenas de Markov , Modelos Genéticos , Humanos

3.

TRAILS: Tree reconstruction of ancestry using incomplete lineage sorting.

Rivas-González, Iker; Schierup, Mikkel H; Wakeley, John; Hobolth, Asger.

PLoS Genet ; 20(2): e1010836, 2024 Feb.

Artículo en Inglés | MEDLINE | ID: mdl-38330138

RESUMEN

Genome-wide genealogies of multiple species carry detailed information about demographic and selection processes on individual branches of the phylogeny. Here, we introduce TRAILS, a hidden Markov model that accurately infers time-resolved population genetics parameters, such as ancestral effective population sizes and speciation times, for ancestral branches using a multi-species alignment of three species and an outgroup. TRAILS leverages the information contained in incomplete lineage sorting fragments by modelling genealogies along the genome as rooted three-leaved trees, each with a topology and two coalescent events happening in discretized time intervals within the phylogeny. Posterior decoding of the hidden Markov model can be used to infer the ancestral recombination graph for the alignment and details on demographic changes within a branch. Since TRAILS performs posterior decoding at the base-pair level, genome-wide scans based on the posterior probabilities can be devised to detect deviations from neutrality. Using TRAILS on a human-chimp-gorilla-orangutan alignment, we recover speciation parameters and extract information about the topology and coalescent times at high resolution.

Asunto(s)

Especiación Genética , Hominidae , Animales , Humanos , Hominidae/genética , Pan troglodytes/genética , Filogenia , Genética de Población , Modelos Genéticos

4.

Maximum likelihood estimation and natural pairwise estimating equations are identical for three sequences and a symmetric 2-state substitution model.

Hobolth, Asger; Wiuf, Carsten.

Theor Popul Biol ; 156: 1-4, 2024 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-38184209

RESUMEN

Consider the problem of estimating the branch lengths in a symmetric 2-state substitution model with a known topology and a general, clock-like or star-shaped tree with three leaves. We show that the maximum likelihood estimates are analytically tractable and can be obtained from pairwise sequence comparisons. Furthermore, we demonstrate that this property does not generalize to larger state spaces, more complex models or larger trees. Our arguments are based on an enumeration of the free parameters of the model and the dimension of the minimal sufficient data vector. Our interest in this problem arose from discussions with our former colleague Freddy Bugge Christiansen.

Asunto(s)

Evolución Molecular , Modelos Genéticos , Funciones de Verosimilitud , Filogenia

5.

Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization.

Pelizzola, Marta; Laursen, Ragnhild; Hobolth, Asger.

BMC Bioinformatics ; 24(1): 187, 2023 May 08.

Artículo en Inglés | MEDLINE | ID: mdl-37158829

RESUMEN

BACKGROUND: The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, and the rank is chosen by comparing the fit of several models with the same underlying distribution and different values for the rank using classical model selection procedures. However, the counts are often overdispersed, and thus the Negative Binomial distribution is more appropriate. RESULTS: We propose a Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients and derive the corresponding update rules for parameter estimation. We also introduce a novel model selection procedure inspired by cross-validation to determine the number of signatures. Using simulations, we study the influence of the distributional assumption on our method together with other classical model selection procedures. We also present a simulation study with a method comparison where we show that state-of-the-art methods are highly overestimating the number of signatures when overdispersion is present. We apply our proposed analysis on a wide range of simulated data and on two real data sets from breast and prostate cancer patients. On the real data we describe a residual analysis to investigate and validate the model choice. CONCLUSIONS: With our results on simulated and real data we show that our model selection procedure is more robust at determining the correct number of signatures under model misspecification. We also show that our model selection procedure is more accurate than the available methods in the literature for finding the true number of signatures. Lastly, the residual analysis clearly emphasizes the overdispersion in the mutational count data. The code for our model selection procedure and Negative Binomial NMF is available in the R package SigMoS and can be found at https://github.com/MartaPelizzola/SigMoS .

Asunto(s)

Algoritmos , Mama , Masculino , Humanos , Mutación , Distribución Binomial , Simulación por Computador

6.

Multivariate phase-type theory for the site frequency spectrum.

Hobolth, Asger; Bladt, Mogens; Andersen, Lars Nørvang.

J Math Biol ; 83(6-7): 63, 2021 11 16.

Artículo en Inglés | MEDLINE | ID: mdl-34783900

RESUMEN

Linear functions of the site frequency spectrum (SFS) play a major role for understanding and investigating genetic diversity. Estimators of the mutation rate (e.g. based on the total number of segregating sites or average of the pairwise differences) and tests for neutrality (e.g. Tajima's D) are perhaps the most well-known examples. The distribution of linear functions of the SFS is important for constructing confidence intervals for the estimators, and to determine significance thresholds for neutrality tests. These distributions are often approximated using simulation procedures. In this paper we use multivariate phase-type theory to specify, characterize and calculate the distribution of linear functions of the site frequency spectrum. In particular, we show that many of the classical estimators of the mutation rate are distributed according to a discrete phase-type distribution. Neutrality tests, however, are generally not discrete phase-type distributed. For neutrality tests we derive the probability generating function using continuous multivariate phase-type theory, and numerically invert the function to obtain the distribution. A main result is an analytically tractable formula for the probability generating function of the SFS. Software implementation of the phase-type methodology is available in the R package PhaseTypeR, and R code for the reproduction of our results is available as an accompanying vignette.

Asunto(s)

Modelos Genéticos , Tasa de Mutación , Genética de Población , Funciones de Verosimilitud , Mutación

7.

Studying models of balancing selection using phase-type theory.

Zeng, Kai; Charlesworth, Brian; Hobolth, Asger.

Genetics ; 218(2)2021 06 24.

Artículo en Inglés | MEDLINE | ID: mdl-33871627

RESUMEN

Balancing selection (BLS) is the evolutionary force that maintains high levels of genetic variability in many important genes. To further our understanding of its evolutionary significance, we analyze models with BLS acting on a biallelic locus: an equilibrium model with long-term BLS, a model with long-term BLS and recent changes in population size, and a model of recent BLS. Using phase-type theory, a mathematical tool for analyzing continuous time Markov chains with an absorbing state, we examine how BLS affects polymorphism patterns in linked neutral regions, as summarized by nucleotide diversity, the expected number of segregating sites, the site frequency spectrum, and the level of linkage disequilibrium (LD). Long-term BLS affects polymorphism patterns in a relatively small genomic neighborhood, and such selection targets are easier to detect when the equilibrium frequencies of the selected variants are close to 50%, or when there has been a population size reduction. For a new mutation subject to BLS, its initial increase in frequency in the population causes linked neutral regions to have reduced diversity, an excess of both high and low frequency derived variants, and elevated LD with the selected locus. These patterns are similar to those produced by selective sweeps, but the effects of recent BLS are weaker. Nonetheless, compared to selective sweeps, nonequilibrium polymorphism and LD patterns persist for a much longer period under recent BLS, which may increase the chance of detecting such selection targets. An R package for analyzing these models, among others (e.g., isolation with migration), is available.

Asunto(s)

Genética de Población , Modelos Genéticos , Selección Genética , Animales , Evolución Molecular , Humanos , Desequilibrio de Ligamiento , Cadenas de Markov , Mutación , Polimorfismo Genético

8.

Ancestral Population Genomics.

Dutheil, Julien Y; Hobolth, Asger.

Methods Mol Biol ; 1910: 555-589, 2019.

Artículo en Inglés | MEDLINE | ID: mdl-31278677

RESUMEN

Borrowing both from population genetics and phylogenetics, the field of population genomics emerged as full genomes of several closely related species were available. Providing we can properly model sequence evolution within populations undergoing speciation events, this resource enables us to estimate key population genetics parameters such as ancestral population sizes and split times. Furthermore we can enhance our understanding of the recombination process and investigate various selective forces. With the advent of resequencing technologies, genome-wide patterns of diversity in extant populations have now come to complement this picture, offering an increasing power to study more recent genetic history.We discuss the basic models of genomes in populations, including speciation models for closely related species. A major point in our discussion is that only a few complete genomes contain much information about the whole population. The reason being that recombination unlinks genomic regions, and therefore a few genomes contain many segments with distinct histories. The challenge of population genomics is to decode this mosaic of histories in order to infer scenarios of demography and selection. We survey modeling strategies for understanding genetic variation in ancestral populations and species. The underlying models build on the coalescent with recombination process and introduce further assumptions to scale the analyses to genomic data sets.

Asunto(s)

Evolución Molecular , Genética de Población , Genoma , Genómica , Animales , Flujo Génico , Variación Genética , Genómica/métodos , Humanos , Cadenas de Markov , Modelos Genéticos , Dinámica Poblacional , Recombinación Genética , Selección Genética

9.

Phase-type distributions in population genetics.

Hobolth, Asger; Siri-Jégousse, Arno; Bladt, Mogens.

Theor Popul Biol ; 127: 16-32, 2019 06.

Artículo en Inglés | MEDLINE | ID: mdl-30822431

RESUMEN

Probability modelling for DNA sequence evolution is well established and provides a rich framework for understanding genetic variation between samples of individuals from one or more populations. We show that both classical and more recent models for coalescence (with or without recombination) can be described in terms of the so-called phase-type theory, where complicated and tedious calculations are circumvented by the use of matrix manipulations. The application of phase-type theory in population genetics consists of describing the biological system as a Markov model by appropriately setting up a state space and calculating the corresponding intensity and reward matrices. Formulae of interest are then expressed in terms of these aforementioned matrices. We illustrate this procedure by a number of examples: (a) Calculating the mean, (co)variance and even higher order moments of the site frequency spectrum in multiple merger coalescent models, (b) Analysing a sample of DNA sequences from the Atlantic Cod using the Beta-coalescent, and (c) Determining the correlation of the number of segregating sites for multiple samples in the two-locus ancestral recombination graph. We believe that phase-type theory has great potential as a tool for analysing probability models in population genetics. The compact matrix notation is useful for clarification of current models, and in particular their formal manipulation and calculations, but also for further development or extensions.

Asunto(s)

Genética de Población , Modelos Genéticos , Algoritmos , Humanos , Cadenas de Markov , Densidad de Población , Recombinación Genética

10.

A general framework for moment-based analysis of genetic data.

Speed, Maria Simonsen; Balding, David Joseph; Hobolth, Asger.

J Math Biol ; 78(6): 1727-1769, 2019 05.

Artículo en Inglés | MEDLINE | ID: mdl-30734077

RESUMEN

In population genetics, the Dirichlet (also called the Balding-Nichols) model has for 20 years been considered the key model to approximate the distribution of allele fractions within populations in a multi-allelic setting. It has often been noted that the Dirichlet assumption is approximate because positive correlations among alleles cannot be accommodated under the Dirichlet model. However, the validity of the Dirichlet distribution has never been systematically investigated in a general framework. This paper attempts to address this problem by providing a general overview of how allele fraction data under the most common multi-allelic mutational structures should be modeled. The Dirichlet and alternative models are investigated by simulating allele fractions from a diffusion approximation of the multi-allelic Wright-Fisher process with mutation, and applying a moment-based analysis method. The study shows that the optimal modeling strategy for the distribution of allele fractions depends on the specific mutation process. The Dirichlet model is only an exceptionally good approximation for the pure drift, Jukes-Cantor and parent-independent mutation processes with small mutation rates. Alternative models are required and proposed for the other mutation processes, such as a Beta-Dirichlet model for the infinite alleles mutation process, and a Hierarchical Beta model for the Kimura, Hasegawa-Kishino-Yano and Tamura-Nei processes. Finally, a novel Hierarchical Beta approximation is developed, a Pyramidal Hierarchical Beta model, for the generalized time-reversible and single-step mutation processes.

Asunto(s)

Alelos , Análisis de Datos , Genética de Población/métodos , Modelos Genéticos , Simulación por Computador , Conjuntos de Datos como Asunto , Humanos , Tasa de Mutación

11.

ncdDetect2: improved models of the site-specific mutation rate in cancer and driver detection with robust significance evaluation.

Juul, Malene; Madsen, Tobias; Guo, Qianyun; Bertl, Johanna; Hobolth, Asger; Kellis, Manolis; Pedersen, Jakob Skou.

Bioinformatics ; 35(2): 189-199, 2019 01 15.

Artículo en Inglés | MEDLINE | ID: mdl-29945188

RESUMEN

Motivation: Understanding the mutational processes that act during cancer development is a key topic of cancer biology. Nevertheless, much remains to be learned, as a complex interplay of processes with dependencies on a range of genomic features creates highly heterogeneous cancer genomes. Accurate driver detection relies on unbiased models of the mutation rate that also capture rate variation from uncharacterized sources. Results: Here, we analyse patterns of observed-to-expected mutation counts across 505 whole cancer genomes, and find that genomic features missing from our mutation-rate model likely operate on a megabase length scale. We extend our site-specific model of the mutation rate to include the additional variance from these sources, which leads to robust significance evaluation of candidate cancer drivers. We thus present ncdDetect v.2, with greatly improved cancer driver detection specificity. Finally, we show that ranking candidates by their posterior mean value of their effect sizes offers an equivalent and more computationally efficient alternative to ranking by their P-values. Availability and implementation: ncdDetect v.2 is implemented as an R-package and is freely available at http://github.com/TobiasMadsen/ncdDetect2. Supplementary information: Supplementary data are available at Bioinformatics online.

Asunto(s)

Modelos Genéticos , Tasa de Mutación , Neoplasias/genética , Biología Computacional , Genómica , Humanos , Programas Informáticos

12.

Regmex: a statistical tool for exploring motifs in ranked sequence lists from genomics experiments.

Nielsen, Morten Muhlig; Tataru, Paula; Madsen, Tobias; Hobolth, Asger; Pedersen, Jakob Skou.

Algorithms Mol Biol ; 13: 17, 2018.

Artículo en Inglés | MEDLINE | ID: mdl-30555524

RESUMEN

BACKGROUND: Motif analysis methods have long been central for studying biological function of nucleotide sequences. Functional genomics experiments extend their potential. They typically generate sequence lists ranked by an experimentally acquired functional property such as gene expression or protein binding affinity. Current motif discovery tools suffer from limitations in searching large motif spaces, and thus more complex motifs may not be included. There is thus a need for motif analysis methods that are tailored for analyzing specific complex motifs motivated by biological questions and hypotheses rather than acting as a screen based motif finding tool. METHODS: We present Regmex (REGular expression Motif EXplorer), which offers several methods to identify overrepresented motifs in ranked lists of sequences. Regmex uses regular expressions to define motifs or families of motifs and embedded Markov models to calculate exact p-values for motif observations in sequences. Biases in motif distributions across ranked sequence lists are evaluated using random walks, Brownian bridges, or modified rank based statistics. A modular setup and fast analytic p value evaluations make Regmex applicable to diverse and potentially large-scale motif analysis problems. RESULTS: We demonstrate use cases of combined motifs on simulated data and on expression data from micro RNA transfection experiments. We confirm previously obtained results and demonstrate the usability of Regmex to test a specific hypothesis about the relative location of microRNA seed sites and U-rich motifs. We further compare the tool with an existing motif discovery tool and show increased sensitivity. CONCLUSIONS: Regmex is a useful and flexible tool to analyze motif hypotheses that relates to large data sets in functional genomics. The method is available as an R package (https://github.com/muhligs/regmex).

13.

Detecting archaic introgression using an unadmixed outgroup.

Skov, Laurits; Hui, Ruoyun; Shchur, Vladimir; Hobolth, Asger; Scally, Aylwyn; Schierup, Mikkel Heide; Durbin, Richard.

PLoS Genet ; 14(9): e1007641, 2018 09.

Artículo en Inglés | MEDLINE | ID: mdl-30226838

RESUMEN

Human populations outside of Africa have experienced at least two bouts of introgression from archaic humans, from Neanderthals and Denisovans. In Papuans there is prior evidence of both these introgressions. Here we present a new approach to detect segments of individual genomes of archaic origin without using an archaic reference genome. The approach is based on a hidden Markov model that identifies genomic regions with a high density of single nucleotide variants (SNVs) not seen in unadmixed populations. We show using simulations that this provides a powerful approach to identifying segments of archaic introgression with a low rate of false detection, given data from a suitable outgroup population is available, without the archaic introgression but containing a majority of the variation that arose since initial separation from the archaic lineage. Furthermore our approach is able to infer admixture proportions and the times both of admixture and of initial divergence between the human and archaic populations. We apply the model to detect archaic introgression in 89 Papuans and show how the identified segments can be assigned to likely Neanderthal or Denisovan origin. We report more Denisovan admixture than previous studies and find a shift in size distribution of fragments of Neanderthal and Denisovan origin that is compatible with a difference in admixture time. Furthermore, we identify small amounts of Denisova ancestry in South East Asians and South Asians.

Asunto(s)

Genoma Humano/genética , Hominidae/genética , Hibridación Genética/genética , Hombre de Neandertal/genética , Animales , Pueblo Asiatico/genética , Población Negra/genética , Fósiles , Humanos , Nativos de Hawái y Otras Islas del Pacífico/genética , Filogenia , Población Blanca/genética

14.

Non-parametric estimation of population size changes from the site frequency spectrum.

Waltoft, Berit Lindum; Hobolth, Asger.

Stat Appl Genet Mol Biol ; 17(3)2018 06 11.

Artículo en Inglés | MEDLINE | ID: mdl-29886455

RESUMEN

Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n - 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n - i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.

Asunto(s)

Frecuencia de los Genes , Modelos Genéticos , Densidad de Población , Pueblo Asiatico/genética , Población Negra/genética , Genética de Población , Genoma Humano , Genética Humana/métodos , Genética Humana/estadística & datos numéricos , Proyecto Genoma Humano , Humanos , Programas Informáticos , Población Blanca/genética

15.

A site specific model and analysis of the neutral somatic mutation rate in whole-genome cancer data.

Bertl, Johanna; Guo, Qianyun; Juul, Malene; Besenbacher, Søren; Nielsen, Morten Muhlig; Hornshøj, Henrik; Pedersen, Jakob Skou; Hobolth, Asger.

BMC Bioinformatics ; 19(1): 147, 2018 04 19.

Artículo en Inglés | MEDLINE | ID: mdl-29673314

RESUMEN

BACKGROUND: Detailed modelling of the neutral mutational process in cancer cells is crucial for identifying driver mutations and understanding the mutational mechanisms that act during cancer development. The neutral mutational process is very complex: whole-genome analyses have revealed that the mutation rate differs between cancer types, between patients and along the genome depending on the genetic and epigenetic context. Therefore, methods that predict the number of different types of mutations in regions or specific genomic elements must consider local genomic explanatory variables. A major drawback of most methods is the need to average the explanatory variables across the entire region or genomic element. This procedure is particularly problematic if the explanatory variable varies dramatically in the element under consideration. RESULTS: To take into account the fine scale of the explanatory variables, we model the probabilities of different types of mutations for each position in the genome by multinomial logistic regression. We analyse 505 cancer genomes from 14 different cancer types and compare the performance in predicting mutation rate for both regional based models and site-specific models. We show that for 1000 randomly selected genomic positions, the site-specific model predicts the mutation rate much better than regional based models. We use a forward selection procedure to identify the most important explanatory variables. The procedure identifies site-specific conservation (phyloP), replication timing, and expression level as the best predictors for the mutation rate. Finally, our model confirms and quantifies certain well-known mutational signatures. CONCLUSION: We find that our site-specific multinomial regression model outperforms the regional based models. The possibility of including genomic variables on different scales and patient specific variables makes it a versatile framework for studying different mutational mechanisms. Our model can serve as the neutral null model for the mutational process; regions that deviate from the null model are candidates for elements that drive cancer development.

Asunto(s)

Genoma Humano , Modelos Genéticos , Tasa de Mutación , Mutación/genética , Neoplasias/genética , Bases de Datos Genéticas , Epigenómica , Humanos , Polimorfismo de Nucleótido Simple/genética , Análisis de Regresión

16.

Pan-cancer screen for mutations in non-coding elements with conservation and cancer specificity reveals correlations with expression and survival.

Hornshøj, Henrik; Nielsen, Morten Muhlig; Sinnott-Armstrong, Nicholas A; Switnicki, Michal P; Juul, Malene; Madsen, Tobias; Sallari, Richard; Kellis, Manolis; Ørntoft, Torben; Hobolth, Asger; Pedersen, Jakob Skou.

NPJ Genom Med ; 3: 1, 2018.

Artículo en Inglés | MEDLINE | ID: mdl-29354286

RESUMEN

Cancer develops by accumulation of somatic driver mutations, which impact cellular function. Mutations in non-coding regulatory regions can now be studied genome-wide and further characterized by correlation with gene expression and clinical outcome to identify driver candidates. Using a new two-stage procedure, called ncDriver, we first screened 507 ICGC whole-genomes from 10 cancer types for non-coding elements, in which mutations are both recurrent and have elevated conservation or cancer specificity. This identified 160 significant non-coding elements, including the TERT promoter, a well-known non-coding driver element, as well as elements associated with known cancer genes and regulatory genes (e.g., PAX5, TOX3, PCF11, MAPRE3). However, in some significant elements, mutations appear to stem from localized mutational processes rather than recurrent positive selection in some cases. To further characterize the driver potential of the identified elements and shortlist candidates, we identified elements where presence of mutations correlated significantly with expression levels (e.g., TERT and CDH10) and survival (e.g., CDH9 and CDH10) in an independent set of 505 TCGA whole-genome samples. In a larger pan-cancer set of 4128 TCGA exomes with expression profiling, we identified mutational correlation with expression for additional elements (e.g., near GATA3, CDC6, ZNF217, and CTCF transcription factor binding sites). Survival analysis further pointed to MIR122, a known marker of poor prognosis in liver cancer. In conclusion, the screen for significant mutation patterns coupled with correlative mutational analysis identified new individual driver candidates and suggest that some non-coding mutations recurrently affect expression and play a role in cancer development.

17.

Non-coding cancer driver candidates identified with a sample- and position-specific model of the somatic mutation rate.

Juul, Malene; Bertl, Johanna; Guo, Qianyun; Nielsen, Morten Muhlig; Switnicki, Michal; Hornshøj, Henrik; Madsen, Tobias; Hobolth, Asger; Pedersen, Jakob Skou.

Elife ; 62017 03 31.

Artículo en Inglés | MEDLINE | ID: mdl-28362259

RESUMEN

Non-coding mutations may drive cancer development. Statistical detection of non-coding driver regions is challenged by a varying mutation rate and uncertainty of functional impact. Here, we develop a statistically founded non-coding driver-detection method, ncdDetect, which includes sample-specific mutational signatures, long-range mutation rate variation, and position-specific impact measures. Using ncdDetect, we screened non-coding regulatory regions of protein-coding genes across a pan-cancer set of whole-genomes (n = 505), which top-ranked known drivers and identified new candidates. For individual candidates, presence of non-coding mutations associates with altered expression or decreased patient survival across an independent pan-cancer sample set (n = 5454). This includes an antigen-presenting gene (CD1A), where 5'UTR mutations correlate significantly with decreased survival in melanoma. Additionally, mutations in a base-excision-repair gene (SMUG1) correlate with a C-to-T mutational-signature. Overall, we find that a rich model of mutational heterogeneity facilitates non-coding driver identification and integrative analysis points to candidates of potential clinical relevance.

Asunto(s)

Carcinogénesis , Tasa de Mutación , Mutación , Neoplasias/patología , Neoplasias/fisiopatología , Bioestadística/métodos , Perfilación de la Expresión Génica , Humanos , Análisis de Supervivencia

18.

Significance evaluation in factor graphs.

Madsen, Tobias; Hobolth, Asger; Jensen, Jens Ledet; Pedersen, Jakob Skou.

BMC Bioinformatics ; 18(1): 199, 2017 Mar 31.

Artículo en Inglés | MEDLINE | ID: mdl-28359297

RESUMEN

BACKGROUND: Factor graphs provide a flexible and general framework for specifying probability distributions. They can capture a range of popular and recent models for analysis of both genomics data as well as data from other scientific fields. Owing to the ever larger data sets encountered in genomics and the multiple-testing issues accompanying them, accurate significance evaluation is of great importance. We here address the problem of evaluating statistical significance of observations from factor graph models. RESULTS: Two novel numerical approximations for evaluation of statistical significance are presented. First a method using importance sampling. Second a saddlepoint approximation based method. We develop algorithms to efficiently compute the approximations and compare them to naive sampling and the normal approximation. The individual merits of the methods are analysed both from a theoretical viewpoint and with simulations. A guideline for choosing between the normal approximation, saddle-point approximation and importance sampling is also provided. Finally, the applicability of the methods is demonstrated with examples from cancer genomics, motif-analysis and phylogenetics. CONCLUSIONS: The applicability of saddlepoint approximation and importance sampling is demonstrated on known models in the factor graph framework. Using the two methods we can substantially improve computational cost without compromising accuracy. This contribution allows analyses of large datasets in the general factor graph framework.

Asunto(s)

Algoritmos , Biología Computacional/métodos , Modelos Teóricos , Secuencia de Aminoácidos , Factor de Unión a CCCTC , Genómica , Humanos , Células MCF-7 , Neoplasias/diagnóstico , Neoplasias/genética , Filogenia , Probabilidad , Dominios y Motivos de Interacción de Proteínas , Proteínas Represoras , Alineación de Secuencia

19.

Statistical Inference in the Wright-Fisher Model Using Allele Frequency Data.

Tataru, Paula; Simonsen, Maria; Bataillon, Thomas; Hobolth, Asger.

Syst Biol ; 66(1): e30-e46, 2017 01 01.

Artículo en Inglés | MEDLINE | ID: mdl-28173553

RESUMEN

The WrightFisher model provides an elegant mathematical framework for understanding allele frequency data. In particular, the model can be used to infer the demographic history of species and identify loci under selection. A crucial quantity for inference under the WrightFisher model is the distribution of allele frequencies (DAF). Despite the apparent simplicity of the model, the calculation of the DAF is challenging. We review and discuss strategies for approximating the DAF, and how these are used in methods that perform inference from allele frequency data. Various evolutionary forces can be incorporated in the WrightFisher model, and we consider these in turn. We begin our review with the basic bi-allelic WrightFisher model where random genetic drift is the only evolutionary force. We then consider mutation, migration, and selection. In particular, we compare diffusion-based and moment-based methods in terms of accuracy, computational efficiency, and analytical tractability. We conclude with a brief overview of the multi-allelic process with a general mutation model.

Asunto(s)

Frecuencia de los Genes/genética , Modelos Genéticos , Evolución Molecular , Flujo Genético , Mutación

20.

An alternative derivation of the stationary distribution of the multivariate neutral Wright-Fisher model for low mutation rates with a view to mutation rate estimation from site frequency data.

Schrempf, Dominik; Hobolth, Asger.

Theor Popul Biol ; 114: 88-94, 2017 04.

Artículo en Inglés | MEDLINE | ID: mdl-28041892

RESUMEN

Recently, Burden and Tang (2016) provided an analytical expression for the stationary distribution of the multivariate neutral Wright-Fisher model with low mutation rates. In this paper we present a simple, alternative derivation that illustrates the approximation. Our proof is based on the discrete multivariate boundary mutation model which has three key ingredients. First, the decoupled Moran model is used to describe genetic drift. Second, low mutation rates are assumed by limiting mutations to monomorphic states. Third, the mutation rate matrix is separated into a time-reversible part and a flux part, as suggested by Burden and Tang (2016). An application of our result to data from several great apes reveals that the assumption of stationarity may be inadequate or that other evolutionary forces like selection or biased gene conversion are acting. Furthermore we find that the model with a reversible mutation rate matrix provides a reasonably good fit to the data compared to the one with a non-reversible mutation rate matrix.

Asunto(s)

Evolución Biológica , Frecuencia de los Genes , Flujo Genético , Tasa de Mutación , Genética de Población , Modelos Genéticos , Mutación , Selección Genética

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA