RESUMO
While many methods are available to detect structural changes in a time series, few procedures are available to quantify the uncertainty of these estimates post-detection. In this work, we fill this gap by proposing a new framework to test the null hypothesis that there is no change in mean around an estimated changepoint. We further show that it is possible to efficiently carry out this framework in the case of changepoints estimated by binary segmentation and its variants, â 0 segmentation, or the fused lasso. Our setup allows us to condition on much less information than existing approaches, which yields higher powered tests. We apply our proposals in a simulation study and on a dataset of chromosomal guanine-cytosine content. These approaches are freely available in the R package ChangepointInference at https://jewellsean.github.io/changepoint-inference/.
RESUMO
Calcium imaging data promises to transform the field of neuroscience by making it possible to record from large populations of neurons simultaneously. However, determining the exact moment in time at which a neuron spikes, from a calcium imaging data set, amounts to a non-trivial deconvolution problem which is of critical importance for downstream analyses. While a number of formulations have been proposed for this task in the recent literature, in this article, we focus on a formulation recently proposed in Jewell and Witten (2018. Exact spike train inference via $\ell_{0} $ optimization. The Annals of Applied Statistics12(4), 2457-2482) that can accurately estimate not just the spike rate, but also the specific times at which the neuron spikes. We develop a much faster algorithm that can be used to deconvolve a fluorescence trace of 100 000 timesteps in less than a second. Furthermore, we present a modification to this algorithm that precludes the possibility of a "negative spike". We demonstrate the performance of this algorithm for spike deconvolution on calcium imaging datasets that were recently released as part of the $\texttt{spikefinder}$ challenge (http://spikefinder.codeneuro.org/). The algorithm presented in this article was used in the Allen Institute for Brain Science's "platform paper" to decode neural activity from the Allen Brain Observatory; this is the main scientific paper in which their data resource is presented. Our $\texttt{C++}$ implementation, along with $\texttt{R}$ and $\texttt{python}$ wrappers, is publicly available. $\texttt{R}$ code is available on $\texttt{CRAN}$ and $\texttt{Github}$, and $\texttt{python}$ wrappers are available on $\texttt{Github}$; see https://github.com/jewellsean/FastLZeroSpikeInference.
Assuntos
Cálcio , Neurônios , Algoritmos , Encéfalo/diagnóstico por imagem , Diagnóstico por Imagem , HumanosRESUMO
Recently, common variants on human chromosome 8q24 were found to be associated with prostate cancer risk. While conducting a genome-wide association study in the Cancer Genetic Markers of Susceptibility project with 550,000 SNPs in a nested case-control study (1,172 cases and 1,157 controls of European origin), we identified a new association at 8q24 with an independent effect on prostate cancer susceptibility. The most significant signal is 70 kb centromeric to the previously reported SNP, rs1447295, but shows little evidence of linkage disequilibrium with it. A combined analysis with four additional studies (total: 4,296 cases and 4,299 controls) confirms association with prostate cancer for rs6983267 in the centromeric locus (P = 9.42 x 10(-13); heterozygote odds ratio (OR): 1.26, 95% confidence interval (c.i.): 1.13-1.41; homozygote OR: 1.58, 95% c.i.: 1.40-1.78). Each SNP remained significant in a joint analysis after adjusting for the other (rs1447295 P = 1.41 x 10(-11); rs6983267 P = 6.62 x 10(-10)). These observations, combined with compelling evidence for a recombination hotspot between the two markers, indicate the presence of at least two independent loci within 8q24 that contribute to prostate cancer in men of European ancestry. We estimate that the population attributable risk of the new locus, marked by rs6983267, is higher than the locus marked by rs1447295 (21% versus 9%).
Assuntos
Cromossomos Humanos Par 8/genética , Predisposição Genética para Doença/genética , Variação Genética , Neoplasias da Próstata/genética , Negro ou Afro-Americano , Sequência de Bases , Etnicidade/genética , Frequência do Gene , Genômica/métodos , Genótipo , Haplótipos/genética , Humanos , Masculino , Dados de Sequência Molecular , Razão de Chances , Polimorfismo de Nucleotídeo Único , Fatores de Risco , Estados Unidos , População BrancaRESUMO
A central statistical goal is to choose between alternative explanatory models of data. In many modern applications, such as population genetics, it is not possible to apply standard methods based on evaluating the likelihood functions of the models, as these are numerically intractable. Approximate Bayesian computation (ABC) is a commonly used alternative for such situations. ABC simulates data x for many parameter values under each model, which is compared to the observed data x obs. More weight is placed on models under which S(x) is close to S(x obs), where S maps data to a vector of summary statistics. Previous work has shown the choice of S is crucial to the efficiency and accuracy of ABC. This paper provides a method to select good summary statistics for model choice. It uses a preliminary step, simulating many x values from all models and fitting regressions to this with the model as response. The resulting model weight estimators are used as S in an ABC analysis. Theoretical results are given to justify this as approximating low dimensional sufficient statistics. A substantive application is presented: choosing between competing coalescent models of demographic growth for Campylobacter jejuni in New Zealand using multi-locus sequence typing data.
Assuntos
Simulação por Computador , Modelos Genéticos , Algoritmos , Teorema de Bayes , Campylobacter jejuni/genética , Genes Bacterianos , Funções Verossimilhança , Tipagem de Sequências MultilocusRESUMO
Widely used models in genetics include the Wright-Fisher diffusion and its moment dual, Kingman's coalescent. Each has a multilocus extension but under neither extension is the sampling distribution available in closed-form, and their computation is extremely difficult. In this paper we derive two new multilocus population genetic models, one a diffusion and the other a coalescent process, which are much simpler than the standard models, but which capture their key properties for large recombination rates. The diffusion model is based on a central limit theorem for density dependent population processes, and we show that the sampling distribution is a linear combination of moments of Gaussian distributions and hence available in closed-form. The coalescent process is based on a probabilistic coupling of the ancestral recombination graph to a simpler genealogical process which exposes the leading dynamics of the former. We further demonstrate that when we consider the sampling distribution as an asymptotic expansion in inverse powers of the recombination parameter, the sampling distributions of the new models agree with the standard ones up to the first two orders.
RESUMO
We consider inference for the reaction rates in discretely observed networks such as those found in models for systems biology, population ecology, and epidemics. Most such networks are neither slow enough nor small enough for inference via the true state-dependent Markov jump process to be feasible. Typically, inference is conducted by approximating the dynamics through an ordinary differential equation (ODE) or a stochastic differential equation (SDE). The former ignores the stochasticity in the true model and can lead to inaccurate inferences. The latter is more accurate but is harder to implement as the transition density of the SDE model is generally unknown. The linear noise approximation (LNA) arises from a first-order Taylor expansion of the approximating SDE about a deterministic solution and can be viewed as a compromise between the ODE and SDE models. It is a stochastic model, but discrete time transition probabilities for the LNA are available through the solution of a series of ordinary differential equations. We describe how a restarting LNA can be efficiently used to perform inference for a general class of reaction networks; evaluate the accuracy of such an approach; and show how and when this approach is either statistically or computationally more efficient than ODE or SDE methods. We apply the LNA to analyze Google Flu Trends data from the North and South Islands of New Zealand, and are able to obtain more accurate short-term forecasts of new flu cases than another recently proposed method, although at a greater computational cost.
Assuntos
Biometria/métodos , Modelos Estatísticos , Simulação por Computador , Ecologia/estatística & dados numéricos , Epidemias/estatística & dados numéricos , Métodos Epidemiológicos , Redes Reguladoras de Genes , Humanos , Influenza Humana/epidemiologia , Modelos Lineares , Processos Estocásticos , Biologia de Sistemas/estatística & dados numéricosRESUMO
Single locus variants (SLVs) are bacterial sequence types that differ at only one of the seven canonical multilocus sequence typing (MLST) loci. Estimating the relative roles of recombination and point mutation in the generation of new alleles that lead to SLVs is helpful in understanding how organisms evolve. The relative rates of recombination and mutation for Campylobacter jejuni and Campylobacter coli were estimated at seven different housekeeping loci from publically available MLST data. The probability of recombination generating a new allele that leads to an SLV is estimated to be roughly seven times more than that of mutation for C. jejuni, but for C. coli recombination and mutation were estimated to have a similar contribution to the generation of SLVs. The majority of nucleotide differences (98 % for C. jejuni and 85 % for C. coli) between strains that make up an SLV are attributable to recombination. These estimates are much larger than estimates of the relative rate of recombination to mutation calculated from more distantly related isolates using MLST data. One explanation for this is that purifying selection plays an important role in the evolution of Campylobacter. A simulation study was performed to test the performance of our method under a range of biologically realistic parameters. We found that our method performed well when the recombination tract length was longer than 3 kb. For situations in which recombination may occur with shorter tract lengths, our estimates are likely to be an underestimate of the ratio of recombination to mutation, and of the importance of recombination for creating diversity in closely related isolates. A parametric bootstrap method was applied to calculate the uncertainty of these estimates.
Assuntos
Campylobacter coli/genética , Campylobacter jejuni/genética , Loci Gênicos/genética , Variação Genética , Mutação Puntual/genética , Recombinação Genética , Alelos , Campylobacter coli/classificação , Campylobacter jejuni/classificação , Bases de Dados Genéticas , Tipagem de Sequências Multilocus , Nucleotídeos/genéticaRESUMO
Campylobacter jejuni is the leading cause of bacterial gastro-enteritis in the developed world. It is thought to infect 2-3 million people a year in the US alone, at a cost to the economy in excess of US $4 billion. C. jejuni is a widespread zoonotic pathogen that is carried by animals farmed for meat and poultry. A connection with contaminated food is recognized, but C. jejuni is also commonly found in wild animals and water sources. Phylogenetic studies have suggested that genotypes pathogenic to humans bear greatest resemblance to non-livestock isolates. Moreover, seasonal variation in campylobacteriosis bears the hallmarks of water-borne disease, and certain outbreaks have been attributed to contamination of drinking water. As a result, the relative importance of these reservoirs to human disease is controversial. We use multilocus sequence typing to genotype 1,231 cases of C. jejuni isolated from patients in Lancashire, England. By modeling the DNA sequence evolution and zoonotic transmission of C. jejuni between host species and the environment, we assign human cases probabilistically to source populations. Our novel population genetics approach reveals that the vast majority (97%) of sporadic disease can be attributed to animals farmed for meat and poultry. Chicken and cattle are the principal sources of C. jejuni pathogenic to humans, whereas wild animal and environmental sources are responsible for just 3% of disease. Our results imply that the primary transmission route is through the food chain, and suggest that incidence could be dramatically reduced by enhanced on-farm biosecurity or preventing food-borne transmission.
Assuntos
Animais Selvagens/microbiologia , Infecções por Campylobacter/transmissão , Campylobacter jejuni/isolamento & purificação , Carne/microbiologia , Microbiologia da Água , Animais , Técnicas de Tipagem Bacteriana , Biodiversidade , Aves , Infecções por Campylobacter/epidemiologia , Infecções por Campylobacter/microbiologia , Campylobacter jejuni/classificação , Campylobacter jejuni/genética , Bovinos , Galinhas , Reservatórios de Doenças/microbiologia , Inglaterra/epidemiologia , Humanos , Coelhos , Ovinos , SuínosRESUMO
Responsible for the majority of bacterial gastroenteritis in the developed world, Campylobacter jejuni is a pervasive pathogen of humans and animals, but its evolution is obscure. In this paper, we exploit contemporary genetic diversity and empirical evidence to piece together the evolutionary history of C. jejuni and quantify its evolutionary potential. Our combined population genetics-phylogenetics approach reveals a surprising picture. Campylobacter jejuni is a rapidly evolving species, subject to intense purifying selection that purges 60% of novel variation, but possessing a massive evolutionary potential. The low mutation rate is offset by a large effective population size so that a mutation at any site can occur somewhere in the population within the space of a week. Recombination has a fundamental role, generating diversity at twice the rate of de novo mutation, and facilitating gene flow between C. jejuni and its sister species Campylobacter coli. We attempt to calibrate the rate of molecular evolution in C. jejuni based solely on within-species variation. The rates we obtain are up to 1,000 times faster than conventional estimates, placing the C. jejuni-C. coli split at the time of the Neolithic revolution. We weigh the plausibility of such recent bacterial evolution against alternative explanations and discuss the evidence required to settle the issue.
Assuntos
Campylobacter jejuni/genética , Evolução Molecular , Infecções por Campylobacter/microbiologia , Campylobacter coli/genética , Campylobacter jejuni/classificação , Inglaterra , Deriva Genética , Especiação Genética , Humanos , Mutação , Recombinação Genética , Seleção GenéticaRESUMO
We look at how to choose genetic distance so as to maximize the power of detecting spatial structure. We answer this question through analyzing two population genetic models that allow for a spatially structured population in a continuous habitat. These models, like most that incorporate spatial structure, can be characterized by a separation of timescales: the history of the sample can be split into a scattering and a collecting phase, and it is only during the scattering phase that the spatial locations of the sample affect the coalescence times. Our results suggest that the optimal choice of genetic distance is based upon splitting a DNA sequence into segments and counting the number of segments at which two sequences differ. The size of these segments depends on the length of the scattering phase for the population genetic model.
Assuntos
Campylobacter jejuni/genética , Cromossomos Bacterianos/genética , Genética Populacional , Genoma Bacteriano/genética , Modelos Genéticos , Campylobacter jejuni/classificação , Interpretação Estatística de Dados , Evolução Molecular , GeografiaRESUMO
We consider inference for demographic models and parameters based upon postprocessing the output of an MCMC method that generates samples of genealogical trees (from the posterior distribution for a specific prior distribution of the genealogy). This approach has the advantage of taking account of the uncertainty in the inference for the tree when making inferences about the demographic model and can be computationally efficient in terms of reanalyzing data under a wide variety of models. We consider a (simulation-consistent) estimate of the likelihood for variable population size models, which uses importance sampling, and propose two new approximate likelihoods, one for migration models and one for continuous spatial models.
Assuntos
Evolução Molecular , Genes/fisiologia , Genética Populacional , Modelos Genéticos , Modelos Estatísticos , Linhagem , Algoritmos , Animais , Teorema de Bayes , DNA/genética , Interpretação Estatística de Dados , Emigração e Imigração , Variação Genética , Humanos , Funções Verossimilhança , Cadeias de Markov , Método de Monte Carlo , SoftwareRESUMO
MOTIVATION: There is much local variation in recombination rates across the human genome--with the majority of recombination occurring in recombination hotspots--short regions of around approximately 2 kb in length that have much higher recombination rates than neighbouring regions. Knowledge of this local variation is important, e.g. in the design and analysis of association studies for disease genes. Population genetic data, such as that generated by the HapMap project, can be used to infer the location of these hotspots. We present a new, efficient and powerful method for detecting recombination hotspots from population data. RESULTS: We compare our method with four current methods for detecting hotspots. It is orders of magnitude quicker, and has greater power, than two related approaches. It appears to be more powerful than HotspotFisher, though less accurate at inferring the precise positions of the hotspot. It was also more powerful than LDhot in some situations: particularly for weaker hotspots (10-40 times the background rate) when SNP density is lower (< 1/kb). AVAILABILITY: Program, data sets, and full details of results are available at: http://www.maths.lancs.ac.uk/~fearnhea/Hotspot.
Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Bases de Dados Genéticas , Genética Populacional , Recombinação Genética/genética , Análise de Sequência de DNA/métodos , Software , Sequência de Bases , Dados de Sequência Molecular , Polimorfismo GenéticoRESUMO
We show how the idea of monotone coupling from the past can produce simple algorithms for simulating samples at a nonneutral locus under a range of demographic models. We specifically consider a biallelic locus and either a general variable population size mode or a general migration model for population subdivision. We investigate the effect of demography on the efficacy of selection and the effect of selection on genetic divergence between populations.
Assuntos
Simulação por Computador , Modelos Genéticos , Algoritmos , Alelos , Animais , Variação Genética , Cadeias de Markov , População/genética , Densidade Demográfica , Seleção GenéticaRESUMO
In this paper we build on an approach proposed by Zou et al. (2014) for nonparametric changepoint detection. This approach defines the best segmentation for a data set as the one which minimises a penalised cost function, with the cost function defined in term of minus a non-parametric log-likelihood for data within each segment. Minimising this cost function is possible using dynamic programming, but their algorithm had a computational cost that is cubic in the length of the data set. To speed up computation, Zou et al. (2014) resorted to a screening procedure which means that the estimated segmentation is no longer guaranteed to be the global minimum of the cost function. We show that the screening procedure adversely affects the accuracy of the changepoint detection method, and show how a faster dynamic programming algorithm, pruned exact linear time (PELT) (Killick et al. 2012), can be used to find the optimal segmentation with a computational cost that can be close to linear in the amount of data. PELT requires a penalty to avoid under/over-fitting the model which can have a detrimental effect on the quality of the detected changepoints. To overcome this issue we use a relatively new method, changepoints over a range of penalties (Haynes et al. 2016), which finds all of the optimal segmentations for multiple penalty values over a continuous range. We apply our method to detect changes in heart-rate during physical activity.
RESUMO
Many common approaches to detecting changepoints, for example based on statistical criteria such as penalised likelihood or minimum description length, can be formulated in terms of minimising a cost over segmentations. We focus on a class of dynamic programming algorithms that can solve the resulting minimisation problem exactly, and thus find the optimal segmentation under the given statistical criteria. The standard implementation of these dynamic programming methods have a computational cost that scales at least quadratically in the length of the time-series. Recently pruning ideas have been suggested that can speed up the dynamic programming algorithms, whilst still being guaranteed to be optimal, in that they find the true minimum of the cost function. Here we extend these pruning methods, and introduce two new algorithms for segmenting data: FPOP and SNIP. Empirical results show that FPOP is substantially faster than existing dynamic programming methods, and unlike the existing methods its computational efficiency is robust to the number of changepoints in the data. We evaluate the method for detecting copy number variations and observe that FPOP has a computational cost that is even competitive with that of binary segmentation, but can give much more accurate segmentations.
RESUMO
We develop a method for maximum-likelihood estimation of coalescence times in genealogical trees, based on population genetics data. For this purpose, a Viterbi-type algorithm is constructed to maximize the joint likelihood of the coalescence times. Marginal confidence intervals for the coalescence times based on the profile likelihoods are also computed. Our method of finding MLEs and calculating C.I.'s appears to be more accurate than alternative numerical maximization methods, and maximum-likelihood inference appears to be more accurate than other existing model-free approaches to estimating coalescent times. We demonstrate the method on two different data sets: human Y chromosome DNA data and fungus DNA data.
Assuntos
Algoritmos , Cromossomos Humanos Y/genética , Classificação/métodos , Evolução Molecular , Modelos Genéticos , Filogenia , Ascomicetos/genética , Simulação por Computador , Genética Populacional , Humanos , Funções Verossimilhança , Masculino , Fatores de TempoRESUMO
We have performed simulations to assess the performance of three population genetics approximate-likelihood methods in estimating the population-scaled recombination rate from sequence data. We measured performance in two ways: accuracy when the sequence data were simulated according to the (simplistic) standard model underlying the methods and robustness to violations of many different aspects of the standard model. Although we found some differences between the methods, performance tended to be similar for all three methods. Despite the fact that the methods are not robust to violations of the underlying model, our simulations indicate that patterns of relative recombination rates should be inferred reasonably well even if the standard model does not hold. In addition, we assess various techniques for improving the performance of approximate-likelihood methods. In particular we find that the composite-likelihood method of Hudson (2001) can be improved by including log-likelihood contributions only for pairs of sites that are separated by some prespecified distance.
Assuntos
Genética Populacional , Modelos Genéticos , Recombinação Genética , Simulação por Computador , Funções Verossimilhança , Mutação/genética , Densidade DemográficaRESUMO
Determining the amount of recombination in the genealogical history of a sample of genes is important to both evolutionary biology and medical population genetics. However, recurrent mutation can produce patterns of genetic diversity similar to those generated by recombination and can bias estimates of the population recombination rate. Hudson 2001 has suggested an approximate-likelihood method based on coalescent theory to estimate the population recombination rate, 4N(e)r, under an infinite-sites model of sequence evolution. Here we extend the method to the estimation of the recombination rate in genomes, such as those of many viruses and bacteria, where the rate of recurrent mutation is high. In addition, we develop a powerful permutation-based method for detecting recombination that is both more powerful than other permutation-based methods and robust to misspecification of the model of sequence evolution. We apply the method to sequence data from viruses, bacteria, and human mitochondrial DNA. The extremely high level of recombination detected in both HIV1 and HIV2 sequences demonstrates that recombination cannot be ignored in the analysis of viral population genetic data.
Assuntos
Evolução Molecular , Recombinação Genética , Análise de Sequência de DNA/métodos , Animais , Humanos , Funções Verossimilhança , Modelos Genéticos , MutaçãoRESUMO
There has been considerable recent interest in understanding the way in which recombination rates vary over small physical distances, and the extent of recombination hotspots, in various genomes. Here we adapt, apply, and assess the power of recently developed coalescent-based approaches to estimating recombination rates from sequence polymorphism data. We apply full-likelihood estimation to study rate variation in and around a well-characterized recombination hotspot in humans, in the beta-globin gene cluster, and show that it provides similar estimates, consistent with those from sperm studies, from two populations deliberately chosen to have different demographic and selectional histories. We also demonstrate how approximate-likelihood methods can be used to detect local recombination hotspots from genomic-scale SNP data. In a simulation study based on 80 100-kb regions, these methods detect 43 out of 60 hotspots (ranging from 1 to 2 kb in size), with only two false positives out of 2000 subregions that were tested for the presence of a hotspot. Our study suggests that new computational tools for sophisticated analysis of population diversity data are valuable for hotspot detection and fine-scale mapping of local recombination rates.
Assuntos
Variação Genética , Recombinação Genética , DNA/genética , Genoma , Globinas/genética , Haplótipos , Humanos , Funções Verossimilhança , Modelos Genéticos , Modelos EstatísticosRESUMO
A repeated cross-sectional study was conducted to determine the prevalence of Campylobacter spp. and the population structure of C. jejuni in European starlings and ducks cohabiting multiple public access sites in an urban area of New Zealand. The country's geographical isolation and relatively recent history of introduction of wild bird species, including the European starling and mallard duck, create an ideal setting to explore the impact of geographical separation on the population biology of C. jejuni, as well as potential public health implications. A total of 716 starling and 720 duck fecal samples were collected and screened for C. jejuni over a 12 month period. This study combined molecular genotyping, population genetics and epidemiological modeling and revealed: (i) higher Campylobacter spp. isolation in starlings (46%) compared with ducks (30%), but similar isolation of C. jejuni in ducks (23%) and starlings (21%), (ii) significant associations between the isolation of Campylobacter spp. and host species, sampling location and time of year using logistic regression, (iii) evidence of population differentiation, as indicated by FST , and host-genotype association with clonal complexes CC ST-177 and CC ST-682 associated with starlings, and clonal complexes CC ST-1034, CC ST-692, and CC ST-1332 associated with ducks, and (iv) greater genetic diversity and genotype richness in ducks compared with starlings. These findings provide evidence that host-associated genotypes, such as the starling-associated ST-177 and ST-682, represent lineages that were introduced with the host species in the 19th century. The isolation of sequence types associated with human disease in New Zealand indicate that wild ducks and starlings need to be considered as a potential public health risk, particularly in urban areas.