ABSTRACT
In genetic association studies, haplotype data provide more refined information than data about separate genetic markers. However, large-scale studies that genotype hundreds to thousands of individuals may only provide results of pooled data. Methods for inferring haplotype frequencies from pooled genetic data that scale well with pool size rely on a normal approximation, which we observe to produce unreliable inference when applied to real data. We illustrate cases where the approximation fails, due to the normal covariance matrix being nearsingular. As an alternative to approximate methods, in this paper we propose two exact methods to infer haplotype frequencies from pooled genetic data based on a latent multinomial model, where the pooled results are considered integer combinations of latent, unobserved haplotype counts. One of our methods, latent count sampling via Markov bases, achieves approximately linear runtime with respect to pool size. Our exact methods produce more accurate inference over existing approximate methods for synthetic data and for haplotype data from the 1000 Genomes Project. We also demonstrate how our methods can be applied to time-series of pooled genetic data, as a proof of concept of how our methods are relevant to more complex hierarchical settings, such as spatiotemporal models.
ABSTRACT
The emergence and spread of drug-resistant Plasmodium falciparum parasites have hindered efforts to eliminate malaria. Monitoring the spread of drug resistance is vital, as drug resistance can lead to widespread treatment failure. We develop a Bayesian model to produce spatio-temporal maps that depict the spread of drug resistance, and apply our methods for the antimalarial sulfadoxine-pyrimethamine. We infer from genetic count data the prevalences over space and time of various malaria parasite haplotypes associated with drug resistance. Previous work has focused on inferring the prevalence of individual molecular markers. In reality, combinations of mutations at multiple markers confer varying degrees of drug resistance to the parasite, indicating that multiple markers should be modelled together. However, the reporting of genetic count data is often inconsistent as some studies report haplotype counts, whereas some studies report mutation counts of individual markers separately. In response, we introduce a latent multinomial Gaussian process model to handle partially reported spatio-temporal count data. As drug-resistant mutations are often used as a proxy for treatment efficacy, point estimates from our spatio-temporal maps can help inform antimalarial drug policies, whereas the uncertainties from our maps can help with optimizing sampling strategies for future monitoring of drug resistance.