RESUMO
Knowledge of the time of HIV-1 infection and the multiplicity of viruses that establish HIV-1 infection is crucial for the in-depth analysis of clinical prevention efficacy trial outcomes. Better estimation methods would improve the ability to characterize immunological and genetic sequence correlates of efficacy within preventive efficacy trials of HIV-1 vaccines and monoclonal antibodies. We developed new methods for infection timing and multiplicity estimation using maximum likelihood estimators that shift and scale (calibrate) estimates by fitting true infection times and founder virus multiplicities to a linear regression model with independent variables defined by data on HIV-1 sequences, viral load, diagnostics, and sequence alignment statistics. Using Poisson models of measured mutation counts and phylogenetic trees, we analyzed longitudinal HIV-1 sequence data together with diagnostic and viral load data from the RV217 and CAPRISA 002 acute HIV-1 infection cohort studies. We used leave-one-out cross validation to evaluate the prediction error of these calibrated estimators versus that of existing estimators and found that both infection time and founder multiplicity can be estimated with improved accuracy and precision by calibration. Calibration considerably improved all estimators of time since HIV-1 infection, in terms of reducing bias to near zero and reducing root mean squared error (RMSE) to 5-10 days for sequences collected 1-2 months after infection. The calibration of multiplicity assessments yielded strong improvements with accurate predictions (ROC-AUC above 0.85) in all cases. These results have not yet been validated on external data, and the best-fitting models are likely to be less robust than simpler models to variation in sequencing conditions. For all evaluated models, these results demonstrate the value of calibration for improved estimation of founder multiplicity and of time since HIV-1 infection.
Assuntos
Vacinas contra a AIDS , Infecções por HIV/prevenção & controle , HIV-1/genética , Modelos Estatísticos , Evolução Molecular , Variação Genética , Infecções por HIV/virologia , Humanos , Mutação , Filogenia , Análise de Sequência , Fatores de Tempo , Carga ViralRESUMO
Increased interest in the immune system's involvement in pathophysiological phenomena coupled with decreased DNA sequencing costs have led to an explosion of antibody and T cell receptor sequencing data collectively termed "adaptive immune receptor repertoire sequencing" (AIRR-seq or Rep-Seq). The AIRR Community has been actively working to standardize protocols, metadata, formats, APIs, and other guidelines to promote open and reproducible studies of the immune repertoire. In this paper, we describe the work of the AIRR Community's Data Representation Working Group to develop standardized data representations for storing and sharing annotated antibody and T cell receptor data. Our file format emphasizes ease-of-use, accessibility, scalability to large data sets, and a commitment to open and transparent science. It is composed of a tab-delimited format with a specific schema. Several popular repertoire analysis tools and data repositories already utilize this AIRR-seq data format. We hope that others will follow suit in the interest of promoting interoperable standards.
Assuntos
Anticorpos/genética , Sequência de Bases , Sistemas de Gerenciamento de Base de Dados , Disseminação de Informação/métodos , Receptores de Antígenos de Linfócitos T/genética , Imunidade Adaptativa/genética , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Sequenciamento de Nucleotídeos em Larga Escala/economia , Humanos , Receptores Imunológicos/genética , Projetos de PesquisaRESUMO
Modern infectious disease outbreak surveillance produces continuous streams of sequence data which require phylogenetic analysis as data arrives. Current software packages for Bayesian phylogenetic inference are unable to quickly incorporate new sequences as they become available, making them less useful for dynamically unfolding evolutionary stories. This limitation can be addressed by applying a class of Bayesian statistical inference algorithms called sequential Monte Carlo (SMC) to conduct online inference, wherein new data can be continuously incorporated to update the estimate of the posterior probability distribution. In this article, we describe and evaluate several different online phylogenetic sequential Monte Carlo (OPSMC) algorithms. We show that proposing new phylogenies with a density similar to the Bayesian prior suffers from poor performance, and we develop "guided" proposals that better match the proposal density to the posterior. Furthermore, we show that the simplest guided proposals can exhibit pathological behavior in some situations, leading to poor results, and that the situation can be resolved by heating the proposal density. The results demonstrate that relative to the widely used MCMC-based algorithm implemented in MrBayes, the total time required to compute a series of phylogenetic posteriors as sequences arrive can be significantly reduced by the use of OPSMC, without incurring a significant loss in accuracy.
Assuntos
Classificação/métodos , Modelos Biológicos , Filogenia , Algoritmos , Teorema de Bayes , Internet , Método de Monte CarloRESUMO
Phylogenetics has seen a steady increase in data set size and substitution model complexity, which require increasing amounts of computational power to compute likelihoods. This motivates strategies to approximate the likelihood functions for branch length optimization and Bayesian sampling. In this article, we develop an approximation to the 1D likelihood function as parametrized by a single branch length. Our method uses a four-parameter surrogate function abstracted from the simplest phylogenetic likelihood function, the binary symmetric model. We show that it offers a surrogate that can be fit over a variety of branch lengths, that it is applicable to a wide variety of models and trees, and that it can be used effectively as a proposal mechanism for Bayesian sampling. The method is implemented as a stand-alone open-source C library for calling from phylogenetics algorithms; it has proven essential for good performance of our online phylogenetic algorithm sts.
Assuntos
Funções Verossimilhança , Filogenia , Análise de Sequência de DNA/métodos , Algoritmos , Teorema de Bayes , Evolução Molecular , Cadeias de Markov , Modelos Genéticos , Método de Monte Carlo , Análise de Sequência de DNA/estatística & dados numéricosRESUMO
Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, is an enterprise that yields valuable evolutionary understanding of many biological systems. Bayesian phylogenetic algorithms, which approximate a posterior distribution on trees, have become a popular if computationally expensive means of doing phylogenetics. Modern data collection technologies are quickly adding new sequences to already substantial databases. With all current techniques for Bayesian phylogenetics, computation must start anew each time a sequence becomes available, making it costly to maintain an up-to-date estimate of a phylogenetic posterior. These considerations highlight the need for an online Bayesian phylogenetic method which can update an existing posterior with new sequences. Here, we provide theoretical results on the consistency and stability of methods for online Bayesian phylogenetic inference based on Sequential Monte Carlo (SMC) and Markov chain Monte Carlo. We first show a consistency result, demonstrating that the method samples from the correct distribution in the limit of a large number of particles. Next, we derive the first reported set of bounds on how phylogenetic likelihood surfaces change when new sequences are added. These bounds enable us to characterize the theoretical performance of sampling algorithms by bounding the effective sample size (ESS) with a given number of particles from below. We show that the ESS is guaranteed to grow linearly as the number of particles in an SMC sampler grows. Surprisingly, this result holds even though the dimensions of the phylogenetic model grow with each new added sequence.