Búsqueda | Portal Regional de la BVS

Long-read amplicon denoising.

Kumar, Venkatesh; Vollbrecht, Thomas; Chernyshev, Mark; Mohan, Sanjay; Hanst, Brian; Bavafa, Nicholas; Lorenzo, Antonia; Kumar, Nikesh; Ketteringham, Robert; Eren, Kemal; Golden, Michael; Oliveira, Michelli F; Murrell, Ben.

Nucleic Acids Res ; 47(18): e104, 2019 10 10.

Artículo en Inglés | MEDLINE | ID: mdl-31418021

RESUMEN

Long-read next-generation amplicon sequencing shows promise for studying complete genes or genomes from complex and diverse populations. Current long-read sequencing technologies have challenging error profiles, hindering data processing and incorporation into downstream analyses. Here we consider the problem of how to reconstruct, free of sequencing error, the true sequence variants and their associated frequencies from PacBio reads. Called 'amplicon denoising', this problem has been extensively studied for short-read sequencing technologies, but current solutions do not always successfully generalize to long reads with high indel error rates. We introduce two methods: one that runs nearly instantly and is very accurate for medium length reads and high template coverage, and another, slower method that is more robust when reads are very long or coverage is lower. On two Mock Virus Community datasets with ground truth, each sequenced on a different PacBio instrument, and on a number of simulated datasets, we compare our two approaches to each other and to existing algorithms. We outperform all tested methods in accuracy, with competitive run times even for our slower method, successfully discriminating templates that differ by a just single nucleotide. Julia implementations of Fast Amplicon Denoising (FAD) and Robust Amplicon Denoising (RAD), and a webserver interface, are freely available.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Metagenómica , ARN Ribosómico 16S/genética , Virus/genética , Algoritmos , Técnicas de Visualización de Superficie Celular/métodos , VIH/genética , Filogenia , Alineación de Secuencia , Anticuerpos de Cadena Única/genética , Programas Informáticos

Full-Length Envelope Analyzer (FLEA): A tool for longitudinal analysis of viral amplicons.

Eren, Kemal; Weaver, Steven; Ketteringham, Robert; Valentyn, Morné; Laird Smith, Melissa; Kumar, Venkatesh; Mohan, Sanjay; Kosakovsky Pond, Sergei L; Murrell, Ben.

PLoS Comput Biol ; 14(12): e1006498, 2018 12.

Artículo en Inglés | MEDLINE | ID: mdl-30543621

RESUMEN

Next generation sequencing of viral populations has advanced our understanding of viral population dynamics, the development of drug resistance, and escape from host immune responses. Many applications require complete gene sequences, which can be impossible to reconstruct from short reads. HIV env, the protein of interest for HIV vaccine studies, is exceptionally challenging for long-read sequencing and analysis due to its length, high substitution rate, and extensive indel variation. While long-read sequencing is attractive in this setting, the analysis of such data is not well handled by existing methods. To address this, we introduce FLEA (Full-Length Envelope Analyzer), which performs end-to-end analysis and visualization of long-read sequencing data. FLEA consists of both a pipeline (optionally run on a high-performance cluster), and a client-side web application that provides interactive results. The pipeline transforms FASTQ reads into high-quality consensus sequences (HQCSs) and uses them to build a codon-aware multiple sequence alignment. The resulting alignment is then used to infer phylogenies, selection pressure, and evolutionary dynamics. The web application provides publication-quality plots and interactive visualizations, including an annotated viral alignment browser, time series plots of evolutionary dynamics, visualizations of gene-wide selective pressures (such as dN/dS) across time and across protein structure, and a phylogenetic tree browser. We demonstrate how FLEA may be used to process Pacific Biosciences HIV env data and describe recent examples of its use. Simulations show how FLEA dramatically reduces the error rate of this sequencing platform, providing an accurate portrait of complex and variable HIV env populations. A public instance of FLEA is hosted at http://flea.datamonkey.org. The Python source code for the FLEA pipeline can be found at https://github.com/veg/flea-pipeline. The client-side application is available at https://github.com/veg/flea-web-app. A live demo of the P018 results can be found at http://flea.murrell.group/view/P018.

Asunto(s)

Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Virus/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Filogenia , Programas Informáticos

Non-negative matrix factorization for learning alignment-specific models of protein evolution.

Murrell, Ben; Weighill, Thomas; Buys, Jan; Ketteringham, Robert; Moola, Sasha; Benade, Gerdus; du Buisson, Lise; Kaliski, Daniel; Hands, Tristan; Scheffler, Konrad.

PLoS One ; 6(12): e28898, 2011.

Artículo en Inglés | MEDLINE | ID: mdl-22216138

RESUMEN

Models of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models.

Asunto(s)

Evolución Biológica , Modelos Teóricos , Proteínas/fisiología , Filogenia

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA