RESUMO
We present ViPRA-Haplo, a de novo strain-specific assembly workflow for reconstructing viral haplotypes in a viral population from paired-end next generation sequencing (NGS) data. The proposed Viral Path Reconstruction Algorithm (ViPRA) generates a subset of paths from a De Bruijn graph of reads using the pairing information of reads. The paths generated by ViPRA are an over-estimation of the true contigs. We propose two refinement methods to obtain an optimal set of contigs representing viral haplotypes. The first method clusters paths reconstructed by ViPRA using VSEARCH Deorowicz et al. 2015 based on sequence similarity, while the second method, MLEHaplo, generates a maximum likelihood estimate of viral populations. We evaluated our pipeline on both simulated and real viral quasispecies data from HIV (and real data from SARS-COV-2). Experimental results show that ViPRA-Haplo, although still an overestimation in the number of true contigs, outperforms the existing tool, PEHaplo, providing up to 9% better genome coverage on HIV real data. In addition, ViPRA-Haplo also retains higher diversity of the viral population as demonstrated by the presence of a higher percentage of contigs less than 1000 base pairs (bps), which also contain k-mers with counts less than 100 (representing rarer sequences), which are absent in PEHaplo. For SARS-CoV-2 sequencing data, ViPRA-Haplo reconstructs contigs that cover more than 90% of the reference genome and were able to validate known SARS-CoV-2 strains in the sequencing data.
Assuntos
Algoritmos , Genoma Viral , Sequenciamento de Nucleotídeos em Larga Escala , SARS-CoV-2 , Sequenciamento de Nucleotídeos em Larga Escala/métodos , SARS-CoV-2/genética , Genoma Viral/genética , Humanos , Haplótipos/genética , COVID-19/virologia , HIV/genética , Biologia Computacional/métodosRESUMO
All vertebrate genomes have been colonized by retroviruses along their evolutionary trajectory. Although endogenous retroviruses (ERVs) can contribute important physiological functions to contemporary hosts, such benefits are attributed to long-term coevolution of ERV and host because germline infections are rare and expansion is slow, and because the host effectively silences them. The genomes of several outbred species including mule deer (Odocoileus hemionus) are currently being colonized by ERVs, which provides an opportunity to study ERV dynamics at a time when few are fixed. We previously established the locus-specific distribution of cervid ERV (CrERV) in populations of mule deer. In this study, we determine the molecular evolutionary processes acting on CrERV at each locus in the context of phylogenetic origin, genome location, and population prevalence. A mule deer genome was de novo assembled from short- and long-insert mate pair reads and CrERV sequence generated at each locus. We report that CrERV composition and diversity have recently measurably increased by horizontal acquisition of a new retrovirus lineage. This new lineage has further expanded CrERV burden and CrERV genomic diversity by activating and recombining with existing CrERV. Resulting interlineage recombinants then endogenize and subsequently expand. CrERV loci are significantly closer to genes than expected if integration were random and gene proximity might explain the recent expansion of one recombinant CrERV lineage. Thus, in mule deer, retroviral colonization is a dynamic period in the molecular evolution of CrERV that also provides a burst of genomic diversity to the host population.
Assuntos
Cervos , Retrovirus Endógenos , Animais , Evolução Biológica , Cervos/genética , Retrovirus Endógenos/genética , Evolução Molecular , Filogenia , Recombinação GenéticaRESUMO
Human Endogenous Retrovirus type K (HERV-K) is the only HERV known to be insertionally polymorphic; not all individuals have a retrovirus at a specific genomic location. It is possible that HERV-Ks contribute to human disease because people differ in both number and genomic location of these retroviruses. Indeed viral transcripts, proteins, and antibody against HERV-K are detected in cancers, auto-immune, and neurodegenerative diseases. However, attempts to link a polymorphic HERV-K with any disease have been frustrated in part because population prevalence of HERV-K provirus at each polymorphic site is lacking and it is challenging to identify closely related elements such as HERV-K from short read sequence data. We present an integrated and computationally robust approach that uses whole genome short read data to determine the occupation status at all sites reported to contain a HERV-K provirus. Our method estimates the proportion of fixed length genomic sequence (k-mers) from whole genome sequence data matching a reference set of k-mers unique to each HERV-K locus and applies mixture model-based clustering of these values to account for low depth sequence data. Our analysis of 1000 Genomes Project Data (KGP) reveals numerous differences among the five KGP super-populations in the prevalence of individual and co-occurring HERV-K proviruses; we provide a visualization tool to easily depict the proportion of the KGP populations with any combination of polymorphic HERV-K provirus. Further, because HERV-K is insertionally polymorphic, the genome burden of known polymorphic HERV-K is variable in humans; this burden is lowest in East Asian (EAS) individuals. Our study identifies population-specific sequence variation for HERV-K proviruses at several loci. We expect these resources will advance research on HERV-K contributions to human diseases.
Assuntos
Retrovirus Endógenos/genética , Genética Populacional/métodos , Genômica/métodos , Provírus/genética , Grupos Raciais/genética , Algoritmos , Genoma Humano/genética , Genoma Viral/genética , Humanos , Epidemiologia Molecular , SoftwareRESUMO
SUMMARY: MicroRNAs (miRNAs) function as master regulators of gene expression. Recent studies demonstrate that miRNA isoforms (isomiRs) play a unique role in cancer development. Here, we present QuagmiR, the first cloud-based tool to analyze isomiRs from next generation sequencing data. Using a novel and flexible searching algorithm designed for the detection and annotation of heterogeneous isomiRs, it permits extensive customization of the query process and reference databases to meet the user 's diverse research needs. AVAILABILITY AND IMPLEMENTATION: QuagmiR is written in Python and can be obtained freely from GitHub (https://github.com/Gu-Lab-RBL-NCI/QuagmiR). QuagmiR can be run from the command line on local machines, as well as on high-performance servers. A web-accessible version of the tool has also been made available for use by academic researchers through the National Cancer Institute-funded Seven Bridges Cancer Genomics Cloud (https://cancergenomicscloud.org). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Computação em Nuvem , Ciência de Dados , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , MicroRNAs , SoftwareRESUMO
We performed an extensive immunogenomic analysis of more than 10,000 tumors comprising 33 diverse cancer types by utilizing data compiled by TCGA. Across cancer types, we identified six immune subtypes-wound healing, IFN-γ dominant, inflammatory, lymphocyte depleted, immunologically quiet, and TGF-ß dominant-characterized by differences in macrophage or lymphocyte signatures, Th1:Th2 cell ratio, extent of intratumoral heterogeneity, aneuploidy, extent of neoantigen load, overall cell proliferation, expression of immunomodulatory genes, and prognosis. Specific driver mutations correlated with lower (CTNNB1, NRAS, or IDH1) or higher (BRAF, TP53, or CASP8) leukocyte levels across all cancers. Multiple control modalities of the intracellular and extracellular networks (transcription, microRNAs, copy number, and epigenetic processes) were involved in tumor-immune cell interactions, both across and within immune subtypes. Our immunogenomics pipeline to characterize these heterogeneous tumors and the resulting data are intended to serve as a resource for future targeted studies to further advance the field.
Assuntos
Genômica/métodos , Neoplasias , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Criança , Feminino , Humanos , Interferon gama/genética , Interferon gama/imunologia , Macrófagos/imunologia , Masculino , Pessoa de Meia-Idade , Neoplasias/classificação , Neoplasias/genética , Neoplasias/imunologia , Prognóstico , Equilíbrio Th1-Th2/fisiologia , Fator de Crescimento Transformador beta/genética , Fator de Crescimento Transformador beta/imunologia , Cicatrização/genética , Cicatrização/imunologia , Adulto JovemRESUMO
Next-generation sequencing has produced petabytes of data, but accessing and analyzing these data remain challenging. Traditionally, researchers investigating public datasets like The Cancer Genome Atlas (TCGA) would download the data to a high-performance cluster, which could take several weeks even with a highly optimized network connection. The National Cancer Institute (NCI) initiated the Cancer Genomics Cloud Pilots program to provide researchers with the resources to process data with cloud computational resources. We present protocols using one of these Cloud Pilots, the Seven Bridges Cancer Genomics Cloud (CGC), to find and query public datasets, bring your own data to the CGC, analyze data using standard or custom workflows, and benchmark tools for accuracy with interactive analysis features. These protocols demonstrate that the CGC is a data-analysis ecosystem that fully empowers researchers with a variety of areas of expertise and interests to collaborate in the analysis of petabytes of data. © 2017 by John Wiley & Sons, Inc.
Assuntos
Bases de Dados Genéticas/estatística & dados numéricos , Neoplasias/genética , Computação em Nuvem , Biologia Computacional , Interpretação Estatística de Dados , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Metadados , Projetos PilotoRESUMO
The Seven Bridges Cancer Genomics Cloud (CGC; www.cancergenomicscloud.org) enables researchers to rapidly access and collaborate on massive public cancer genomic datasets, including The Cancer Genome Atlas. It provides secure on-demand access to data, analysis tools, and computing resources. Researchers from diverse backgrounds can easily visualize, query, and explore cancer genomic datasets visually or programmatically. Data of interest can be immediately analyzed in the cloud using more than 200 preinstalled, curated bioinformatics tools and workflows. Researchers can also extend the functionality of the platform by adding their own data and tools via an intuitive software development kit. By colocalizing these resources in the cloud, the CGC enables scalable, reproducible analyses. Researchers worldwide can use the CGC to investigate key questions in cancer genomics. Cancer Res; 77(21); e3-6. ©2017 AACR.
Assuntos
Biologia Computacional , Genômica , Neoplasias/genética , Genoma Humano , Humanos , Internet , Pesquisa , SoftwareRESUMO
We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of k-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies k-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of k-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that k-mers of a given size constitute a frame. We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes.
RESUMO
Metagenomics involves the analysis of genomes of microorganisms sampled directly from their environment. Next Generation Sequencing allows a high-throughput sampling of small segments from genomes in the metagenome to generate reads. To study the properties and relationships of the microorganisms present, clustering can be performed based on the inherent composition of the sampled reads for unknown species. We propose a two-dimensional lattice based probabilistic model for clustering metagenomic datasets. The occurrence of a species in the metagenome is estimated using a lattice of probabilistic distributions over small sized genomic sequences. The two dimensions denote distributions for different sizes and groups of words respectively. The lattice structure allows for additional support for a node from its neighbors when the probabilistic support for the species using the parameters of the current node is deemed insufficient. We also show convergence for our algorithm. We test our algorithm on simulated metagenomic data containing bacterial species and observe more than 85% precision. We also evaluate our algorithm on an in vitro-simulated bacterial metagenome and on human patient data, and show a better clustering than other algorithms even for short reads and varied abundance. The software and datasets can be downloaded from https:// github.com/lattclus/lattice-metage.
RESUMO
BACKGROUND: Infection with feline immunodeficiency virus (FIV) causes an immunosuppressive disease whose consequences are less severe if cats are co-infected with an attenuated FIV strain (PLV). We use virus diversity measurements, which reflect replication ability and the virus response to various conditions, to test whether diversity of virulent FIV in lymphoid tissues is altered in the presence of PLV. Our data consisted of the 3' half of the FIV genome from three tissues of animals infected with FIV alone, or with FIV and PLV, sequenced by 454 technology. RESULTS: Since rare variants dominate virus populations, we had to carefully distinguish sequence variation from errors due to experimental protocols and sequencing. We considered an exponential-normal convolution model used for background correction of microarray data, and modified it to formulate an error correction approach for minor allele frequencies derived from high-throughput sequencing. Similar to accounting for over-dispersion in counts, this accounts for error-inflated variability in frequencies - and quite effectively reproduces empirically observed distributions. After obtaining error-corrected minor allele frequencies, we applied ANalysis Of VAriance (ANOVA) based on a linear mixed model and found that conserved sites and transition frequencies in FIV genes differ among tissues of dual and single infected cats. Furthermore, analysis of minor allele frequencies at individual FIV genome sites revealed 242 sites significantly affected by infection status (dual vs. single) or infection status by tissue interaction. All together, our results demonstrated a decrease in FIV diversity in bone marrow in the presence of PLV. Importantly, these effects were weakened or undetectable when error correction was performed with other approaches (thresholding of minor allele frequencies; probabilistic clustering of reads). We also queried the data for cytidine deaminase activity on the viral genome, which causes an asymmetric increase in G to A substitutions, but found no evidence for this host defense strategy. CONCLUSIONS: Our error correction approach for minor allele frequencies (more sensitive and computationally efficient than other algorithms) and our statistical treatment of variation (ANOVA) were critical for effective use of high-throughput sequencing data in understanding viral diversity. We found that co-infection with PLV shifts FIV diversity from bone marrow to lymph node and spleen.
Assuntos
Doenças do Gato/imunologia , Interpretação Estatística de Dados , Síndrome de Imunodeficiência Adquirida Felina/imunologia , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Vírus da Imunodeficiência Felina/classificação , Vírus da Imunodeficiência Felina/genética , Modelos Estatísticos , Algoritmos , Animais , Doenças do Gato/genética , Doenças do Gato/transmissão , Doenças do Gato/virologia , Gatos , DNA Viral/genética , Síndrome de Imunodeficiência Adquirida Felina/genética , Síndrome de Imunodeficiência Adquirida Felina/virologia , Vírus da Imunodeficiência Felina/patogenicidadeRESUMO
High genetic variability in viral populations plays an important role in disease progression, pathogenesis, and drug resistance. The last few years has seen significant progress in the development of methods for reconstruction of viral populations using data from next-generation sequencing technologies. These methods identify the differences between individual haplotypes by mapping the short reads to a reference genome. Much less has been published about resolving the population structure when a reference genome is lacking or is not well-defined, which severely limits the application of these new technologies to resolve virus population structure. We describe a computational framework, called Mutant-Bin, for clustering individual haplotypes in a viral population and determining their prevalence based on a set of deep sequencing reads. The main advantages of our method are that: (i) it enables determination of the population structure and haplotype frequencies when a reference genome is lacking; (ii) the method is unsupervised-the number of haplotypes does not have to be specified in advance; and (iii) it identifies the polymorphic sites that co-occur in a subset of haplotypes and the frequency with which they appear in the viral population. The method was evaluated on simulated reads with sequencing errors and 454 pyrosequencing reads from HIV samples. Our method clustered a high percentage of haplotypes with low false-positive rates, even at low genetic diversity.