Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 12 de 12
Filter
Add more filters










Publication year range
1.
Article in English | MEDLINE | ID: mdl-38451771

ABSTRACT

We present ViPRA-Haplo, a de novo strain-specific assembly workflow for reconstructing viral haplotypes in a viral population from paired-end next generation sequencing (NGS) data. The proposed Viral Path Reconstruction Algorithm (ViPRA) generates a subset of paths from a De Bruijn graph of reads using the pairing information of reads. The paths generated by ViPRA are an over-estimation of the true contigs. We propose two refinement methods to obtain an optimal set of contigs representing viral haplotypes. The first method clusters paths reconstructed by ViPRA using VSEARCH Deorowicz et al. 2015 based on sequence similarity, while the second method, MLEHaplo, generates a maximum likelihood estimate of viral populations. We evaluated our pipeline on both simulated and real viral quasispecies data from HIV (and real data from SARS-COV-2). Experimental results show that ViPRA-Haplo, although still an overestimation in the number of true contigs, outperforms the existing tool, PEHaplo, providing up to 9% better genome coverage on HIV real data. In addition, ViPRA-Haplo also retains higher diversity of the viral population as demonstrated by the presence of a higher percentage of contigs less than 1000 base pairs (bps), which also contain k-mers with counts less than 100 (representing rarer sequences), which are absent in PEHaplo. For SARS-CoV-2 sequencing data, ViPRA-Haplo reconstructs contigs that cover more than 90% of the reference genome and were able to validate known SARS-CoV-2 strains in the sequencing data.


Subject(s)
Algorithms , Genome, Viral , High-Throughput Nucleotide Sequencing , SARS-CoV-2 , High-Throughput Nucleotide Sequencing/methods , SARS-CoV-2/genetics , Genome, Viral/genetics , Humans , Haplotypes/genetics , COVID-19/virology , HIV/genetics , Computational Biology/methods
2.
PLoS Comput Biol ; 15(3): e1006564, 2019 03.
Article in English | MEDLINE | ID: mdl-30921327

ABSTRACT

Human Endogenous Retrovirus type K (HERV-K) is the only HERV known to be insertionally polymorphic; not all individuals have a retrovirus at a specific genomic location. It is possible that HERV-Ks contribute to human disease because people differ in both number and genomic location of these retroviruses. Indeed viral transcripts, proteins, and antibody against HERV-K are detected in cancers, auto-immune, and neurodegenerative diseases. However, attempts to link a polymorphic HERV-K with any disease have been frustrated in part because population prevalence of HERV-K provirus at each polymorphic site is lacking and it is challenging to identify closely related elements such as HERV-K from short read sequence data. We present an integrated and computationally robust approach that uses whole genome short read data to determine the occupation status at all sites reported to contain a HERV-K provirus. Our method estimates the proportion of fixed length genomic sequence (k-mers) from whole genome sequence data matching a reference set of k-mers unique to each HERV-K locus and applies mixture model-based clustering of these values to account for low depth sequence data. Our analysis of 1000 Genomes Project Data (KGP) reveals numerous differences among the five KGP super-populations in the prevalence of individual and co-occurring HERV-K proviruses; we provide a visualization tool to easily depict the proportion of the KGP populations with any combination of polymorphic HERV-K provirus. Further, because HERV-K is insertionally polymorphic, the genome burden of known polymorphic HERV-K is variable in humans; this burden is lowest in East Asian (EAS) individuals. Our study identifies population-specific sequence variation for HERV-K proviruses at several loci. We expect these resources will advance research on HERV-K contributions to human diseases.


Subject(s)
Endogenous Retroviruses/genetics , Genetics, Population/methods , Genomics/methods , Proviruses/genetics , Racial Groups/genetics , Algorithms , Genome, Human/genetics , Genome, Viral/genetics , Humans , Molecular Epidemiology , Software
3.
Comput Struct Biotechnol J ; 15: 388-395, 2017.
Article in English | MEDLINE | ID: mdl-28819548

ABSTRACT

We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of k-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies k-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of k-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that k-mers of a given size constitute a frame. We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes.

4.
Article in English | MEDLINE | ID: mdl-27168602

ABSTRACT

Metagenomics involves the analysis of genomes of microorganisms sampled directly from their environment. Next Generation Sequencing allows a high-throughput sampling of small segments from genomes in the metagenome to generate reads. To study the properties and relationships of the microorganisms present, clustering can be performed based on the inherent composition of the sampled reads for unknown species. We propose a two-dimensional lattice based probabilistic model for clustering metagenomic datasets. The occurrence of a species in the metagenome is estimated using a lattice of probabilistic distributions over small sized genomic sequences. The two dimensions denote distributions for different sizes and groups of words respectively. The lattice structure allows for additional support for a node from its neighbors when the probabilistic support for the species using the parameters of the current node is deemed insufficient. We also show convergence for our algorithm. We test our algorithm on simulated metagenomic data containing bacterial species and observe more than 85% precision. We also evaluate our algorithm on an in vitro-simulated bacterial metagenome and on human patient data, and show a better clustering than other algorithms even for short reads and varied abundance. The software and datasets can be downloaded from https:// github.com/lattclus/lattice-metage.

6.
J Comput Biol ; 20(6): 453-63, 2013 Jun.
Article in English | MEDLINE | ID: mdl-23718149

ABSTRACT

High genetic variability in viral populations plays an important role in disease progression, pathogenesis, and drug resistance. The last few years has seen significant progress in the development of methods for reconstruction of viral populations using data from next-generation sequencing technologies. These methods identify the differences between individual haplotypes by mapping the short reads to a reference genome. Much less has been published about resolving the population structure when a reference genome is lacking or is not well-defined, which severely limits the application of these new technologies to resolve virus population structure. We describe a computational framework, called Mutant-Bin, for clustering individual haplotypes in a viral population and determining their prevalence based on a set of deep sequencing reads. The main advantages of our method are that: (i) it enables determination of the population structure and haplotype frequencies when a reference genome is lacking; (ii) the method is unsupervised-the number of haplotypes does not have to be specified in advance; and (iii) it identifies the polymorphic sites that co-occur in a subset of haplotypes and the frequency with which they appear in the viral population. The method was evaluated on simulated reads with sequencing errors and 454 pyrosequencing reads from HIV samples. Our method clustered a high percentage of haplotypes with low false-positive rates, even at low genetic diversity.


Subject(s)
Gene Frequency , Genome, Viral , Mutation , Sequence Analysis, DNA/methods , Genetic Variation , Haplotypes
7.
J Biomed Biotechnol ; 2012: 153647, 2012.
Article in English | MEDLINE | ID: mdl-22577288

ABSTRACT

A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. The efficacy of clustering methods depends on the number of reads in the dataset, the read length and relative abundances of source genomes in the microbial community. In this paper, we formulate an unsupervised naive Bayes multispecies, multidimensional mixture model for reads from a metagenome. We use the proposed model to cluster metagenomic reads by their species of origin and to characterize the abundance of each species. We model the distribution of word counts along a genome as a Gaussian for shorter, frequent words and as a Poisson for longer words that are rare. We employ either a mixture of Gaussians or mixture of Poissons to model reads within each bin. Further, we handle the high-dimensionality and sparsity associated with the data, by grouping the set of words comprising the reads, resulting in a two-way mixture model. Finally, we demonstrate the accuracy and applicability of this method on simulated and real metagenomes. Our method can accurately cluster reads as short as 100 bps and is robust to varying abundances, divergences and read lengths.


Subject(s)
Metagenomics/methods , Models, Genetic , Algorithms , Cluster Analysis , Computer Simulation , Databases, Genetic , Genome, Bacterial
8.
Chromosoma ; 117(6): 553-67, 2008 Dec.
Article in English | MEDLINE | ID: mdl-18600338

ABSTRACT

To study when and where active genes replicated in early S phase are transcribed, a series of pulse-chase experiments are performed to label replicating chromatin domains (RS) in early S phase and subsequently transcription sites (TS) after chase periods of 0 to 24 h. Surprisingly, transcription activity throughout these chase periods did not show significant colocalization with early RS chromatin domains. Application of novel image segmentation and proximity algorithms, however, revealed close proximity of TS with the labeled chromatin domains independent of chase time. In addition, RNA polymerase II was highly proximal and showed significant colocalization with both TS and the chromatin domains. Based on these findings, we propose that chromatin activated for transcription dynamically unfolds or "loops out" of early RS chromatin domains where it can interact with RNA polymerase II and other components of the transcriptional machinery. Our results further suggest that the early RS chromatin domains are transcribing genes throughout the cell cycle and that multiple chromatin domains are organized around the same transcription factory.


Subject(s)
Cell Nucleus/metabolism , DNA Replication , Transcription, Genetic , Chromatin/metabolism , Chromosome Positioning , HeLa Cells , Humans , Image Processing, Computer-Assisted , RNA Polymerase II/metabolism , S Phase , Time Factors
9.
Conf Proc IEEE Eng Med Biol Soc ; 2006: 3057-61, 2006.
Article in English | MEDLINE | ID: mdl-17946542

ABSTRACT

This paper presents our comparative study of the application of intensity based similarity measures to the problem of matching genomic structures in microscopic images of living cells. As part of our ongoing research, we present here for the first time evidence from experiments and simulations that show the benefit of using an iterative matching algorithm guided by an intensity based similarity measure. Our experimental results are compared against a gold standard and suggest the measures that work best in the presence of fluorescent decay and other problems inherent to time-lapse microscopy. This makes our approach widely applicable in the study of the dynamics of living cells with time-lapse microscopic imaging.


Subject(s)
Cellular Structures/metabolism , Cellular Structures/ultrastructure , Image Processing, Computer-Assisted/statistics & numerical data , Algorithms , Biomedical Engineering , Genomics , Information Theory , Microscopy, Fluorescence
11.
Bioinformatics ; 21(4): 423-9, 2005 Feb 15.
Article in English | MEDLINE | ID: mdl-15608052

ABSTRACT

MOTIVATION: Genome sequencing projects and high-through-put technologies like DNA and Protein arrays have resulted in a very large amount of information-rich data. Microarray experimental data are a valuable, but limited source for inferring gene regulation mechanisms on a genomic scale. Additional information such as promoter sequences of genes/DNA binding motifs, gene ontologies, and location data, when combined with gene expression analysis can increase the statistical significance of the finding. This paper introduces a machine learning approach to information fusion for combining heterogeneous genomic data. The algorithm uses an unsupervised joint learning mechanism that identifies clusters of genes using the combined data. RESULTS: The correlation between gene expression time-series patterns obtained from different experimental conditions and the presence of several distinct and repeated motifs in their upstream sequences is examined here using publicly available yeast cell-cycle data. The results show that the combined learning approach taken here identifies correlated genes effectively. The algorithm provides an automated clustering method, but allows the user to specify apriori the influence of each data type on the final clustering using probabilities. AVAILABILITY: Software code is available by request from the first author. CONTACT: jkasturi@cse.psu.edu.


Subject(s)
Algorithms , Artificial Intelligence , Databases, Genetic , Gene Expression Profiling/methods , Information Storage and Retrieval/methods , Models, Genetic , Saccharomyces cerevisiae Proteins/genetics , Saccharomyces cerevisiae Proteins/metabolism , Cluster Analysis , Database Management Systems , Genome, Fungal , Genomics/methods , Models, Statistical , Oligonucleotide Array Sequence Analysis/methods , Pattern Recognition, Automated/methods , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism
12.
Bioinformatics ; 19(4): 449-58, 2003 Mar 01.
Article in English | MEDLINE | ID: mdl-12611799

ABSTRACT

MOTIVATION: Arrays allow measurements of the expression levels of thousands of mRNAs to be made simultaneously. The resulting data sets are information rich but require extensive mining to enhance their usefulness. Information theoretic methods are capable of assessing similarities and dissimilarities between data distributions and may be suited to the analysis of gene expression experiments. The purpose of this study was to investigate information theoretic data mining approaches to discover temporal patterns of gene expression from array-derived gene expression data. RESULTS: The Kullback-Leibler divergence, an information-theoretic distance that measures the relative dissimilarity between two data distribution profiles, was used in conjunction with an unsupervised self-organizing map algorithm. Two published, array-derived gene expression data sets were analyzed. The patterns obtained with the KL clustering method were found to be superior to those obtained with the hierarchical clustering algorithm using the Pearson correlation distance measure. The biological significance of the results was also examined. AVAILABILITY: Software code is available by request from the authors. All programs were written in ANSI C and Matlab (Mathworks Inc., Natick, MA).


Subject(s)
Algorithms , Gene Expression Profiling/methods , Gene Expression Regulation/physiology , Oligonucleotide Array Sequence Analysis/methods , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Cell Cycle/genetics , Cluster Analysis , Fibroblasts/physiology , Gene Expression Regulation/genetics , Humans , Models, Genetic , Models, Statistical , Pattern Recognition, Automated , Software , Time Factors , Yeasts/cytology , Yeasts/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...