RESUMO
Motivation: De Bruijn graphs are a common assembly data structure for sequencing datasets. But with the advances in sequencing technologies, assembling high coverage datasets has become a computational challenge. Read normalization, which removes redundancy in datasets, is widely applied to reduce resource requirements. Current normalization algorithms, though efficient, provide no guarantee to preserve important k-mers that form connections between regions in the graph. Results: Here, normalization is phrased as a set multi-cover problem on reads and a heuristic algorithm, Optimized Read Normalization Algorithm (ORNA), is proposed. ORNA normalizes to the minimum number of reads required to retain all k-mers and their relative k-mer abundances from the original dataset. Hence, all connections from the original graph are preserved. ORNA was tested on various RNA-seq datasets with different coverage values. It was compared to the current normalization algorithms and was found to be performing better. Normalizing error corrected data allows for more accurate assemblies compared to the normalized uncorrected dataset. Further, an application is proposed in which multiple datasets are combined and normalized to predict novel transcripts that would have been missed otherwise. Finally, ORNA is a general purpose normalization algorithm that is fast and significantly reduces datasets with loss of assembly quality in between [1, 30]% depending on reduction stringency. Availability and implementation: ORNA is available at https://github.com/SchulzLab/ORNA. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Simulação por Computador , Análise de Sequência de RNA , Biologia ComputacionalRESUMO
Across kingdoms, RNA interference (RNAi) has been shown to control gene expression at the transcriptional- or the post-transcriptional level. Here, we describe a mechanism which involves both aspects: truncated transgenes, which fail to produce intact mRNA, induce siRNA accumulation and silencing of homologous loci in trans in the ciliate Paramecium We show that silencing is achieved by co-transcriptional silencing, associated with repressive histone marks at the endogenous gene. This is accompanied by secondary siRNA accumulation, strictly limited to the open reading frame of the remote locus. Our data shows that in this mechanism, heterochromatic marks depend on a variety of RNAi components. These include RDR3 and PTIWI14 as well as a second set of components, which are also involved in post-transcriptional silencing: RDR2, PTIWI13, DCR1 and CID2. Our data indicates differential processing of nascent un-spliced and long, spliced transcripts thus suggesting a hitherto-unrecognized functional interaction between post-transcriptional and co-transcriptional RNAi. Both sets of RNAi components are required for efficient trans-acting RNAi at the chromatin level and our data indicates similar mechanisms contributing to genome wide regulation of gene expression by epigenetic mechanisms.
Assuntos
Heterocromatina/metabolismo , Paramecium/genética , Proteínas de Protozoários/genética , Interferência de RNA , RNA de Cadeia Dupla/genética , Transgenes , Montagem e Desmontagem da Cromatina , Proteínas de Ligação a DNA/genética , Proteínas de Ligação a DNA/metabolismo , Escherichia coli/genética , Escherichia coli/metabolismo , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Ontologia Genética , Heterocromatina/química , Anotação de Sequência Molecular , Paramecium/metabolismo , Plasmídeos/química , Plasmídeos/metabolismo , Polinucleotídeo Adenililtransferase/genética , Polinucleotídeo Adenililtransferase/metabolismo , Proteínas de Protozoários/antagonistas & inibidores , Proteínas de Protozoários/metabolismo , RNA de Cadeia Dupla/metabolismo , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , RNA Interferente Pequeno/genética , RNA Interferente Pequeno/metabolismo , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismoRESUMO
MOTIVATION: De novo transcriptome assembly is an integral part for many RNA-seq workflows. Common applications include sequencing of non-model organisms, cancer or meta transcriptomes. Most de novo transcriptome assemblers use the de Bruijn graph (DBG) as the underlying data structure. The quality of the assemblies produced by such assemblers is highly influenced by the exact word length k As such no single kmer value leads to optimal results. Instead, DBGs over different kmer values are built and the assemblies are merged to improve sensitivity. However, no studies have investigated thoroughly the problem of automatically learning at which kmer value to stop the assembly. Instead a suboptimal selection of kmer values is often used in practice. RESULTS: Here we investigate the contribution of a single kmer value in a multi-kmer based assembly approach. We find that a comparative clustering of related assemblies can be used to estimate the importance of an additional kmer assembly. Using a model fit based algorithm we predict the kmer value at which no further assemblies are necessary. Our approach is tested with different de novo assemblers for datasets with different coverage values and read lengths. Further, we suggest a simple post processing step that significantly improves the quality of multi-kmer assemblies. CONCLUSION: We provide an automatic method for limiting the number of kmer values without a significant loss in assembly quality but with savings in assembly time. This is a step forward to making multi-kmer methods more reliable and easier to use. AVAILABILITY AND IMPLEMENTATION: A general implementation of our approach can be found under: https://github.com/SchulzLab/KREATIONSupplementary information: Supplementary data are available at Bioinformatics online. CONTACT: mschulz@mmci.uni-saarland.de.
Assuntos
Transcriptoma , Algoritmos , Análise por Conglomerados , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
Genome scale metabolic model provides an overview of an organism's metabolic capability. These genome-specific metabolic reconstructions are based on identification of gene to protein to reaction (GPR) associations and, in turn, on homology with annotated genes from other organisms. Cyanobacteria are photosynthetic prokaryotes which have diverged appreciably from their nonphotosynthetic counterparts. They also show significant evolutionary divergence from plants, which are well studied for their photosynthetic apparatus. We argue that context-specific sequence and domain similarity can add to the repertoire of the GPR associations and significantly expand our view of the metabolic capability of cyanobacteria. We took an approach that combines the results of context-specific sequence-to-sequence similarity search with those of sequence-to-profile searches. We employ PSI-BLAST for the former, and CDD, Pfam, and COG for the latter. An optimization algorithm was devised to arrive at a weighting scheme to combine the different evidences with KEGG-annotated GPRs as training data. We present the algorithm in the form of software "Systematic, Homology-based Automated Re-annotation for Prokaryotes (SHARP)." We predicted 3,781 new GPR associations for the 10 prokaryotes considered of which eight are cyanobacteria species. These new GPR associations fall in several metabolic pathways and were used to annotate 7,718 gaps in the metabolic network. These new annotations led to discovery of several pathways that may be active and thereby providing new directions for metabolic engineering of these species for production of useful products. Metabolic model developed on such a reconstructed network is likely to give better phenotypic predictions.
Assuntos
Cianobactérias/genética , Genoma Bacteriano , Redes e Vias Metabólicas , Anotação de Sequência Molecular , Cianobactérias/metabolismoRESUMO
Specialized de novo assemblers for diverse datatypes have been developed and are in widespread use for the analyses of single-cell genomics, metagenomics and RNA-seq data. However, assembly of large sequencing datasets produced by modern technologies is challenging and computationally intensive. In-silico read normalization has been suggested as a computational strategy to reduce redundancy in read datasets, which leads to significant speedups and memory savings of assembly pipelines. Previously, we presented a set multi-cover optimization based approach, ORNA, where reads are reduced without losing important k-mer connectivity information, as used in assembly graphs. Here we propose extensions to ORNA, named ORNA-Q and ORNA-K, which consider a weighted set multi-cover optimization formulation for the in-silico read normalization problem. These novel formulations make use of the base quality scores obtained from sequencers (ORNA-Q) or k-mer abundances of reads (ORNA-K) to improve normalization further. We devise efficient heuristic algorithms for solving both formulations. In applications to human RNA-seq data, ORNA-Q and ORNA-K are shown to assemble more or equally many full length transcripts compared to other normalization methods at similar or higher read reduction values. The algorithm is implemented under the latest version of ORNA (v2.0, https://github.com/SchulzLab/ORNA ).