Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 21
1.
Nat Methods ; 2024 Jun 07.
Article En | MEDLINE | ID: mdl-38849569

The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.

2.
bioRxiv ; 2023 Jul 27.
Article En | MEDLINE | ID: mdl-37546854

The Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. The consortium generated over 427 million long-read sequences from cDNA and direct RNA datasets, encompassing human, mouse, and manatee species, using different protocols and sequencing platforms. These data were utilized by developers to address challenges in transcript isoform detection and quantification, as well as de novo transcript isoform identification. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. When aiming to detect rare and novel transcripts or when using reference-free approaches, incorporating additional orthogonal data and replicate samples are advised. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.

3.
Cell Death Dis ; 14(1): 19, 2023 01 12.
Article En | MEDLINE | ID: mdl-36635266

The abnormal tumor microenvironment (TME) often dictates the therapeutic response of cancer to chemo- and immuno-therapy. Aberrant expression of pericentromeric satellite repeats has been reported for epithelial cancers, including lung cancer. However, the transcription of tandemly repetitive elements in stromal cells of the TME has been unappreciated, limiting the optimal use of satellite transcripts as biomarkers or anti-cancer targets. We found that transcription of pericentromeric satellite DNA (satDNA) in mouse and human lung adenocarcinoma was observed in cancer-associated fibroblasts (CAFs). In vivo, lung fibroblasts expressed pericentromeric satellite repeats HS2/HS3 specifically in tumors. In vitro, transcription of satDNA was induced in lung fibroblasts in response to TGFß, IL1α, matrix stiffness, direct contact with tumor cells and treatment with chemotherapeutic drugs. Single-cell transcriptome analysis of human lung adenocarcinoma confirmed that CAFs were the cell type with the highest number of satellite transcripts. Human HS2/HS3 pericentromeric transcripts were detected in the nucleus, cytoplasm, extracellularly and co-localized with extracellular vesicles in situ in human biopsies and activated fibroblasts in vitro. The transcripts were transmitted into recipient cells and entered their nuclei. Knock-down of satellite transcripts in human lung fibroblasts attenuated cellular senescence and blocked the formation of an inflammatory CAFs phenotype which resulted in the inhibition of their pro-tumorigenic functions. In sum, our data suggest that satellite long non-coding (lnc) RNAs are induced in CAFs, regulate expression of inflammatory genes and can be secreted from the cells, which potentially might present a new element of cell-cell communication in the TME.


Adenocarcinoma , Cancer-Associated Fibroblasts , Lung Neoplasms , RNA, Long Noncoding , Humans , Animals , Mice , Cancer-Associated Fibroblasts/metabolism , RNA, Long Noncoding/genetics , RNA, Long Noncoding/metabolism , Fibroblasts/metabolism , DNA, Satellite , Lung Neoplasms/pathology , Adenocarcinoma/genetics , Lung , Carcinogenesis/genetics , Tumor Microenvironment/genetics
4.
Int J Mol Sci ; 24(2)2023 Jan 11.
Article En | MEDLINE | ID: mdl-36674941

Elaboration of protocols for differentiation of human pluripotent stem cells to dopamine neurons is an important issue for development of cell replacement therapy for Parkinson's disease. A number of protocols have been already developed; however, their efficiency and specificity still can be improved. Investigating the role of signaling cascades, important for neurogenesis, can help to solve this problem and to provide a deeper understanding of their role in neuronal development. Notch signaling plays an essential role in development and maintenance of the central nervous system after birth. In our study, we analyzed the effect of Notch activation and inhibition at the early stages of differentiation of human induced pluripotent stem cells to dopaminergic neurons. We found that, during the first seven days of differentiation, the cells were not sensitive to the Notch inhibition. On the contrary, activation of Notch signaling during the same time period led to significant changes and was associated with an increase in expression of genes, specific for caudal parts of the brain, a decrease of expression of genes, specific for forebrain, as well as a decrease of expression of genes, important for the formation of axons and dendrites and microtubule stabilizing proteins.


Induced Pluripotent Stem Cells , Pluripotent Stem Cells , Humans , Dopaminergic Neurons/metabolism , Induced Pluripotent Stem Cells/metabolism , Cell Differentiation , Pluripotent Stem Cells/metabolism , Signal Transduction , Receptors, Notch/metabolism
5.
Nat Biotechnol ; 41(7): 915-918, 2023 Jul.
Article En | MEDLINE | ID: mdl-36593406

Annotating newly sequenced genomes and determining alternative isoforms from long-read RNA data are complex and incompletely solved problems. Here we present IsoQuant-a computational tool using intron graphs that accurately reconstructs transcripts both with and without reference genome annotation. For novel transcript discovery, IsoQuant reduces the false-positive rate fivefold and 2.5-fold for Oxford Nanopore reference-based or reference-free mode, respectively. IsoQuant also improves performance for Pacific Biosciences data.


High-Throughput Nucleotide Sequencing , RNA , Protein Isoforms/genetics , Sequence Analysis, RNA , Genome , Sequence Analysis, DNA
6.
Front Microbiol ; 13: 981458, 2022.
Article En | MEDLINE | ID: mdl-36386613

While metagenome sequencing may provide insights on the genome sequences and composition of microbial communities, metatranscriptome analysis can be useful for studying the functional activity of a microbiome. RNA-Seq data provides the possibility to determine active genes in the community and how their expression levels depend on external conditions. Although the field of metatranscriptomics is relatively young, the number of projects related to metatranscriptome analysis increases every year and the scope of its applications expands. However, there are several problems that complicate metatranscriptome analysis: complexity of microbial communities, wide dynamic range of transcriptome expression and importantly, the lack of high-quality computational methods for assembling meta-RNA sequencing data. These factors deteriorate the contiguity and completeness of metatranscriptome assemblies, therefore affecting further downstream analysis. Here we present MetaGT, a pipeline for de novo assembly of metatranscriptomes, which is based on the idea of combining both metatranscriptomic and metagenomic data sequenced from the same sample. MetaGT assembles metatranscriptomic contigs and fills in missing regions based on their alignments to metagenome assembly. This approach allows to overcome described complexities and obtain complete RNA sequences, and additionally estimate their abundances. Using various publicly available real and simulated datasets, we demonstrate that MetaGT yields significant improvement in coverage and completeness of metatranscriptome assemblies compared to existing methods that do not exploit metagenomic data. The pipeline is implemented in NextFlow and is freely available from https://github.com/ablab/metaGT.

7.
Nat Biotechnol ; 40(7): 1082-1092, 2022 07.
Article En | MEDLINE | ID: mdl-35256815

Single-nuclei RNA sequencing characterizes cell types at the gene level. However, compared to single-cell approaches, many single-nuclei cDNAs are purely intronic, lack barcodes and hinder the study of isoforms. Here we present single-nuclei isoform RNA sequencing (SnISOr-Seq). Using microfluidics, PCR-based artifact removal, target enrichment and long-read sequencing, SnISOr-Seq increased barcoded, exon-spanning long reads 7.5-fold compared to naive long-read single-nuclei sequencing. We applied SnISOr-Seq to adult human frontal cortex and found that exons associated with autism exhibit coordinated and highly cell-type-specific inclusion. We found two distinct combination patterns: those distinguishing neural cell types, enriched in TSS-exon, exon-polyadenylation-site and non-adjacent exon pairs, and those with multiple configurations within one cell type, enriched in adjacent exon pairs. Finally, we observed that human-specific exons are almost as tightly coordinated as conserved exons, implying that coordination can be rapidly established during evolution. SnISOr-Seq enables cell-type-specific long-read isoform analysis in human brain and in any frozen or hard-to-dissociate sample.


Brain , RNA , Alternative Splicing/genetics , Brain/metabolism , Exons/genetics , Humans , Protein Isoforms/genetics , RNA/genetics , Sequence Analysis, RNA
8.
Genome Res ; 32(4): 726-737, 2022 04.
Article En | MEDLINE | ID: mdl-35301264

Long-read transcriptomics require understanding error sources inherent to technologies. Current approaches cannot compare methods for an individual RNA molecule. Here, we present a novel platform-comparison method that combines barcoding strategies and long-read sequencing to sequence cDNA copies representing an individual RNA molecule on both Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). We compare these long-read pairs in terms of sequence content and isoform patterns. Although individual read pairs show high similarity, we find differences in (1) aligned length, (2) transcription start site (TSS), (3) polyadenylation site (poly(A)-site) assignment, and (4) exon-intron structures. Overall, 25% of read pairs disagree on either TSS, poly(A)-site, or splice site. Intron-chain disagreement typically arises from alignment errors of microexons and complicated splice sites. Our single-molecule technology comparison reveals that inconsistencies are often caused by sequencing error-induced inaccurate ONT alignments, especially to downstream GUNNGU donor motifs. However, annotation-disagreeing upstream shifts in NAGNAG acceptors in ONT are often confirmed by PacBio and are thus likely real. In both barcoded and nonbarcoded ONT reads, we find that intron number and proximity of GU/AGs better predict inconsistencies with the annotation than read quality alone. We summarize these findings in an annotation-based algorithm for spliced alignment correction that improves subsequent transcript construction with ONT reads.


Nanopores , DNA, Complementary , High-Throughput Nucleotide Sequencing/methods , RNA , Sequence Analysis, DNA/methods , Technology
9.
Int J Mol Sci ; 22(16)2021 Aug 16.
Article En | MEDLINE | ID: mdl-34445502

Trace amine-associated receptors (TAAR) recognize organic compounds, including primary, secondary, and tertiary amines. The TAAR5 receptor is known to be involved in the olfactory sensing of innate socially relevant odors encoded by volatile amines. However, emerging data point to the involvement of TAAR5 in brain functions, particularly in the emotional behaviors mediated by the limbic system which suggests its potential contribution to the pathogenesis of neuropsychiatric diseases. TAAR5 expression was explored in datasets available in the Gene Expression Omnibus, Allen Brain Atlas, and Human Protein Atlas databases. Transcriptomic data demonstrate ubiquitous low TAAR5 expression in the cortical and limbic brain areas, the amygdala and the hippocampus, the nucleus accumbens, the thalamus, the hypothalamus, the basal ganglia, the cerebellum, the substantia nigra, and the white matter. Altered TAAR5 expression is identified in Down syndrome, major depressive disorder, or HIV-associated encephalitis. Taken together, these data indicate that TAAR5 in humans is expressed not only in the olfactory system but also in certain brain structures, including the limbic regions receiving olfactory input and involved in critical brain functions. Thus, TAAR5 can potentially be involved in the pathogenesis of brain disorders and represents a valuable novel target for neuropsychopharmacology.


Brain/metabolism , Depressive Disorder, Major/genetics , Down Syndrome/genetics , Down-Regulation , Encephalitis, Viral/genetics , HIV Infections/complications , Receptors, G-Protein-Coupled/genetics , Databases, Genetic , Encephalitis, Viral/etiology , Gene Expression Profiling , Gene Expression Regulation , HIV Infections/genetics , Humans , Oligonucleotide Array Sequence Analysis , Sequence Analysis, RNA , Tissue Distribution
10.
Sci Rep ; 10(1): 19981, 2020 11 17.
Article En | MEDLINE | ID: mdl-33203921

Stress-related neuropsychiatric disorders are widespread, debilitating and often treatment-resistant illnesses that represent an urgent unmet biomedical problem. Animal models of these disorders are widely used to study stress pathogenesis. A more recent and historically less utilized model organism, the zebrafish (Danio rerio), is a valuable tool in stress neuroscience research. Utilizing the 5-week chronic unpredictable stress (CUS) model, here we examined brain transcriptomic profiles and complex dynamic behavioral stress responses, as well as neurochemical alterations in adult zebrafish and their correction by chronic antidepressant, fluoxetine, treatment. Overall, CUS induced complex neurochemical and behavioral alterations in zebrafish, including stable anxiety-like behaviors and serotonin metabolism deficits. Chronic fluoxetine (0.1 mg/L for 11 days) rescued most of the observed behavioral and neurochemical responses. Finally, whole-genome brain transcriptomic analyses revealed altered expression of various CNS genes (partially rescued by chronic fluoxetine), including inflammation-, ubiquitin- and arrestin-related genes. Collectively, this supports zebrafish as a valuable translational tool to study stress-related pathogenesis, whose anxiety and serotonergic deficits parallel rodent and clinical studies, and genomic analyses implicate neuroinflammation, structural neuronal remodeling and arrestin/ubiquitin pathways in both stress pathogenesis and its potential therapy.


Behavior, Animal/physiology , Stress, Psychological/physiopathology , Transcriptome/physiology , Zebrafish/physiology , Animals , Antidepressive Agents/pharmacology , Anxiety/drug therapy , Anxiety/physiopathology , Behavior, Animal/drug effects , Brain/drug effects , Brain/physiopathology , Disease Models, Animal , Female , Fluoxetine/pharmacology , Male , Stress, Psychological/drug therapy , Transcriptome/drug effects
11.
Proc Natl Acad Sci U S A ; 117(44): 27300-27306, 2020 11 03.
Article En | MEDLINE | ID: mdl-33087570

Conventional "bulk" PCR often yields inefficient and nonuniform amplification of complex templates in DNA libraries, introducing unwanted biases. Amplification of single DNA molecules encapsulated in a myriad of emulsion droplets (emulsion PCR, ePCR) allows the mitigation of this problem. Different ePCR regimes were experimentally analyzed to identify the most robust techniques for enhanced amplification of DNA libraries. A phenomenological mathematical model that forms an essential basis for optimal use of ePCR for library amplification was developed. A detailed description by high-throughput sequencing of amplified DNA-encoded libraries highlights the principal advantages of ePCR over bulk PCR. ePCR outperforms PCR, reduces gross DNA errors, and provides a more uniform distribution of the amplified sequences. The quasi single-molecule amplification achieved via ePCR represents the fundamental requirement in case of complex DNA templates being prone to diversity degeneration and provides a way to preserve the quality of DNA libraries.


Emulsions/chemistry , High-Throughput Nucleotide Sequencing/methods , Polymerase Chain Reaction/methods , DNA/genetics , DNA Primers/genetics , Gene Library , Genome/genetics , Humans , Models, Theoretical , Nucleic Acid Amplification Techniques/methods , Templates, Genetic
12.
Plants (Basel) ; 9(9)2020 Sep 18.
Article En | MEDLINE | ID: mdl-32961840

The association among environmental cues, ethylene response, ABA signaling, and reactive oxygen species (ROS) homeostasis in the process of seed dormancy release is nowadays well-established in many species. Alternating temperatures are recognized as one of the main environmental signals determining dormancy release, but their underlying mechanisms are scarcely known. Dry after-ripened wild cardoon achenes germinated poorly at a constant temperature of 20, 15, or 10 °C, whereas germination was stimulated by 80% at alternating temperatures of 20/10 °C. Using an RNA-Seq approach, we identified 23,640 and annotated 14,078 gene transcripts expressed in dry achenes and achenes exposed to constant or alternating temperatures. Transcriptional patterns identified in dry condition included seed reserve and response to dehydration stress genes (i.e., HSPs, peroxidases, and LEAs). At a constant temperature, we observed an upregulation of ABA biosynthesis genes (i.e., NCED9), ABA-responsive genes (i.e., ABI5 and TAP), as well as other genes previously related to physiological dormancy and inhibition of germination. However, the alternating temperatures were associated with the upregulation of ethylene metabolism (i.e., ACO1, 4, and ACS10) and signaling (i.e., EXPs) genes and ROS homeostasis regulators genes (i.e., RBOH and CAT). Accordingly, the ethylene production was twice as high at alternating than at constant temperatures. The presence in the germination medium of ethylene or ROS synthesis and signaling inhibitors reduced significantly, but not completely, germination at 20/10 °C. Conversely, the presence of methyl viologen and salicylhydroxamic acid (SHAM), a peroxidase inhibitor, partially increased germination at constant temperature. Taken together, the present study provides the first insights into the gene expression patterns and physiological response associated with dormancy release at alternating temperatures in wild cardoon (Cynara cardunculus var. sylvestris).

13.
BMC Genomics ; 21(1): 317, 2020 Aug 21.
Article En | MEDLINE | ID: mdl-32819282

BACKGROUND: The investigation of transcriptome profiles using short reads in non-model organisms, which lack of well-annotated genomes, is limited by partial gene reconstruction and isoform detection. In contrast, long-reads sequencing techniques revealed their potential to generate complete transcript assemblies even when a reference genome is lacking. Cynara cardunculus var. altilis (DC) (cultivated cardoon) is a perennial hardy crop adapted to dry environments with many industrial and nutraceutical applications due to the richness of secondary metabolites mostly produced in flower heads. The investigation of this species benefited from the recent release of a draft genome, but the transcriptome profile during the capitula formation still remains unexplored. In the present study we show a transcriptome analysis of vegetative and inflorescence organs of cultivated cardoon through a novel hybrid RNA-seq assembly approach utilizing both long and short RNA-seq reads. RESULTS: The inclusion of a single Nanopore flow-cell output in a hybrid sequencing approach determined an increase of 15% complete assembled genes and 18% transcript isoforms respect to short reads alone. Among 25,463 assembled unigenes, we identified 578 new genes and updated 13,039 gene models, 11,169 of which were alternatively spliced isoforms. During capitulum development, 3424 genes were differentially expressed and approximately two-thirds were identified as transcription factors including bHLH, MYB, NAC, C2H2 and MADS-box which were highly expressed especially after capitulum opening. We also show the expression dynamics of key genes involved in the production of valuable secondary metabolites of which capitulum is rich such as phenylpropanoids, flavonoids and sesquiterpene lactones. Most of their biosynthetic genes were strongly transcribed in the flower heads with alternative isoforms exhibiting differentially expression levels across the tissues. CONCLUSIONS: This novel hybrid sequencing approach allowed to improve the transcriptome assembly, to update more than half of annotated genes and to identify many novel genes and different alternatively spliced isoforms. This study provides new insights on the flowering cycle in an Asteraceae plant, a valuable resource for plant biology and breeding in Cynara and an effective method for improving gene annotation.


Cynara , Transcriptome , Cynara/genetics , Gene Expression Profiling , High-Throughput Nucleotide Sequencing , Molecular Sequence Annotation , Plant Breeding
14.
BMC Bioinformatics ; 21(Suppl 12): 302, 2020 Jul 24.
Article En | MEDLINE | ID: mdl-32703149

BACKGROUND: De novo RNA-Seq assembly is a powerful method for analysing transcriptomes when the reference genome is not available or poorly annotated. However, due to the short length of Illumina reads it is usually impossible to reconstruct complete sequences of complex genes and alternative isoforms. Recently emerged possibility to generate long RNA reads, such as PacBio and Oxford Nanopores, may dramatically improve the assembly quality, and thus the consecutive analysis. While reference-based tools for analysing long RNA reads were recently developed, there is no established pipeline for de novo assembly of such data. RESULTS: In this work we present a novel method that allows to perform high-quality de novo transcriptome assemblies by combining accuracy and reliability of short reads with exon structure information carried out from long error-prone reads. The algorithm is designed by incorporating existing hybridSPAdes approach into rnaSPAdes pipeline and adapting it for transcriptomic data. CONCLUSION: To evaluate the benefit of using long RNA reads we selected several datasets containing both Illumina and Iso-seq or Oxford Nanopore Technologies (ONT) reads. Using an existing quality assessment software, we show that hybrid assemblies performed with rnaSPAdes contain more full-length genes and alternative isoforms comparing to the case when only short-read data is used.


Algorithms , Transcriptome/genetics , Databases, Genetic , Humans , MCF-7 Cells , Nanopores , RNA-Seq , Reproducibility of Results
15.
Gigascience ; 8(9)2019 09 01.
Article En | MEDLINE | ID: mdl-31494669

BACKGROUND: The possibility of generating large RNA-sequencing datasets has led to development of various reference-based and de novo transcriptome assemblers with their own strengths and limitations. While reference-based tools are widely used in various transcriptomic studies, their application is limited to the organisms with finished and well-annotated genomes. De novo transcriptome reconstruction from short reads remains an open challenging problem, which is complicated by the varying expression levels across different genes, alternative splicing, and paralogous genes. RESULTS: Herein we describe the novel transcriptome assembler rnaSPAdes, which has been developed on top of the SPAdes genome assembler and explores computational parallels between assembly of transcriptomes and single-cell genomes. We also present quality assessment reports for rnaSPAdes assemblies, compare it with modern transcriptome assembly tools using several evaluation approaches on various RNA-sequencing datasets, and briefly highlight strong and weak points of different assemblers. CONCLUSIONS: Based on the performed comparison between different assembly methods, we infer that it is not possible to detect the absolute leader according to all quality metrics and all used datasets. However, rnaSPAdes typically outperforms other assemblers by such important property as the number of assembled genes and isoforms, and at the same time has higher accuracy statistics on average comparing to the closest competitors.


Algorithms , RNA-Seq , Transcriptome , Animals , Arabidopsis/genetics , Caenorhabditis elegans/genetics , Humans , Mice , Zea mays/genetics
16.
Bioinformatics ; 35(13): 2303-2305, 2019 07 01.
Article En | MEDLINE | ID: mdl-30475983

SUMMARY: Scaffolding is an important step in every genome assembly pipeline, which allows to order contigs into longer sequences using various types of linkage information, such as mate-pair libraries and long reads. In this work, we operate with a notion of a scaffold graph-a graph, vertices of which correspond to the assembled contigs and edges represent connections between them. We present a software package called Scaffold Graph ToolKit that allows to construct and visualize scaffold graphs using different kinds of sequencing data. We show that the scaffold graph appears to be useful for analyzing and assessing genome assemblies, and demonstrate several use cases that can be helpful for both assembly software developers and their users. AVAILABILITY AND IMPLEMENTATION: SGTK is implemented in C++, Python and JavaScript and is freely available at https://github.com/olga24912/SGTK. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Software , Sequence Analysis, DNA
17.
Bioinformatics ; 32(14): 2210-2, 2016 07 15.
Article En | MEDLINE | ID: mdl-27153654

UNLABELLED: Ability to generate large RNA-Seq datasets created a demand for both de novo and reference-based transcriptome assemblers. However, while many transcriptome assemblers are now available, there is still no unified quality assessment tool for RNA-Seq assemblies. We present rnaQUAST-a tool for evaluating RNA-Seq assembly quality and benchmarking transcriptome assemblers using reference genome and gene database. rnaQUAST calculates various metrics that demonstrate completeness and correctness levels of the assembled transcripts, and outputs them in a user-friendly report. AVAILABILITY AND IMPLEMENTATION: rnaQUAST is implemented in Python and is freely available at http://bioinf.spbau.ru/en/rnaquast CONTACT: ap@bioinf.spbau.ru SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Computational Biology/methods , Sequence Analysis, RNA , Software , Transcriptome
18.
Bioinformatics ; 31(20): 3262-8, 2015 Oct 15.
Article En | MEDLINE | ID: mdl-26040456

MOTIVATION: Advances in Next-Generation Sequencing technologies and sample preparation recently enabled generation of high-quality jumping libraries that have a potential to significantly improve short read assemblies. However, assembly algorithms have to catch up with experimental innovations to benefit from them and to produce high-quality assemblies. RESULTS: We present a new algorithm that extends recently described exSPAnder universal repeat resolution approach to enable its applications to several challenging data types, including jumping libraries generated by the recently developed Illumina Nextera Mate Pair protocol. We demonstrate that, with these improvements, bacterial genomes often can be assembled in a few contigs using only a single Nextera Mate Pair library of short reads. AVAILABILITY AND IMPLEMENTATION: Described algorithms are implemented in C++ as a part of SPAdes genome assembler, which is freely available at bioinf.spbau.ru/en/spades. CONTACT: ap@bioinf.spbau.ru SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Algorithms , Gene Library , Genomics/methods , Genome, Bacterial , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods
19.
Bioinformatics ; 30(12): i293-301, 2014 Jun 15.
Article En | MEDLINE | ID: mdl-24931996

UNLABELLED: Next-generation sequencing (NGS) technologies have raised a challenging de novo genome assembly problem that is further amplified in recently emerged single-cell sequencing projects. While various NGS assemblers can use information from several libraries of read-pairs, most of them were originally developed for a single library and do not fully benefit from multiple libraries. Moreover, most assemblers assume uniform read coverage, condition that does not hold for single-cell projects where utilization of read-pairs is even more challenging. We have developed an exSPAnder algorithm that accurately resolves repeats in the case of both single and multiple libraries of read-pairs in both standard and single-cell assembly projects. AVAILABILITY AND IMPLEMENTATION: http://bioinf.spbau.ru/en/spades


Algorithms , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Actinomycetales/genetics , DNA/chemistry , Gene Library , Genome, Bacterial , Humans , Repetitive Sequences, Nucleic Acid , Staphylococcus aureus/genetics
20.
J Comput Biol ; 20(10): 714-37, 2013 Oct.
Article En | MEDLINE | ID: mdl-24093227

Recent advances in single-cell genomics provide an alternative to largely gene-centric metagenomics studies, enabling whole-genome sequencing of uncultivated bacteria. However, single-cell assembly projects are challenging due to (i) the highly nonuniform read coverage and (ii) a greatly elevated number of chimeric reads and read pairs. While recently developed single-cell assemblers have addressed the former challenge, methods for assembling highly chimeric reads remain poorly explored. We present algorithms for identifying chimeric edges and resolving complex bulges in de Bruijn graphs, which significantly improve single-cell assemblies. We further describe applications of the single-cell assembler SPAdes to a new approach for capturing and sequencing "microbial dark matter" that forms small pools of randomly selected single cells (called a mini-metagenome) and further sequences all genomes from the mini-metagenome at once. On single-cell bacterial datasets, SPAdes improves on the recently developed E+V-SC and IDBA-UD assemblers specifically designed for single-cell sequencing. For standard (cultivated monostrain) datasets, SPAdes also improves on A5, ABySS, CLC, EULER-SR, Ray, SOAPdenovo, and Velvet. Thus, recently developed single-cell assemblers not only enable single-cell sequencing, but also improve on conventional assemblers on their own turf. SPAdes is available for free online download under a GPLv2 license.


Contig Mapping/methods , DNA, Bacterial/genetics , DNA, Concatenated/genetics , Algorithms , Base Composition , Computational Biology , Escherichia coli/genetics , Gene Library , Genome, Bacterial , High-Throughput Nucleotide Sequencing , Nucleic Acid Amplification Techniques , Pedobacter/genetics , Prochlorococcus/genetics , Sequence Analysis, DNA , Single-Cell Analysis
...