RESUMO
Recent advances in long fragment read (LFR, also known as linked-read technologies or read-cloud) technologies, such as single tube long fragment reads (stLFR), 10X Genomics Chromium reads, and TruSeq synthetic long-reads, have enabled efficient haplotyping and genome assembly. However, in the case of stLFR and 10X Genomics Chromium reads, the long fragments of a genome are covered sparsely by reads in each barcode and most barcodes are contained in multiple long fragments from different regions, which results in inefficient assembly when using long-range information. Thus, methods to address these shortcomings are vital for capitalizing on the additional information obtained using these technologies. We therefore designed IterCluster, a novel, alignment-free clustering algorithm that can cluster barcodes from the same target region of a genome, using -mer frequency-based features and a Markov Cluster (MCL) approach to identify enough reads in a target region of a genome to ensure sufficient target genome sequence depth. The IterCluster method was validated using BGI stLFR and 10X Genomics chromium reads datasets. IterCluster had a higher precision and recall rate on BGI stLFR data compared to 10X Genomics Chromium read data. In addition, we demonstrated how IterCluster improves the de novo assembly results when using a divide-and-conquer strategy on a human genome data set (scaffold/contig N50 = 13.2 kbp/7.1 kbp vs. 17.1 kbp/11.9 kbp before and after IterCluster, respectively). IterCluster provides a new way for determining LFR barcode enrichment and a novel approach for de novo assembly using LFR data. IterCluster is OpenSource and available on https://github.com/JianCong-WENG/IterCluster.
RESUMO
Here, we describe single-tube long fragment read (stLFR), a technology that enables sequencing of data from long DNA molecules using economical second-generation sequencing technology. It is based on adding the same barcode sequence to subfragments of the original long DNA molecule (DNA cobarcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process, up to 3.6 billion unique barcode sequences were generated on beads, enabling practically nonredundant cobarcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique cobarcoding of more than 8 million 20- to 300-kb genomic DNA fragments. Analysis of the human genome NA12878 with stLFR demonstrated high-quality variant calling and phase block lengths up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries, and their construction did not significantly add to the time or cost of whole-genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high-quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento Completo do Genoma/métodos , Análise Custo-Benefício , Diploide , Biblioteca Gênica , Genoma Humano , Genômica , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala/economia , Humanos , Sequenciamento Completo do Genoma/economiaRESUMO
The vast abundance of terpene natural products in nature is due to enzymes known as terpene synthases (TPSs) that convert acyclic prenyl diphosphate precursors into a multitude of cyclic and acyclic carbon skeletons. Yet the evolution of TPSs is not well understood at higher levels of classification. Microbial TPSs from bacteria and fungi are only distantly related to typical plant TPSs, whereas genes similar to microbial TPS genes have been recently identified in the lycophyte Selaginella moellendorffii The goal of this study was to investigate the distribution, evolution, and biochemical functions of microbial terpene synthase-like (MTPSL) genes in other plants. By analyzing the transcriptomes of 1,103 plant species ranging from green algae to flowering plants, putative MTPSL genes were identified predominantly from nonseed plants, including liverworts, mosses, hornworts, lycophytes, and monilophytes. Directed searching for MTPSL genes in the sequenced genomes of a wide range of seed plants confirmed their general absence in this group. Among themselves, MTPSL proteins from nonseed plants form four major groups, with two of these more closely related to bacterial TPSs and the other two to fungal TPSs. Two of the four groups contain a canonical aspartate-rich "DDxxD" motif. The third group has a "DDxxxD" motif, and the fourth group has only the first two "DD" conserved in this motif. Upon heterologous expression, representative members from each of the four groups displayed diverse catalytic functions as monoterpene and sesquiterpene synthases, suggesting these are important for terpene formation in nonseed plants.
Assuntos
Alquil e Aril Transferases/genética , Evolução Molecular , Filogenia , Transcriptoma/genética , Clorófitas/genética , Mapeamento Cromossômico , Embriófitas/genética , Perfilação da Expressão Gênica , Proteínas de Plantas/genética , Sementes/genéticaRESUMO
BACKGROUND: The ATP-binding cassette (ABC) transporter gene superfamily is ubiquitous among extant organisms and prominently represented in plants. ABC transporters act to transport compounds across cellular membranes and are involved in a diverse range of biological processes. Thus, the applicability to biotechnology is vast, including cancer resistance in humans, drug resistance among vertebrates, and herbicide and other xenobiotic resistance in plants. In addition, plants appear to harbor the highest diversity of ABC transporter genes compared with any other group of organisms. This study applied transcriptome analysis to survey the kingdom-wide ABC transporter diversity in plants and suggest biotechnology applications of this diversity. RESULTS: We utilized sequence similarity-based informatics techniques to infer the identity of ABC transporter gene candidates from 1295 phylogenetically-diverse plant transcriptomes. A total of 97,149 putative (approximately 25 % were full-length) ABC transporter gene members were identified; each RNA-Seq library (plant sample) had 88 ± 30 gene members. As expected, simpler organisms, such as algae, had fewer unique members than vascular land plants. Differences were also noted in the richness of certain ABC transporter subfamilies. Land plants had more unique ABCB, ABCC, and ABCG transporter gene members on average (p < 0.005), and green algae, red algae, and bryophytes had significantly more ABCF transporter gene members (p < 0.005). Ferns had significantly fewer ABCA transporter gene members than all other plant groups (p < 0.005). CONCLUSIONS: We present a transcriptomic overview of ABC transporter gene members across all major plant groups. An increase in the number of gene family members present in the ABCB, ABCC, and ABCD transporter subfamilies may indicate an expansion of the ABC transporter superfamily among green land plants, which include all crop species. The striking difference between the number of ABCA subfamily transporter gene members between ferns and other plant taxa is surprising and merits further investigation. Discussed is the potential exploitation of ABC transporters in plant biotechnology, with an emphasis on crops.
Assuntos
Transportadores de Cassetes de Ligação de ATP/genética , Genes de Plantas/genética , Variação Genética/genética , Genoma de Planta/genética , Proteínas de Plantas/genética , Plantas/genética , Biotecnologia/tendências , Mapeamento Cromossômico/métodos , Mineração de Dados/métodos , Bases de Dados de Proteínas , Especificidade da EspécieRESUMO
[This corrects the article DOI: 10.1186/2047-217X-1-18.].
RESUMO
The human genome is diploid, and knowledge of the variants on each chromosome is important for the interpretation of genomic information. Here we report the assembly of a haplotype-resolved diploid genome without using a reference genome. Our pipeline relies on fosmid pooling together with whole-genome shotgun strategies, based solely on next-generation sequencing and hierarchical assembly methods. We applied our sequencing method to the genome of an Asian individual and generated a 5.15-Gb assembled genome with a haplotype N50 of 484 kb. Our analysis identified previously undetected indels and 7.49 Mb of novel coding sequences that could not be aligned to the human reference genome, which include at least six predicted genes. This haplotype-resolved genome represents the most complete de novo human genome assembly to date. Application of our approach to identify individual haplotype differences should aid in translating genotypes to phenotypes for the development of personalized medicine.
Assuntos
Genoma Humano , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Medicina de Precisão , Povo Asiático/genética , Sequência de Bases , Mapeamento Cromossômico , Diploide , Genótipo , Humanos , Fenótipo , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNARESUMO
Many phylogenomic studies based on transcriptomes have been limited to "single-copy" genes due to methodological challenges in homology and orthology inferences. Only a relatively small number of studies have explored analyses beyond reconstructing species relationships. We sampled 69 transcriptomes in the hyperdiverse plant clade Caryophyllales and 27 outgroups from annotated genomes across eudicots. Using a combined similarity- and phylogenetic tree-based approach, we recovered 10,960 homolog groups, where each was represented by at least eight ingroup taxa. By decomposing these homolog trees, and taking gene duplications into account, we obtained 17,273 ortholog groups, where each was represented by at least ten ingroup taxa. We reconstructed the species phylogeny using a 1,122-gene data set with a gene occupancy of 92.1%. From the homolog trees, we found that both synonymous and nonsynonymous substitution rates in herbaceous lineages are up to three times as fast as in their woody relatives. This is the first time such a pattern has been shown across thousands of nuclear genes with dense taxon sampling. We also pinpointed regions of the Caryophyllales tree that were characterized by relatively high frequencies of gene duplication, including three previously unrecognized whole-genome duplications. By further combining information from homolog tree topology and synonymous distance between paralog pairs, phylogenetic locations for 13 putative genome duplication events were identified. Genes that experienced the greatest gene family expansion were concentrated among those involved in signal transduction and oxidoreduction, including a cytochrome P450 gene that encodes a key enzyme in the betalain synthesis pathway. Our approach demonstrates a new approach for functional phylogenomic analysis in nonmodel species that is based on homolog groups in addition to inferred ortholog groups.
Assuntos
Caryophyllaceae/genética , Evolução Molecular , Duplicação Gênica/fisiologia , Genoma de Planta/fisiologia , Filogenia , Transcriptoma/fisiologia , Caryophyllaceae/classificação , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
MOTIVATION: Transcriptome sequencing has long been the favored method for quickly and inexpensively obtaining a large number of gene sequences from an organism with no reference genome. Owing to the rapid increase in throughputs and decrease in costs of next- generation sequencing, RNA-Seq in particular has become the method of choice. However, the very short reads (e.g. 2 × 90 bp paired ends) from next generation sequencing makes de novo assembly to recover complete or full-length transcript sequences an algorithmic challenge. RESULTS: Here, we present SOAPdenovo-Trans, a de novo transcriptome assembler designed specifically for RNA-Seq. We evaluated its performance on transcriptome datasets from rice and mouse. Using as our benchmarks the known transcripts from these well-annotated genomes (sequenced a decade ago), we assessed how SOAPdenovo-Trans and two other popular transcriptome assemblers handled such practical issues as alternative splicing and variable expression levels. Our conclusion is that SOAPdenovo-Trans provides higher contiguity, lower redundancy and faster execution. AVAILABILITY AND IMPLEMENTATION: Source code and user manual are available at http://sourceforge.net/projects/soapdenovotrans/.
Assuntos
Algoritmos , Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Processamento Alternativo , Animais , Genômica/métodos , Camundongos , Oryza/genéticaRESUMO
Optogenetic tools enable examination of how specific cell types contribute to brain circuit functions. A long-standing question is whether it is possible to independently activate two distinct neural populations in mammalian brain tissue. Such a capability would enable the study of how different synapses or pathways interact to encode information in the brain. Here we describe two channelrhodopsins, Chronos and Chrimson, discovered through sequencing and physiological characterization of opsins from over 100 species of alga. Chrimson's excitation spectrum is red shifted by 45 nm relative to previous channelrhodopsins and can enable experiments in which red light is preferred. We show minimal visual system-mediated behavioral interference when using Chrimson in neurobehavioral studies in Drosophila melanogaster. Chronos has faster kinetics than previous channelrhodopsins yet is effectively more light sensitive. Together these two reagents enable two-color activation of neural spiking and downstream synaptic transmission in independent neural populations without detectable cross-talk in mouse brain slice.
Assuntos
Proteínas de Drosophila/metabolismo , Drosophila melanogaster/fisiologia , Luz , Neurônios/fisiologia , Animais , Proteínas de Drosophila/genética , Dados de Sequência Molecular , Optogenética , Rodopsina/genética , Rodopsina/metabolismoRESUMO
MOTIVATION: The boost of next-generation sequencing technologies provides us with an unprecedented opportunity for elucidating genetic mysteries, yet the short-read length hinders us from better assembling the genome from scratch. New protocols now exist that can generate overlapping pair-end reads. By joining the 3' ends of each read pair, one is able to construct longer reads for assembling. However, effectively joining two overlapped pair-end reads remains a challenging task. RESULT: In this article, we present an efficient tool called Connecting Overlapped Pair-End (COPE) reads, to connect overlapping pair-end reads using k-mer frequencies. We evaluated our tool on 30× simulated pair-end reads from Arabidopsis thaliana with 1% base error. COPE connected over 99% of reads with 98.8% accuracy, which is, respectively, 10 and 2% higher than the recently published tool FLASH. When COPE is applied to real reads for genome assembly, the resulting contigs are found to have fewer errors and give a 14-fold improvement in the N50 measurement when compared with the contigs produced using unconnected reads. AVAILABILITY AND IMPLEMENTATION: COPE is implemented in C++ and is freely available as open-source code at ftp://ftp.genomics.org.cn/pub/cope. CONTACT: twlam@cs.hku.hk or luoruibang@genomics.org.cn
Assuntos
Algoritmos , Arabidopsis/genética , Mapeamento Cromossômico , Genômica/métodos , Análise de Sequência de DNA/métodos , Mapeamento de Sequências ContíguasRESUMO
The Pacific oyster Crassostrea gigas belongs to one of the most species-rich but genomically poorly explored phyla, the Mollusca. Here we report the sequencing and assembly of the oyster genome using short reads and a fosmid-pooling strategy, along with transcriptomes of development and stress response and the proteome of the shell. The oyster genome is highly polymorphic and rich in repetitive sequences, with some transposable elements still actively shaping variation. Transcriptome studies reveal an extensive set of genes responding to environmental stress. The expansion of genes coding for heat shock protein 70 and inhibitors of apoptosis is probably central to the oyster's adaptation to sessile life in the highly stressful intertidal zone. Our analyses also show that shell formation in molluscs is more complex than currently understood and involves extensive participation of cells and their exosomes. The oyster genome sequence fills a void in our understanding of the Lophotrochozoa.
Assuntos
Adaptação Fisiológica/genética , Exoesqueleto/crescimento & desenvolvimento , Crassostrea/genética , Genoma/genética , Estresse Fisiológico/fisiologia , Exoesqueleto/química , Animais , Proteínas Reguladoras de Apoptose/genética , Elementos de DNA Transponíveis/genética , Evolução Molecular , Feminino , Regulação da Expressão Gênica no Desenvolvimento/genética , Genes Homeobox/genética , Genômica , Proteínas de Choque Térmico HSP70/genética , Humanos , Larva/genética , Larva/crescimento & desenvolvimento , Espectrometria de Massas , Anotação de Sequência Molecular , Dados de Sequência Molecular , Polimorfismo Genético/genética , Sequências Repetitivas de Ácido Nucleico/genética , Análise de Sequência de DNA , Estresse Fisiológico/genética , Transcriptoma/genéticaRESUMO
BACKGROUND: There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions. FINDINGS: To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome. CONCLUSIONS: Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.
RESUMO
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.
Assuntos
Genoma/fisiologia , Genômica/métodos , Análise de Sequência de DNA/métodosRESUMO
To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundant microbial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals. The gene set, approximately 150 times larger than the human gene complement, contains an overwhelming majority of the prevalent (more frequent) microbial genes of the cohort and probably includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, indicating that the entire cohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions present in all individuals and most bacteria, respectively.