Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 38
Filter
Add more filters

Publication year range
1.
Cell ; 151(3): 547-58, 2012 Oct 26.
Article in English | MEDLINE | ID: mdl-23101625

ABSTRACT

Retroviral overexpression of reprogramming factors (Oct4, Sox2, Klf4, c-Myc) generates induced pluripotent stem cells (iPSCs). However, the integration of foreign DNA could induce genomic dysregulation. Cell-permeant proteins (CPPs) could overcome this limitation. To date, this approach has proved exceedingly inefficient. We discovered a striking difference in the pattern of gene expression induced by viral versus CPP-based delivery of the reprogramming factors, suggesting that a signaling pathway required for efficient nuclear reprogramming was activated by the retroviral, but not CPP approach. In gain- and loss-of-function studies, we find that the toll-like receptor 3 (TLR3) pathway enables efficient induction of pluripotency by viral or mmRNA approaches. Stimulation of TLR3 causes rapid and global changes in the expression of epigenetic modifiers to enhance chromatin remodeling and nuclear reprogramming. Activation of inflammatory pathways are required for efficient nuclear reprogramming in the induction of pluripotency.


Subject(s)
Cell-Penetrating Peptides/metabolism , Cellular Reprogramming , Immunity, Innate , Induced Pluripotent Stem Cells/metabolism , Signal Transduction , Cell Line , Fibroblasts/metabolism , Humans , Inflammation/metabolism , Kruppel-Like Factor 4 , NF-kappa B/metabolism , Octamer Transcription Factor-3/metabolism , Retroviridae/metabolism , Toll-Like Receptor 3/metabolism
2.
Bioinformatics ; 38(23): 5245-5252, 2022 11 30.
Article in English | MEDLINE | ID: mdl-36250792

ABSTRACT

MOTIVATION: Clustered regularly interspaced short palindromic repeats (CRISPR)-based genetic perturbation screen is a powerful tool to probe gene function. However, experimental noises, especially for the lowly expressed genes, need to be accounted for to maintain proper control of false positive rate. METHODS: We develop a statistical method, named CRISPR screen with Expression Data Analysis (CEDA), to integrate gene expression profiles and CRISPR screen data for identifying essential genes. CEDA stratifies genes based on expression level and adopts a three-component mixture model for the log-fold change of single-guide RNAs (sgRNAs). Empirical Bayesian prior and expectation-maximization algorithm are used for parameter estimation and false discovery rate inference. RESULTS: Taking advantage of gene expression data, CEDA identifies essential genes with higher expression. Compared to existing methods, CEDA shows comparable reliability but higher sensitivity in detecting essential genes with moderate sgRNA fold change. Therefore, using the same CRISPR data, CEDA generates an additional hit gene list. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Clustered Regularly Interspaced Short Palindromic Repeats , Genes, Essential , Bayes Theorem , CRISPR-Cas Systems , Gene Expression , Reproducibility of Results , RNA, Small Untranslated/genetics
3.
Genome Res ; 29(8): 1329-1342, 2019 08.
Article in English | MEDLINE | ID: mdl-31201211

ABSTRACT

Genome-wide chromatin accessibility and nucleosome occupancy profiles have been widely investigated, while the long-range dynamics remain poorly studied at the single-cell level. Here, we present a new experimental approach, methyltransferase treatment followed by single-molecule long-read sequencing (MeSMLR-seq), for long-range mapping of nucleosomes and chromatin accessibility at single DNA molecules and thus achieve comprehensive-coverage characterization of the corresponding heterogeneity. MeSMLR-seq offers direct measurements of both nucleosome-occupied and nucleosome-evicted regions on a single DNA molecule, which is challenging for many existing methods. We applied MeSMLR-seq to haploid yeast, where single DNA molecules represent single cells, and thus we could investigate the combinatorics of many (up to 356) nucleosomes at long range in single cells. We illustrated the differential organization principles of nucleosomes surrounding the transcription start site for silent and actively transcribed genes, at the single-cell level and in the long-range scale. The heterogeneous patterns of chromatin status spanning multiple genes were phased. Together with single-cell RNA-seq data, we quantitatively revealed how chromatin accessibility correlated with gene transcription positively in a highly heterogeneous scenario. Moreover, we quantified the openness of promoters and investigated the coupled chromatin changes of adjacent genes at single DNA molecules during transcription reprogramming. In addition, we revealed the coupled changes of chromatin accessibility for two neighboring glucose transporter genes in response to changes in glucose concentration.


Subject(s)
Euchromatin/metabolism , Gene Expression Regulation, Fungal , Histones/genetics , Saccharomyces cerevisiae/genetics , Transcription, Genetic , Chromosome Mapping , DNA, Fungal/genetics , DNA, Fungal/metabolism , Euchromatin/chemistry , Glucose/metabolism , Glucose Transport Proteins, Facilitative/genetics , Glucose Transport Proteins, Facilitative/metabolism , High-Throughput Nucleotide Sequencing , Histones/metabolism , Methyltransferases/chemistry , Monosaccharide Transport Proteins/genetics , Monosaccharide Transport Proteins/metabolism , Nucleosomes/chemistry , Nucleosomes/metabolism , Promoter Regions, Genetic , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae Proteins/genetics , Saccharomyces cerevisiae Proteins/metabolism , Single-Cell Analysis/methods , Transcription Initiation Site
4.
Bioinformatics ; 37(Suppl_1): i477-i483, 2021 07 12.
Article in English | MEDLINE | ID: mdl-34252938

ABSTRACT

MOTIVATION: Oxford Nanopore Technologies sequencing devices support adaptive sequencing, in which undesired reads can be ejected from a pore in real time. This feature allows targeted sequencing aided by computational methods for mapping partial reads, rather than complex library preparation protocols. However, existing mapping methods either require a computationally expensive base-calling procedure before using aligners to map partial reads or work well only on small genomes. RESULTS: In this work, we present a new streaming method that can map nanopore raw signals for real-time selective sequencing. Rather than converting read signals to bases, we propose to convert reference genomes to signals and fully operate in the signal space. Our method features a new way to index reference genomes using k-d trees, a novel seed selection strategy and a seed chaining algorithm tailored toward the current signal characteristics. We implemented the method as a tool Sigmap. Then we evaluated it on both simulated and real data and compared it to the state-of-the-art nanopore raw signal mapper Uncalled. Our results show that Sigmap yields comparable performance on mapping yeast simulated raw signals, and better mapping accuracy on mapping yeast real raw signals with a 4.4× speedup. Moreover, our method performed well on mapping raw signals to genomes of size >100 Mbp and correctly mapped 11.49% more real raw signals of green algae, which leads to a significantly higher F1-score (0.9354 versus 0.8660). AVAILABILITY AND IMPLEMENTATION: Sigmap code is accessible at https://github.com/haowenz/sigmap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Nanopores , Algorithms , Genome , High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA , Software
5.
Proc Natl Acad Sci U S A ; 116(35): 17470-17479, 2019 08 27.
Article in English | MEDLINE | ID: mdl-31395738

ABSTRACT

The most frequently mutated protein in human cancer is p53, a transcription factor (TF) that regulates myriad genes instrumental in diverse cellular outcomes including growth arrest and cell death. Cell context-dependent p53 modulation is critical for this life-or-death balance, yet remains incompletely understood. Here we identify sequence signatures enriched in genomic p53-binding sites modulated by the transcription cofactor iASPP. Moreover, our p53-iASPP crystal structure reveals that iASPP displaces the p53 L1 loop-which mediates sequence-specific interactions with the signature-corresponding base-without perturbing other DNA-recognizing modules of the p53 DNA-binding domain. A TF commonly uses multiple structural modules to recognize its cognate DNA, and thus this mechanism of a cofactor fine-tuning TF-DNA interactions through targeting a particular module is likely widespread. Previously, all tumor suppressors and oncoproteins that associate with the p53 DNA-binding domain-except the oncogenic E6 from human papillomaviruses (HPVs)-structurally cluster at the DNA-binding site of p53, complicating drug design. By contrast, iASPP inhibits p53 through a distinct surface overlapping the E6 footprint, opening prospects for p53-targeting precision medicine to improve cancer therapy.


Subject(s)
DNA/genetics , DNA/metabolism , Intracellular Signaling Peptides and Proteins/metabolism , Repressor Proteins/metabolism , Response Elements , Tumor Suppressor Protein p53/metabolism , Base Sequence , Binding Sites , Cell Line, Tumor , DNA/chemistry , Gene Expression Profiling , Humans , Intracellular Signaling Peptides and Proteins/chemistry , Models, Molecular , Nucleotide Motifs , Oncogene Proteins, Viral/chemistry , Oncogene Proteins, Viral/metabolism , Protein Binding , Protein Conformation , Repressor Proteins/chemistry , Structure-Activity Relationship , Tumor Suppressor Protein p53/chemistry
6.
Brief Bioinform ; 20(6): 2306-2315, 2019 11 27.
Article in English | MEDLINE | ID: mdl-30239581

ABSTRACT

The intra-tumor heterogeneity is associated with cancer progression and therapeutic resistance, such as in breast cancer. While the existing methods for studying tumor heterogeneity only analyze variant allele frequency (VAF), the genotype of variant is also informative for inferring subclones, which can be detected by long reads or paired-end reads. We developed GenoClone to integrate VAF with the genotype of variant innovatively, so it showed superior performance of inferring the number of subclones, estimating the fractions of subclones and identifying somatic single-nucleotide variants composition of subclones. When GenoClone was applied to 389 TCGA breast cancer samples, it revealed extensive intra-tumor heterogeneity. We further found that a few somatic mutations were relevant to the late stage of tumor evolution, including the ones at the oncogene PIK3CA and the tumor suppress gene TP53. Moreover, 52 subclones that were identified from 167 samples shared high similarity of somatic mutations, which were clustered into three groups with the sizes of 24, 14 and 14. It is helpful for understanding the development of breast cancer in certain subgroups of people and the drug development for population level. Furthermore, GenoClone also identified the tumor heterogeneity in different aliquots of the same samples. The implementation of GenoClone is available at http://www.healthcare.uiowa.edu/labs/au/GenoClone/.


Subject(s)
Breast Neoplasms/pathology , Genetic Linkage , Germ-Line Mutation , Breast Neoplasms/genetics , Class I Phosphatidylinositol 3-Kinases/genetics , Female , Genotype , Humans , Monte Carlo Method , Polymorphism, Single Nucleotide , Tumor Suppressor Protein p53/genetics
7.
Bioinformatics ; 34(13): 2168-2176, 2018 07 01.
Article in English | MEDLINE | ID: mdl-29905763

ABSTRACT

Motivation: In the past years, the long read (LR) sequencing technologies, such as Pacific Biosciences and Oxford Nanopore Technologies, have been demonstrated to substantially improve the quality of genome assembly and transcriptome characterization. Compared to the high cost of genome assembly by LR sequencing, it is more affordable to generate LRs for transcriptome characterization. That is, when informative transcriptome LR data are available without a high-quality genome, a method for de novo transcriptome assembly and annotation is of high demand. Results: Without a reference genome, IDP-denovo performs de novo transcriptome assembly, isoform annotation and quantification by integrating the strengths of LRs and short reads. Using the GM12878 human data as a gold standard, we demonstrated that IDP-denovo had superior sensitivity of transcript assembly and high accuracy of isoform annotation. In addition, IDP-denovo outputs two abundance indices to provide a comprehensive expression profile of genes/isoforms. IDP-denovo represents a robust approach for transcriptome assembly, isoform annotation and quantification for non-model organism studies. Applying IDP-denovo to a non-model organism, Dendrobium officinale, we discovered a number of novel genes and novel isoforms that were not reported by the existing annotation library. These results reveal the high diversity of gene isoforms in D.officinale, which was not reported in the existing annotation library. Availability and implementation: The dataset of Dendrobium officinale used/analyzed during the current study has been deposited in SRA, with accession code SRP094520. IDP-denovo is available for download at www.healthcare.uiowa.edu/labs/au/IDP-denovo/. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Alternative Splicing , Gene Expression Profiling/methods , Gene Library , Dendrobium/genetics , High-Throughput Nucleotide Sequencing/methods , Humans , Sequence Analysis, RNA/methods
8.
Nucleic Acids Res ; 45(5): e32, 2017 03 17.
Article in English | MEDLINE | ID: mdl-27899656

ABSTRACT

Allele-specific expression (ASE) is a fundamental problem in studying gene regulation and diploid transcriptome profiles, with two key challenges: (i) haplotyping and (ii) estimation of ASE at the gene isoform level. Existing ASE analysis methods are limited by a dependence on haplotyping from laborious experiments or extra genome/family trio data. In addition, there is a lack of methods for gene isoform level ASE analysis. We developed a tool, IDP-ASE, for full ASE analysis. By innovative integration of Third Generation Sequencing (TGS) long reads with Second Generation Sequencing (SGS) short reads, the accuracy of haplotyping and ASE quantification at the gene and gene isoform level was greatly improved as demonstrated by the gold standard data GM12878 data and semi-simulation data. In addition to methodology development, applications of IDP-ASE to human embryonic stem cells and breast cancer cells indicate that the imbalance of ASE and non-uniformity of gene isoform ASE is widespread, including tumorigenesis relevant genes and pluripotency markers. These results show that gene isoform expression and allele-specific expression cooperate to provide high diversity and complexity of gene regulation and expression, highlighting the importance of studying ASE at the gene isoform level. Our study provides a robust bioinformatics solution to understand ASE using RNA sequencing data only.


Subject(s)
Alleles , Haplotypes , High-Throughput Nucleotide Sequencing/methods , RNA Isoforms/genetics , RNA, Messenger/genetics , Transcriptome , Gene Expression Regulation , Human Embryonic Stem Cells/cytology , Human Embryonic Stem Cells/metabolism , Humans , MCF-7 Cells , RNA Isoforms/metabolism , RNA, Messenger/metabolism , Sequence Analysis, RNA
9.
Nucleic Acids Res ; 43(18): e116, 2015 Oct 15.
Article in English | MEDLINE | ID: mdl-26040699

ABSTRACT

We developed an innovative hybrid sequencing approach, IDP-fusion, to detect fusion genes, determine fusion sites and identify and quantify fusion isoforms. IDP-fusion is the first method to study gene fusion events by integrating Third Generation Sequencing long reads and Second Generation Sequencing short reads. We applied IDP-fusion to PacBio data and Illumina data from the MCF-7 breast cancer cells. Compared with the existing tools, IDP-fusion detects fusion genes at higher precision and a very low false positive rate. The results show that IDP-fusion will be useful for unraveling the complexity of multiple fusion splices and fusion isoforms within tumorigenesis-relevant fusion genes.


Subject(s)
Carcinogenesis/genetics , Gene Expression Profiling , Gene Fusion , High-Throughput Nucleotide Sequencing/methods , Breast Neoplasms/genetics , Breast Neoplasms/metabolism , Female , Humans , MCF-7 Cells , Protein Isoforms/genetics , Protein Isoforms/metabolism , Sequence Alignment
10.
Plant J ; 82(6): 951-961, 2015 Jun.
Article in English | MEDLINE | ID: mdl-25912611

ABSTRACT

Danshen, Salvia miltiorrhiza Bunge, is one of the most widely used herbs in traditional Chinese medicine, wherein its rhizome/roots are particularly valued. The corresponding bioactive components include the tanshinone diterpenoids, the biosynthesis of which is a subject of considerable interest. Previous investigations of the S. miltiorrhiza transcriptome have relied on short-read next-generation sequencing (NGS) technology, and the vast majority of the resulting isotigs do not represent full-length cDNA sequences. Moreover, these efforts have been targeted at either whole plants or hairy root cultures. Here, we demonstrate that the tanshinone pigments are produced and accumulate in the root periderm, and apply a combination of NGS and single-molecule real-time (SMRT) sequencing to various root tissues, particularly including the periderm, to provide a more complete view of the S. miltiorrhiza transcriptome, with further insight into tanshinone biosynthesis as well. In addition, the use of SMRT long-read sequencing offered the ability to examine alternative splicing, which was found to occur in approximately 40% of the detected gene loci, including several involved in isoprenoid/terpenoid metabolism.


Subject(s)
Abietanes/biosynthesis , Alternative Splicing , Plant Roots/genetics , Salvia miltiorrhiza/genetics , Abietanes/metabolism , Gene Expression Profiling/methods , Gene Expression Regulation, Plant , High-Throughput Nucleotide Sequencing/methods , Plant Proteins/genetics , Plant Proteins/metabolism , Plant Roots/metabolism , Salvia miltiorrhiza/metabolism , Sequence Analysis, DNA/methods , Transcriptome
11.
Genome Res ; 23(1): 201-16, 2013 Jan.
Article in English | MEDLINE | ID: mdl-22960373

ABSTRACT

The Xenopus embryo has provided key insights into fate specification, the cell cycle, and other fundamental developmental and cellular processes, yet a comprehensive understanding of its transcriptome is lacking. Here, we used paired end RNA sequencing (RNA-seq) to explore the transcriptome of Xenopus tropicalis in 23 distinct developmental stages. We determined expression levels of all genes annotated in RefSeq and Ensembl and showed for the first time on a genome-wide scale that, despite a general state of transcriptional silence in the earliest stages of development, approximately 150 genes are transcribed prior to the midblastula transition. In addition, our splicing analysis uncovered more than 10,000 novel splice junctions at each stage and revealed that many known genes have additional unannotated isoforms. Furthermore, we used Cufflinks to reconstruct transcripts from our RNA-seq data and found that ∼13.5% of the final contigs are derived from novel transcribed regions, both within introns and in intergenic regions. We then developed a filtering pipeline to separate protein-coding transcripts from noncoding RNAs and identified a confident set of 6686 noncoding transcripts in 3859 genomic loci. Since the current reference genome, XenTro3, consists of hundreds of scaffolds instead of full chromosomes, we also performed de novo reconstruction of the transcriptome using Trinity and uncovered hundreds of transcripts that are missing from the genome. Collectively, our data will not only aid in completing the assembly of the Xenopus tropicalis genome but will also serve as a valuable resource for gene discovery and for unraveling the fundamental mechanisms of vertebrate embryogenesis.


Subject(s)
Gene Expression Regulation, Developmental , Sequence Analysis, RNA , Transcriptome , Xenopus/genetics , Animals , Ecthyma, Contagious , Embryo, Nonmammalian/metabolism , Introns , Larva/genetics , Larva/metabolism , Physical Chromosome Mapping , RNA Splicing , RNA, Untranslated , Sequence Alignment , Xenopus/growth & development
12.
Proc Natl Acad Sci U S A ; 110(50): E4821-30, 2013 Dec 10.
Article in English | MEDLINE | ID: mdl-24282307

ABSTRACT

Although transcriptional and posttranscriptional events are detected in RNA-Seq data from second-generation sequencing, full-length mRNA isoforms are not captured. On the other hand, third-generation sequencing, which yields much longer reads, has current limitations of lower raw accuracy and throughput. Here, we combine second-generation sequencing and third-generation sequencing with a custom-designed method for isoform identification and quantification to generate a high-confidence isoform dataset for human embryonic stem cells (hESCs). We report 8,084 RefSeq-annotated isoforms detected as full-length and an additional 5,459 isoforms predicted through statistical inference. Over one-third of these are novel isoforms, including 273 RNAs from gene loci that have not previously been identified. Further characterization of the novel loci indicates that a subset is expressed in pluripotent cells but not in diverse fetal and adult tissues; moreover, their reduced expression perturbs the network of pluripotency-associated genes. Results suggest that gene identification, even in well-characterized human cell lines and tissues, is likely far from complete.


Subject(s)
Alternative Splicing/genetics , Embryonic Stem Cells/metabolism , Gene Expression Profiling/methods , High-Throughput Nucleotide Sequencing/methods , Protein Isoforms/genetics , Transcriptome/genetics , Embryonic Stem Cells/chemistry , Humans , Male
13.
Mol Syst Biol ; 9: 632, 2013.
Article in English | MEDLINE | ID: mdl-23295861

ABSTRACT

Landmark events occur in a coordinated manner during pre-implantation development of the mammalian embryo, yet the regulatory network that orchestrates these events remains largely unknown. Here, we present the first systematic investigation of the network in pre-implantation mouse embryos using morpholino-mediated gene knockdowns of key embryonic stem cell (ESC) factors followed by detailed transcriptome analysis of pooled embryos, single embryos, and individual blastomeres. We delineated the regulons of Oct4, Sall4, and Nanog and identified a set of metabolism- and transport-related genes that were controlled by these transcription factors in embryos but not in ESCs. Strikingly, the knockdown embryos arrested at a range of developmental stages. We provided evidence that the DNA methyltransferase Dnmt3b has a role in determining the extent to which a knockdown embryo can develop. We further showed that the feed-forward loop comprising Dnmt3b, the pluripotency factors, and the miR-290-295 cluster exemplifies a network motif that buffers embryos against gene expression noise. Our findings indicate that Oct4, Sall4, and Nanog form a robust and integrated network to govern mammalian pre-implantation development.


Subject(s)
Blastocyst/physiology , DNA-Binding Proteins/genetics , Embryonic Stem Cells/physiology , Gene Regulatory Networks , Homeodomain Proteins/genetics , Octamer Transcription Factor-3/genetics , Transcription Factors/genetics , Animals , Blastocyst/metabolism , DNA (Cytosine-5-)-Methyltransferases/genetics , DNA (Cytosine-5-)-Methyltransferases/metabolism , DNA-Binding Proteins/metabolism , Embryo Culture Techniques , Embryo, Mammalian/metabolism , Embryonic Development , Female , Gene Expression Profiling , Gene Expression Regulation, Developmental , Gene Knockdown Techniques , Homeodomain Proteins/metabolism , Male , Mice , Mice, Inbred C57BL , Mice, Inbred DBA , MicroRNAs/genetics , Nanog Homeobox Protein , Octamer Transcription Factor-3/metabolism , Oligonucleotide Array Sequence Analysis , Transcription Factors/metabolism , DNA Methyltransferase 3B
14.
Nat Biotechnol ; 42(4): 591-596, 2024 Apr.
Article in English | MEDLINE | ID: mdl-37349523

ABSTRACT

Current N6-methyladenosine (m6A) mapping methods need large amounts of RNA or are limited to cultured cells. Through optimized sample recovery and signal-to-noise ratio, we developed picogram-scale m6A RNA immunoprecipitation and sequencing (picoMeRIP-seq) for studying m6A in vivo in single cells and scarce cell types using standard laboratory equipment. We benchmark m6A mapping on titrations of poly(A) RNA and embryonic stem cells and in single zebrafish zygotes, mouse oocytes and embryos.


Subject(s)
RNA , Zebrafish , Animals , Mice , Zebrafish/genetics , Zebrafish/metabolism , RNA/genetics , RNA, Messenger/genetics , Embryonic Stem Cells , Cells, Cultured
15.
Nat Struct Mol Biol ; 30(5): 703-709, 2023 05.
Article in English | MEDLINE | ID: mdl-37081317

ABSTRACT

Despite the significance of N6-methyladenosine (m6A) in gene regulation, the requirement for large amounts of RNA has hindered m6A profiling in mammalian early embryos. Here we apply low-input methyl RNA immunoprecipitation and sequencing to map m6A in mouse oocytes and preimplantation embryos. We define the landscape of m6A during the maternal-to-zygotic transition, including stage-specifically expressed transcription factors essential for cell fate determination. Both the maternally inherited transcripts to be degraded post fertilization and the zygotically activated genes during zygotic genome activation are widely marked by m6A. In contrast to m6A-marked zygotic ally-activated genes, m6A-marked maternally inherited transcripts have a higher tendency to be targeted by microRNAs. Moreover, RNAs derived from retrotransposons, such as MTA that is maternally expressed and MERVL that is transcriptionally activated at the two-cell stage, are largely marked by m6A. Our results provide a foundation for future studies exploring the regulatory roles of m6A in mammalian early embryonic development.


Subject(s)
Gene Expression Regulation, Developmental , MicroRNAs , Animals , Mice , Blastocyst , Oocytes/metabolism , Embryonic Development/genetics , Zygote , MicroRNAs/metabolism , Mammals/genetics
16.
Nucleic Acids Res ; 38(14): 4570-8, 2010 Aug.
Article in English | MEDLINE | ID: mdl-20371516

ABSTRACT

Alternative splicing is a prevalent post-transcriptional process, which is not only important to normal cellular function but is also involved in human diseases. The newly developed second generation sequencing technique provides high-throughput data (RNA-seq data) to study alternative splicing events in different types of cells. Here, we present a computational method, SpliceMap, to detect splice junctions from RNA-seq data. This method does not depend on any existing annotation of gene structures and is capable of finding novel splice junctions with high sensitivity and specificity. It can handle long reads (50-100 nt) and can exploit paired-read information to improve mapping accuracy. Several parameters are included in the output to indicate the reliability of the predicted junction and help filter out false predictions. We applied SpliceMap to analyze 23 million paired 50-nt reads from human brain tissue. The results show at this depth of sequencing, RNA-seq can support reliable detection of splice junctions except for those that are present at very low level. Compared to current methods, SpliceMap can achieve 12% higher sensitivity without sacrificing specificity.


Subject(s)
Alternative Splicing , RNA Splice Sites , Sequence Analysis, RNA , Software , Algorithms , Computational Biology/methods , Humans , Polymerase Chain Reaction
17.
Nat Biotechnol ; 39(11): 1348-1365, 2021 11.
Article in English | MEDLINE | ID: mdl-34750572

ABSTRACT

Rapid advances in nanopore technologies for sequencing single long DNA and RNA molecules have led to substantial improvements in accuracy, read length and throughput. These breakthroughs have required extensive development of experimental and bioinformatics methods to fully exploit nanopore long reads for investigations of genomes, transcriptomes, epigenomes and epitranscriptomes. Nanopore sequencing is being applied in genome assembly, full-length transcript detection and base modification detection and in more specialized areas, such as rapid clinical diagnoses and outbreak surveillance. Many opportunities remain for improving data quality and analytical approaches through the development of new nanopores, base-calling methods and experimental protocols tailored to particular applications.


Subject(s)
Nanopore Sequencing , Nanopores , Computational Biology , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Technology
18.
Nat Commun ; 12(1): 1361, 2021 03 01.
Article in English | MEDLINE | ID: mdl-33649327

ABSTRACT

Sperm contributes diverse RNAs to the zygote. While sperm small RNAs have been shown to impact offspring phenotypes, our knowledge of the sperm transcriptome, especially the composition of long RNAs, has been limited by the lack of sensitive, high-throughput experimental techniques that can distinguish intact RNAs from fragmented RNAs, known to abound in sperm. Here, we integrate single-molecule long-read sequencing with short-read sequencing to detect sperm intact RNAs (spiRNAs). We identify 3440 spiRNA species in mice and 4100 in humans. The spiRNA profile consists of both mRNAs and long non-coding RNAs, is evolutionarily conserved between mice and humans, and displays an enrichment in mRNAs encoding for ribosome. In sum, we characterize the landscape of intact long RNAs in sperm, paving the way for future studies on their biogenesis and functions. Our experimental and bioinformatics approaches can be applied to other tissues and organisms to detect intact transcripts.


Subject(s)
Conserved Sequence/genetics , High-Throughput Nucleotide Sequencing/methods , RNA/genetics , Single Molecule Imaging , Spermatozoa/metabolism , Animals , Evolution, Molecular , Gene Ontology , Humans , Male , Mice, Inbred C57BL , RNA/metabolism , RNA, Long Noncoding/genetics , RNA, Long Noncoding/metabolism , RNA, Messenger/genetics , RNA, Messenger/metabolism , Ribosomes/metabolism , Testis/metabolism , Transcriptome/genetics
19.
Genome Biol ; 21(1): 14, 2020 01 17.
Article in English | MEDLINE | ID: mdl-31952552

ABSTRACT

The error-prone third-generation sequencing (TGS) long reads can be corrected by the high-quality second-generation sequencing (SGS) short reads, which is referred to as hybrid error correction. We here investigate the influences of the principal algorithmic factors of two major types of hybrid error correction methods by mathematical modeling and analysis on both simulated and real data. Our study reveals the distribution of accuracy gain with respect to the original long read error rate. We also demonstrate that the original error rate of 19% is the limit for perfect correction, beyond which long reads are too error-prone to be corrected by these methods.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Sequence Alignment , Algorithms
20.
Genome Biol ; 20(1): 26, 2019 02 04.
Article in English | MEDLINE | ID: mdl-30717772

ABSTRACT

BACKGROUND: Third-generation sequencing technologies have advanced the progress of the biological research by generating reads that are substantially longer than second-generation sequencing technologies. However, their notorious high error rate impedes straightforward data analysis and limits their application. A handful of error correction methods for these error-prone long reads have been developed to date. The output data quality is very important for downstream analysis, whereas computing resources could limit the utility of some computing-intense tools. There is a lack of standardized assessments for these long-read error-correction methods. RESULTS: Here, we present a comparative performance assessment of ten state-of-the-art error-correction methods for long reads. We established a common set of benchmarks for performance assessment, including sensitivity, accuracy, output rate, alignment rate, output read length, run time, and memory usage, as well as the effects of error correction on two downstream applications of long reads: de novo assembly and resolving haplotype sequences. CONCLUSIONS: Taking into account all of these metrics, we provide a suggestive guideline for method choice based on available data size, computing resources, and individual research goals.


Subject(s)
Genomics/methods , Sequence Analysis, DNA , Software/statistics & numerical data , Animals , Arabidopsis , Drosophila melanogaster , Escherichia coli , Saccharomyces cerevisiae , Scientific Experimental Error , Sequence Alignment
SELECTION OF CITATIONS
SEARCH DETAIL