Pesquisa | BVS Violência e Saúde

1.

Improved sequence mapping using a complete reference genome and lift-over.

Chen, Nae-Chyun; Paulin, Luis F; Sedlazeck, Fritz J; Koren, Sergey; Phillippy, Adam M; Langmead, Ben.

Nat Methods ; 21(1): 41-49, 2024 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-38036856

RESUMO

Complete, telomere-to-telomere (T2T) genome assemblies promise improved analyses and the discovery of new variants, but many essential genomic resources remain associated with older reference genomes. Thus, there is a need to translate genomic features and read alignments between references. Here we describe a method called levioSAM2 that performs fast and accurate lift-over between assemblies using a whole-genome map. In addition to enabling the use of several references, we demonstrate that aligning reads to a high-quality reference (for example, T2T-CHM13) and lifting to an older reference (for example, Genome reference Consortium (GRC)h38) improves the accuracy of the resulting variant calls on the old reference. By leveraging the quality improvements of T2T-CHM13, levioSAM2 reduces small and structural variant calling errors compared with GRC-based mapping using real short- and long-read datasets. Performance is especially improved for a set of complex medically relevant genes, where the GRC references are lower quality.

Assuntos

Genoma , Genômica , Análise de Sequência de DNA/métodos , Genômica/métodos , Mapeamento Cromossômico , Sequenciamento de Nucleotídeos em Larga Escala

2.

Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2.

Baker, Daniel N; Langmead, Ben.

Genome Res ; 33(7): 1218-1227, 2023 07.

Artigo em Inglês | MEDLINE | ID: mdl-37414575

RESUMO

A genomic sketch is a small, probabilistic representation of the set of k-mers in a sequencing data set. Sketches are building blocks for large-scale analyses that consider similarities between many pairs of sequences or sequence collections. Although existing tools can easily compare tens of thousands of genomes, data sets can reach millions of sequences and beyond. Popular tools also fail to consider k-mer multiplicities, making them less applicable in quantitative settings. Here, we describe a method called Dashing 2 that builds on the SetSketch data structure. SetSketch is related to HyperLogLog (HLL) but discards use of leading zero count in favor of a truncated logarithm of adjustable base. Unlike HLL, SetSketch can perform multiplicity-aware sketching when combined with the ProbMinHash method. Dashing 2 integrates locality-sensitive hashing to scale all-pairs comparisons to millions of sequences. It achieves superior similarity estimates for the Jaccard coefficient and average nucleotide identity compared with the original Dashing, but in much less time while using the same-sized sketch. Dashing 2 is a free, open source software.

Assuntos

Genômica , Software , Genômica/métodos , Genoma , Nucleotídeos , Algoritmos , Análise de Sequência de DNA/métodos

3.

Efficient taxa identification using a pangenome index.

Ahmed, Omar; Rossi, Massimiliano; Boucher, Christina; Langmead, Ben.

Genome Res ; 33(7): 1069-1077, 2023 07.

Artigo em Inglês | MEDLINE | ID: mdl-37258301

RESUMO

Tools that classify sequencing reads against a database of reference sequences require efficient index data-structures. The r-index is a compressed full-text index that answers substring presence/absence, count, and locate queries in space proportional to the amount of distinct sequence in the database: [Formula: see text] space, where r is the number of Burrows-Wheeler runs. To date, the r-index has lacked the ability to quickly classify matches according to which reference sequences (or sequence groupings, i.e., taxa) a match overlaps. We present new algorithms and methods for solving this problem. Specifically, given a collection D of d documents, [Formula: see text] over an alphabet of size σ, we extend the r-index with [Formula: see text] additional words to support document listing queries for a pattern [Formula: see text] that occurs in [Formula: see text] documents in D in [Formula: see text] time and [Formula: see text] space, where w is the machine word size. Applied in a bacterial mock community experiment, our method is up to three times faster than a comparable method that uses the standard r-index locate queries. We show that our method classifies both simulated and real nanopore reads at the strain level with higher accuracy compared with other approaches. Finally, we present strategies for compacting this structure in applications in which read lengths or match lengths can be bounded.

Assuntos

Algoritmos , Bactérias , Análise de Sequência , Bactérias/genética

4.

Sigmoni: classification of nanopore signal with a compressed pangenome index.

Shivakumar, Vikram S; Ahmed, Omar Y; Kovaka, Sam; Zakeri, Mohsen; Langmead, Ben.

Bioinformatics ; 40(Suppl 1): i287-i296, 2024 06 28.

Artigo em Inglês | MEDLINE | ID: mdl-38940135

RESUMO

SUMMARY: Improvements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes. We introduce Sigmoni: a rapid, multiclass classification method based on the r-index that scales to references of hundreds of Gbps. Sigmoni quantizes nanopore signal into a discrete alphabet of picoamp ranges. It performs rapid, approximate matching using matching statistics, classifying reads based on distributions of picoamp matching statistics and co-linearity statistics, all in linear query time without the need for seed-chain-extend. Sigmoni is 10-100× faster than previous methods for adaptive sampling in host depletion experiments with improved accuracy, and can query reads against large microbial or human pangenomes. Sigmoni is the first signal-based tool to scale to a complete human genome and pangenome while remaining fast enough for adaptive sampling applications. AVAILABILITY AND IMPLEMENTATION: Sigmoni is implemented in Python, and is available open-source at https://github.com/vshiv18/sigmoni.

Assuntos

Algoritmos , Humanos , Sequenciamento por Nanoporos/métodos , Software , Nanoporos , Genoma Humano , Genômica/métodos , Análise de Sequência de DNA/métodos

5.

Cloud computing for genomic data analysis and collaboration.

Langmead, Ben; Nellore, Abhinav.

Nat Rev Genet ; 19(4): 208-219, 2018 04.

Artigo em Inglês | MEDLINE | ID: mdl-29379135

RESUMO

Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data.

Assuntos

Computação em Nuvem , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Internet , Biologia Computacional , Humanos

6.

Cloud computing for genomic data analysis and collaboration.

Langmead, Ben; Nellore, Abhinav.

Nat Rev Genet ; 19(5): 325, 2018 05.

Artigo em Inglês | MEDLINE | ID: mdl-29430012

RESUMO

This corrects the article DOI: 10.1038/nrg.2017.113.

7.

Recounting the FANTOM CAGE-Associated Transcriptome.

Imada, Eddie Luidy; Sanchez, Diego Fernando; Collado-Torres, Leonardo; Wilks, Christopher; Matam, Tejasvi; Dinalankara, Wikum; Stupnikov, Aleksey; Lobo-Pereira, Francisco; Yip, Chi-Wai; Yasuzawa, Kayoko; Kondo, Naoto; Itoh, Masayoshi; Suzuki, Harukazu; Kasukawa, Takeya; Hon, Chung-Chau; de Hoon, Michiel J L; Shin, Jay W; Carninci, Piero; Jaffe, Andrew E; Leek, Jeffrey T; Favorov, Alexander; Franco, Gloria R; Langmead, Ben; Marchionni, Luigi.

Genome Res ; 30(7): 1073-1081, 2020 07.

Artigo em Inglês | MEDLINE | ID: mdl-32079618

RESUMO

Long noncoding RNAs (lncRNAs) have emerged as key coordinators of biological and cellular processes. Characterizing lncRNA expression across cells and tissues is key to understanding their role in determining phenotypes, including human diseases. We present here FC-R2, a comprehensive expression atlas across a broadly defined human transcriptome, inclusive of over 109,000 coding and noncoding genes, as described in the FANTOM CAGE-Associated Transcriptome (FANTOM-CAT) study. This atlas greatly extends the gene annotation used in the original recount2 resource. We demonstrate the utility of the FC-R2 atlas by reproducing key findings from published large studies and by generating new results across normal and diseased human samples. In particular, we (a) identify tissue-specific transcription profiles for distinct classes of coding and noncoding genes, (b) perform differential expression analysis across thirteen cancer types, identifying novel noncoding genes potentially involved in tumor pathogenesis and progression, and (c) confirm the prognostic value for several enhancer lncRNAs expression in cancer. Our resource is instrumental for the systematic molecular characterization of lncRNA by the FANTOM6 Consortium. In conclusion, comprised of over 70,000 samples, the FC-R2 atlas will empower other researchers to investigate functions and biological roles of both known coding genes and novel lncRNAs.

Assuntos

Transcriptoma , Bases de Dados Genéticas , Elementos Facilitadores Genéticos , Perfilação da Expressão Gênica , Genoma Humano , Humanos , Neoplasias/genética , Especificidade de Órgãos , Prognóstico , RNA Longo não Codificante/genética , RNA Longo não Codificante/metabolismo , RNA Mensageiro/metabolismo

8.

A scalable and unbiased discordance metric with H.

Dyjack, Nathan; Baker, Daniel N; Braverman, Vladimir; Langmead, Ben; Hicks, Stephanie C.

Biostatistics ; 2022 Sep 05.

Artigo em Inglês | MEDLINE | ID: mdl-36063544

RESUMO

A standard unsupervised analysis is to cluster observations into discrete groups using a dissimilarity measure, such as Euclidean distance. If there does not exist a ground-truth label for each observation necessary for external validity metrics, then internal validity metrics, such as the tightness or separation of the clusters, are often used. However, the interpretation of these internal metrics can be problematic when using different dissimilarity measures as they have different magnitudes and ranges of values that they span. To address this problem, previous work introduced the "scale-agnostic" $G_{+}$ discordance metric; however, this internal metric is slow to calculate for large data. Furthermore, in the setting of unsupervised clustering with $k$ groups, we show that $G_{+}$ varies as a function of the proportion of observations assigned to each of the groups (or clusters), referred to as the group balance, which is an undesirable property. To address this problem, we propose a modification of $G_{+}$, referred to as $H_{+}$, and demonstrate that $H_{+}$ does not vary as a function of group balance using a simulation study and with public single-cell RNA-sequencing data. Finally, we provide scalable approaches to estimate $H_{+}$, which are available in the $\mathtt{fasthplus}$ R package.

9.

Two-stage linked component analysis for joint decomposition of multiple biologically related data sets.

Chen, Huan; Caffo, Brian; Stein-O'Brien, Genevieve; Liu, Jinrui; Langmead, Ben; Colantuoni, Carlo; Xiao, Luo.

Biostatistics ; 23(4): 1200-1217, 2022 10 14.

Artigo em Inglês | MEDLINE | ID: mdl-35358296

RESUMO

Integrative analysis of multiple data sets has the potential of fully leveraging the vast amount of high throughput biological data being generated. In particular such analysis will be powerful in making inference from publicly available collections of genetic, transcriptomic and epigenetic data sets which are designed to study shared biological processes, but which vary in their target measurements, biological variation, unwanted noise, and batch variation. Thus, methods that enable the joint analysis of multiple data sets are needed to gain insights into shared biological processes that would otherwise be hidden by unwanted intra-data set variation. Here, we propose a method called two-stage linked component analysis (2s-LCA) to jointly decompose multiple biologically related experimental data sets with biological and technological relationships that can be structured into the decomposition. The consistency of the proposed method is established and its empirical performance is evaluated via simulation studies. We apply 2s-LCA to jointly analyze four data sets focused on human brain development and identify meaningful patterns of gene expression in human neurogenesis that have shared structure across these data sets.

Assuntos

Transcriptoma , Simulação por Computador , Humanos

10.

LevioSAM: fast lift-over of variant-aware reference alignments.

Mun, Taher; Chen, Nae-Chyun; Langmead, Ben.

Bioinformatics ; 37(22): 4243-4245, 2021 11 18.

Artigo em Inglês | MEDLINE | ID: mdl-34037690

RESUMO

MOTIVATION: As more population genetics datasets and population-specific references become available, the task of translating ('lifting') read alignments from one reference coordinate system to another is becoming more common. Existing tools generally require a chain file, whereas VCF files are the more common way to represent variation. Existing tools also do not make effective use of threads, creating a post-alignment bottleneck. RESULTS: LevioSAM is a tool for lifting SAM/BAM alignments from one reference to another using a VCF file containing population variants. LevioSAM uses succinct data structures and scales efficiently to many threads. When run downstream of a read aligner, levioSAM is more than 7 times faster than an aligner when both are run with 16 threads. AVAILABILITY AND IMPLEMENTATION: Software Package: https://github.com/alshai/levioSAM, Experiments: https://github.com/langmead-lab/levioSAM-experiments. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Análise de Sequência de DNA

11.

Megadepth: efficient coverage quantification for BigWigs and BAMs.

Wilks, Christopher; Ahmed, Omar; Baker, Daniel N; Zhang, David; Collado-Torres, Leonardo; Langmead, Ben.

Bioinformatics ; 37(18): 3014-3016, 2021 09 29.

Artigo em Inglês | MEDLINE | ID: mdl-33693500

RESUMO

MOTIVATION: A common way to summarize sequencing datasets is to quantify data lying within genes or other genomic intervals. This can be slow and can require different tools for different input file types. RESULTS: Megadepth is a fast tool for quantifying alignments and coverage for BigWig and BAM/CRAM input files, using substantially less memory than the next-fastest competitor. Megadepth can summarize coverage within all disjoint intervals of the Gencode V35 gene annotation for more than 19 000 GTExV8 BigWig files in approximately 1 h using 32 threads. Megadepth is available both as a command-line tool and as an R/Bioconductor package providing much faster quantification compared to the rtracklayer package. AVAILABILITY AND IMPLEMENTATION: https://github.com/ChristopherWilks/megadepth, https://bioconductor.org/packages/megadepth. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Genoma , Genômica , Software , Anotação de Sequência Molecular

12.

Vargas: heuristic-free alignment for assessing linear and graph read aligners.

Darby, Charlotte A; Gaddipati, Ravi; Schatz, Michael C; Langmead, Ben.

Bioinformatics ; 36(12): 3712-3718, 2020 06 01.

Artigo em Inglês | MEDLINE | ID: mdl-32321164

RESUMO

MOTIVATION: Read alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. RESULTS: Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these 'gold standard' Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-maximal exact match and vg to align more reads correctly. AVAILABILITY AND IMPLEMENTATION: Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Heurística , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Genômica , Análise de Sequência de DNA , Software

13.

Analyzing whole genome bisulfite sequencing data from highly divergent genotypes.

Wulfridge, Phillip; Langmead, Ben; Feinberg, Andrew P; Hansen, Kasper D.

Nucleic Acids Res ; 47(19): e117, 2019 11 04.

Artigo em Inglês | MEDLINE | ID: mdl-31392989

RESUMO

In the study of DNA methylation, genetic variation between species, strains or individuals can result in CpG sites that are exclusive to a subset of samples, and insertions and deletions can rearrange the spatial distribution of CpGs. How to account for this variation in an analysis of the interplay between sequence variation and DNA methylation is not well understood, especially when the number of CpG differences between samples is large. Here, we use whole-genome bisulfite sequencing data on two highly divergent mouse strains to study this problem. We show that alignment to personal genomes is necessary for valid methylation quantification. We introduce a method for including strain-specific CpGs in differential analysis, and show that this increases power. We apply our method to a human normal-cancer dataset, and show this improves accuracy and power, illustrating the broad applicability of our approach. Our method uses smoothing to impute methylation levels at strain-specific sites, thereby allowing strain-specific CpGs to contribute to the analysis, while accounting for differences in the spatial occurrences of CpGs. Our results have implications for joint analysis of genetic variation and DNA methylation using bisulfite-converted DNA, and unlocks the use of personal genomes for addressing this question.

Assuntos

Variação Genética/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento Completo do Genoma/métodos , Animais , Ilhas de CpG/genética , Metilação de DNA/genética , Epigênese Genética , Genoma Humano/genética , Genótipo , Humanos , Camundongos , Análise de Sequência de DNA

14.

Scaling read aligners to hundreds of threads on general-purpose processors.

Langmead, Ben; Wilks, Christopher; Antonescu, Valentin; Charles, Rone.

Bioinformatics ; 35(3): 421-432, 2019 02 01.

Artigo em Inglês | MEDLINE | ID: mdl-30020410

RESUMO

Motivation: General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners. Results: We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling. Availability and implementation: Experiments for this study: https://github.com/BenLangmead/bowtie-scaling. Bowtie: http://bowtie-bio.sourceforge.net. Bowtie 2: http://bowtie-bio.sourceforge.net/bowtie2. HISAT: http://www.ccb.jhu.edu/software/hisat. Supplementary information: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Genômica , Software , Sistemas Computacionais

15.

Integrated Transcriptomic and Proteomic Analysis of Primary Human Umbilical Vein Endothelial Cells.

Madugundu, Anil K; Na, Chan Hyun; Nirujogi, Raja Sekhar; Renuse, Santosh; Kim, Kwang Pyo; Burns, Kathleen H; Wilks, Christopher; Langmead, Ben; Ellis, Shannon E; Collado-Torres, Leonardo; Halushka, Marc K; Kim, Min-Sik; Pandey, Akhilesh.

Proteomics ; 19(15): e1800315, 2019 08.

Artigo em Inglês | MEDLINE | ID: mdl-30983154

RESUMO

Understanding the molecular profile of every human cell type is essential for understanding its role in normal physiology and disease. Technological advancements in DNA sequencing, mass spectrometry, and computational methods allow us to carry out multiomics analyses although such approaches are not routine yet. Human umbilical vein endothelial cells (HUVECs) are a widely used model system to study pathological and physiological processes associated with the cardiovascular system. In this study, next-generation sequencing and high-resolution mass spectrometry to profile the transcriptome and proteome of primary HUVECs is employed. Analysis of 145 million paired-end reads from next-generation sequencing confirmed expression of 12 186 protein-coding genes (FPKM ≥0.1), 439 novel long non-coding RNAs, and revealed 6089 novel isoforms that were not annotated in GENCODE. Proteomics analysis identifies 6477 proteins including confirmation of N-termini for 1091 proteins, isoforms for 149 proteins, and 1034 phosphosites. A database search to specifically identify other post-translational modifications provide evidence for a number of modification sites on 117 proteins which include ubiquitylation, lysine acetylation, and mono-, di- and tri-methylation events. Evidence for 11 "missing proteins," which are proteins for which there was insufficient or no protein level evidence, is provided. Peptides supporting missing protein and novel events are validated by comparison of MS/MS fragmentation patterns with synthetic peptides. Finally, 245 variant peptides derived from 207 expressed proteins in addition to alternate translational start sites for seven proteins and evidence for novel proteoforms for five proteins resulting from alternative splicing are identified. Overall, it is believed that the integrated approach employed in this study is widely applicable to study any primary cell type for deeper molecular characterization.

Assuntos

Proteômica/métodos , Transcriptoma/genética , Processamento Alternativo/genética , Células Endoteliais da Veia Umbilical Humana , Humanos

16.

Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples.

Wilks, Christopher; Gaddipati, Phani; Nellore, Abhinav; Langmead, Ben.

Bioinformatics ; 34(1): 114-116, 2018 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-28968689

RESUMO

Motivation: As more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. These enable researchers to leverage vast datasets that would otherwise be difficult to obtain. Results: Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70 000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can score junctions according to tissue specificity or other criteria, and can score samples according to the relative frequency of different splicing patterns. We describe the software and outline biological questions that can be explored with Snaptron queries. Availability and implementation: Documentation is at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron and https://github.com/ChristopherWilks/snaptron-experiments with a CC BY-NC 4.0 license. Contact: chris.wilks@jhu.edu or langmea@cs.jhu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , Splicing de RNA , Análise de Sequência de RNA/métodos , Software , Éxons , Humanos

17.

Flexible expressed region analysis for RNA-seq with derfinder.

Collado-Torres, Leonardo; Nellore, Abhinav; Frazee, Alyssa C; Wilks, Christopher; Love, Michael I; Langmead, Ben; Irizarry, Rafael A; Leek, Jeffrey T; Jaffe, Andrew E.

Nucleic Acids Res ; 45(2): e9, 2017 01 25.

Artigo em Inglês | MEDLINE | ID: mdl-27694310

RESUMO

Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly.We present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (i) implementing a computationally efficient bump-hunting approach to identify DERs that permits genome-scale analyses in a large number of samples, (ii) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (iii) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations, our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete.derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.

Assuntos

Perfilação da Expressão Gênica/métodos , Software , Regulação da Expressão Gênica , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Anotação de Sequência Molecular , Especificidade de Órgãos/genética , Transcriptoma , Navegador

18.

Widespread splicing of repetitive element loci into coding regions of gene transcripts.

Darby, Miranda M; Leek, Jeffrey T; Langmead, Ben; Yolken, Robert H; Sabunciyan, Sarven.

Hum Mol Genet ; 25(22): 4962-4982, 2016 11 15.

Artigo em Inglês | MEDLINE | ID: mdl-28171598

RESUMO

We performed a thorough characterization of expressed repetitive element loci (RE) in the human orbitofrontal cortex (OFC) using directional RNA sequencing data. Considering only sequencing reads that map uniquely onto the human genome, we discovered that the overwhelming majority of intronic and exonic RE are expressed in the same orientation as the gene in which they reside. Our mapping approach enabled the identification of novel differentially expressed RE transcripts between the OFC and peripheral blood lymphocytes. Further analysis revealed that RE are extensively spliced into coding regions of gene transcripts yielding thousands of novel mRNA variants with altered coding potential. Lower frequency splicing of RE into untranslated regions of gene transcripts was also observed. The same pattern of RE splicing in the brain was also detected for Drosophila, zebrafish, mouse, rat, dog and rabbit. RE splicing occurs largely at canonical GT-AG splice junctions with LINE and SINE elements forming the most RE splice junctions in the human OFC. This type of splicing usually gives rise to a minor splice variant of the endogenous gene and in silico analysis suggests that RE splicing has the potential to introduce novel open reading frames. Reanalysis of previously published sequencing data performed in the mouse cerebellum revealed that thousands of RE splice variants are associated with translating ribosomes. Our results demonstrate that RE expression is more complex than previously envisioned and raise the possibility that RE splicing might generate functional protein isoforms.

Assuntos

Sequências Repetitivas Dispersas/genética , Sítios de Splice de RNA/genética , Splicing de RNA/genética , Processamento Alternativo/genética , Animais , Sequência de Bases , Encéfalo/metabolismo , DNA/genética , Éxons , Perfilação da Expressão Gênica/métodos , Genoma/genética , Humanos , Íntrons , Fases de Leitura Aberta/genética , Córtex Pré-Frontal/metabolismo , Isoformas de Proteínas/genética , RNA Mensageiro/genética , Sequências Repetitivas de Ácido Nucleico/genética , Análise de Sequência de RNA , Regiões não Traduzidas/genética

19.

Alignment of Next-Generation Sequencing Reads.

Reinert, Knut; Langmead, Ben; Weese, David; Evers, Dirk J.

Annu Rev Genomics Hum Genet ; 16: 133-51, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-25939052

RESUMO

High-throughput DNA sequencing has considerably changed the possibilities for conducting biomedical research by measuring billions of short DNA or RNA fragments. A central computational problem, and for many applications a first step, consists of determining where the fragments came from in the original genome. In this article, we review the main techniques for generating the fragments, the main applications, and the main algorithmic ideas for computing a solution to the read alignment problem. In addition, we describe pitfalls and difficulties connected to determining the correct positions of reads.

Assuntos

Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Genoma , Poliploidia , Sequências Repetitivas de Ácido Nucleico , Software

20.

HISAT: a fast spliced aligner with low memory requirements.

Kim, Daehwan; Langmead, Ben; Salzberg, Steven L.

Nat Methods ; 12(4): 357-60, 2015 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-25751142

RESUMO

HISAT (hierarchical indexing for spliced alignment of transcripts) is a highly efficient system for aligning reads from RNA sequencing experiments. HISAT uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index, employing two types of indexes for alignment: a whole-genome FM index to anchor each alignment and numerous local FM indexes for very rapid extensions of these alignments. HISAT's hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of â¼64,000 bp. Tests on real and simulated data sets showed that HISAT is the fastest system currently available, with equal or better accuracy than any other method. Despite its large number of indexes, HISAT requires only 4.3 gigabytes of memory. HISAT supports genomes of any size, including those larger than 4 billion bases.

Assuntos

Alinhamento de Sequência/métodos , Análise de Sequência de DNA/instrumentação , Humanos , Limite de Detecção , Pseudogenes/genética , Análise de Sequência de RNA

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA