RESUMO
Long noncoding RNAs (lncRNAs) have emerged as key coordinators of biological and cellular processes. Characterizing lncRNA expression across cells and tissues is key to understanding their role in determining phenotypes, including human diseases. We present here FC-R2, a comprehensive expression atlas across a broadly defined human transcriptome, inclusive of over 109,000 coding and noncoding genes, as described in the FANTOM CAGE-Associated Transcriptome (FANTOM-CAT) study. This atlas greatly extends the gene annotation used in the original recount2 resource. We demonstrate the utility of the FC-R2 atlas by reproducing key findings from published large studies and by generating new results across normal and diseased human samples. In particular, we (a) identify tissue-specific transcription profiles for distinct classes of coding and noncoding genes, (b) perform differential expression analysis across thirteen cancer types, identifying novel noncoding genes potentially involved in tumor pathogenesis and progression, and (c) confirm the prognostic value for several enhancer lncRNAs expression in cancer. Our resource is instrumental for the systematic molecular characterization of lncRNA by the FANTOM6 Consortium. In conclusion, comprised of over 70,000 samples, the FC-R2 atlas will empower other researchers to investigate functions and biological roles of both known coding genes and novel lncRNAs.
Assuntos
Transcriptoma , Bases de Dados Genéticas , Elementos Facilitadores Genéticos , Perfilação da Expressão Gênica , Genoma Humano , Humanos , Neoplasias/genética , Especificidade de Órgãos , Prognóstico , RNA Longo não Codificante/genética , RNA Longo não Codificante/metabolismo , RNA Mensageiro/metabolismoRESUMO
OBJECTIVES: To perform a systematic review comparing the diagnostic accuracy of MRI vs. CT for assessing pancreatic ductal adenocarcinoma (PDAC) vascular invasion. METHODS: MEDLINE, EMBASE, Cochrane Central, and Scopus were searched until December 2021 for diagnostic accuracy studies comparing MRI vs. CT to evaluate vascular invasion of pathologically confirmed PDAC in the same patients. Findings on resection or exploratory laparotomy were the preferred reference standard. Data extraction, risk of bias, and applicability assessment were performed by two authors using the Quality Assessment of Diagnostic Accuracy Studies-Comparative Tool. Bivariate random-effects meta-analysis and meta-regression were performed with 95% confidence intervals (95% CI). RESULTS: Three studies were included assessing 474 vessels without vascular invasion and 65 with vascular invasion in 107 patients. All patients were imaged using MRI at ≥ 1.5 T and a pancreatic protocol CT. No difference was shown between MRI and CT for diagnosing PDAC vascular invasion: MRI/CT sensitivity (95% CI) were 71% (47-87%)/74% (56-86%), and specificity were 97% (94-99%)/97% (94-98%). Sources of bias included selection bias from only a subset of CT patients undergoing MRI and verification bias from patients with unresectable disease not confirmed on surgery. No patients received neoadjuvant therapy prior to staging. CONCLUSIONS: Based on limited data, no difference was observed between MRI and pancreatic protocol CT for PDAC vascular invasion assessment. MRI may be an adequate substitute for pancreatic protocol CT in some patients, particularly those who have already had a single-phase CT. Larger and more recent cohort studies at low risk of bias, including patients who have received neoadjuvant therapy, are needed. CLINICAL RELEVANCE STATEMENT: Abdominal MRI performed similarly to pancreatic protocol CT at assessing pancreatic ductal adenocarcinoma vascular invasion, suggesting local staging is adequate in some patients using MRI. More data are needed using larger, more recent cohorts including patients with neoadjuvant treatment. KEY POINTS: ⢠Based on limited data, no difference was found between MRI and pancreatic protocol CT sensitivity and specificity for diagnosing PDAC vascular invasion (p = 0.81, 0.73 respectively). ⢠Risk of bias could be reduced in future PDAC MRI vs CT comparative diagnostic test accuracy research by ensuring all enrolled patients undergo both imaging modalities being compared in random order and regardless of the findings on either modality. ⢠More studies are needed that directly compare the diagnostic performance of MRI and CT for PDAC staging after neoadjuvant therapy.
Assuntos
Adenocarcinoma , Carcinoma Ductal Pancreático , Neoplasias Pancreáticas , Humanos , Neoplasias Pancreáticas/patologia , Adenocarcinoma/diagnóstico por imagem , Tomografia Computadorizada por Raios X/métodos , Imageamento por Ressonância Magnética , Carcinoma Ductal Pancreático/diagnóstico por imagem , Sensibilidade e Especificidade , Testes Diagnósticos de Rotina , Neoplasias PancreáticasRESUMO
MOTIVATION: A common way to summarize sequencing datasets is to quantify data lying within genes or other genomic intervals. This can be slow and can require different tools for different input file types. RESULTS: Megadepth is a fast tool for quantifying alignments and coverage for BigWig and BAM/CRAM input files, using substantially less memory than the next-fastest competitor. Megadepth can summarize coverage within all disjoint intervals of the Gencode V35 gene annotation for more than 19 000 GTExV8 BigWig files in approximately 1 h using 32 threads. Megadepth is available both as a command-line tool and as an R/Bioconductor package providing much faster quantification compared to the rtracklayer package. AVAILABILITY AND IMPLEMENTATION: https://github.com/ChristopherWilks/megadepth, https://bioconductor.org/packages/megadepth. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genoma , Genômica , Software , Anotação de Sequência MolecularRESUMO
Motivation: General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners. Results: We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling. Availability and implementation: Experiments for this study: https://github.com/BenLangmead/bowtie-scaling. Bowtie: http://bowtie-bio.sourceforge.net. Bowtie 2: http://bowtie-bio.sourceforge.net/bowtie2. HISAT: http://www.ccb.jhu.edu/software/hisat. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Genômica , Software , Sistemas ComputacionaisRESUMO
Understanding the molecular profile of every human cell type is essential for understanding its role in normal physiology and disease. Technological advancements in DNA sequencing, mass spectrometry, and computational methods allow us to carry out multiomics analyses although such approaches are not routine yet. Human umbilical vein endothelial cells (HUVECs) are a widely used model system to study pathological and physiological processes associated with the cardiovascular system. In this study, next-generation sequencing and high-resolution mass spectrometry to profile the transcriptome and proteome of primary HUVECs is employed. Analysis of 145 million paired-end reads from next-generation sequencing confirmed expression of 12 186 protein-coding genes (FPKM ≥0.1), 439 novel long non-coding RNAs, and revealed 6089 novel isoforms that were not annotated in GENCODE. Proteomics analysis identifies 6477 proteins including confirmation of N-termini for 1091 proteins, isoforms for 149 proteins, and 1034 phosphosites. A database search to specifically identify other post-translational modifications provide evidence for a number of modification sites on 117 proteins which include ubiquitylation, lysine acetylation, and mono-, di- and tri-methylation events. Evidence for 11 "missing proteins," which are proteins for which there was insufficient or no protein level evidence, is provided. Peptides supporting missing protein and novel events are validated by comparison of MS/MS fragmentation patterns with synthetic peptides. Finally, 245 variant peptides derived from 207 expressed proteins in addition to alternate translational start sites for seven proteins and evidence for novel proteoforms for five proteins resulting from alternative splicing are identified. Overall, it is believed that the integrated approach employed in this study is widely applicable to study any primary cell type for deeper molecular characterization.
Assuntos
Proteômica/métodos , Transcriptoma/genética , Processamento Alternativo/genética , Células Endoteliais da Veia Umbilical Humana , HumanosRESUMO
Motivation: As more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. These enable researchers to leverage vast datasets that would otherwise be difficult to obtain. Results: Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70 000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can score junctions according to tissue specificity or other criteria, and can score samples according to the relative frequency of different splicing patterns. We describe the software and outline biological questions that can be explored with Snaptron queries. Availability and implementation: Documentation is at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron and https://github.com/ChristopherWilks/snaptron-experiments with a CC BY-NC 4.0 license. Contact: chris.wilks@jhu.edu or langmea@cs.jhu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Splicing de RNA , Análise de Sequência de RNA/métodos , Software , Éxons , HumanosRESUMO
Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly.We present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (i) implementing a computationally efficient bump-hunting approach to identify DERs that permits genome-scale analyses in a large number of samples, (ii) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (iii) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations, our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete.derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.
Assuntos
Perfilação da Expressão Gênica/métodos , Software , Regulação da Expressão Gênica , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Anotação de Sequência Molecular , Especificidade de Órgãos/genética , Transcriptoma , NavegadorRESUMO
MOTIVATION: RNA sequencing (RNA-seq) experiments now span hundreds to thousands of samples. Current spliced alignment software is designed to analyze each sample separately. Consequently, no information is gained from analyzing multiple samples together, and it requires extra work to obtain analysis products that incorporate data from across samples. RESULTS: We describe Rail-RNA, a cloud-enabled spliced aligner that analyzes many samples at once. Rail-RNA eliminates redundant work across samples, making it more efficient as samples are added. For many samples, Rail-RNA is more accurate than annotation-assisted aligners. We use Rail-RNA to align 667 RNA-seq samples from the GEUVADIS project on Amazon Web Services in under 16 h for US$0.91 per sample. Rail-RNA outputs alignments in SAM/BAM format; but it also outputs (i) base-level coverage bigWigs for each sample; (ii) coverage bigWigs encoding normalized mean and median coverages at each base across samples analyzed; and (iii) exon-exon splice junctions and indels (features) in columnar formats that juxtapose coverages in samples in which a given feature is found. Supplementary outputs are ready for use with downstream packages for reproducible statistical analysis. We use Rail-RNA to identify expressed regions in the GEUVADIS samples and show that both annotated and unannotated (novel) expressed regions exhibit consistent patterns of variation across populations and with respect to known confounding variables. AVAILABILITY AND IMPLEMENTATION: Rail-RNA is open-source software available at http://rail.bio. CONTACTS: anellore@gmail.com or langmea@cs.jhu.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Splicing de RNA , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Software , Éxons , Perfilação da Expressão GênicaRESUMO
MOTIVATION: Public archives contain thousands of trillions of bases of valuable sequencing data. More than 40% of the Sequence Read Archive is human data protected by provisions such as dbGaP. To analyse dbGaP-protected data, researchers must typically work with IT administrators and signing officials to ensure all levels of security are implemented at their institution. This is a major obstacle, impeding reproducibility and reducing the utility of archived data. RESULTS: We present a protocol and software tool for analyzing protected data in a commercial cloud. The protocol, Rail-dbGaP, is applicable to any tool running on Amazon Web Services Elastic MapReduce. The tool, Rail-RNA v0.2, is a spliced aligner for RNA-seq data, which we demonstrate by running on 9662 samples from the dbGaP-protected GTEx consortium dataset. The Rail-dbGaP protocol makes explicit for the first time the steps an investigator must take to develop Elastic MapReduce pipelines that analyse dbGaP-protected data in a manner compliant with NIH guidelines. Rail-RNA automates implementation of the protocol, making it easy for typical biomedical investigators to study protected RNA-seq data, regardless of their local IT resources or expertise. AVAILABILITY AND IMPLEMENTATION: Rail-RNA is available from http://rail.bio Technical details on the Rail-dbGaP protocol as well as an implementation walkthrough are available at https://github.com/nellore/rail-dbgap Detailed instructions on running Rail-RNA on dbGaP-protected data using Amazon Web Services are available at http://docs.rail.bio/dbgap/ CONTACTS: : anellore@gmail.com or langmea@cs.jhu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Biologia Computacional , Bases de Dados Genéticas , Software , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , RNA , Reprodutibilidade dos TestesRESUMO
The Arabidopsis Information Resource (TAIR, http://arabidopsis.org) is a genome database for Arabidopsis thaliana, an important reference organism for many fundamental aspects of biology as well as basic and applied plant biology research. TAIR serves as a central access point for Arabidopsis data, annotates gene function and expression patterns using controlled vocabulary terms, and maintains and updates the A. thaliana genome assembly and annotation. TAIR also provides researchers with an extensive set of visualization and analysis tools. Recent developments include several new genome releases (TAIR8, TAIR9 and TAIR10) in which the A. thaliana assembly was updated, pseudogenes and transposon genes were re-annotated, and new data from proteomics and next generation transcriptome sequencing were incorporated into gene models and splice variants. Other highlights include progress on functional annotation of the genome and the release of several new tools including Textpresso for Arabidopsis which provides the capability to carry out full text searches on a large body of research literature.
Assuntos
Arabidopsis/genética , Bases de Dados Genéticas , Genes de Plantas , Anotação de Sequência Molecular , Arabidopsis/metabolismo , Proteínas de Arabidopsis/genética , Genoma de Planta , SoftwareRESUMO
Alternative polyadenylation (APA) is an important post-transcriptional mechanism that has major implications in biological processes and diseases. Although specialized sequencing methods for polyadenylation exist, availability of these data are limited compared to RNA-sequencing data. We developed REPAC, a framework for the analysis of APA from RNA-sequencing data. Using REPAC, we investigate the landscape of APA caused by activation of B cells. We also show that REPAC is faster than alternative methods by at least 7-fold and that it scales well to hundreds of samples. Overall, the REPAC method offers an accurate, easy, and convenient solution for the exploration of APA.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Poliadenilação , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Regiões 3' não Traduzidas , RNA Mensageiro , Análise de Sequência de RNA/métodosRESUMO
We present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new Monorail analysis pipeline. To facilitate access to the data, we provide the recount3 and snapcount R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features. Monorail can be used to process local and/or private data, allowing results to be directly compared to any study in recount3. Taken together, our tools help biologists maximize the utility of publicly available RNA-seq data, especially to improve their understanding of newly collected data. recount3 is available from http://rna.recount.bio .
Assuntos
Splicing de RNA , RNA-Seq/métodos , RNA/genética , Animais , Sequência de Bases , Biologia Computacional/métodos , Éxons , Regulação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Camundongos , Análise de Sequência de RNA/métodos , SoftwareRESUMO
The Arabidopsis Information Resource (TAIR, http://arabidopsis.org) is the model organism database for the fully sequenced and intensively studied model plant Arabidopsis thaliana. Data in TAIR is derived in large part from manual curation of the Arabidopsis research literature and direct submissions from the research community. New developments at TAIR include the addition of the GBrowse genome viewer to the TAIR site, a redesigned home page, navigation structure and portal pages to make the site more intuitive and easier to use, the launch of several TAIR web services and a new genome annotation release (TAIR7) in April 2007. A combination of manual and computational methods were used to generate this release, which contains 27,029 protein-coding genes, 3889 pseudogenes or transposable elements and 1123 ncRNAs (32,041 genes in all, 37,019 gene models). A total of 681 new genes and 1002 new splice variants were added. Overall, 10,098 loci (one-third of all loci from the previous TAIR6 release) were updated for the TAIR7 release.
Assuntos
Proteínas de Arabidopsis/genética , Arabidopsis/genética , Bases de Dados Genéticas , Processamento Alternativo , Genes de Plantas , Genoma de Planta , Genômica , Internet , RNA não Traduzido/genética , Interface Usuário-Computador , Vocabulário ControladoRESUMO
Public archives of next-generation sequencing data are growing exponentially, but the difficulty of marshaling this data has led to its underutilization by scientists. Here, we present ASCOT, a resource that uses annotation-free methods to rapidly analyze and visualize splice variants across tens of thousands of bulk and single-cell data sets in the public archive. To demonstrate the utility of ASCOT, we identify novel cell type-specific alternative exons across the nervous system and leverage ENCODE and GTEx data sets to study the unique splicing of photoreceptors. We find that PTBP1 knockdown and MSI1 and PCBP2 overexpression are sufficient to activate many photoreceptor-specific exons in HepG2 liver cancer cells. This work demonstrates how large-scale analysis of public RNA-Seq data sets can yield key insights into cell type-specific control of RNA splicing and underscores the importance of considering both annotated and unannotated splicing events.
Assuntos
Processamento Alternativo/genética , Biologia Computacional/métodos , Análise de Dados , Células Fotorreceptoras/citologia , Sítios de Splice de RNA/genética , Animais , Linhagem Celular Tumoral , Expressão Gênica/genética , Células Hep G2 , Ribonucleoproteínas Nucleares Heterogêneas/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias Hepáticas/genética , Camundongos , Proteínas do Tecido Nervoso/biossíntese , Proteínas do Tecido Nervoso/genética , Neurônios/citologia , Proteína de Ligação a Regiões Ricas em Polipirimidinas/genética , Proteínas de Ligação a RNA/biossíntese , Proteínas de Ligação a RNA/genética , Retina/citologia , Análise de Sequência de RNA/métodosRESUMO
The Cancer Genomics Hub (CGHub) is the online repository of the sequencing programs of the National Cancer Institute (NCI), including The Cancer Genomics Atlas (TCGA), the Cancer Cell Line Encyclopedia (CCLE) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) projects, with data from 25 different types of cancer. The CGHub currently contains >1.4 PB of data, has grown at an average rate of 50 TB a month and serves >100 TB per week. The architecture of CGHub is designed to support bulk searching and downloading through a Web-accessible application programming interface, enforce patient genome confidentiality in data storage and transmission and optimize for efficiency in access and transfer. In this article, we describe the design of these three components, present performance results for our transfer protocol, GeneTorrent, and finally report on the growth of the system in terms of data stored and transferred, including estimated limits on the current architecture. Our experienced-based estimates suggest that centralizing storage and computational resources is more efficient than wide distribution across many satellite labs. Database URL: https://cghub.ucsc.edu.