Search | VHL Regional Portal

1.

Detecting differential transcript usage in complex diseases with SPIT.

Erdogdu, Beril; Varabyou, Ales; Hicks, Stephanie C; Salzberg, Steven L; Pertea, Mihaela.

Cell Rep Methods ; 4(3): 100736, 2024 Mar 25.

Article in English | MEDLINE | ID: mdl-38508189

ABSTRACT

Differential transcript usage (DTU) plays a crucial role in determining how gene expression differs among cells, tissues, and developmental stages, contributing to the complexity and diversity of biological systems. In abnormal cells, it can also lead to deficiencies in protein function and underpin disease pathogenesis. Analyzing DTU via RNA sequencing (RNA-seq) data is vital, but the genetic heterogeneity in populations with complex diseases presents an intricate challenge due to diverse causal events and undetermined subtypes. Although the majority of common diseases in humans are categorized as complex, state-of-the-art DTU analysis methods often overlook this heterogeneity in their models. We therefore developed SPIT, a statistical tool that identifies predominant subgroups in transcript usage within a population along with their distinctive sets of DTU events. This study provides comprehensive assessments of SPIT's methodology and applies it to analyze brain samples from individuals with schizophrenia, revealing previously unreported DTU events in six candidate genes.

Subject(s)

Gene Expression Profiling , RNA , Humans , Gene Expression Profiling/methods , Sequence Analysis, RNA

2.

Investigating Open Reading Frames in Known and Novel Transcripts using ORFanage.

Varabyou, Ales; Erdogdu, Beril; Salzberg, Steven L; Pertea, Mihaela.

Nat Comput Sci ; 3(8): 700-708, 2023 Aug.

Article in English | MEDLINE | ID: mdl-38098813

ABSTRACT

ORFanage is a system designed to assign open reading frames (ORFs) to known and novel gene transcripts while maximizing similarity to annotated proteins. The primary intended use of ORFanage is the identification of ORFs in the assembled results of RNA sequencing experiments, a capability that most transcriptome assembly methods do not have. Our experiments demonstrate how ORFanage can be used to find novel protein variants in RNA-seq datasets, and to improve the annotations of ORFs in tens of thousands of transcript models in the human annotation databases. Through its implementation of a highly accurate and efficient pseudo-alignment algorithm, ORFanage is substantially faster than other ORF annotation methods, enabling its application to very large datasets. When used to analyze transcriptome assemblies, ORFanage can aid in the separation of signal from transcriptional noise and the identification of likely functional transcript variants, ultimately advancing our understanding of biology and medicine.

3.

The status of the human gene catalogue.

Amaral, Paulo; Carbonell-Sala, Silvia; De La Vega, Francisco M; Faial, Tiago; Frankish, Adam; Gingeras, Thomas; Guigo, Roderic; Harrow, Jennifer L; Hatzigeorgiou, Artemis G; Johnson, Rory; Murphy, Terence D; Pertea, Mihaela; Pruitt, Kim D; Pujar, Shashikant; Takahashi, Hazuki; Ulitsky, Igor; Varabyou, Ales; Wells, Christine A; Yandell, Mark; Carninci, Piero; Salzberg, Steven L.

Nature ; 622(7981): 41-47, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37794265

ABSTRACT

Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.

Subject(s)

Genes , Genome, Human , Molecular Sequence Annotation , Protein Isoforms , Humans , Genome, Human/genetics , Molecular Sequence Annotation/standards , Molecular Sequence Annotation/trends , Protein Isoforms/genetics , Human Genome Project , Pseudogenes , RNA/genetics

4.

CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure.

Varabyou, Ales; Sommer, Markus J; Erdogdu, Beril; Shinder, Ida; Minkin, Ilia; Chao, Kuan-Hao; Park, Sukhwan; Heinz, Jakob; Pockrandt, Christopher; Shumate, Alaina; Rincon, Natalia; Puiu, Daniela; Steinegger, Martin; Salzberg, Steven L; Pertea, Mihaela.

Genome Biol ; 24(1): 249, 2023 10 30.

Article in English | MEDLINE | ID: mdl-37904256

ABSTRACT

CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available at http://ccb.jhu.edu/chess .

Subject(s)

Genome, Human , Proteins , Humans , Phylogeny , Proteins/genetics , Algorithms , Software , Molecular Sequence Annotation

5.

Detecting differential transcript usage in complex diseases with SPIT.

Erdogdu, Beril; Varabyou, Ales; Hicks, Stephanie C; Salzberg, Steven L; Pertea, Mihaela.

bioRxiv ; 2023 Jul 10.

Article in English | MEDLINE | ID: mdl-37503064

ABSTRACT

Differential transcript usage (DTU) plays a crucial role in determining how gene expression differs among cells, tissues, and different developmental stages, thereby contributing to the complexity and diversity of biological systems. In abnormal cells, it can also lead to deficiencies in protein function, potentially leading to pathogenesis of diseases. Detecting such events for single-gene genetic traits is relatively uncomplicated; however, the heterogeneity of populations with complex diseases presents an intricate challenge due to the presence of diverse causal events and undetermined subtypes. SPIT is the first statistical tool that quantifies the heterogeneity in transcript usage within a population and identifies predominant subgroups along with their distinctive sets of DTU events. We provide comprehensive assessments of SPIT's methodology in both single-gene and complex traits and report the results of applying SPIT to analyze brain samples from individuals with schizophrenia. Our analysis reveals previously unreported DTU events in six candidate genes.

6.

Investigating Open Reading Frames in Known and Novel Transcripts using ORFanage.

Varabyou, Ales; Erdogdu, Beril; Salzberg, Steven L; Pertea, Mihaela.

bioRxiv ; 2023 Mar 25.

Article in English | MEDLINE | ID: mdl-36993373

ABSTRACT

ORFanage is a system designed to assign open reading frames (ORFs) to both known and novel gene transcripts while maximizing similarity to annotated proteins. The primary intended use of ORFanage is the identification of ORFs in the assembled results of RNA sequencing (RNA-seq) experiments, a capability that most transcriptome assembly methods do not have. Our experiments demonstrate how ORFanage can be used to find novel protein variants in RNA-seq datasets, and to improve the annotations of ORFs in tens of thousands of transcript models in the RefSeq and GENCODE human annotation databases. Through its implementation of a highly accurate and efficient pseudo-alignment algorithm, ORFanage is substantially faster than other ORF annotation methods, enabling its application to very large datasets. When used to analyze transcriptome assemblies, ORFanage can aid in the separation of signal from transcriptional noise and the identification of likely functional transcript variants, ultimately advancing our understanding of biology and medicine.

7.

The status of the human gene catalogue.

Amaral, Paulo; Carbonell-Sala, Silvia; De La Vega, Francisco M; Faial, Tiago; Frankish, Adam; Gingeras, Thomas; Guigo, Roderic; Harrow, Jennifer L; Hatzigeorgiou, Artemis G; Johnson, Rory; Murphy, Terence D; Pertea, Mihaela; Pruitt, Kim D; Pujar, Shashikant; Takahashi, Hazuki; Ulitsky, Igor; Varabyou, Ales; Wells, Christine A; Yandell, Mark; Carninci, Piero; Salzberg, Steven L.

ArXiv ; 2023 Mar 24.

Article in English | MEDLINE | ID: mdl-36994150

ABSTRACT

Scientists have been trying to identify all of the genes in the human genome since the initial draft of the genome was published in 2001. Over the intervening years, much progress has been made in identifying protein-coding genes, and the estimated number has shrunk to fewer than 20,000, although the number of distinct protein-coding isoforms has expanded dramatically. The invention of high-throughput RNA sequencing and other technological breakthroughs have led to an explosion in the number of reported non-coding RNA genes, although most of them do not yet have any known function. A combination of recent advances offers a path forward to identifying these functions and towards eventually completing the human gene catalogue. However, much work remains to be done before we have a universal annotation standard that includes all medically significant genes, maintains their relationships with different reference genomes, and describes clinically relevant genetic variants.

8.

Structure-guided isoform identification for the human transcriptome.

Sommer, Markus J; Cha, Sooyoung; Varabyou, Ales; Rincon, Natalia; Park, Sukhwan; Minkin, Ilia; Pertea, Mihaela; Steinegger, Martin; Salzberg, Steven L.

Elife ; 112022 12 15.

Article in English | MEDLINE | ID: mdl-36519529

ABSTRACT

Recently developed methods to predict three-dimensional protein structure with high accuracy have opened new avenues for genome and proteome research. We explore a new hypothesis in genome annotation, namely whether computationally predicted structures can help to identify which of multiple possible gene isoforms represents a functional protein product. Guided by protein structure predictions, we evaluated over 230,000 isoforms of human protein-coding genes assembled from over 10,000 RNA sequencing experiments across many human tissues. From this set of assembled transcripts, we identified hundreds of isoforms with more confidently predicted structure and potentially superior function in comparison to canonical isoforms in the latest human gene database. We illustrate our new method with examples where structure provides a guide to function in combination with expression and evolutionary evidence. Additionally, we provide the complete set of structures as a resource to better understand the function of human genes and their isoforms. These results demonstrate the promise of protein structure prediction as a genome annotation tool, allowing us to refine even the most highly curated catalog of human proteins. More generally we demonstrate a practical, structure-guided approach that can be used to enhance the annotation of any genome.

Subject(s)

Genome , Transcriptome , Humans , Molecular Sequence Annotation , Protein Isoforms/genetics , Sequence Analysis, RNA

9.

TieBrush: an efficient method for aggregating and summarizing mapped reads across large datasets.

Varabyou, Ales; Pertea, Geo; Pockrandt, Christopher; Pertea, Mihaela.

Bioinformatics ; 37(20): 3650-3651, 2021 Oct 25.

Article in English | MEDLINE | ID: mdl-33964128

ABSTRACT

SUMMARY: Although the ability to programmatically summarize and visually inspect sequencing data is an integral part of genome analysis, currently available methods are not capable of handling large numbers of samples. In particular, making a visual comparison of transcriptional landscapes between two sets of thousands of RNA-seq samples is limited by available computational resources, which can be overwhelmed due to the sheer size of the data. In this work, we present TieBrush, a software package designed to process very large sequencing datasets (RNA, whole-genome, exome, etc.) into a form that enables quick visual and computational inspection. TieBrush can also be used as a method for aggregating data for downstream computational analysis, and is compatible with most software tools that take aligned reads as input. AVAILABILITY AND IMPLEMENTATION: TieBrush is provided as a C++ package under the MIT License. Precompiled binaries, source code and example data are available on GitHub (https://github.com/alevar/tiebrush). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

10.

Rapid detection of inter-clade recombination in SARS-CoV-2 with Bolotie.

Varabyou, Ales; Pockrandt, Christopher; Salzberg, Steven L; Pertea, Mihaela.

Genetics ; 218(3)2021 07 14.

Article in English | MEDLINE | ID: mdl-33983397

ABSTRACT

The ability to detect recombination in pathogen genomes is crucial to the accuracy of phylogenetic analysis and consequently to forecasting the spread of infectious diseases and to developing therapeutics and public health policies. However, in case of the SARS-CoV-2, the low divergence of near-identical genomes sequenced over a short period of time makes conventional analysis infeasible. Using a novel method, we identified 225 anomalous SARS-CoV-2 genomes of likely recombinant origins out of the first 87,695 genomes to be released, several of which have persisted in the population. Bolotie is specifically designed to perform a rapid search for inter-clade recombination events over extremely large datasets, facilitating analysis of novel isolates in seconds. In cases where raw sequencing data were available, we were able to rule out the possibility that these samples represented co-infections by analyzing the underlying sequence reads. The Bolotie software and other data from our study are available at https://github.com/salzberg-lab/bolotie.

Subject(s)

SARS-CoV-2 , Genome, Viral , Phylogeny , Recombination, Genetic , Software

11.

Identification of microbial agents in tissue specimens of ocular and periocular sarcoidosis using a metagenomics approach.

Shifera, Amde Selassie; Pockrandt, Christopher; Rincon, Natalia; Ge, Yuchen; Lu, Jennifer; Varabyou, Ales; Jedlicka, Anne E; Sun, Karen; Scott, Alan L; Eberhart, Charles; Thorne, Jennifer E; Salzberg, Steven L.

F1000Res ; 10: 820, 2021.

Article in English | MEDLINE | ID: mdl-36212901

ABSTRACT

Background: Metagenomic sequencing has the potential to identify a wide range of pathogens in human tissue samples. Sarcoidosis is a complex disorder whose etiology remains unknown and for which a variety of infectious causes have been hypothesized. We sought to conduct metagenomic sequencing on cases of ocular and periocular sarcoidosis, none of them with previously identified infectious causes. Methods: Archival tissue specimens of 16 subjects with biopsies of ocular and periocular tissues that were positive for non-caseating granulomas were used as cases. Four archival tissue specimens that did not demonstrate non-caseating granulomas were also included as controls. Genomic DNA was extracted from tissue sections. DNA libraries were generated from the extracted genomic DNA and the libraries underwent next-generation sequencing. Results: We generated between 4.8 and 20.7 million reads for each of the 16 cases plus four control samples. For eight of the cases, we identified microbial pathogens that were present well above the background, with one potential pathogen identified for seven of the cases and two possible pathogens for one of the cases. Five of the eight cases were associated with bacteria ( Campylobacter concisus, Neisseria elongata, Streptococcus salivarius, Pseudopropionibacterium propionicum, and Paracoccus yeei), two cases with fungi ( Exophiala oligosperma, Lomentospora prolificans and Aspergillus versicolor) and one case with a virus (Mupapillomavirus 1). Interestingly, four of the five bacterial species are also part of the human oral microbiome. Conclusions: Using a metagenomic sequencing we identified possible infectious causes in half of the ocular and periocular sarcoidosis cases analyzed. Our findings support the proposition that sarcoidosis could be an etiologically heterogenous disease. Because these are previously banked samples, direct follow-up in the respective patients is impossible, but these results suggest that sequencing may be a valuable tool in better understanding the etiopathogenesis of sarcoidosis and in diagnosing and treating this disease.

Subject(s)

Microbiota , Sarcoidosis , Bacteria/genetics , High-Throughput Nucleotide Sequencing/methods , Humans , Metagenome , Metagenomics/methods , Microbiota/genetics , Sarcoidosis/diagnosis , Sarcoidosis/genetics

12.

Effects of transcriptional noise on estimates of gene and transcript expression in RNA sequencing experiments.

Varabyou, Ales; Salzberg, Steven L; Pertea, Mihaela.

Genome Res ; 31(2): 301-308, 2021 Feb.

Article in English | MEDLINE | ID: mdl-33361112

ABSTRACT

RNA sequencing is widely used to measure gene expression across a vast range of animal and plant tissues and conditions. Most studies of computational methods for gene expression analysis use simulated data to evaluate the accuracy of these methods. These simulations typically include reads generated from known genes at varying levels of expression. Until now, simulations did not include reads from noisy transcripts, which might include erroneous transcription, erroneous splicing, and other processes that affect transcription in living cells. Here we examine the effects of realistic amounts of transcriptional noise on the ability of leading computational methods to assemble and quantify the genes and transcripts in an RNA sequencing experiment. We show that the inclusion of noise leads to systematic errors in the ability of these programs to measure expression, including systematic underestimates of transcript abundance levels and large increases in the number of false-positive genes and transcripts. Our results also suggest that alignment-free computational methods sometimes fail to detect transcripts expressed at relatively low levels.

13.

Rapid detection of inter-clade recombination in SARS-CoV-2 with Bolotie.

Varabyou, Ales; Pockrandt, Christopher; Salzberg, Steven L; Pertea, Mihaela.

bioRxiv ; 2020 Sep 21.

Article in English | MEDLINE | ID: mdl-32995774

ABSTRACT

The ability to detect recombination in pathogen genomes is crucial to the accuracy of phylogenetic analysis and consequently to forecasting the spread of infectious diseases and to developing therapeutics and public health policies. However, previous methods for detecting recombination and reassortment events cannot handle the computational requirements of analyzing tens of thousands of genomes, a scenario that has now emerged in the effort to track the spread of the SARS-CoV-2 virus. Furthermore, the low divergence of near-identical genomes sequenced in short periods of time presents a statistical challenge not addressed by available methods. In this work we present Bolotie, an efficient method designed to detect recombination and reassortment events between clades of viral genomes. We applied our method to a large collection of SARS-CoV-2 genomes and discovered hundreds of isolates that are likely of a recombinant origin. In cases where raw sequencing data was available, we were able to rule out the possibility that these samples represented co-infections by analyzing the underlying sequence reads. Our findings further show that several recombinants appear to have persisted in the population.

14.

Single-cell transcriptional landscapes reveal HIV-1-driven aberrant host gene transcription as a potential therapeutic target.

Liu, Runxia; Yeh, Yang-Hui Jimmy; Varabyou, Ales; Collora, Jack A; Sherrill-Mix, Scott; Talbot, C Conover; Mehta, Sameet; Albrecht, Kristen; Hao, Haiping; Zhang, Hao; Pollack, Ross A; Beg, Subul A; Calvi, Rachela M; Hu, Jianfei; Durand, Christine M; Ambinder, Richard F; Hoh, Rebecca; Deeks, Steven G; Chiarella, Jennifer; Spudich, Serena; Douek, Daniel C; Bushman, Frederic D; Pertea, Mihaela; Ho, Ya-Chi.

Sci Transl Med ; 12(543)2020 05 13.

Article in English | MEDLINE | ID: mdl-32404504

ABSTRACT

Understanding HIV-1-host interactions can identify the cellular environment supporting HIV-1 reactivation and mechanisms of clonal expansion. We developed HIV-1 SortSeq to isolate rare HIV-1-infected cells from virally suppressed, HIV-1-infected individuals upon early latency reversal. Single-cell transcriptome analysis of HIV-1 SortSeq+ cells revealed enrichment of nonsense-mediated RNA decay and viral transcription pathways. HIV-1 SortSeq+ cells up-regulated cellular factors that can support HIV-1 transcription (IMPDH1 and JAK1) or promote cellular survival (IL2 and IKBKB). HIV-1-host RNA landscape analysis at the integration site revealed that HIV-1 drives high aberrant host gene transcription downstream, but not upstream, of the integration site through HIV-1-to-host aberrant splicing, in which HIV-1 RNA splices into the host RNA and aberrantly drives host RNA transcription. HIV-1-induced aberrant transcription was driven by the HIV-1 promoter as shown by CRISPR-dCas9-mediated HIV-1-specific activation and could be suppressed by CRISPR-dCas9-mediated inhibition of HIV-1 5' long terminal repeat. Overall, we identified cellular factors supporting HIV-1 reactivation and HIV-1-driven aberrant host gene transcription as potential therapeutic targets to disrupt HIV-1 persistence.

Subject(s)

HIV Infections , HIV-1 , Gene Expression Regulation, Viral , HIV Infections/drug therapy , HIV Infections/genetics , HIV-1/genetics , Humans , Transcription, Genetic , Virus Activation , Virus Latency

15.

CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise.

Pertea, Mihaela; Shumate, Alaina; Pertea, Geo; Varabyou, Ales; Breitwieser, Florian P; Chang, Yu-Chi; Madugundu, Anil K; Pandey, Akhilesh; Salzberg, Steven L.

Genome Biol ; 19(1): 208, 2018 11 28.

Article in English | MEDLINE | ID: mdl-30486838

ABSTRACT

We assembled the sequences from deep RNA sequencing experiments by the Genotype-Tissue Expression (GTEx) project, to create a new catalog of human genes and transcripts, called CHESS. The new database contains 42,611 genes, of which 20,352 are potentially protein-coding and 22,259 are noncoding, and a total of 323,258 transcripts. These include 224 novel protein-coding genes and 116,156 novel transcripts. We detected over 30 million additional transcripts at more than 650,000 genomic loci, nearly all of which are likely nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells. The CHESS database is available at http://ccb.jhu.edu/chess .

Subject(s)

Databases, Genetic , Sequence Analysis, RNA , Transcription, Genetic , Amino Acid Sequence , Animals , Female , Humans , Introns , Male

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL