RESUMO
BACKGROUND: Alternative splicing is a pivotal mechanism of post-transcriptional modification that contributes to the transcriptome plasticity and proteome diversity in metazoan cells. Although many splicing regulations around the exon/intron regions are known, the relationship between promoter-bound transcription factors and the downstream alternative splicing largely remains unexplored. RESULTS: In this study, we present computational approaches to unravel the regulatory relationship between promoter-bound transcription factor binding sites (TFBSs) and the splicing patterns. We curated a fine dataset that includes DNase I hypersensitive site sequencing and transcriptomes across fifteen human tissues from ENCODE. Specifically, we proposed different representations of TF binding context and splicing patterns to examine the associations between the promoter and downstream splicing events. While machine learning models demonstrated potential in predicting splicing patterns based on TFBS occupancies, the limitations in the generalization of predicting the splicing forms of singleton genes across diverse tissues was observed with carefully examination using different cross-validation methods. We further investigated the association between alterations in individual TFBS at promoters and shifts in exon splicing efficiency. Our results demonstrate that the convolutional neural network (CNN) models, trained on TF binding changes in the promoters, can predict the changes in splicing patterns. Furthermore, a systemic in silico substitutions analysis on the CNN models highlighted several potential splicing regulators. Notably, using empirical validation using K562 CTCFL shRNA knock-down data, we showed the significant role of CTCFL in splicing regulation. CONCLUSION: In conclusion, our finding highlights the potential role of promoter-bound TFBSs in influencing the regulation of downstream splicing patterns and provides insights for discovering alternative splicing regulations.
Assuntos
Processamento Alternativo , Aprendizado Profundo , Regiões Promotoras Genéticas , Fatores de Transcrição , Humanos , Sítios de Ligação , Fatores de Transcrição/metabolismo , Fatores de Transcrição/genética , Biologia Computacional/métodos , Éxons/genéticaRESUMO
BACKGROUND: Alternative splicing (AS) increases the diversity of transcriptome and could fine-tune the function of genes, so that understanding the regulation of AS is vital. AS could be regulated by many different cis-regulatory elements, such as enhancer. Enhancer has been experimentally proved to regulate AS in some genes. However, there is a lack of genome-wide studies on the association between enhancer and AS (enhancer-AS association). To bridge the gap, here we developed an integrative analysis on a genome-wide scale to identify enhancer-AS associations in human and mouse. RESULT: We collected enhancer datasets which include 28 human and 24 mouse tissues and cell lines, and RNA-seq datasets which are paired with the selected tissues. Combining with data integration and statistical analysis, we identified 3,242 human and 7,716 mouse genes which have significant enhancer-AS associations in at least one tissue. On average, for each gene, about 6% of enhancers in human (5% in mouse) are associated to AS change and for each enhancer, approximately one gene is identified to have enhancer-AS association in both human and mouse. We found that 52% of the human significant (34% in mouse) enhancer-AS associations are the co-existence of homologous genes and homologous enhancers. We further constructed a user-friendly platform, named Visualization of Enhancer-associated Alternative Splicing (VEnAS, http://venas.iis.sinica.edu.tw/ ), to provide genomic architecture, intuitive association plot, and contingency table of the significant enhancer-AS associations. CONCLUSION: This study provides the first genome-wide identification of enhancer-AS associations in human and mouse. The results suggest that a notable portion of enhancers are playing roles in AS regulations. The analyzed results and the proposed platform VEnAS would provide a further understanding of enhancers on regulating alternative splicing.
Assuntos
Processamento Alternativo , Elementos Facilitadores Genéticos , Animais , Estudo de Associação Genômica Ampla , Genômica/métodos , Humanos , Camundongos , RNA-SeqRESUMO
SUMMARY: In higher eukaryotes, the generation of transcript isoforms from a single gene through alternative splicing (AS) and alternative transcription (AT) mechanisms increases functional and regulatory diversities. Annotating these alternative transcript events is essential for genomic studies. However, there are no existing tools that generate comprehensive annotations of all these alternative transcript events including both AS and AT events. In the present study, we develop CATANA, with the encoded exon usage patterns based on the flattened gene model, to identify ten types of AS and AT events. We demonstrate the power and versatility of CATANA by showing greater depth of annotations of alternative transcript events according to either genome annotation or RNA-seq data. AVAILABILITY AND IMPLEMENTATION: CATANA is available on https://github.com/shiauck/CATANA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Processamento Alternativo , Software , Transcrição Gênica , Éxons , Genoma , Análise de Sequência de RNARESUMO
Lung cancer is the most common cause of cancer-related mortality with more than 1.4 million deaths per year worldwide. To search for significant somatic alterations in lung cancer, we analyzed, integrated and manually curated various data sets and literatures to present an integrated genomic database of non-small cell lung cancer (IGDB.NSCLC, http://igdb.nsclc.ibms.sinica.edu.tw). We collected data sets derived from hundreds of human NSCLC (lung adenocarcinomas and/or squamous cell carcinomas) to illustrate genomic alterations [chromosomal regions with copy number alterations (CNAs), gain/loss and loss of heterozygosity], aberrant expressed genes and microRNAs, somatic mutations and experimental evidence and clinical information of alterations retrieved from literatures. IGDB.NSCLC provides user friendly interfaces and searching functions to display multiple layers of evidence especially emphasizing on concordant alterations of CNAs with co-localized altered gene expression, aberrant microRNAs expression, somatic mutations or genes with associated clinicopathological features. These significant concordant alterations in NSCLC are graphically or tabularly presented to facilitate and prioritize as the putative cancer targets for pathological and mechanistic studies of lung tumorigenesis and for developing new strategies in clinical interventions.
Assuntos
Carcinoma Pulmonar de Células não Pequenas/genética , Bases de Dados Genéticas , Neoplasias Pulmonares/genética , Carcinoma Pulmonar de Células não Pequenas/metabolismo , Perfilação da Expressão Gênica , Genes Neoplásicos , Variação Genética , Genômica , Humanos , Neoplasias Pulmonares/metabolismo , MicroRNAs/metabolismo , Mutação , Integração de SistemasRESUMO
Cell line identification is emerging as an essential method for every cell line user in research community to avoid using misidentified cell lines for experiments and publications. IGRhCellID (http://igrcid.ibms.sinica.edu.tw) is designed to integrate eight cell identification methods including seven methods (STR profile, gender, immunotypes, karyotype, isoenzyme profile, TP53 mutation and mutations of cancer genes) available in various public databases and our method of profiling genome alterations of human cell lines. With data validation of 11 small deleted genes in human cancer cell lines, profiles of genomic alterations further allow users to search for human cell lines with deleted gene to serve as indigenous knock-out cell model (such as SMAD4 in gene view), with amplified gene to be the cell models for testing therapeutic efficacy (such as ERBB2 in gene view) and with overlapped aberrant chromosomal loci for revealing common cancer genes (such as 9p21.3 homozygous deletion with co-deleted CDKN2A, CDKN2B and MTAP in chromosome view). IGRhCellID provides not only available methods for cell identification to help eradicating concerns of using misidentified cells but also designated genetic features of human cell lines for experiments.
Assuntos
Linhagem Celular , Bases de Dados Factuais , Genômica , Linhagem Celular Tumoral , Genes , Loci Gênicos , HumanosRESUMO
Single-cell nanopore sequencing of full-length mRNAs transforms single-cell multi-omics studies. However, challenges include high sequencing errors and dependence on short-reads and/or barcode whitelists. To address these, we develop scNanoGPS to calculate same-cell genotypes (mutations) and phenotypes (gene/isoform expressions) without short-read nor whitelist guidance. We apply scNanoGPS onto 23,587 long-read transcriptomes from 4 tumors and 2 cell-lines. Standalone, scNanoGPS deconvolutes error-prone long-reads into single-cells and single-molecules, and simultaneously accesses both phenotypes and genotypes of individual cells. Our analyses reveal that tumor and stroma/immune cells express distinct combination of isoforms (DCIs). In a kidney tumor, we identify 924 DCI genes involved in cell-type-specific functions such as PDE10A in tumor cells and CCL3 in lymphocytes. Transcriptome-wide mutation analyses identify many cell-type-specific mutations including VEGFA mutations in tumor cells and HLA-A mutations in immune cells, highlighting the critical roles of different mutant populations in tumors. Together, scNanoGPS facilitates applications of single-cell long-read sequencing technologies.
Assuntos
Carcinoma Intraductal não Infiltrante , Neoplasias Renais , Humanos , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , Fenótipo , Diester Fosfórico HidrolasesRESUMO
Single-cell nanopore sequencing of full-length mRNAs (scNanoRNAseq) is transforming singlecell multi-omics studies. However, challenges include computational complexity and dependence on short-read curation. To address this, we developed a comprehensive toolkit, scNanoGPS to calculate same-cell genotypes-phenotypes without short-read guidance. We applied scNanoGPS onto 23,587 long-read transcriptomes from 4 tumors and 2 cell lines. Standalone, scNanoGPS accurately deconvoluted error-prone long-reads into single-cells and single-molecules. Further, scNanoGPS simultaneously accessed both phenotypes (expressions/isoforms) and genotypes (mutations) of individual cells. Our analyses revealed that tumor and stroma/immune cells often expressed significantly distinct combinations of isoforms (DCIs). In a kidney tumor, we identified 924 genes with DCIs involved in cell-type-specific functions such as PDE10A in tumor cells and CCL3 in lymphocytes. Moreover, transcriptome-wide mutation analyses identified many cell-type-specific mutations including VEGFA mutations in tumor cells and HLA-A mutations in immune cells, highlighting critical roles of different populations in tumors. Together, scNanoGPS facilitates applications of single-cell long-read sequencing.
RESUMO
The deadliest anaplastic thyroid cancer (ATC) often transforms from indolent differentiated thyroid cancer (DTC); however, the complex intratumor transformation process is poorly understood. We investigated an anaplastic transformation model by dissecting both cell lineage and cell fate transitions using single-cell transcriptomic and genetic alteration data from patients with different subtypes of thyroid cancer. The resulting spectrum of ATC transformation included stress-responsive DTC cells, inflammatory ATC cells (iATCs), and mitotic-defective ATC cells and extended all the way to mesenchymal ATC cells (mATCs). Furthermore, our analysis identified 2 important milestones: (a) a diploid stage, in which iATC cells were diploids with inflammatory phenotypes and (b) an aneuploid stage, in which mATCs gained aneuploid genomes and mesenchymal phenotypes, producing excessive amounts of collagen and collagen-interacting receptors. In parallel, cancer-associated fibroblasts showed strong interactions among mesenchymal cell types, macrophages shifted from M1 to M2 states, and T cells reprogrammed from cytotoxic to exhausted states, highlighting new therapeutic opportunities for the treatment of ATC.
Assuntos
Carcinoma Anaplásico da Tireoide , Neoplasias da Glândula Tireoide , Humanos , Transcriptoma , Neoplasias da Glândula Tireoide/genética , Neoplasias da Glândula Tireoide/metabolismo , Carcinoma Anaplásico da Tireoide/genética , Perfilação da Expressão Gênica , Aneuploidia , Linhagem Celular TumoralRESUMO
MicroRNAs have been found in various organisms and play essential roles in gene expression regulation of many critical cellular processes. Large-scale computational prediction of miRNAs has been conducted for many organisms using known genomic sequences; however, there has been no such effort for the thousands of known viral genomes. Some viruses utilize existing host cellular pathways for their own benefit. Furthermore, viruses are capable of encoding miRNAs and using them to repress host genes. Thus, identifying potential miRNAs in all viral genomes would be valuable to virologists who study virus-host interactions. Based on our previously reported hairpin secondary structure and feature selection filters, we have examined the 2266 available viral genome sequences for putative miRNA hairpins and identified 33 691 hairpin candidates in 1491 genomes. Evaluation of the system performance indicated that our discovery pipeline exhibited 84.4% sensitivity. We established an interface for users to query the predicted viral miRNA hairpins based on taxonomic classification, and a host target gene prediction service based on the RNAhybrid program and the 3'-UTR gene sequences of human, mouse, rat, zebrafish, rice and Arabidopsis. The viral miRNA prediction database (Vir-Mir) can be accessed via http://alk.ibms.sinica.edu.tw.