RESUMEN
SUMMARY: We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified. AVAILABILITY AND IMPLEMENTATION: The software is freely available. pypgatk: https://github.com/bigbio/py-pgatk/ and pgdb: https://nf-co.re/pgdb. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Proteogenómica , Humanos , Péptidos/genética , Programas Informáticos , Algoritmos , ProteínasRESUMEN
Gene transcription is regulated mainly by transcription factors (TFs). ENCODE and Roadmap Epigenomics provide global binding profiles of TFs, which can be used to identify regulatory regions. To this end we implemented a method to systematically construct cell-type and species-specific maps of regulatory regions and TF-TF interactions. We illustrated the approach by developing maps for five human cell-lines and two other species. We detected â¼144k putative regulatory regions among the human cell-lines, with the majority of them being â¼300 bp. We found â¼20k putative regulatory elements in the ENCODE heterochromatic domains suggesting a large regulatory potential in the regions presumed transcriptionally silent. Among the most significant TF interactions identified in the heterochromatic regions were CTCF and the cohesin complex, which is in agreement with previous reports. Finally, we investigated the enrichment of the obtained putative regulatory regions in the 3D chromatin domains. More than 90% of the regions were discovered in the 3D contacting domains. We found a significant enrichment of GWAS SNPs in the putative regulatory regions. These significant enrichments provide evidence that the regulatory regions play a crucial role in the genomic structural stability. Additionally, we generated maps of putative regulatory regions for prostate and colorectal cancer human cell-lines.
Asunto(s)
Genómica , Secuencias Reguladoras de Ácidos Nucleicos , Sitios de Unión , Línea Celular , Cromatina/genética , Cromatina/metabolismo , Inmunoprecipitación de Cromatina , Mapeo Cromosómico , Biología Computacional/métodos , Genoma Humano , Estudio de Asociación del Genoma Completo , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Anotación de Secuencia Molecular , Polimorfismo de Nucleótido Simple , Unión Proteica , Mapeo de Interacción de Proteínas , Mapas de Interacción de Proteínas , Factores de Transcripción/metabolismoRESUMEN
Somatic mutations drive cancer and there are established ways to study those in coding sequences. It has been shown that some regulatory mutations are over-represented in cancer. We develop a new strategy to find putative regulatory mutations based on experimentally established motifs for transcription factors (TFs). In total, we find 1,552 candidate regulatory mutations predicted to significantly reduce binding affinity of many TFs in hepatocellular carcinoma and affecting binding of CTCF also in esophagus, gastric, and pancreatic cancers. Near mutated motifs, there is a significant enrichment of (1) genes mutated in cancer, (2) tumor-suppressor genes, (3) genes in KEGG cancer pathways, and (4) sets of genes previously associated to cancer. Experimental and functional validations support the findings. The strategy can be applied to identify regulatory mutations in any cell type with established TF motifs and will aid identifications of genes contributing to cancer.
Asunto(s)
Carcinoma Hepatocelular/genética , Neoplasias Hepáticas/genética , Mutación , Factores de Transcripción/genética , Sitios de Unión , Bases de Datos Genéticas , Regulación Neoplásica de la Expresión Génica , Redes Reguladoras de Genes , Predisposición Genética a la Enfermedad , Células Hep G2 , Humanos , Unión Proteica , Análisis de Secuencia de ADN , Factores de Transcripción/metabolismoRESUMEN
BACKGROUND: Finding peaks in ChIP-seq is an important process in biological inference. In some cases, such as positioning nucleosomes with specific histone modifications or finding transcription factor binding specificities, the precision of the detected peak plays a significant role. There are several applications for finding peaks (called peak finders) based on different algorithms (e.g. MACS, Erange and HPeak). Benchmark studies have shown that the existing peak finders identify different peaks for the same dataset and it is not known which one is the most accurate. We present the first meta-server called Peak Finder MetaServer (PFMS) that collects results from several peak finders and produces consensus peaks. Our application accepts three standard ChIP-seq data formats: BED, BAM, and SAM. RESULTS: Sensitivity and specificity of seven widely used peak finders were examined. For the experiments we used three previously studied Transcription Factors (TF) ChIP-seq datasets and identified three of the selected peak finders that returned results with high specificity and very good sensitivity compared to the remaining four. We also ran PFMS using the three selected peak finders on the same TF datasets and achieved higher specificity and sensitivity than the peak finders individually. CONCLUSIONS: We show that combining outputs from up to seven peak finders yields better results than individual peak finders. In addition, three of the seven peak finders outperform the remaining four, and running PFMS with these three returns even more accurate results. Another added value of PFMS is a separate report of the peaks returned by each of the included peak finders.
Asunto(s)
Inmunoprecipitación de Cromatina , Biología Computacional/métodos , Bases de Datos Genéticas , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Factores de Transcripción/genética , Factores de Transcripción/metabolismoRESUMEN
Glioblastoma (GBM) cancer stem cells (GSCs) contribute to GBM's origin, recurrence, and resistance to treatment. However, the understanding of how mRNA expression patterns of GBM subtypes are reflected at global proteome level in GSCs is limited. To characterize protein expression in GSCs, we performed in-depth proteogenomic analysis of patient-derived GSCs by RNA-sequencing and mass-spectrometry. We quantified > 10 000 proteins in two independent GSC panels and propose a GSC-associated proteomic signature characterizing two distinct phenotypic conditions; one defined by proteins upregulated in proneural and classical GSCs (GPC-like), and another by proteins upregulated in mesenchymal GSCs (GM-like). The GM-like protein set in GBM tissue was associated with necrosis, recurrence, and worse overall survival. Through proteogenomics, we discovered 252 non-canonical peptides in the GSCs, i.e., protein sequences that are variant or derive from genome regions previously considered non-protein-coding, including variants of the heterogeneous ribonucleoproteins implicated in RNA splicing. In summary, GSCs express two protein sets that have an inverse association with clinical outcomes in GBM. The discovery of non-canonical protein sequences questions existing gene models and pinpoints new protein targets for research in GBM.
Asunto(s)
Neoplasias Encefálicas , Glioblastoma , Humanos , Glioblastoma/genética , Glioblastoma/metabolismo , Proteómica , Neoplasias Encefálicas/metabolismo , Células Madre Neoplásicas/metabolismo , Línea Celular TumoralRESUMEN
Despite improvement of current treatment strategies and novel targeted drugs, relapse and treatment resistance largely determine the outcome for acute myeloid leukemia (AML) patients. To identify the underlying molecular characteristics, numerous studies have been aimed to decipher the genomic- and transcriptomic landscape of AML. Nevertheless, further molecular changes allowing malignant cells to escape treatment remain to be elucidated. Mass spectrometry is a powerful tool enabling detailed insights into proteomic changes that could explain AML relapse and resistance. Here, we investigated AML samples from 47 adult and 22 pediatric patients at serial time-points during disease progression using mass spectrometry-based in-depth proteomics. We show that the proteomic profile at relapse is enriched for mitochondrial ribosomal proteins and subunits of the respiratory chain complex, indicative of reprogrammed energy metabolism from diagnosis to relapse. Further, higher levels of granzymes and lower levels of the anti-inflammatory protein CR1/CD35 suggest an inflammatory signature promoting disease progression. Finally, through a proteogenomic approach, we detected novel peptides, which present a promising repertoire in the search for biomarkers and tumor-specific druggable targets. Altogether, this study highlights the importance of proteomic studies in holistic approaches to improve treatment and survival of AML patients.
Asunto(s)
Leucemia Mieloide Aguda , Proteogenómica , Humanos , Niño , Adulto , Proteómica/métodos , Leucemia Mieloide Aguda/tratamiento farmacológico , Leucemia Mieloide Aguda/genética , Leucemia Mieloide Aguda/patología , Recurrencia , Progresión de la EnfermedadRESUMEN
In a cancer genome, the noncoding sequence contains the vast majority of somatic mutations. While very few are expected to be cancer drivers, those affecting regulatory elements have the potential to have downstream effects on gene regulation that may contribute to cancer progression. To prioritize regulatory mutations, we screened somatic mutations in the Pan-Cancer Analysis of Whole Genomes cohort of 2,515 cancer genomes on individual bases to assess their potential regulatory roles in their respective cancer types. We found a highly significant enrichment of regulatory mutations associated with the deamination signature overlapping a CpG site in the CCAAT/Enhancer Binding Protein ß recognition sites in many cancer types. Overall, 5,749 mutated regulatory elements were identified in 1,844 tumor samples from 39 cohorts containing 11,962 candidate regulatory mutations. Our analysis indicated 20 or more regulatory mutations in 5.5% of the samples, and an overall average of six per tumor. Several recurrent elements were identified, and major cancer-related pathways were significantly enriched for genes nearby the mutated regulatory elements. Our results provide a detailed view of the role of regulatory elements in cancer genomes.
Asunto(s)
Biología Computacional , Genómica , Anotación de Secuencia Molecular , Mutación , Neoplasias/genética , Regiones no Traducidas , Sitios de Unión , Biomarcadores de Tumor , Biología Computacional/métodos , Susceptibilidad a Enfermedades , Regulación Neoplásica de la Expresión Génica , Predisposición Genética a la Enfermedad , Genómica/métodos , Humanos , Tasa de Mutación , Neoplasias/metabolismo , Motivos de Nucleótidos , Unión Proteica , Secuencias Reguladoras de Ácidos Nucleicos , Transducción de Señal , Factores de Transcripción/metabolismoRESUMEN
Knowledge of clinically targetable tumor antigens is becoming vital for broader design and utility of therapeutic cancer vaccines. This information is obtained reliably by directly interrogating the MHC-I presented peptide ligands, the immunopeptidome, with state-of-the-art mass spectrometry. Our manuscript describes direct identification of novel tumor antigens for an aggressive triple-negative breast cancer model. Immunopeptidome profiling revealed 2481 unique antigens, among them a novel ERV antigen originating from an endogenous retrovirus element. The clinical benefit and tumor control potential of the identified tumor antigens and ERV antigen were studied in a preclinical model using two vaccine platforms and therapeutic settings. Prominent control of established tumors was achieved using an oncolytic adenovirus platform designed for flexible and specific tumor targeting, namely PeptiCRAd. Our study presents a pipeline integrating immunopeptidome analysis-driven antigen discovery with a therapeutic cancer vaccine platform for improved personalized oncolytic immunotherapy.
RESUMEN
Despite major advancements in lung cancer treatment, long-term survival is still rare, and a deeper understanding of molecular phenotypes would allow the identification of specific cancer dependencies and immune evasion mechanisms. Here we performed in-depth mass spectrometry (MS)-based proteogenomic analysis of 141 tumors representing all major histologies of non-small cell lung cancer (NSCLC). We identified six distinct proteome subtypes with striking differences in immune cell composition and subtype-specific expression of immune checkpoints. Unexpectedly, high neoantigen burden was linked to global hypomethylation and complex neoantigens mapped to genomic regions, such as endogenous retroviral elements and introns, in immune-cold subtypes. Further, we linked immune evasion with LAG3 via STK11 mutation-dependent HNF1A activation and FGL1 expression. Finally, we develop a data-independent acquisition MS-based NSCLC subtype classification method, validate it in an independent cohort of 208 NSCLC cases and demonstrate its clinical utility by analyzing an additional cohort of 84 late-stage NSCLC biopsy samples.
Asunto(s)
Carcinoma de Pulmón de Células no Pequeñas , Neoplasias Pulmonares , Proteogenómica , Carcinoma de Pulmón de Células no Pequeñas/genética , Fibrinógeno/uso terapéutico , Genómica/métodos , Humanos , Evasión Inmune/genética , Neoplasias Pulmonares/genéticaRESUMEN
Several Genome Wide Association Studies (GWAS) have reported variants associated to immune diseases. However, the identified variants are rarely the drivers of the associations and the molecular mechanisms behind the genetic contributions remain poorly understood. ChIP-seq data for TFs and histone modifications provide snapshots of protein-DNA interactions allowing the identification of heterozygous SNPs showing significant allele specific signals (AS-SNPs). AS-SNPs can change a TF binding site resulting in altered gene regulation and are primary candidates to explain associations observed in GWAS and expression studies. We identified 17,293 unique AS-SNPs across 7 lymphoblastoid cell lines. In this set of cell lines we interrogated 85% of common genetic variants in the population for potential regulatory effect and we identified 237 AS-SNPs associated to immune GWAS traits and 714 to gene expression in B cells. To elucidate possible regulatory mechanisms we integrated long-range 3D interactions data to identify putative target genes and motif predictions to identify TFs whose binding may be affected by AS-SNPs yielding a collection of 173 AS-SNPs associated to gene expression and 60 to B cell related traits. We present a systems strategy to find functional gene regulatory variants, the TFs that bind differentially between alleles and novel strategies to detect the regulated genes.