Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides.

Umer, Husen M; Audain, Enrique; Zhu, Yafeng; Pfeuffer, Julianus; Sachsenberg, Timo; Lehtiö, Janne; Branca, Rui M; Perez-Riverol, Yasset

Umer, Husen M; Audain, Enrique; Zhu, Yafeng; Pfeuffer, Julianus; Sachsenberg, Timo; Lehtiö, Janne; Branca, Rui M; Perez-Riverol, Yasset.

Afiliación

Umer HM; Department of Oncology-Pathology, Science for Life Laboratory, Karolinska Institutet, Stockholm 17165, Sweden.
Audain E; Department of Congenital Heart Disease and Pediatric Cardiology, Universitätsklinikum Schleswig-Holstein Kiel, Kiel 24105, Germany.
Zhu Y; Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-sen University, Guangzhou 510120, China.
Pfeuffer J; Algorithmic Bioinformatics, Freie Universität Berlin, Berlin 14195, Germany.
Sachsenberg T; Visualization and Data Analysis, Zuse Institute Berlin, Berlin 14195, Germany.
Lehtiö J; Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany.
Branca RM; Department of Oncology-Pathology, Science for Life Laboratory, Karolinska Institutet, Stockholm 17165, Sweden.
Perez-Riverol Y; Department of Oncology-Pathology, Science for Life Laboratory, Karolinska Institutet, Stockholm 17165, Sweden.

Bioinformatics ; 38(5): 1470-1472, 2022 02 07.

Article en En | MEDLINE | ID: mdl-34904638

RESUMEN

SUMMARY: We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified. AVAILABILITY AND IMPLEMENTATION: The software is freely available. pypgatk: https://github.com/bigbio/py-pgatk/ and pgdb: https://nf-co.re/pgdb. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Proteogenómica; Humanos; Péptidos/genética; Programas Informáticos; Algoritmos; Proteínas

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Proteogenómica Tipo de estudio: Diagnostic_studies Límite: Humans Idioma: En Revista: Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2022 Tipo del documento: Article País de afiliación: Suecia

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google