Proteogenomic database construction driven from large scale RNA-seq data.

Woo, Sunghee; Cha, Seong Won; Merrihew, Gennifer; He, Yupeng; Castellana, Natalie; Guest, Clark; MacCoss, Michael; Bafna, Vineet

Woo, Sunghee; Cha, Seong Won; Merrihew, Gennifer; He, Yupeng; Castellana, Natalie; Guest, Clark; MacCoss, Michael; Bafna, Vineet.

Afiliación

Woo S; Department of Electrical and Computing Engineering, ¶Department of Bioinformatics and Systems Biology, and §Department of Computer Science, University of California, San Diego , La Jolla, California 92093, United States.

J Proteome Res ; 13(1): 21-8, 2014 Jan 03.

Article en En | MEDLINE | ID: mdl-23802565

ABSTRACT

ABSTRACT

The advent of inexpensive RNA-seq technologies and other deep sequencing technologies for RNA has the promise to radically improve genomic annotation, providing information on transcribed regions and splicing events in a variety of cellular conditions. Using MS-based proteogenomics, many of these events can be confirmed directly at the protein level. However, the integration of large amounts of redundant RNA-seq data and mass spectrometry data poses a challenging problem. Our paper addresses this by construction of a compact database that contains all useful information expressed in RNA-seq reads. Applying our method to cumulative C. elegans data reduced 496.2 GB of aligned RNA-seq SAM files to 410 MB of splice graph database written in FASTA format. This corresponds to 1000× compression of data size, without loss of sensitivity. We performed a proteogenomics study using the custom data set, using a completely automated pipeline, and identified a total of 4044 novel events, including 215 novel genes, 808 novel exons, 12 alternative splicings, 618 gene-boundary corrections, 245 exon-boundary changes, 938 frame shifts, 1166 reverse strands, and 42 translated UTRs. Our results highlight the usefulness of transcript + proteomic integration for improved genome annotations.

Asunto(s)

Caenorhabditis elegans/metabolismo; Bases de Datos Genéticas; Bases de Datos de Proteínas; Genoma; Proteoma; Análisis de Secuencia de ARN; Secuencia de Aminoácidos; Animales; Automatización; Caenorhabditis elegans/genética; Proteínas del Helminto/química; Proteínas del Helminto/genética; Proteínas del Helminto/metabolismo; Datos de Secuencia Molecular

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Genoma / Análisis de Secuencia de ARN / Caenorhabditis elegans / Proteoma / Bases de Datos Genéticas / Bases de Datos de Proteínas Límite: Animals Idioma: En Revista: J Proteome Res Asunto de la revista: BIOQUIMICA Año: 2014 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google