Combining DGE and RNA-sequencing data to identify new polyA+ non-coding transcripts in the human genome.

Philippe, Nicolas; Bou Samra, Elias; Boureux, Anthony; Mancheron, Alban; Rufflé, Florence; Bai, Qiang; De Vos, John; Rivals, Eric; Commes, Thérèse

Philippe, Nicolas; Bou Samra, Elias; Boureux, Anthony; Mancheron, Alban; Rufflé, Florence; Bai, Qiang; De Vos, John; Rivals, Eric; Commes, Thérèse.

Afiliación

Philippe N; Transcriptomics, bioinformatics and myeloid leukaemia, INSERM, U1040, Institute for Research in Biotherapy, Montpellier F-34197, France, Université Montpellier 2, Montpellier, France, Institut de Biologie Computationnelle, Maison de la modélisation, Université Montpellier 2, France, LIRMM, MAB, CNRS UMR 5506, Université Montpellier 2, Montpellier, France and Genomic instability of pluripotent stem cells, INSERM, U1040, Institute for Research in Biotherapy, Montpellier F-34197, France.

Nucleic Acids Res ; 42(5): 2820-32, 2014 Mar.

Article en En | MEDLINE | ID: mdl-24357408

ABSTRACT

ABSTRACT

Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as 'TranscriRef'). We then annotated 750,000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified â¼34,000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http//cractools.gforge.inria.fr/softwares/digitagct.

Asunto(s)

Perfilación de la Expresión Génica/métodos; Genoma Humano; ARN no Traducido/análisis; Análisis de Secuencia de ARN/métodos; Línea Celular; Humanos; Anotación de Secuencia Molecular; Poli A/análisis; Programas Informáticos; Transcripción Genética

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Genoma Humano / Análisis de Secuencia de ARN / Perfilación de la Expresión Génica / ARN no Traducido Límite: Humans Idioma: En Revista: Nucleic Acids Res Año: 2014 Tipo del documento: Article País de afiliación: Francia

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google