Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data.

Paulson, Joseph N; Chen, Cho-Yi; Lopes-Ramos, Camila M; Kuijjer, Marieke L; Platig, John; Sonawane, Abhijeet R; Fagny, Maud; Glass, Kimberly; Quackenbush, John

Paulson, Joseph N; Chen, Cho-Yi; Lopes-Ramos, Camila M; Kuijjer, Marieke L; Platig, John; Sonawane, Abhijeet R; Fagny, Maud; Glass, Kimberly; Quackenbush, John.

Afiliación

Paulson JN; Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA.
Chen CY; Department of Biostatistics, Harvard School of Public Health, Boston, MA, 02215, USA.
Lopes-Ramos CM; Present address: Genentech, Department of Biostatistics, Product Development, 1 DNA Way, South San Francisco, CA, 94080, USA.
Kuijjer ML; Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA.
Platig J; Department of Biostatistics, Harvard School of Public Health, Boston, MA, 02215, USA.
Sonawane AR; Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA.
Fagny M; Department of Biostatistics, Harvard School of Public Health, Boston, MA, 02215, USA.
Glass K; Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA.
Quackenbush J; Department of Biostatistics, Harvard School of Public Health, Boston, MA, 02215, USA.

BMC Bioinformatics ; 18(1): 437, 2017 Oct 03.

Article en En | MEDLINE | ID: mdl-28974199

ABSTRACT

ABSTRACT

BACKGROUND:

Although ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data - critical first steps for any subsequent analysis.

RESULTS:

We find that analysis of large RNA-Seq data sets requires both careful quality control and the need to account for sparsity due to the heterogeneity intrinsic in multi-group studies. We developed Yet Another RNA Normalization software pipeline (YARN), that includes quality control and preprocessing, gene filtering, and normalization steps designed to facilitate downstream analysis of large, heterogeneous RNA-Seq data sets and we demonstrate its use with data from the Genotype-Tissue Expression (GTEx) project.

CONCLUSIONS:

An R package instantiating YARN is available at http//bioconductor.org/packages/yarn .

Asunto(s)

Bases de Datos Genéticas; Especificidad de Órganos/genética; Análisis de Secuencia de ARN/métodos; Análisis de Secuencia de ARN/normas; Perfilación de la Expresión Génica; Regulación de la Expresión Génica; Humanos; Anotación de Secuencia Molecular; Análisis de Componente Principal; Control de Calidad; Estándares de Referencia; Tamaño de la Muestra; Programas Informáticos

Palabras clave

Filtering; GTEx; Normalization; Preprocessing; Quality control; RNA-Seq

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Bases de datos: MEDLINE Asunto principal: Especificidad de Órganos / Análisis de Secuencia de ARN / Bases de Datos Genéticas Límite: Humans Idioma: En Revista: BMC Bioinformatics Asunto de la revista: INFORMATICA MEDICA Año: 2017 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google