Error, noise and bias in de novo transcriptome assemblies.

Freedman, Adam H; Clamp, Michele; Sackton, Timothy B

Freedman, Adam H; Clamp, Michele; Sackton, Timothy B.

Afiliación

Freedman AH; Faculty of Arts and Sciences Informatics Group, Harvard University, Cambridge, MA, USA.
Clamp M; Faculty of Arts and Sciences Informatics Group, Harvard University, Cambridge, MA, USA.
Sackton TB; Faculty of Arts and Sciences Informatics Group, Harvard University, Cambridge, MA, USA.

Mol Ecol Resour ; 21(1): 18-29, 2021 Jan.

Article en En | MEDLINE | ID: mdl-32180366

RESUMEN

De novo transcriptome assembly is a powerful tool, and has been widely used over the last decade for making evolutionary inferences. However, it relies on two implicit assumptions: that the assembled transcriptome is an unbiased representation of the underlying expressed transcriptome, and that expression estimates from the assembly are good, if noisy approximations of the relative abundance of expressed transcripts. Using publicly available data for model organisms, we demonstrate that, across assembly algorithms and data sets, these assumptions are consistently violated. Bias exists at the nucleotide level, with genotyping error rates ranging from 30% to 83%. As a result, diversity is underestimated in transcriptome assemblies, with consistent underestimation of heterozygosity in all but the most inbred samples. Even at the gene level, expression estimates show wide deviations from map-to-reference estimates, and positive bias at lower expression levels. Standard filtering of transcriptome assemblies improves the robustness of gene expression estimates but leads to the loss of a meaningful number of protein-coding genes, including many that are highly expressed. We demonstrate a computational method, length-rescaled CPM, to partly alleviate noise and bias in expression estimates. Researchers should consider ways to minimize the impact of bias in transcriptome assemblies.

Asunto(s)

Sesgo; Perfilación de la Expresión Génica; Transcriptoma; Algoritmos

Palabras clave

adaptation; bioinformatics/phyloinfomatics; genomics/proteomics; population genetics; transcriptomics

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Sesgo / Perfilación de la Expresión Génica / Transcriptoma Idioma: En Revista: Mol Ecol Resour Año: 2021 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google