findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies.
Bioinformatics
; 34(4): 550-557, 2018 02 15.
Article
in En
| MEDLINE
| ID: mdl-29444236
ABSTRACT
Motivation Analyzing k-mer frequencies in whole-genome sequencing data is becoming a common method for estimating genome size (GS). However, it remains uninvestigated how accurate the method is, especially if it can capture intra-species GS variation. Results:
We present findGSE, which fits skew normal distributions to k-mer frequencies to estimate GS. findGSE outperformed existing tools in an extensive simulation study. Estimating GSs of 89 Arabidopsis thaliana accessions, findGSE showed the highest capability in capturing GS variations. In an application with 71 female and 71 male human individuals, findGSE delivered an average of 3039 Mb as haploid human GS, while female genomes were on average 41 Mb larger than male genomes, in astonishing agreement with size difference of the X and Y chromosomes. Further analysis showed that human GS variations link to geographical patterns and significant differences between populations, which can be explained by variable abundances of LINE-1 retrotransposons. Availability and implementation R package of findGSE is freely available at https//github.com/schneebergerlab/findGSE and supported on linux and Mac systems. Contact schneeberger@mpipz.mpg.de. Supplementary information Supplementary data are available at Bioinformatics online.
Full text:
1
Collection:
01-internacional
Database:
MEDLINE
Main subject:
Software
/
Genome, Human
/
Sequence Analysis, DNA
/
Genome, Plant
/
Genome Size
Limits:
Female
/
Humans
/
Male
Language:
En
Journal:
Bioinformatics
Journal subject:
INFORMATICA MEDICA
Year:
2018
Type:
Article
Affiliation country:
Germany