Kullback Leibler divergence in complete bacterial and phage genomes.

Akhter, Sajia; Aziz, Ramy K; Kashef, Mona T; Ibrahim, Eslam S; Bailey, Barbara; Edwards, Robert A

Akhter, Sajia; Aziz, Ramy K; Kashef, Mona T; Ibrahim, Eslam S; Bailey, Barbara; Edwards, Robert A.

Afiliação

Akhter S; Computational Science Research Center, San Diego State University, San Diego, CA, USA.
Aziz RK; Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt.
Kashef MT; Department of Computer Science, San Diego State University, San Diego, CA, United States of America.
Ibrahim ES; Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt.
Bailey B; Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt.
Edwards RA; Department of Mathematics & Statistics, San Diego State University, San Diego, CA, USA.

PeerJ ; 5: e4026, 2017.

Article em En | MEDLINE | ID: mdl-29204318

ABSTRACT

ABSTRACT

The amino acid content of the proteins encoded by a genome may predict the coding potential of that genome and may reflect lifestyle restrictions of the organism. Here, we calculated the Kullback-Leibler divergence from the mean amino acid content as a metric to compare the amino acid composition for a large set of bacterial and phage genome sequences. Using these data, we demonstrate that (i) there is a significant difference between amino acid utilization in different phylogenetic groups of bacteria and phages; (ii) many of the bacteria with the most skewed amino acid utilization profiles, or the bacteria that host phages with the most skewed profiles, are endosymbionts or parasites; (iii) the skews in the distribution are not restricted to certain metabolic processes but are common across all bacterial genomic subsystems; (iv) amino acid utilization profiles strongly correlate with GC content in bacterial genomes but very weakly correlate with the G+C percent in phage genomes. These findings might be exploited to distinguish coding from non-coding sequences in large data sets, such as metagenomic sequence libraries, to help in prioritizing subsequent analyses.

Palavras-chave

Genometrics; Genomics; Information theory; Metagenomics

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2017 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2017 Tipo de documento: Article País de afiliação: Estados Unidos