BacTag - a pipeline for fast and accurate gene and allele typing in bacterial sequencing data based on database preprocessing.

Khachatryan, Lusine; Kraakman, Margriet E M; Bernards, Alexandra T; Laros, Jeroen F J

Khachatryan, Lusine; Kraakman, Margriet E M; Bernards, Alexandra T; Laros, Jeroen F J.

Afiliação

Khachatryan L; Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands. l.khachatryan@lumc.nl.
Kraakman MEM; Department of Medical Microbiology, Leiden University Medical Center, Leiden, The Netherlands.
Bernards AT; Department of Medical Microbiology, Leiden University Medical Center, Leiden, The Netherlands.
Laros JFJ; Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands.

BMC Genomics ; 20(1): 338, 2019 May 06.

Article em En | MEDLINE | ID: mdl-31060512

ABSTRACT

ABSTRACT

BACKGROUND:

Bacteria carry a wide array of genes, some of which have multiple alleles. These different alleles are often responsible for distinct types of virulence and can determine the classification at the subspecies levels (e.g., housekeeping genes for Multi Locus Sequence Typing, MLST). Therefore, it is important to rapidly detect not only the gene of interest, but also the relevant allele. Current sequencing-based methods are limited to mapping reads to each of the known allele reference, which is a time-consuming procedure.

RESULTS:

To address this limitation, we developed BacTag - a pipeline that rapidly and accurately detects which genes are present in a sequencing dataset and reports the allele of each of the identified genes. We exploit the fact that different alleles of the same gene have a high similarity. Instead of mapping the reads to each of the allele reference sequences, we preprocess the database prior to the analysis, which makes the subsequent gene and allele identification efficient. During the preprocessing, we determine a representative reference sequence for each gene and store the differences between all alleles and this chosen reference. Throughout the analysis we estimate whether the gene is present in the sequencing data by mapping the reads to this reference sequence; if the gene is found, we compare the variants to those in the preprocessed database. This allows to detect which specific allele is present in the sequencing data. Our pipeline was successfully tested on artificial WGS E. coli, S. pseudintermedius, P. gingivalis, M. bovis, Borrelia spp. and Streptomyces spp. data and real WGS E. coli and K. pneumoniae data in order to report alleles of MLST house-keeping genes.

CONCLUSIONS:

We developed a new pipeline for fast and accurate gene and allele recognition based on database preprocessing and parallel computing and performed better or comparable to the current popular tools. We believe that our approach can be useful for a wide range of projects, including bacterial subspecies classification, clinical diagnostics of bacterial infections, and epidemiological studies.

Assuntos

Bactérias/classificação; Bactérias/genética; Sequenciamento de Nucleotídeos em Larga Escala/métodos; Tipagem Molecular/métodos; Análise de Sequência de DNA/métodos; Alelos; Bases de Dados Genéticas; Genes Bacterianos; Genoma Bacteriano

Palavras-chave

Allele typing; Database preprocessing; Multi-locus sequence typing; Next-generation sequencing

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Bactérias / Análise de Sequência de DNA / Tipagem Molecular / Sequenciamento de Nucleotídeos em Larga Escala Idioma: En Revista: BMC Genomics Assunto da revista: GENETICA Ano de publicação: 2019 Tipo de documento: Article País de afiliação: Holanda

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google