Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples.

Pettengill, James B; Pightling, Arthur W; Baugher, Joseph D; Rand, Hugh; Strain, Errol

Pettengill, James B; Pightling, Arthur W; Baugher, Joseph D; Rand, Hugh; Strain, Errol.

Affiliation

Pettengill JB; Biostatistics and Bioinformatics Staff, Center for Food Safety and Applied Nutrition, Food and Drug Administration, 5001 Campus Drive, College Park, MD 20740, United States of America.
Pightling AW; Biostatistics and Bioinformatics Staff, Center for Food Safety and Applied Nutrition, Food and Drug Administration, 5001 Campus Drive, College Park, MD 20740, United States of America.
Baugher JD; Biostatistics and Bioinformatics Staff, Center for Food Safety and Applied Nutrition, Food and Drug Administration, 5001 Campus Drive, College Park, MD 20740, United States of America.
Rand H; Biostatistics and Bioinformatics Staff, Center for Food Safety and Applied Nutrition, Food and Drug Administration, 5001 Campus Drive, College Park, MD 20740, United States of America.
Strain E; Biostatistics and Bioinformatics Staff, Center for Food Safety and Applied Nutrition, Food and Drug Administration, 5001 Campus Drive, College Park, MD 20740, United States of America.

PLoS One ; 11(11): e0166162, 2016.

Article in En | MEDLINE | ID: mdl-27832109

ABSTRACT

ABSTRACT

The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence data are being populated. These databases currently contain tens of thousands of samples and are expected to grow to hundreds of thousands within a few years. For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genetic distances among a set of samples. Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. We evaluated seven measures of genetic distance, which were estimated from either k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) or nucleotide sites (NUCmer and an extended multi-locus sequence typing (MLST) scheme). When analyzing empirical data (whole-genome sequence data from 18,997 Salmonella isolates) there are features (e.g., genomic, assembly, and contamination) that cause distances inferred from k-mer profiles, which treat absent data as informative, to fail to accurately capture the distance between samples when compared to distances inferred from differences in nucleotide sites. Thus, site-based distances, like NUCmer and extended MLST, are superior in performance, but accessing the computing resources necessary to perform them may be challenging when analyzing large databases.

Subject(s)

Genome, Bacterial/genetics; Multilocus Sequence Typing/methods; Salmonella/genetics; Sequence Analysis, DNA/methods; Animals; Computational Biology/methods; Humans; Phylogeny; Reproducibility of Results; Salmonella/classification; Salmonella/physiology; Salmonella Infections/microbiology; Species Specificity; Time Factors

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Salmonella / Genome, Bacterial / Sequence Analysis, DNA / Multilocus Sequence Typing Type of study: Diagnostic_studies Limits: Animals / Humans Language: En Journal: PLoS One Journal subject: CIENCIA / MEDICINA Year: 2016 Document type: Article Affiliation country: United States

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google