Phylogenetic double placement of mixed samples.

Balaban, Metin; Mirarab, Siavash

Balaban, Metin; Mirarab, Siavash.

Afiliación

Balaban M; Bioinformatics and Systems Biology Department, University of California San Diego, San Diego, CA 92093, USA.
Mirarab S; Electrical and Computer Engineering Department, University of California San Diego, San Diego, CA 92093, USA.

Bioinformatics ; 36(Suppl_1): i335-i343, 2020 07 01.

Article en En | MEDLINE | ID: mdl-32657414

ABSTRACT

ABSTRACT

MOTIVATION Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction.

RESULTS:

We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MIxed Sample Analysis tool (MISA), on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice. AVAILABILITY AND IMPLEMENTATION The software and data are available at https//github.com/balabanmetin/misa and https//github.com/balabanmetin/misa-data. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos; Programas Informáticos; Evolución Biológica; Genoma; Secuenciación de Nucleótidos de Alto Rendimiento; Filogenia; Análisis de Secuencia de ADN

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Banco de datos: MEDLINE Asunto principal: Algoritmos / Programas Informáticos Idioma: En Año: 2020 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Banco de datos: MEDLINE Asunto principal: Algoritmos / Programas Informáticos Idioma: En Año: 2020 Tipo del documento: Article