Pesquisa | Biblioteca Virtual em Saúde

A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes.

Gkanogiannis, Anestis; Gazut, Stéphane; Salanoubat, Marcel; Kanj, Sawsan; Brüls, Thomas.

BMC Bioinformatics ; 17(1): 311, 2016 Aug 19.

Artigo em Inglês | MEDLINE | ID: mdl-27542753

RESUMO

BACKGROUND: Metagenomics holds great promises for deepening our knowledge of key bacterial driven processes, but metagenome assembly remains problematic, typically resulting in representation biases and discarding significant amounts of non-redundant sequence information. In order to alleviate constraints assembly can impose on downstream analyses, and/or to increase the fraction of raw reads assembled via targeted assemblies relying on pre-assembly binning steps, we developed a set of binning modules and evaluated their combination in a new "assembly-free" binning protocol. RESULTS: We describe a scalable multi-tiered binning algorithm that combines frequency and compositional features to cluster unassembled reads, and demonstrate i) significant runtime performance gains of the developed modules against state of the art software, obtained through parallelization and the efficient use of large lock-free concurrent hash maps, ii) its relevance for clustering unassembled reads from high complexity (e.g., harboring 700 distinct genomes) samples, iii) its relevance to experimental setups involving multiple samples, through a use case consisting in the "de novo" identification of sequences from a target genome (e.g., a pathogenic strain) segregating at low levels in a cohort of 50 complex microbiomes (harboring 100 distinct genomes each), in the background of closely related strains and the absence of reference genomes, iv) its ability to correctly identify clusters of sequences from the E. coli O104:H4 genome as the most strongly correlated to the infection status in 53 microbiomes sampled from the 2011 STEC outbreak in Germany, and to accurately cluster contigs of this pathogenic strain from a cross-assembly of these 53 microbiomes. CONCLUSIONS: We present a set of sequence clustering ("binning") modules and their application to biomarker (e.g., genomes of pathogenic organisms) discovery from large synthetic and real metagenomics datasets. Initially designed for the "assembly-free" analysis of individual metagenomic samples, we demonstrate their extension to setups involving multiple samples via the usage of the "alignment-free" d2S statistic to relate clusters across samples, and illustrate how the clustering modules can otherwise be leveraged for de novo "pre-assembly" tasks by segregating sequences into biologically meaningful partitions.

Assuntos

Algoritmos , Biomarcadores/química , Metagenoma , Metagenômica , Microbiota/genética , Conjuntos de Dados como Assunto , Humanos

Shared Nearest Neighbor Clustering in a Locality Sensitive Hashing Framework.

Kanj, Sawsan; Brüls, Thomas; Gazut, Stéphane.

J Comput Biol ; 25(2): 236-250, 2018 02.

Artigo em Inglês | MEDLINE | ID: mdl-28953425

RESUMO

We present a new algorithm to cluster high-dimensional sequence data and its application to the field of metagenomics, which aims at reconstructing individual genomes from a mixture of genomes sampled from an environmental site, without any prior knowledge of reference data (genomes) or the shape of clusters. Such problems typically cannot be solved directly with classical approaches seeking to estimate the density of clusters, for example, using the shared nearest neighbors (SNN) rule, due to the prohibitive size of contemporary sequence datasets. We explore here a new approach based on combining the SNN rule with the concept of locality sensitive hashing (LSH). The proposed method, called LSH-SNN, works by randomly splitting the input data into smaller-sized subsets (buckets) and employing the SNN rule on each of these buckets. Links can be created among neighbors sharing a sufficient number of elements, hence allowing clusters to be grown from linked elements. LSH-SNN can scale up to larger datasets consisting of millions of sequences, while achieving high accuracy across a variety of sample sizes and complexities.

Assuntos

Genômica/métodos , Metagenoma , Análise de Sequência de DNA/métodos , Análise por Conglomerados , Genoma Bacteriano

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA