KIMI: Knockoff Inference for Motif Identification from molecular sequences with controlled false discovery rate.

Bai, Xin; Ren, Jie; Fan, Yingying; Sun, Fengzhu

Bai, Xin; Ren, Jie; Fan, Yingying; Sun, Fengzhu.

Afiliação

Bai X; Quantitative and Computational Biology Program, Department of Biological Sciences, Los Angeles, CA 90089, USA.
Ren J; Quantitative and Computational Biology Program, Department of Biological Sciences, Los Angeles, CA 90089, USA.
Fan Y; Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089, USA.
Sun F; Quantitative and Computational Biology Program, Department of Biological Sciences, Los Angeles, CA 90089, USA.

Bioinformatics ; 37(6): 759-766, 2021 05 05.

Article em En | MEDLINE | ID: mdl-33119059

ABSTRACT

ABSTRACT

MOTIVATION The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmids, etc. Computational methods based on k-mer frequencies have been shown to be highly effective for classifying metagenomic sequencing reads into different groups. However, such methods usually use all the k-mers as features for prediction without selecting relevant k-mers for the different groups of sequences, i.e. unique nucleotide patterns containing biological significance.

RESULTS:

To select k-mers for distinguishing different groups of sequences with guaranteed false discovery rate (FDR) control, we develop KIMI, a general framework based on model-X Knockoffs regarded as the state-of-the-art statistical method for FDR control, for sequence motif discovery with arbitrary target FDR level, such that reproducibility can be theoretically guaranteed. KIMI is shown through simulation studies to be effective in simultaneously controlling FDR and yielding high power, outperforming the broadly used Benjamini-Hochberg procedure and the q-value method for FDR control. To illustrate the usefulness of KIMI in analyzing real datasets, we take the viral motif discovery problem as an example and implement KIMI on a real dataset consisting of viral and bacterial contigs. We show that the accuracy of predicting viral and bacterial contigs can be increased by training the prediction model only on relevant k-mers selected by KIMI. AVAILABILITYAND IMPLEMENTATION Our implementation of KIMI is available at https//github.com/xinbaiusc/KIMI. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Assuntos

Metagenômica; Microbiota; Algoritmos; Simulação por Computador; Metagenoma; Reprodutibilidade dos Testes; Análise de Sequência de DNA

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Metagenômica / Microbiota Tipo de estudo: Diagnostic_studies / Prognostic_studies Idioma: En Revista: Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2021 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google