ABSTRACT
BACKGROUND: Many metagenomic studies have linked the imbalance in microbial abundance profiles to a wide range of diseases. These studies suggest utilizing the microbial abundance profiles as potential markers for metagenomic-associated conditions. Due to the inevitable importance of biomarkers in understanding the disease progression and the development of possible therapies, various computational tools have been proposed for metagenomic biomarker detection. However, most existing tools require prior scripting knowledge and lack user friendly interfaces, causing considerable time and effort to install, configure, and run these tools. Besides, there is no available all-in-one solution for running and comparing various metagenomic biomarker detection simultaneously. In addition, most of these tools just present the suggested biomarkers without any statistical evaluation for their quality. RESULTS: To overcome these limitations, this work presents MetaAnalyst, a software package with a simple graphical user interface (GUI) that (i) automates the installation and configuration of 28 state-of-the-art tools, (ii) supports flexible study design to enable studying the dataset under different scenarios smoothly, iii) runs and evaluates several algorithms simultaneously iv) supports different input formats and provides the user with several preprocessing capabilities, v) provides a variety of metrics to evaluate the quality of the suggested markers, and vi) presents the outcomes in the form of publication quality plots with various formatting capabilities as well as Excel sheets. CONCLUSIONS: The utility of this tool has been verified through studying a metagenomic dataset under four scenarios. The executable file for MetaAnalyst along with its user manual are made available at https://github.com/mshawaqfeh/MetaAnalyst .
Subject(s)
Algorithms , Software , Humans , Metagenomics , Biomarkers , PhenotypeABSTRACT
BACKGROUND: Analyzing Variance heterogeneity in genome wide association studies (vGWAS) is an emerging approach for detecting genetic loci involved in gene-gene and gene-environment interactions. vGWAS analysis detects variability in phenotype values across genotypes, as opposed to typical GWAS analysis, which detects variations in the mean phenotype value. RESULTS: A handful of vGWAS analysis methods have been recently introduced in the literature. However, very little work has been done for evaluating these methods. To enable the development of better vGWAS analysis methods, this work presents the first quantitative vGWAS simulation procedure. To that end, we describe the mathematical framework and algorithm for generating quantitative vGWAS phenotype data from genotype profiles. Our simulation model accounts for both haploid and diploid genotypes under different modes of dominance. Our model is also able to simulate any number of genetic loci causing mean and variance heterogeneity. CONCLUSIONS: We demonstrate the utility of our simulation procedure through generating a variety of genetic loci types to evaluate common GWAS and vGWAS analysis methods. The results of this evaluation highlight the challenges current tools face in detecting GWAS and vGWAS loci.
Subject(s)
Computer Simulation , Genome-Wide Association Study , Algorithms , Diploidy , Genetic Loci , Genotype , Humans , Linkage Disequilibrium/genetics , Phenotype , Polymorphism, Single Nucleotide/geneticsABSTRACT
BACKGROUND: Biomarker detection presents itself as a major means of translating biological data into clinical applications. Due to the recent advances in high throughput sequencing technologies, an increased number of metagenomics studies have suggested the dysbiosis in microbial communities as potential biomarker for certain diseases. The reproducibility of the results drawn from metagenomic data is crucial for clinical applications and to prevent incorrect biological conclusions. The variability in the sample size and the subjects participating in the experiments induce diversity, which may drastically change the outcome of biomarker detection algorithms. Therefore, a robust biomarker detection algorithm that ensures the consistency of the results irrespective of the natural diversity present in the samples is needed. RESULTS: Toward this end, this paper proposes a novel Regularized Low Rank-Sparse Decomposition (RegLRSD) algorithm. RegLRSD models the bacterial abundance data as a superposition between a sparse matrix and a low-rank matrix, which account for the differentially and non-differentially abundant microbes, respectively. Hence, the biomarker detection problem is cast as a matrix decomposition problem. In order to yield more consistent and solid biological conclusions, RegLRSD incorporates the prior knowledge that the irrelevant microbes do not exhibit significant variation between samples belonging to different phenotypes. Moreover, an efficient algorithm to extract the sparse matrix is proposed. Comprehensive comparisons of RegLRSD with the state-of-the-art algorithms on three realistic datasets are presented. The obtained results demonstrate that RegLRSD consistently outperforms the other algorithms in terms of reproducibility performance and provides a marker list with high classification accuracy. CONCLUSIONS: The proposed RegLRSD algorithm for biomarker detection provides high reproducibility and classification accuracy performance regardless of the dataset complexity and the number of selected biomarkers. This renders RegLRSD as a reliable and powerful tool for identifying potential metagenomic biomarkers.
Subject(s)
Algorithms , Biomarkers/analysis , Metagenomics/methods , Animals , Biomarkers/metabolism , Colitis, Ulcerative/diagnosis , Colitis, Ulcerative/metabolism , Dogs , Exocrine Pancreatic Insufficiency/diagnosis , Exocrine Pancreatic Insufficiency/metabolism , High-Throughput Nucleotide Sequencing , Inflammatory Bowel Diseases/diagnosis , Inflammatory Bowel Diseases/metabolism , Mice , Reproducibility of ResultsABSTRACT
BACKGROUND: Inferring the microbial interaction networks (MINs) and modeling their dynamics are critical in understanding the mechanisms of the bacterial ecosystem and designing antibiotic and/or probiotic therapies. Recently, several approaches were proposed to infer MINs using the generalized Lotka-Volterra (gLV) model. Main drawbacks of these models include the fact that these models only consider the measurement noise without taking into consideration the uncertainties in the underlying dynamics. Furthermore, inferring the MIN is characterized by the limited number of observations and nonlinearity in the regulatory mechanisms. Therefore, novel estimation techniques are needed to address these challenges. RESULTS: This work proposes SgLV-EKF: a stochastic gLV model that adopts the extended Kalman filter (EKF) algorithm to model the MIN dynamics. In particular, SgLV-EKF employs a stochastic modeling of the MIN by adding a noise term to the dynamical model to compensate for modeling uncertainties. This stochastic modeling is more realistic than the conventional gLV model which assumes that the MIN dynamics are perfectly governed by the gLV equations. After specifying the stochastic model structure, we propose the EKF to estimate the MIN. SgLV-EKF was compared with two similarity-based algorithms, one algorithm from the integral-based family and two regression-based algorithms, in terms of the achieved performance on two synthetic data-sets and two real data-sets. The first data-set models the randomness in measurement data, whereas, the second data-set incorporates uncertainties in the underlying dynamics. The real data-sets are provided by a recent study pertaining to an antibiotic-mediated Clostridium difficile infection. The experimental results demonstrate that SgLV-EKF outperforms the alternative methods in terms of robustness to measurement noise, modeling errors, and tracking the dynamics of the MIN. CONCLUSIONS: Performance analysis demonstrates that the proposed SgLV-EKF algorithm represents a powerful and reliable tool to infer MINs and track their dynamics.
Subject(s)
Algorithms , Metagenomics/methods , Microbial Interactions , Models, TheoreticalABSTRACT
The study of recurrent copy number variations (CNVs) plays an important role in understanding the onset and evolution of complex diseases such as cancer. Array-based comparative genomic hybridization (aCGH) is a widely used microarray based technology for identifying CNVs. However, due to high noise levels and inter-sample variability, detecting recurrent CNVs from aCGH data remains a challenging topic. This paper proposes a novel method for identification of the recurrent CNVs. In the proposed method, the noisy aCGH data is modeled as the superposition of three matrices: a full-rank matrix of weighted piece-wise generating signals accounting for the clean aCGH data, a Gaussian noise matrix to model the inherent experimentation errors and other sources of error, and a sparse matrix to capture the sparse inter-sample (sample-specific) variations. We demonstrated the ability of our method to separate accurately recurrent CNVs from sample-specific variations and noise in both simulated (artificial) data and real data. The proposed method produced more accurate results than current state-of-the-art methods used in recurrent CNV detection and exhibited robustness to noise and sample-specific variations.
Subject(s)
Computational Biology/methods , DNA Copy Number Variations/genetics , Comparative Genomic Hybridization , Databases, Genetic , Humans , Models, GeneticABSTRACT
BACKGROUND: Metronidazole has a substantial impact on the gut microbiome. However, the recovery of the microbiome after discontinuation of administration, and the metabolic consequences of such alterations have not been investigated to date. OBJECTIVES: To describe the impact of 14-day metronidazole administration, alone or in combination with a hydrolyzed protein diet, on fecal microbiome, metabolome, bile acids (BAs), and lactate production, and on serum metabolome in healthy dogs. ANIMALS: Twenty-four healthy pet dogs. METHODS: Prospective, nonrandomized controlled study. Dogs fed various commercial diets were divided in 3 groups: control group (no intervention, G1); group receiving hydrolyzed protein diet, followed by metronidazole administration (G2); and group receiving metronidazole only (G3). Microbiome composition was evaluated with sequencing of 16S rRNA genes and quantitative polymerase chain reaction (qPCR)-based dysbiosis index. Untargeted metabolomics analysis of fecal and serum samples was performed, followed by targeted assays for fecal BAs and lactate. RESULTS: No changes were observed in G1, or G2 during diet change. Metronidazole significantly changed microbiome composition in G2 and G3, including decreases in richness (P < .001) and in key bacteria such as Fusobacteria (q < 0.001) that did not fully resolve 4 weeks after metronidazole discontinuation. Fecal dysbiosis index was significantly increased (P < .001). Those changes were accompanied by increased fecal total lactate (P < .001), and decreased secondary BAs deoxycholic acid and lithocholic acid (P < .001). CONCLUSION AND CLINICAL IMPORTANCE: Our results indicate a minimum 4-week effect of metronidazole on fecal microbiome and metabolome, supporting a cautious approach to prescription of metronidazole in dogs.
Subject(s)
Metabolome , Microbiota , Animals , Dogs , Feces , Metronidazole/pharmacology , Prospective Studies , RNA, Ribosomal, 16S/geneticsABSTRACT
BACKGROUND: Recent developments of high throughput sequencing technologies allow the characterization of the microbial communities inhabiting our world. Various metagenomic studies have suggested using microbial taxa as potential biomarkers for certain diseases. In practice, the number of available samples varies from experiment to experiment. Therefore, a robust biomarker detection algorithm is needed to provide a set of potential markers irrespective of the number of available samples. Consistent performance is essential to derive solid biological conclusions and to transfer these findings into clinical applications. Surprisingly, the consistency of a metagenomic biomarker detection algorithm with respect to the variation in the experiment size has not been addressed by the current state-of-art algorithms. RESULTS: We propose a consistency-classification framework that enables the assessment of consistency and classification performance of a biomarker discovery algorithm. This evaluation protocol is based on random resampling to mimic the variation in the experiment size. Moreover, we model the metagenomic data matrix as a superposition of two matrices. The first matrix is a low-rank matrix that models the abundance levels of the irrelevant bacteria. The second matrix is a sparse matrix that captures the abundance levels of the bacteria that are differentially abundant between different phenotypes. Then, we propose a novel Robust Principal Component Analysis (RPCA) based biomarker discovery algorithm to recover the sparse matrix. RPCA belongs to the class of multivariate feature selection methods which treat the features collectively rather than individually. This provides the proposed algorithm with an inherent ability to handle the complex microbial interactions. Comprehensive comparisons of RPCA with the state-of-the-art algorithms on two realistic datasets are conducted. Results show that RPCA consistently outperforms the other algorithms in terms of classification accuracy and reproducibility performance. CONCLUSIONS: The RPCA-based biomarker detection algorithm provides a high reproducibility performance irrespective of the complexity of the dataset or the number of selected biomarkers. Also, RPCA selects biomarkers with quite high discriminative accuracy. Thus, RPCA is a consistent and accurate tool for selecting taxanomical biomarkers for different microbial populations. REVIEWERS: This article was reviewed by Masanori Arita and Zoltan Gaspari.
Subject(s)
Genetic Markers , Metagenomics/methods , Microbiota/genetics , Algorithms , Animals , Bacteria/classification , Bacteria/genetics , Colitis, Ulcerative/diagnosis , Colitis, Ulcerative/microbiology , Dog Diseases/diagnosis , Dog Diseases/microbiology , Dogs , Inflammatory Bowel Diseases/diagnosis , Inflammatory Bowel Diseases/microbiology , Inflammatory Bowel Diseases/veterinary , Mice , Models, Genetic , Principal Component Analysis/methods , Reproducibility of ResultsABSTRACT
In systems biology, the regulation of gene expressions involves a complex network of regulators. Transcription factors (TFs) represent an important component of this network: they are proteins that control which genes are turned on or off in the genome by binding to specific DNA sequences. Transcription regulatory networks (TRNs) describe gene expressions as a function of regulatory inputs specified by interactions between proteins and DNA. A complete understanding of TRNs helps to predict a variety of biological processes and to diagnose, characterize and eventually develop more efficient therapies. Recent advances in biological high-throughput technologies, such as DNA microarray data and next-generation sequence (NGS) data, have made the inference of transcription factor activities (TFAs) and TF-gene regulations possible. Network component analysis (NCA) represents an efficient computational framework for TRN inference from the information provided by microarrays, ChIP-on-chip and the prior information about TF-gene regulation. However, NCA suffers from several shortcomings. Recently, several algorithms based on the NCA framework have been proposed to overcome these shortcomings. This paper first overviews the computational principles behind NCA, and then, it surveys the state-of-the-art NCA-based algorithms proposed in the literature for TRN reconstruction.