Search | VHL Regional Portal

1.

Standardized multi-omics of Earth's microbiomes reveals microbial and metabolite diversity.

Shaffer, Justin P; Nothias, Louis-Félix; Thompson, Luke R; Sanders, Jon G; Salido, Rodolfo A; Couvillion, Sneha P; Brejnrod, Asker D; Lejzerowicz, Franck; Haiminen, Niina; Huang, Shi; Lutz, Holly L; Zhu, Qiyun; Martino, Cameron; Morton, James T; Karthikeyan, Smruthi; Nothias-Esposito, Mélissa; Dührkop, Kai; Böcker, Sebastian; Kim, Hyun Woo; Aksenov, Alexander A; Bittremieux, Wout; Minich, Jeremiah J; Marotz, Clarisse; Bryant, MacKenzie M; Sanders, Karenina; Schwartz, Tara; Humphrey, Greg; Vásquez-Baeza, Yoshiki; Tripathi, Anupriya; Parida, Laxmi; Carrieri, Anna Paola; Beck, Kristen L; Das, Promi; González, Antonio; McDonald, Daniel; Ladau, Joshua; Karst, Søren M; Albertsen, Mads; Ackermann, Gail; DeReus, Jeff; Thomas, Torsten; Petras, Daniel; Shade, Ashley; Stegen, James; Song, Se Jin; Metz, Thomas O; Swafford, Austin D; Dorrestein, Pieter C; Jansson, Janet K; Gilbert, Jack A.

Nat Microbiol ; 7(12): 2128-2150, 2022 12.

Article in English | MEDLINE | ID: mdl-36443458

ABSTRACT

Despite advances in sequencing, lack of standardization makes comparisons across studies challenging and hampers insights into the structure and function of microbial communities across multiple habitats on a planetary scale. Here we present a multi-omics analysis of a diverse set of 880 microbial community samples collected for the Earth Microbiome Project. We include amplicon (16S, 18S, ITS) and shotgun metagenomic sequence data, and untargeted metabolomics data (liquid chromatography-tandem mass spectrometry and gas chromatography mass spectrometry). We used standardized protocols and analytical methods to characterize microbial communities, focusing on relationships and co-occurrences of microbially related metabolites and microbial taxa across environments, thus allowing us to explore diversity at extraordinary scale. In addition to a reference database for metagenomic and metabolomic data, we provide a framework for incorporating additional studies, enabling the expansion of existing knowledge in the form of an evolving community resource. We demonstrate the utility of this database by testing the hypothesis that every microbe and metabolite is everywhere but the environment selects. Our results show that metabolite diversity exhibits turnover and nestedness related to both microbial communities and the environment, whereas the relative abundances of microbially related metabolites vary and co-occur with specific microbial consortia in a habitat-specific manner. We additionally show the power of certain chemistry, in particular terpenoids, in distinguishing Earth's environments (for example, terrestrial plant surfaces and soils, freshwater and marine animal stool), as well as that of certain microbes including Conexibacter woesei (terrestrial soils), Haloquadratum walsbyi (marine deposits) and Pantoea dispersa (terrestrial plant detritus). This Resource provides insight into the taxa and metabolites within microbial communities from diverse habitats across Earth, informing both microbial and chemical ecology, and provides a foundation and methods for multi-omics microbiome studies of hosts and the environment.

Subject(s)

Microbiota , Animals , Microbiota/genetics , Metagenome , Metagenomics , Earth, Planet , Soil

2.

Combining explainable machine learning, demographic and multi-omic data to inform precision medicine strategies for inflammatory bowel disease.

Gardiner, Laura-Jayne; Carrieri, Anna Paola; Bingham, Karen; Macluskie, Graeme; Bunton, David; McNeil, Marian; Pyzer-Knapp, Edward O.

PLoS One ; 17(2): e0263248, 2022.

Article in English | MEDLINE | ID: mdl-35196350

ABSTRACT

Inflammatory bowel diseases (IBDs), including ulcerative colitis and Crohn's disease, affect several million individuals worldwide. These diseases are heterogeneous at the clinical, immunological and genetic levels and result from complex host and environmental interactions. Investigating drug efficacy for IBD can improve our understanding of why treatment response can vary between patients. We propose an explainable machine learning (ML) approach that combines bioinformatics and domain insight, to integrate multi-modal data and predict inter-patient variation in drug response. Using explanation of our models, we interpret the ML models' predictions to infer unique combinations of important features associated with pharmacological responses obtained during preclinical testing of drug candidates in ex vivo patient-derived fresh tissues. Our inferred multi-modal features that are predictive of drug efficacy include multi-omic data (genomic and transcriptomic), demographic, medicinal and pharmacological data. Our aim is to understand variation in patient responses before a drug candidate moves forward to clinical trials. As a pharmacological measure of drug efficacy, we measured the reduction in the release of the inflammatory cytokine TNFα from the fresh IBD tissues in the presence/absence of test drugs. We initially explored the effects of a mitogen-activated protein kinase (MAPK) inhibitor; however, we later showed our approach can be applied to other targets, test drugs or mechanisms of interest. Our best model predicted TNFα levels from demographic, medicinal and genomic features with an error of only 4.98% on unseen patients. We incorporated transcriptomic data to validate insights from genomic features. Our results showed variations in drug effectiveness (measured by ex vivo assays) between patients that differed in gender, age or condition and linked new genetic polymorphisms to patient response variation to the anti-inflammatory treatment BIRB796 (Doramapimod). Our approach models IBD drug response while also identifying its most predictive features as part of a transparent ML precision medicine strategy.

Subject(s)

Colitis, Ulcerative/genetics , Colitis, Ulcerative/metabolism , Crohn Disease/genetics , Crohn Disease/metabolism , Genomics/methods , Machine Learning , Precision Medicine/methods , Adolescent , Adult , Aged , Anti-Inflammatory Agents, Non-Steroidal/pharmacology , Colitis, Ulcerative/pathology , Crohn Disease/pathology , Drug Evaluation, Preclinical/methods , Female , Humans , Male , Mesalamine/pharmacology , Middle Aged , Naphthalenes/pharmacology , Phenylurea Compounds/pharmacology , Prednisolone/pharmacology , Pyrazoles/pharmacology , Signal Transduction/drug effects , Transcriptome/genetics , Tumor Necrosis Factor-alpha/metabolism , Young Adult

3.

Utilizing stability criteria in choosing feature selection methods yields reproducible results in microbiome data.

Jiang, Lingjing; Haiminen, Niina; Carrieri, Anna-Paola; Huang, Shi; Vázquez-Baeza, Yoshiki; Parida, Laxmi; Kim, Ho-Cheol; Swafford, Austin D; Knight, Rob; Natarajan, Loki.

Biometrics ; 78(3): 1155-1167, 2022 09.

Article in English | MEDLINE | ID: mdl-33914902

ABSTRACT

Feature selection is indispensable in microbiome data analysis, but it can be particularly challenging as microbiome data sets are high dimensional, underdetermined, sparse and compositional. Great efforts have recently been made on developing new methods for feature selection that handle the above data characteristics, but almost all methods were evaluated based on performance of model predictions. However, little attention has been paid to address a fundamental question: how appropriate are those evaluation criteria? Most feature selection methods often control the model fit, but the ability to identify meaningful subsets of features cannot be evaluated simply based on the prediction accuracy. If tiny changes to the data would lead to large changes in the chosen feature subset, then many selected features are likely to be a data artifact rather than real biological signal. This crucial need of identifying relevant and reproducible features motivated the reproducibility evaluation criterion such as Stability, which quantifies how robust a method is to perturbations in the data. In our paper, we compare the performance of popular model prediction metrics (MSE or AUC) with proposed reproducibility criterion Stability in evaluating four widely used feature selection methods in both simulations and experimental microbiome applications with continuous or binary outcomes. We conclude that Stability is a preferred feature selection criterion over model prediction metrics because it better quantifies the reproducibility of the feature selection method.

Subject(s)

Microbiota , Algorithms , Reproducibility of Results

4.

Efficient computation of Faith's phylogenetic diversity with applications in characterizing microbiomes.

Armstrong, George; Cantrell, Kalen; Huang, Shi; McDonald, Daniel; Haiminen, Niina; Carrieri, Anna Paola; Zhu, Qiyun; Gonzalez, Antonio; McGrath, Imran; Beck, Kristen L; Hakim, Daniel; Havulinna, Aki S; Méric, Guillaume; Niiranen, Teemu; Lahti, Leo; Salomaa, Veikko; Jain, Mohit; Inouye, Michael; Swafford, Austin D; Kim, Ho-Cheol; Parida, Laxmi; Vázquez-Baeza, Yoshiki; Knight, Rob.

Genome Res ; 31(11): 2131-2137, 2021 11.

Article in English | MEDLINE | ID: mdl-34479875

ABSTRACT

The number of publicly available microbiome samples is continually growing. As data set size increases, bottlenecks arise in standard analytical pipelines. Faith's phylogenetic diversity (Faith's PD) is a highly utilized phylogenetic alpha diversity metric that has thus far failed to effectively scale to trees with millions of vertices. Stacked Faith's phylogenetic diversity (SFPhD) enables calculation of this widely adopted diversity metric at a much larger scale by implementing a computationally efficient algorithm. The algorithm reduces the amount of computational resources required, resulting in more accessible software with a reduced carbon footprint, as compared to previous approaches. The new algorithm produces identical results to the previous method. We further demonstrate that the phylogenetic aspect of Faith's PD provides increased power in detecting diversity differences between younger and older populations in the FINRISK study's metagenomic data.

Subject(s)

Microbiota , Microbiota/genetics , Phylogeny

5.

Interpreting machine learning models to investigate circadian regulation and facilitate exploration of clock function.

Gardiner, Laura-Jayne; Rusholme-Pilcher, Rachel; Colmer, Josh; Rees, Hannah; Crescente, Juan Manuel; Carrieri, Anna Paola; Duncan, Susan; Pyzer-Knapp, Edward O; Krishna, Ritesh; Hall, Anthony.

Proc Natl Acad Sci U S A ; 118(32)2021 08 10.

Article in English | MEDLINE | ID: mdl-34353905

ABSTRACT

The circadian clock is an important adaptation to life on Earth. Here, we use machine learning to predict complex, temporal, and circadian gene expression patterns in Arabidopsis Most significantly, we classify circadian genes using DNA sequence features generated de novo from public, genomic resources, facilitating downstream application of our methods with no experimental work or prior knowledge needed. We use local model explanation that is transcript specific to rank DNA sequence features, providing a detailed profile of the potential circadian regulatory mechanisms for each transcript. Furthermore, we can discriminate the temporal phase of transcript expression using the local, explanation-derived, and ranked DNA sequence features, revealing hidden subclasses within the circadian class. Model interpretation/explanation provides the backbone of our methodological advances, giving insight into biological processes and experimental design. Next, we use model interpretation to optimize sampling strategies when we predict circadian transcripts using reduced numbers of transcriptomic timepoints. Finally, we predict the circadian time from a single, transcriptomic timepoint, deriving marker transcripts that are most impactful for accurate prediction; this could facilitate the identification of altered clock function from existing datasets.

Subject(s)

Arabidopsis Proteins/genetics , Circadian Clocks/genetics , Circadian Rhythm/physiology , Machine Learning , Models, Biological , Apoproteins/genetics , Arabidopsis/genetics , Arabidopsis/physiology , Circadian Clocks/physiology , Circadian Rhythm/genetics , Ecotype , Gene Expression Profiling , Gene Expression Regulation, Plant , Phytochrome/genetics , Phytochrome A/genetics , Regulatory Sequences, Nucleic Acid

6.

SARS-CoV-2 detection status associates with bacterial community composition in patients and the hospital environment.

Marotz, Clarisse; Belda-Ferre, Pedro; Ali, Farhana; Das, Promi; Huang, Shi; Cantrell, Kalen; Jiang, Lingjing; Martino, Cameron; Diner, Rachel E; Rahman, Gibraan; McDonald, Daniel; Armstrong, George; Kodera, Sho; Donato, Sonya; Ecklu-Mensah, Gertrude; Gottel, Neil; Salas Garcia, Mariana C; Chiang, Leslie Y; Salido, Rodolfo A; Shaffer, Justin P; Bryant, Mac Kenzie; Sanders, Karenina; Humphrey, Greg; Ackermann, Gail; Haiminen, Niina; Beck, Kristen L; Kim, Ho-Cheol; Carrieri, Anna Paola; Parida, Laxmi; Vázquez-Baeza, Yoshiki; Torriani, Francesca J; Knight, Rob; Gilbert, Jack; Sweeney, Daniel A; Allard, Sarah M.

Microbiome ; 9(1): 132, 2021 06 08.

Article in English | MEDLINE | ID: mdl-34103074

ABSTRACT

BACKGROUND: SARS-CoV-2 is an RNA virus responsible for the coronavirus disease 2019 (COVID-19) pandemic. Viruses exist in complex microbial environments, and recent studies have revealed both synergistic and antagonistic effects of specific bacterial taxa on viral prevalence and infectivity. We set out to test whether specific bacterial communities predict SARS-CoV-2 occurrence in a hospital setting. METHODS: We collected 972 samples from hospitalized patients with COVID-19, their health care providers, and hospital surfaces before, during, and after admission. We screened for SARS-CoV-2 using RT-qPCR, characterized microbial communities using 16S rRNA gene amplicon sequencing, and used these bacterial profiles to classify SARS-CoV-2 RNA detection with a random forest model. RESULTS: Sixteen percent of surfaces from COVID-19 patient rooms had detectable SARS-CoV-2 RNA, although infectivity was not assessed. The highest prevalence was in floor samples next to patient beds (39%) and directly outside their rooms (29%). Although bed rail samples more closely resembled the patient microbiome compared to floor samples, SARS-CoV-2 RNA was detected less often in bed rail samples (11%). SARS-CoV-2 positive samples had higher bacterial phylogenetic diversity in both human and surface samples and higher biomass in floor samples. 16S microbial community profiles enabled high classifier accuracy for SARS-CoV-2 status in not only nares, but also forehead, stool, and floor samples. Across these distinct microbial profiles, a single amplicon sequence variant from the genus Rothia strongly predicted SARS-CoV-2 presence across sample types, with greater prevalence in positive surface and human samples, even when compared to samples from patients in other intensive care units prior to the COVID-19 pandemic. CONCLUSIONS: These results contextualize the vast diversity of microbial niches where SARS-CoV-2 RNA is detected and identify specific bacterial taxa that associate with the viral RNA prevalence both in the host and hospital environment. Video Abstract.

Subject(s)

COVID-19 , SARS-CoV-2 , Hospitals , Humans , Pandemics , Phylogeny , RNA, Ribosomal, 16S/genetics , RNA, Viral/genetics

7.

Challenges in benchmarking metagenomic profilers.

Sun, Zheng; Huang, Shi; Zhang, Meng; Zhu, Qiyun; Haiminen, Niina; Carrieri, Anna Paola; Vázquez-Baeza, Yoshiki; Parida, Laxmi; Kim, Ho-Cheol; Knight, Rob; Liu, Yang-Yu.

Nat Methods ; 18(6): 618-626, 2021 06.

Article in English | MEDLINE | ID: mdl-33986544

ABSTRACT

Accurate microbial identification and abundance estimation are crucial for metagenomics analysis. Various methods for classification of metagenomic data and estimation of taxonomic profiles, broadly referred to as metagenomic profilers, have been developed. Nevertheless, benchmarking of metagenomic profilers remains challenging because some tools are designed to report relative sequence abundance while others report relative taxonomic abundance. Here we show how misleading conclusions can be drawn by neglecting this distinction between relative abundance types when benchmarking metagenomic profilers. Moreover, we show compelling evidence that interchanging sequence abundance and taxonomic abundance will influence both per-sample summary statistics and cross-sample comparisons. We suggest that the microbiome research community pay attention to potentially misleading biological conclusions arising from this issue when benchmarking metagenomic profilers, by carefully considering the type of abundance data that were analyzed and interpreted and clearly stating the strategy used for metagenomic profiling.

Subject(s)

Benchmarking/methods , Metagenomics , Computational Biology/methods , Gene Expression Profiling , Microbiota/genetics , Sequence Analysis, DNA/methods

8.

EMPress Enables Tree-Guided, Interactive, and Exploratory Analyses of Multi-omic Data Sets.

Cantrell, Kalen; Fedarko, Marcus W; Rahman, Gibraan; McDonald, Daniel; Yang, Yimeng; Zaw, Thant; Gonzalez, Antonio; Janssen, Stefan; Estaki, Mehrbod; Haiminen, Niina; Beck, Kristen L; Zhu, Qiyun; Sayyari, Erfan; Morton, James T; Armstrong, George; Tripathi, Anupriya; Gauglitz, Julia M; Marotz, Clarisse; Matteson, Nathaniel L; Martino, Cameron; Sanders, Jon G; Carrieri, Anna Paola; Song, Se Jin; Swafford, Austin D; Dorrestein, Pieter C; Andersen, Kristian G; Parida, Laxmi; Kim, Ho-Cheol; Vázquez-Baeza, Yoshiki; Knight, Rob.

mSystems ; 6(2)2021 Mar 16.

Article in English | MEDLINE | ID: mdl-33727399

ABSTRACT

Standard workflows for analyzing microbiomes often include the creation and curation of phylogenetic trees. Here we present EMPress, an interactive web tool for visualizing trees in the context of microbiome, metabolome, and other community data scalable to trees with well over 500,000 nodes. EMPress provides novel functionality-including ordination integration and animations-alongside many standard tree visualization features and thus simplifies exploratory analyses of many forms of 'omic data.IMPORTANCE Phylogenetic trees are integral data structures for the analysis of microbial communities. Recent work has also shown the utility of trees constructed from certain metabolomic data sets, further highlighting their importance in microbiome research. The ever-growing scale of modern microbiome surveys has led to numerous challenges in visualizing these data. In this paper we used five diverse data sets to showcase the versatility and scalability of EMPress, an interactive web visualization tool. EMPress addresses the growing need for exploratory analysis tools that can accommodate large, complex multi-omic data sets.

9.

Explainable AI reveals changes in skin microbiome composition linked to phenotypic differences.

Carrieri, Anna Paola; Haiminen, Niina; Maudsley-Barton, Sean; Gardiner, Laura-Jayne; Murphy, Barry; Mayes, Andrew E; Paterson, Sarah; Grimshaw, Sally; Winn, Martyn; Shand, Cameron; Hadjidoukas, Panagiotis; Rowe, Will P M; Hawkins, Stacy; MacGuire-Flanagan, Ashley; Tazzioli, Jane; Kenny, John G; Parida, Laxmi; Hoptroff, Michael; Pyzer-Knapp, Edward O.

Sci Rep ; 11(1): 4565, 2021 02 25.

Article in English | MEDLINE | ID: mdl-33633172

ABSTRACT

Alterations in the human microbiome have been observed in a variety of conditions such as asthma, gingivitis, dermatitis and cancer, and much remains to be learned about the links between the microbiome and human health. The fusion of artificial intelligence with rich microbiome datasets can offer an improved understanding of the microbiome's role in human health. To gain actionable insights it is essential to consider both the predictive power and the transparency of the models by providing explanations for the predictions. We combine the collection of leg skin microbiome samples from two healthy cohorts of women with the application of an explainable artificial intelligence (EAI) approach that provides accurate predictions of phenotypes with explanations. The explanations are expressed in terms of variations in the relative abundance of key microbes that drive the predictions. We predict skin hydration, subject's age, pre/post-menopausal status and smoking status from the leg skin microbiome. The changes in microbial composition linked to skin hydration can accelerate the development of personalized treatments for healthy skin, while those associated with age may offer insights into the skin aging process. The leg microbiome signatures associated with smoking and menopausal status are consistent with previous findings from oral/respiratory tract microbiomes and vaginal/gut microbiomes respectively. This suggests that easily accessible microbiome samples could be used to investigate health-related phenotypes, offering potential for non-invasive diagnosis and condition monitoring. Our EAI approach sets the stage for new work focused on understanding the complex relationships between microbial communities and phenotypes. Our approach can be applied to predict any condition from microbiome samples and has the potential to accelerate the development of microbiome-based personalized therapeutics and non-invasive diagnostics.

Subject(s)

Artificial Intelligence , Biodiversity , Microbiota , Phenotype , Skin/microbiology , Adult , Aged , Aging , Computational Biology/methods , Data Analysis , Deep Learning , Female , Humans , Male , Menopause , Metagenome , Metagenomics/methods , Middle Aged , Smokers , Young Adult

10.

Microbial context predicts SARS-CoV-2 prevalence in patients and the hospital built environment.

Marotz, Clarisse; Belda-Ferre, Pedro; Ali, Farhana; Das, Promi; Huang, Shi; Cantrell, Kalen; Jiang, Lingjing; Martino, Cameron; Diner, Rachel E; Rahman, Gibraan; McDonald, Daniel; Armstrong, George; Kodera, Sho; Donato, Sonya; Ecklu-Mensah, Gertrude; Gottel, Neil; Garcia, Mariana C Salas; Chiang, Leslie Y; Salido, Rodolfo A; Shaffer, Justin P; Bryant, MacKenzie; Sanders, Karenina; Humphrey, Greg; Ackermann, Gail; Haiminen, Niina; Beck, Kristen L; Kim, Ho-Cheol; Carrieri, Anna Paola; Parida, Laxmi; Vázquez-Baeza, Yoshiki; Torriani, Francesca J; Knight, Rob; Gilbert, Jack A; Sweeney, Daniel A; Allard, Sarah M.

medRxiv ; 2020 Nov 22.

Article in English | MEDLINE | ID: mdl-33236030

ABSTRACT

Synergistic effects of bacteria on viral stability and transmission are widely documented but remain unclear in the context of SARS-CoV-2. We collected 972 samples from hospitalized ICU patients with coronavirus disease 2019 (COVID-19), their health care providers, and hospital surfaces before, during, and after admission. We screened for SARS-CoV-2 using RT-qPCR, characterized microbial communities using 16S rRNA gene amplicon sequencing, and contextualized the massive microbial diversity in this dataset in a meta-analysis of over 20,000 samples. Sixteen percent of surfaces from COVID-19 patient rooms were positive, with the highest prevalence in floor samples next to patient beds (39%) and directly outside their rooms (29%). Although bed rail samples increasingly resembled the patient microbiome throughout their stay, SARS-CoV-2 was less frequently detected there (11%). Despite surface contamination in almost all patient rooms, no health care workers providing COVID-19 patient care contracted the disease. SARS-CoV-2 positive samples had higher bacterial phylogenetic diversity across human and surface samples, and higher biomass in floor samples. 16S microbial community profiles allowed for high classifier accuracy for SARS-CoV-2 status in not only nares, but also forehead, stool and floor samples. Across these distinct microbial profiles, a single amplicon sequence variant from the genus Rothia was highly predictive of SARS-CoV-2 across sample types, and had higher prevalence in positive surface and human samples, even when comparing to samples from patients in another intensive care unit prior to the COVID-19 pandemic. These results suggest that bacterial communities contribute to viral prevalence both in the host and hospital environment.

11.

Using human in vitro transcriptome analysis to build trustworthy machine learning models for prediction of animal drug toxicity.

Gardiner, Laura-Jayne; Carrieri, Anna Paola; Wilshaw, Jenny; Checkley, Stephen; Pyzer-Knapp, Edward O; Krishna, Ritesh.

Sci Rep ; 10(1): 9522, 2020 06 12.

Article in English | MEDLINE | ID: mdl-32533004

ABSTRACT

During the development of new drugs or compounds there is a requirement for preclinical trials, commonly involving animal tests, to ascertain the safety of the compound prior to human trials. Machine learning techniques could provide an in-silico alternative to animal models for assessing drug toxicity, thus reducing expensive and invasive animal testing during clinical trials, for drugs that are most likely to fail safety tests. Here we present a machine learning model to predict kidney dysfunction, as a proxy for drug induced renal toxicity, in rats. To achieve this, we use inexpensive transcriptomic profiles derived from human cell lines after chemical compound treatment to train our models combined with compound chemical structure information. Genomics data due to its sparse, high-dimensional and noisy nature presents significant challenges in building trustworthy and transparent machine learning models. Here we address these issues by judiciously building feature sets from heterogenous sources and coupling them with measures of model uncertainty achieved through Gaussian Process based Bayesian models. We combine the use of insight into the feature-wise contributions to our predictions with the use of predictive uncertainties recovered from the Gaussian Process to improve the transparency and trustworthiness of the model.

Subject(s)

Drug-Related Side Effects and Adverse Reactions/genetics , Gene Expression Profiling , Machine Learning , Models, Theoretical , Animals , Humans , Quality Control , Uncertainty

12.

Human Skin, Oral, and Gut Microbiomes Predict Chronological Age.

Huang, Shi; Haiminen, Niina; Carrieri, Anna-Paola; Hu, Rebecca; Jiang, Lingjing; Parida, Laxmi; Russell, Baylee; Allaband, Celeste; Zarrinpar, Amir; Vázquez-Baeza, Yoshiki; Belda-Ferre, Pedro; Zhou, Hongwei; Kim, Ho-Cheol; Swafford, Austin D; Knight, Rob; Xu, Zhenjiang Zech.

mSystems ; 5(1)2020 Feb 11.

Article in English | MEDLINE | ID: mdl-32047061

ABSTRACT

Human gut microbiomes are known to change with age, yet the relative value of human microbiomes across the body as predictors of age, and prediction robustness across populations is unknown. In this study, we tested the ability of the oral, gut, and skin (hand and forehead) microbiomes to predict age in adults using random forest regression on data combined from multiple publicly available studies, evaluating the models in each cohort individually. Intriguingly, the skin microbiome provides the best prediction of age (mean ± standard deviation, 3.8 ± 0.45 years, versus 4.5 ± 0.14 years for the oral microbiome and 11.5 ± 0.12 years for the gut microbiome). This also agrees with forensic studies showing that the skin microbiome predicts postmortem interval better than microbiomes from other body sites. Age prediction models constructed from the hand microbiome generalized to the forehead and vice versa, across cohorts, and results from the gut microbiome generalized across multiple cohorts (United States, United Kingdom, and China). Interestingly, taxa enriched in young individuals (18 to 30 years) tend to be more abundant and more prevalent than taxa enriched in elderly individuals (>60 yrs), suggesting a model in which physiological aging occurs concomitantly with the loss of key taxa over a lifetime, enabling potential microbiome-targeted therapeutic strategies to prevent aging.IMPORTANCE Considerable evidence suggests that the gut microbiome changes with age or even accelerates aging in adults. Whether the age-related changes in the gut microbiome are more or less prominent than those for other body sites and whether predictions can be made about a person's age from a microbiome sample remain unknown. We therefore combined several large studies from different countries to determine which body site's microbiome could most accurately predict age. We found that the skin was the best, on average yielding predictions within 4 years of chronological age. This study sets the stage for future research on the role of the microbiome in accelerating or decelerating the aging process and in the susceptibility for age-related diseases.

13.

Streaming histogram sketching for rapid microbiome analytics.

Rowe, Will Pm; Carrieri, Anna Paola; Alcon-Giner, Cristina; Caim, Shabhonam; Shaw, Alex; Sim, Kathleen; Kroll, J Simon; Hall, Lindsay J; Pyzer-Knapp, Edward O; Winn, Martyn D.

Microbiome ; 7(1): 40, 2019 03 16.

Article in English | MEDLINE | ID: mdl-30878035

ABSTRACT

BACKGROUND: The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for tyrhe compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time. RESULTS: We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed 'histosketch' that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a 'real life' example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s. CONCLUSIONS: Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space. ( https://github.com/will-rowe/hulk ).

Subject(s)

Bacteria/classification , Gastrointestinal Microbiome , Metagenomics/methods , Anti-Bacterial Agents/therapeutic use , Bacterial Infections/drug therapy , Cohort Studies , Humans , Infant, Newborn , Infant, Premature , Machine Learning , Sequence Analysis, DNA , Software

14.

Accelerating molecular discovery through data and physical sciences: Applications to peptide-membrane interactions.

Cipcigan, Flaviu; Carrieri, Anna Paola; Pyzer-Knapp, Edward O; Krishna, Ritesh; Hsiao, Ya-Wen; Winn, Martyn; Ryadnov, Maxim G; Edge, Colin; Martyna, Glenn; Crain, Jason.

J Chem Phys ; 148(24): 241744, 2018 Jun 28.

Article in English | MEDLINE | ID: mdl-29960328

ABSTRACT

Simulation and data analysis have evolved into powerful methods for discovering and understanding molecular modes of action and designing new compounds to exploit these modes. The combination provides a strong impetus to create and exploit new tools and techniques at the interfaces between physics, biology, and data science as a pathway to new scientific insight and accelerated discovery. In this context, we explore the rational design of novel antimicrobial peptides (short protein sequences exhibiting broad activity against multiple species of bacteria). We show how datasets can be harvested to reveal features which inform new design concepts. We introduce new analysis and visualization tools: a graphical representation of the k-mer spectrum as a fundamental property encoded in antimicrobial peptide databases and a data-driven representation to illustrate membrane binding and permeation of helical peptides.

Subject(s)

Anti-Bacterial Agents/chemistry , Antimicrobial Cationic Peptides/chemistry , Data Mining , Databases, Protein , Membranes/chemistry , Natural Science Disciplines , Bacteria/metabolism , Drug Discovery , Membranes/metabolism

15.

Sampling ARG of multiple populations under complex configurations of subdivision and admixture.

Carrieri, Anna Paola; Utro, Filippo; Parida, Laxmi.

Bioinformatics ; 32(7): 1048-56, 2016 04 01.

Article in English | MEDLINE | ID: mdl-26644417

ABSTRACT

MOTIVATION: Simulating complex evolution scenarios of multiple populations is an important task for answering many basic questions relating to population genomics. Apart from the population samples, the underlying Ancestral Recombinations Graph (ARG) is an additional important means in hypothesis checking and reconstruction studies. Furthermore, complex simulations require a plethora of interdependent parameters making even the scenario-specification highly non-trivial. RESULTS: We present an algorithm SimRA that simulates generic multiple population evolution model with admixture. It is based on random graphs that improve dramatically in time and space requirements of the classical algorithm of single populations.Using the underlying random graphs model, we also derive closed forms of expected values of the ARG characteristics i.e., height of the graph, number of recombinations, number of mutations and population diversity in terms of its defining parameters. This is crucial in aiding the user to specify meaningful parameters for the complex scenario simulations, not through trial-and-error based on raw compute power but intelligent parameter estimation. To the best of our knowledge this is the first time closed form expressions have been computed for the ARG properties. We show that the expected values closely match the empirical values through simulations.Finally, we demonstrate that SimRA produces the ARG in compact forms without compromising any accuracy. We demonstrate the compactness and accuracy through extensive experiments. AVAILABILITY AND IMPLEMENTATION: SimRA (Simulation based on Random graph Algorithms) source, executable, user manual and sample input-output sets are available for downloading at: https://github.com/ComputationalGenomics/SimRA CONTACT: : parida@us.ibm.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Genetics, Population , Phylogeny , Genome, Human , Humans , Pedigree , Population Groups , Recombination, Genetic

16.

Explaining evolution via constrained persistent perfect phylogeny.

Bonizzoni, Paola; Carrieri, Anna Paola; Della Vedova, Gianluca; Trucco, Gabriella.

BMC Genomics ; 15 Suppl 6: S10, 2014.

Article in English | MEDLINE | ID: mdl-25572381

ABSTRACT

BACKGROUND: The perfect phylogeny is an often used model in phylogenetics since it provides an efficient basic procedure for representing the evolution of genomic binary characters in several frameworks, such as for example in haplotype inference. The model, which is conceptually the simplest, is based on the infinite sites assumption, that is no character can mutate more than once in the whole tree. A main open problem regarding the model is finding generalizations that retain the computational tractability of the original model but are more flexible in modeling biological data when the infinite site assumption is violated because of e.g. back mutations. A special case of back mutations that has been considered in the study of the evolution of protein domains (where a domain is acquired and then lost) is persistency, that is the fact that a character is allowed to return back to the ancestral state. In this model characters can be gained and lost at most once. In this paper we consider the computational problem of explaining binary data by the Persistent Perfect Phylogeny model (referred as PPP) and for this purpose we investigate the problem of reconstructing an evolution where some constraints are imposed on the paths of the tree. RESULTS: We define a natural generalization of the PPP problem obtained by requiring that for some pairs (character, species), neither the species nor any of its ancestors can have the character. In other words, some characters cannot be persistent for some species. This new problem is called Constrained PPP (CPPP). Based on a graph formulation of the CPPP problem, we are able to provide a polynomial time solution for the CPPP problem for matrices whose conflict graph has no edges. Using this result, we develop a parameterized algorithm for solving the CPPP problem where the parameter is the number of characters. CONCLUSIONS: A preliminary experimental analysis shows that the constrained persistent perfect phylogeny model allows to explain efficiently data that do not conform with the classical perfect phylogeny model.

Subject(s)

Evolution, Molecular , Models, Genetic , Phylogeny , Algorithms

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL