Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 16 de 16
Filter
1.
Nat Commun ; 15(1): 1664, 2024 Feb 23.
Article in English | MEDLINE | ID: mdl-38395976

ABSTRACT

Stem cells exist in vitro in a spectrum of interconvertible pluripotent states. Analyzing hundreds of hiPSCs derived from different individuals, we show the proportions of these pluripotent states vary considerably across lines. We discover 13 gene network modules (GNMs) and 13 regulatory network modules (RNMs), which are highly correlated with each other suggesting that the coordinated co-accessibility of regulatory elements in the RNMs likely underlie the coordinated expression of genes in the GNMs. Epigenetic analyses reveal that regulatory networks underlying self-renewal and pluripotency are more complex than previously realized. Genetic analyses identify thousands of regulatory variants that overlapped predicted transcription factor binding sites and are associated with chromatin accessibility in the hiPSCs. We show that the master regulator of pluripotency, the NANOG-OCT4 Complex, and its associated network are significantly enriched for regulatory variants with large effects, suggesting that they play a role in the varying cellular proportions of pluripotency states between hiPSCs. Our work bins tens of thousands of regulatory elements in hiPSCs into discrete regulatory networks, shows that pluripotency and self-renewal processes have a surprising level of regulatory complexity, and suggests that genetic factors may contribute to cell state transitions in human iPSC lines.


Subject(s)
Induced Pluripotent Stem Cells , Humans , Induced Pluripotent Stem Cells/metabolism , Gene Regulatory Networks , Chromatin/genetics , Cell Differentiation/genetics , Octamer Transcription Factor-3/genetics
2.
bioRxiv ; 2023 Sep 19.
Article in English | MEDLINE | ID: mdl-37292794

ABSTRACT

Stem cells exist in vitro in a spectrum of interconvertible pluripotent states. Analyzing hundreds of hiPSCs derived from different individuals, we show the proportions of these pluripotent states vary considerably across lines. We discovered 13 gene network modules (GNMs) and 13 regulatory network modules (RNMs), which were highly correlated with each other suggesting that the coordinated co-accessibility of regulatory elements in the RNMs likely underlied the coordinated expression of genes in the GNMs. Epigenetic analyses revealed that regulatory networks underlying self-renewal and pluripotency have a surprising level of complexity. Genetic analyses identified thousands of regulatory variants that overlapped predicted transcription factor binding sites and were associated with chromatin accessibility in the hiPSCs. We show that the master regulator of pluripotency, the NANOG-OCT4 Complex, and its associated network were significantly enriched for regulatory variants with large effects, suggesting that they may play a role in the varying cellular proportions of pluripotency states between hiPSCs. Our work captures the coordinated activity of tens of thousands of regulatory elements in hiPSCs and bins these elements into discrete functionally characterized regulatory networks, shows that regulatory elements in pluripotency networks harbor variants with large effects, and provides a rich resource for future pluripotent stem cell research.

3.
Nat Commun ; 11(1): 2927, 2020 06 10.
Article in English | MEDLINE | ID: mdl-32522982

ABSTRACT

Structural variants (SVs) and short tandem repeats (STRs) comprise a broad group of diverse DNA variants which vastly differ in their sizes and distributions across the genome. Here, we identify genomic features of SV classes and STRs that are associated with gene expression and complex traits, including their locations relative to eGenes, likelihood of being associated with multiple eGenes, associated eGene types (e.g., coding, noncoding, level of evolutionary constraint), effect sizes, linkage disequilibrium with tagging single nucleotide variants used in GWAS, and likelihood of being associated with GWAS traits. We identify a set of high-impact SVs/STRs associated with the expression of three or more eGenes via chromatin loops and show that they are highly enriched for being associated with GWAS traits. Our study provides insights into the genomic properties of structural variant classes and short tandem repeats that are associated with gene expression and human traits.


Subject(s)
Microsatellite Repeats/genetics , Cell Line , Genetic Variation/genetics , Genome-Wide Association Study , Humans , Linkage Disequilibrium/genetics , Multifactorial Inheritance , Polymorphism, Single Nucleotide/genetics , Quantitative Trait Loci/genetics
4.
Nat Commun ; 11(1): 2928, 2020 06 10.
Article in English | MEDLINE | ID: mdl-32522985

ABSTRACT

Structural variants (SVs) and short tandem repeats (STRs) are important sources of genetic diversity but are not routinely analyzed in genetic studies because they are difficult to accurately identify and genotype. Because SVs and STRs range in size and type, it is necessary to apply multiple algorithms that incorporate different types of evidence from sequencing data and employ complex filtering strategies to discover a comprehensive set of high-quality and reproducible variants. Here we assemble a set of 719 deep whole genome sequencing (WGS) samples (mean 42×) from 477 distinct individuals which we use to discover and genotype a wide spectrum of SV and STR variants using five algorithms. We use 177 unique pairs of genetic replicates to identify factors that affect variant call reproducibility and develop a systematic filtering strategy to create of one of the most complete and well characterized maps of SVs and STRs to date.


Subject(s)
Microsatellite Repeats/genetics , Whole Genome Sequencing/methods , Algorithms , Computational Biology , Genotype , Haplotypes/genetics , High-Throughput Nucleotide Sequencing , Humans
5.
Stem Cell Reports ; 13(5): 924-938, 2019 11 12.
Article in English | MEDLINE | ID: mdl-31668852

ABSTRACT

Despite the importance of understanding how variability across induced pluripotent stem cell (iPSC) lines due to non-genetic factors (clone and passage) influences their differentiation outcome, large-scale studies capable of addressing this question have not yet been conducted. Here, we differentiated 191 iPSC lines to generate iPSC-derived cardiovascular progenitor cells (iPSC-CVPCs). We observed cellular heterogeneity across the iPSC-CVPC samples due to varying fractions of two cell types: cardiomyocytes (CMs) and epicardium-derived cells (EPDCs). Comparing the transcriptomes of CM-fated and EPDC-fated iPSCs, we discovered that 91 signature genes and X chromosome dosage differences are associated with these two distinct cardiac developmental trajectories. In an independent set of 39 iPSCs differentiated into CMs, we confirmed that sex and transcriptional differences affect cardiac-fate outcome. Our study provides novel insights into how iPSC transcriptional and X chromosome gene dosage differences influence their response to differentiation stimuli and, hence, cardiac cell fate.


Subject(s)
Chromosomes, Human, X/genetics , Induced Pluripotent Stem Cells/cytology , Myocytes, Cardiac/cytology , Pericardium/cytology , Transcriptome , Cell Differentiation , Cells, Cultured , Female , Humans , Induced Pluripotent Stem Cells/metabolism , Male , Myocytes, Cardiac/metabolism , Pericardium/metabolism , X Chromosome Inactivation
6.
Nat Genet ; 51(10): 1506-1517, 2019 10.
Article in English | MEDLINE | ID: mdl-31570892

ABSTRACT

The cardiac transcription factor (TF) gene NKX2-5 has been associated with electrocardiographic (EKG) traits through genome-wide association studies (GWASs), but the extent to which differential binding of NKX2-5 at common regulatory variants contributes to these traits has not yet been studied. We analyzed transcriptomic and epigenomic data from induced pluripotent stem cell-derived cardiomyocytes from seven related individuals, and identified ~2,000 single-nucleotide variants associated with allele-specific effects (ASE-SNVs) on NKX2-5 binding. NKX2-5 ASE-SNVs were enriched for altered TF motifs, for heart-specific expression quantitative trait loci and for EKG GWAS signals. Using fine-mapping combined with epigenomic data from induced pluripotent stem cell-derived cardiomyocytes, we prioritized candidate causal variants for EKG traits, many of which were NKX2-5 ASE-SNVs. Experimentally characterizing two NKX2-5 ASE-SNVs (rs3807989 and rs590041) showed that they modulate the expression of target genes via differential protein binding in cardiac cells, indicating that they are functional variants underlying EKG GWAS signals. Our results show that differential NKX2-5 binding at numerous regulatory variants across the genome contributes to EKG phenotypes.


Subject(s)
Atrial Fibrillation/genetics , Atrial Fibrillation/pathology , Homeobox Protein Nkx-2.5/genetics , Homeobox Protein Nkx-2.5/metabolism , Polymorphism, Single Nucleotide , Quantitative Trait Loci , Regulatory Elements, Transcriptional , Adolescent , Adult , Aged , Aged, 80 and over , Alleles , Child , Electrocardiography , Epigenomics , Female , Genetic Predisposition to Disease , Genome, Human , Genome-Wide Association Study , Humans , Induced Pluripotent Stem Cells/metabolism , Induced Pluripotent Stem Cells/pathology , Male , Middle Aged , Myocytes, Cardiac/metabolism , Myocytes, Cardiac/pathology , Phenotype , Protein Binding , Transcriptome , Young Adult
7.
Nat Commun ; 10(1): 2078, 2019 05 07.
Article in English | MEDLINE | ID: mdl-31064983

ABSTRACT

Genetic variants affecting pancreatic islet enhancers are central to T2D risk, but the gene targets of islet enhancer activity are largely unknown. We generate a high-resolution map of islet chromatin loops using Hi-C assays in three islet samples and use loops to annotate target genes of islet enhancers defined using ATAC-seq and published ChIP-seq data. We identify candidate target genes for thousands of islet enhancers, and find that enhancer looping is correlated with islet-specific gene expression. We fine-map T2D risk variants affecting islet enhancers, and find that candidate target genes of these variants defined using chromatin looping and eQTL mapping are enriched in protein transport and secretion pathways. At IGF2BP2, a fine-mapped T2D variant reduces islet enhancer activity and IGF2BP2 expression, and conditional inactivation of IGF2BP2 in mouse islets impairs glucose-stimulated insulin secretion. Our findings provide a resource for studying islet enhancer function and identifying genes involved in T2D risk.


Subject(s)
Chromatin/metabolism , Diabetes Mellitus, Type 2/genetics , Gene Regulatory Networks/genetics , Islets of Langerhans/metabolism , RNA-Binding Proteins/genetics , Adult , Animals , Cell Nucleus/metabolism , Chromatin Assembly and Disassembly/genetics , Diabetes Mellitus, Type 2/pathology , Enhancer Elements, Genetic/genetics , Female , Gene Expression Profiling , Genetic Predisposition to Disease , Glucose/metabolism , Humans , Insulin/metabolism , Islets of Langerhans/cytology , Male , Mice , Mice, Inbred C57BL , Mice, Knockout , Middle Aged , Molecular Conformation , Quantitative Trait Loci/genetics , RNA-Binding Proteins/metabolism
8.
Stem Cell Reports ; 12(6): 1342-1353, 2019 06 11.
Article in English | MEDLINE | ID: mdl-31080113

ABSTRACT

We evaluate whether human induced pluripotent stem cell-derived retinal pigment epithelium (iPSC-RPE) cells can be used to prioritize and functionally characterize causal variants at age-related macular degeneration (AMD) risk loci. We generated iPSC-RPE from six subjects and show that they have morphological and molecular characteristics similar to those of native RPE. We generated RNA-seq, ATAC-seq, and H3K27ac ChIP-seq data and observed high similarity in gene expression and enriched transcription factor motif profiles between iPSC-RPE and human fetal RPE. We performed fine mapping of AMD risk loci by integrating molecular data from the iPSC-RPE, adult retina, and adult RPE, which identified rs943080 as the probable causal variant at VEGFA. We show that rs943080 is associated with altered chromatin accessibility of a distal ATAC-seq peak, decreased overall gene expression of VEGFA, and allele-specific expression of a non-coding transcript. Our study thus provides a potential mechanism underlying the association of the VEGFA locus with AMD.


Subject(s)
Genetic Loci , Induced Pluripotent Stem Cells/metabolism , Macular Degeneration , Retinal Pigment Epithelium/metabolism , Vascular Endothelial Growth Factor A , Female , Humans , Induced Pluripotent Stem Cells/pathology , Macular Degeneration/genetics , Macular Degeneration/metabolism , Macular Degeneration/pathology , Retinal Pigment Epithelium/pathology , Sequence Analysis, RNA , Vascular Endothelial Growth Factor A/biosynthesis , Vascular Endothelial Growth Factor A/genetics
9.
Nat Commun ; 10(1): 1054, 2019 03 05.
Article in English | MEDLINE | ID: mdl-30837461

ABSTRACT

While genetic variation at chromatin loops is relevant for human disease, the relationships between contact propensity (the probability that loci at loops physically interact), genetics, and gene regulation are unclear. We quantitatively interrogate these relationships by comparing Hi-C and molecular phenotype data across cell types and haplotypes. While chromatin loops consistently form across different cell types, they have subtle quantitative differences in contact frequency that are associated with larger changes in gene expression and H3K27ac. For the vast majority of loci with quantitative differences in contact frequency across haplotypes, the changes in magnitude are smaller than those across cell types; however, the proportional relationships between contact propensity, gene expression, and H3K27ac are consistent. These findings suggest that subtle changes in contact propensity have a biologically meaningful role in gene regulation and could be a mechanism by which regulatory genetic variants in loop anchors mediate effects on expression.


Subject(s)
Chromatin/genetics , DNA/genetics , Gene Expression Regulation , Histones/genetics , Quantitative Trait Loci/genetics , Adolescent , Adult , Aged , Cell Line , Chromatin/metabolism , DNA/metabolism , Female , Histones/metabolism , Humans , Induced Pluripotent Stem Cells , Male , Middle Aged , Myocytes, Cardiac , Nucleic Acid Conformation , Polymorphism, Single Nucleotide , Whole Genome Sequencing , Young Adult
10.
Circ Genom Precis Med ; 11(12): e002170, 2018 12.
Article in English | MEDLINE | ID: mdl-30562114

ABSTRACT

BACKGROUND: Identifying genetic variation associated with plasma protein levels, and the mechanisms by which they act, could provide insight into alterable processes involved in regulation of protein levels. Although protein levels can be affected by genetic variants, their estimation can also be biased by missense variants in coding exons causing technical artifacts. Integrating genome sequence genotype data with mass spectrometry-based protein level estimation could reduce bias, thereby improving detection of variation that affects RNA or protein metabolism. METHODS: Here, we integrate the blood plasma protein levels of 664 proteins from 165 participants of the Tromsø Study, measured via tandem mass tag mass spectrometry, with whole-exome sequencing data to identify common and rare genetic variation associated with peptide and protein levels (protein quantitative trait loci [pQTLs]). We additionally use literature and database searches to prioritize putative functional variants for each pQTL. RESULTS: We identify 109 independent associations (36 protein and 73 peptide) and use genotype data to exclude 49 (4 protein and 45 peptide) as technical artifacts. We describe 2 particular cases of rare variation: 1 associated with the complement pathway and 1 with platelet degranulation. We identify putative functional variants and show that pQTLs act through diverse molecular mechanisms that affect both RNA and protein metabolism. CONCLUSIONS: We show that although the majority of pQTLs exert their effects by modulating RNA metabolism, many affect protein levels directly. Our work demonstrates the extent by which pQTL studies are affected by technical artifacts and highlights how prioritizing the functional variant in pQTL studies can lead to insights into the molecular steps by which a protein may be regulated.


Subject(s)
Blood Proteins/analysis , Blood Proteins/genetics , Genetic Variation , Cohort Studies , Exons , Female , Genotype , Humans , Male , Mass Spectrometry , Proteome/genetics , Quantitative Trait Loci , Exome Sequencing
11.
Cell Rep ; 24(4): 883-894, 2018 07 24.
Article in English | MEDLINE | ID: mdl-30044985

ABSTRACT

To understand the mutational burden of human induced pluripotent stem cells (iPSCs), we sequenced genomes of 18 fibroblast-derived iPSC lines and identified different classes of somatic mutations based on structure, origin, and frequency. Copy-number alterations affected 295 kb in each sample and strongly impacted gene expression. UV-damage mutations were present in ∼45% of the iPSCs and accounted for most of the observed heterogeneity in mutation rates across lines. Subclonal mutations (not present in all iPSCs within a line) composed 10% of point mutations and, compared with clonal variants, showed an enrichment in active promoters and increased association with altered gene expression. Our study shows that, by combining WGS, transcriptome, and epigenome data, we can understand the mutational burden of each iPSC line on an individual basis and suggests that this information could be used to prioritize iPSC lines for models of specific human diseases and/or transplantation therapy.


Subject(s)
Induced Pluripotent Stem Cells/cytology , Induced Pluripotent Stem Cells/physiology , Cell Differentiation/physiology , Cells, Cultured , Cellular Reprogramming/genetics , Humans , Mutation , Mutation Rate
12.
BMC Syst Biol ; 12(1): 25, 2018 03 02.
Article in English | MEDLINE | ID: mdl-29499714

ABSTRACT

BACKGROUND: The efficacy of antibiotics against M. tuberculosis has been shown to be influenced by experimental media conditions. Investigations of M. tuberculosis growth in physiological conditions have described an environment that is different from common in vitro media. Thus, elucidating the interplay between available nutrient sources and antibiotic efficacy has clear medical relevance. While genome-scale reconstructions of M. tuberculosis have enabled the ability to interrogate media differences for the past 10 years, recent reconstructions have diverged from each other without standardization. A unified reconstruction of M. tuberculosis H37Rv would elucidate the impact of different nutrient conditions on antibiotic efficacy and provide new insights for therapeutic intervention. RESULTS: We present a new genome-scale model of M. tuberculosis H37Rv, named iEK1011, that unifies and updates previous M. tuberculosis H37Rv genome-scale reconstructions. We functionally assess iEK1011 against previous models and show that the model increases correct gene essentiality predictions on two different experimental datasets by 6% (53% to 60%) and 18% (60% to 71%), respectively. We compared simulations between in vitro and approximated in vivo media conditions to examine the predictive capabilities of iEK1011. The simulated differences recapitulated literature defined characteristics in the rewiring of TCA metabolism including succinate secretion, gluconeogenesis, and activation of both the glyoxylate shunt and the methylcitrate cycle. To assist efforts to elucidate mechanisms of antibiotic resistance development, we curated 16 metabolic genes related to antimicrobial resistance and approximated evolutionary drivers of resistance. Comparing simulations of these antibiotic resistance features between in vivo and in vitro media highlighted condition-dependent differences that may influence the efficacy of antibiotics. CONCLUSIONS: iEK1011 provides a computational knowledge base for exploring the impact of different environmental conditions on the metabolic state of M. tuberculosis H37Rv. As more experimental data and knowledge of M. tuberculosis H37Rv become available, a unified and standardized M. tuberculosis model will prove to be a valuable resource to the research community studying the systems biology of M. tuberculosis.


Subject(s)
Genomics/standards , Models, Genetic , Mycobacterium tuberculosis/genetics , Drug Resistance, Bacterial/genetics , Mycobacterium tuberculosis/drug effects , Mycobacterium tuberculosis/physiology , Reference Standards
13.
Genetics ; 207(4): 1301-1312, 2017 12.
Article in English | MEDLINE | ID: mdl-29074555

ABSTRACT

Expression quantitative trait loci (eQTL) studies have typically used single-variant association analysis to identify genetic variants correlated with gene expression. However, this approach has several drawbacks: causal variants cannot be distinguished from nonfunctional variants in strong linkage disequilibrium, combined effects from multiple causal variants cannot be captured, and low-frequency (<5% MAF) eQTL variants are difficult to identify. While these issues possibly could be overcome by using sparse polygenic models, which associate multiple genetic variants with gene expression simultaneously, the predictive performance of these models for eQTL studies has not been evaluated. Here, we assessed the ability of three sparse polygenic models (Lasso, Elastic Net, and BSLMM) to identify causal variants, and compared their efficacy to single-variant association analysis and a fine-mapping model. Using simulated data, we determined that, while these methods performed similarly when there was one causal SNP present at a gene, BSLMM substantially outperformed single-variant association analysis for prioritizing causal eQTL variants when multiple causal eQTL variants were present (1.6- to 5.2-fold higher recall at 20% precision), and identified up to 2.3-fold more low frequency variants as the top eQTL SNP. Analysis of real RNA-seq and whole-genome sequencing data of 131 iPSC samples showed that the eQTL SNPs identified by BSLMM had a higher functional enrichment in DHS sites and were more often low-frequency than those identified with single-variant association analysis. Our study showed that BSLMM is a more effective approach than single-variant association analysis for prioritizing multiple causal eQTL variants at a single gene.


Subject(s)
Genetic Predisposition to Disease , Genome-Wide Association Study/statistics & numerical data , Multifactorial Inheritance/genetics , Quantitative Trait Loci/genetics , Gene Expression/genetics , Genetic Variation , Humans , Linkage Disequilibrium , Polymorphism, Single Nucleotide/genetics
14.
BMC Bioinformatics ; 18(1): 207, 2017 Apr 07.
Article in English | MEDLINE | ID: mdl-28388874

ABSTRACT

BACKGROUND: Genomic interaction studies use next-generation sequencing (NGS) to examine the interactions between two loci on the genome, with subsequent bioinformatics analyses typically including annotation, intersection, and merging of data from multiple experiments. While many file types and analysis tools exist for storing and manipulating single locus NGS data, there is currently no file standard or analysis tool suite for manipulating and storing paired-genomic-loci: the data type resulting from "genomic interaction" studies. As genomic interaction sequencing data are becoming prevalent, a standard file format and tools for working with these data conveniently and efficiently are needed. RESULTS: This article details a file standard and novel software tool suite for working with paired-genomic-loci data. We present the paired-genomic-loci (PGL) file standard for genomic-interactions data, and the accompanying analysis tool suite "pgltools": a cross platform, pypy compatible python package available both as an easy-to-use UNIX package, and as a python module, for integration into pipelines of paired-genomic-loci analyses. CONCLUSIONS: Pgltools is a freely available, open source tool suite for manipulating paired-genomic-loci data. Source code, an in-depth manual, and a tutorial are available publicly at www.github.com/billgreenwald/pgltools , and a python module of the operations can be installed from PyPI via the PyGLtools module.


Subject(s)
Chromatin/metabolism , Genomics/methods , Software , Chromatin/genetics , Chromatin Immunoprecipitation , Genetic Loci , High-Throughput Nucleotide Sequencing
15.
Stem Cell Reports ; 8(4): 1086-1100, 2017 04 11.
Article in English | MEDLINE | ID: mdl-28410642

ABSTRACT

Large-scale collections of induced pluripotent stem cells (iPSCs) could serve as powerful model systems for examining how genetic variation affects biology and disease. Here we describe the iPSCORE resource: a collection of systematically derived and characterized iPSC lines from 222 ethnically diverse individuals that allows for both familial and association-based genetic studies. iPSCORE lines are pluripotent with high genomic integrity (no or low numbers of somatic copy-number variants) as determined using high-throughput RNA-sequencing and genotyping arrays, respectively. Using iPSCs from a family of individuals, we show that iPSC-derived cardiomyocytes demonstrate gene expression patterns that cluster by genetic background, and can be used to examine variants associated with physiological and disease phenotypes. The iPSCORE collection contains representative individuals for risk and non-risk alleles for 95% of SNPs associated with human phenotypes through genome-wide association studies. Our study demonstrates the utility of iPSCORE for examining how genetic variants influence molecular and physiological traits in iPSCs and derived cell lines.


Subject(s)
Arrhythmias, Cardiac/genetics , Databases, Factual , Genetic Association Studies , Genetic Variation , Induced Pluripotent Stem Cells/metabolism , Myocytes, Cardiac/metabolism , Arrhythmias, Cardiac/ethnology , Arrhythmias, Cardiac/metabolism , Arrhythmias, Cardiac/physiopathology , Cell Differentiation , Cell Line , Cellular Reprogramming/genetics , Genotype , High-Throughput Nucleotide Sequencing , Humans , Induced Pluripotent Stem Cells/cytology , Multigene Family , Myocytes, Cardiac/cytology , Oligonucleotide Array Sequence Analysis , Phenotype , Polymorphism, Single Nucleotide , Racial Groups
16.
BMC Genomics ; 18(1): 296, 2017 04 13.
Article in English | MEDLINE | ID: mdl-28407798

ABSTRACT

BACKGROUND: Metagenomics is the study of the microbial genomes isolated from communities found on our bodies or in our environment. By correctly determining the relation between human health and the human associated microbial communities, novel mechanisms of health and disease can be found, thus enabling the development of novel diagnostics and therapeutics. Due to the diversity of the microbial communities, strategies developed for aligning human genomes cannot be utilized, and genomes of the microbial species in the community must be assembled de novo. However, in order to obtain the best metagenomic assemblies, it is important to choose the proper assembler. Due to the rapidly evolving nature of metagenomics, new assemblers are constantly created, and the field has not yet agreed on a standardized process. Furthermore, the truth sets used to compare these methods are either too simple (computationally derived diverse communities) or complex (microbial communities of unknown composition), yielding results that are hard to interpret. In this analysis, we interrogate the strengths and weaknesses of five popular assemblers through the use of defined biological samples of known genomic composition and abundance. We assessed the performance of each assembler on their ability to reassemble genomes, call taxonomic abundances, and recreate open reading frames (ORFs). RESULTS: We tested five metagenomic assemblers: Omega, metaSPAdes, IDBA-UD, metaVelvet and MEGAHIT on known and synthetic metagenomic data sets. MetaSPAdes excelled in diverse sets, IDBA-UD performed well all around, metaVelvet had high accuracy in high abundance organisms, and MEGAHIT was able to accurately differentiate similar organisms within a community. At the ORF level, metaSPAdes and MEGAHIT had the least number of missing ORFs within diverse and similar communities respectively. CONCLUSIONS: Depending on the metagenomics question asked, the correct assembler for the task at hand will differ. It is important to choose the appropriate assembler, and thus clearly define the biological problem of an experiment, as different assemblers will give different answers to the same question.


Subject(s)
Chromosome Mapping/methods , Computational Biology/methods , Metagenomics/methods , Data Accuracy , Genome, Bacterial , Humans , Open Reading Frames , Software
SELECTION OF CITATIONS
SEARCH DETAIL
...