RESUMEN
The importance of sampling from globally representative populations has been well established in human genomics. In human microbiome research, however, we lack a full understanding of the global distribution of sampling in research studies. This information is crucial to better understand global patterns of microbiome-associated diseases and to extend the health benefits of this research to all populations. Here, we analyze the country of origin of all 444,829 human microbiome samples that are available from the world's 3 largest genomic data repositories, including the Sequence Read Archive (SRA). The samples are from 2,592 studies of 19 body sites, including 220,017 samples of the gut microbiome. We show that more than 71% of samples with a known origin come from Europe, the United States, and Canada, including 46.8% from the US alone, despite the country representing only 4.3% of the global population. We also find that central and southern Asia is the most underrepresented region: Countries such as India, Pakistan, and Bangladesh account for more than a quarter of the world population but make up only 1.8% of human microbiome samples. These results demonstrate a critical need to ensure more global representation of participants in microbiome studies.
Asunto(s)
Microbioma Gastrointestinal/genética , Genómica/métodos , Metagenoma/genética , Metagenómica/métodos , Microbiota/genética , Asia , Bangladesh , Canadá , Países Desarrollados , Europa (Continente) , Genómica/estadística & datos numéricos , Geografía , Humanos , India , Metagenómica/estadística & datos numéricos , Pakistán , Estados UnidosRESUMEN
The advent of high-throughput metagenomic sequencing has prompted the development of efficient taxonomic profiling methods allowing to measure the presence, abundance and phylogeny of organisms in a wide range of environmental samples. Multivariate sequence-derived abundance data further has the potential to enable inference of ecological associations between microbial populations, but several technical issues need to be accounted for, like the compositional nature of the data, its extreme sparsity and overdispersion, as well as the frequent need to operate in under-determined regimes. The ecological network reconstruction problem is frequently cast into the paradigm of Gaussian Graphical Models (GGMs) for which efficient structure inference algorithms are available, like the graphical lasso and neighborhood selection. Unfortunately, GGMs or variants thereof can not properly account for the extremely sparse patterns occurring in real-world metagenomic taxonomic profiles. In particular, structural zeros (as opposed to sampling zeros) corresponding to true absences of biological signals fail to be properly handled by most statistical methods. We present here a zero-inflated log-normal graphical model (available at https://github.com/vincentprost/Zi-LN) specifically aimed at handling such "biological" zeros, and demonstrate significant performance gains over state-of-the-art statistical methods for the inference of microbial association networks, with most notable gains obtained when analyzing taxonomic profiles displaying sparsity levels on par with real-world metagenomic datasets.
Asunto(s)
Microbiota , Modelos Biológicos , Algoritmos , Biología Computacional , Simulación por Computador , Metagenoma , Metagenómica/estadística & datos numéricos , Consorcios Microbianos/genética , Consorcios Microbianos/fisiología , Microbiota/genética , Microbiota/fisiología , Análisis Multivariante , Distribución Normal , Biología SintéticaRESUMEN
Metagenomic samples are snapshots of complex ecosystems at work. They comprise hundreds of known and unknown species, contain multiple strain variants and vary greatly within and across environments. Many microbes found in microbial communities are not easily grown in culture making their DNA sequence our only clue into their evolutionary history and biological function. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. Current methods have made significant strides in reconstructing DNA segments comprising operons, tandem gene arrays and syntenic blocks. Shorter, higher-throughput sequencing technologies have become the de facto standard in the field. Sequencers are now able to generate billions of short reads in only a few days. Multiple metagenomic assembly strategies, pipelines and assemblers have appeared in recent years. Owing to the inherent complexity of metagenome assembly, regardless of the assembly algorithm and sequencing method, metagenome assemblies contain errors. Recent developments in assembly validation tools have played a pivotal role in improving metagenomics assemblers. Here, we survey recent progress in the field of metagenomic assembly, provide an overview of key approaches for genomic and metagenomic assembly validation and demonstrate the insights that can be derived from assemblies through the use of assembly validation strategies. We also discuss the potential for impact of long-read technologies in metagenomics. We conclude with a discussion of future challenges and opportunities in the field of metagenomic assembly and validation.
Asunto(s)
Metagenoma , Metagenómica/métodos , Microbiota/genética , Algoritmos , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Metagenómica/estadística & datos numéricos , Metagenómica/tendencias , Programas InformáticosRESUMEN
Federation is a popular concept in building distributed cyberinfrastructures, whereby computational resources are provided by multiple organizations through a unified portal, decreasing the complexity of moving data back and forth among multiple organizations. Federation has been used in bioinformatics only to a limited extent, namely, federation of datastores, e.g. SBGrid Consortium for structural biology and Gene Expression Omnibus (GEO) for functional genomics. Here, we posit that it is important to federate both computational resources (CPU, GPU, FPGA, etc.) and datastores to support popular bioinformatics portals, with fast-increasing data volumes and increasing processing requirements. A prime example, and one that we discuss here, is in genomics and metagenomics. It is critical that the processing of the data be done without having to transport the data across large network distances. We exemplify our design and development through our experience with metagenomics-RAST (MG-RAST), the most popular metagenomics analysis pipeline. Currently, it is hosted completely at Argonne National Laboratory. However, through a recently started collaborative National Institutes of Health project, we are taking steps toward federating this infrastructure. Being a widely used resource, we have to move toward federation without disrupting 50 K annual users. In this article, we describe the computational tools that will be useful for federating a bioinformatics infrastructure and the open research challenges that we see in federating such infrastructures. It is hoped that our manuscript can serve to spur greater federation of bioinformatics infrastructures by showing the steps involved, and thus, allow them to scale to support larger user bases.
Asunto(s)
Genómica/estadística & datos numéricos , Difusión de la Información/métodos , Macrodatos , Biología Computacional/métodos , Confidencialidad , Bases de Datos Genéticas/estadística & datos numéricos , Privacidad Genética , Humanos , Metagenómica/estadística & datos numéricos , Programas Informáticos , Estados UnidosRESUMEN
Microbiome research has grown rapidly over the past decade, with a proliferation of new methods that seek to make sense of large, complex data sets. Here, we survey two of the primary types of methods for analyzing microbiome data: read classification and metagenomic assembly, and we review some of the challenges facing these methods. All of the methods rely on public genome databases, and we also discuss the content of these databases and how their quality has a direct impact on our ability to interpret a microbiome sample.
Asunto(s)
Bases de Datos Genéticas , Metagenómica/métodos , Algoritmos , Biología Computacional/métodos , Bases de Datos Genéticas/estadística & datos numéricos , Perfilación de la Expresión Génica/estadística & datos numéricos , Marcadores Genéticos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Metagenoma , Metagenómica/estadística & datos numéricos , Microbiota/genética , Filogenia , Alineación de Secuencia/estadística & datos numéricosRESUMEN
As technologies change, MG-RAST is adapting. Newly available software is being included to improve accuracy and performance. As a computational service constantly running large volume scientific workflows, MG-RAST is the right location to perform benchmarking and implement algorithmic or platform improvements, in many cases involving trade-offs between specificity, sensitivity and run-time cost. The work in [Glass EM, Dribinsky Y, Yilmaz P, et al. ISME J 2014;8:1-3] is an example; we use existing well-studied data sets as gold standards representing different environments and different technologies to evaluate any changes to the pipeline. Currently, we use well-understood data sets in MG-RAST as platform for benchmarking. The use of artificial data sets for pipeline performance optimization has not added value, as these data sets are not presenting the same challenges as real-world data sets. In addition, the MG-RAST team welcomes suggestions for improvements of the workflow. We are currently working on versions 4.02 and 4.1, both of which contain significant input from the community and our partners that will enable double barcoding, stronger inferences supported by longer-read technologies, and will increase throughput while maintaining sensitivity by using Diamond and SortMeRNA. On the technical platform side, the MG-RAST team intends to support the Common Workflow Language as a standard to specify bioinformatics workflows, both to facilitate development and efficient high-performance implementation of the community's data analysis tasks.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Metagenoma , Metagenómica/métodos , Programas Informáticos , Algoritmos , Presupuestos , Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/economía , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Internet , Metagenómica/economía , Metagenómica/estadística & datos numéricos , Análisis de Secuencia de ADN/economía , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/estadística & datos numéricos , Interfaz Usuario-Computador , Flujo de TrabajoRESUMEN
OBJECTIVE: To determine the presence and identity of extracellular bacteriophage (phage) families, genera and species in the vagina of pregnant women. DESIGN: Descriptive, observational cohort study. SETTING: São Paulo, Brazil. POPULATION: Pregnant women at 21-24 weeks' gestation. METHODS: Vaginal samples from 107 women whose vaginal microbiome and pregnancy outcomes were previously determined were analysed for phages by metagenomic sequencing. MAIN OUTCOME MEASURES: Identification of phage families, genera and species. RESULTS: Phages were detected in 96 (89.7%) of the samples. Six different phage families were identified: Siphoviridae in 69.2%, Myoviridae in 49.5%, Microviridae in 37.4%, Podoviridae in 20.6%, Herelleviridae in 10.3% and Inviridae in 1.9% of the women. Four different phage families were present in 14 women (13.1%), three families in 20 women (18.7%), two families in 31 women (29.1%) and one family in 31 women (29.1%). The most common phage species detected were Bacillus phages in 48 (43.6%), Escherichia phages in 45 (40.9%), Staphylococcus phages in 40 (36.4%), Gokushovirus in 33 (30.0%) and Lactobacillus phages in 29 (26.4%) women. In a preliminary exploratory analysis, there were no associations between a particular phage family, the number of phage families present in the vagina or any particular phage species and either gestational age at delivery or the bacterial community state type present in the vagina. CONCLUSIONS: Multiple phages are present in the vagina of most mid-trimester pregnant women. TWEETABLE ABSTRACT: Bacteriophages are present in the vagina of most pregnant women.
Asunto(s)
Bacteriófagos , Microbiota/fisiología , Vagina/microbiología , Adulto , Bacteriófagos/clasificación , Bacteriófagos/genética , Bacteriófagos/aislamiento & purificación , Brasil , Femenino , Edad Gestacional , Humanos , Metagenoma , Metagenómica/métodos , Metagenómica/estadística & datos numéricos , Embarazo , Resultado del Embarazo/epidemiologíaRESUMEN
The taxonomic composition of microbial communities can be assessed using universal marker amplicon sequencing. The most common taxonomic markers are the 16S rDNA for bacterial communities and the internal transcribed spacer (ITS) region for fungal communities, but various other markers are used for barcoding eukaryotes. A crucial step in the bioinformatic analysis of amplicon sequences is the identification of representative sequences. This can be achieved using a clustering approach or by denoising raw sequencing reads. DADA2 is a widely adopted algorithm, released as an R library, that denoises marker-specific amplicons from next-generation sequencing and produces a set of representative sequences referred to as 'Amplicon Sequence Variants' (ASV). Here, we present Dadaist2, a modular pipeline, providing a complete suite for the analysis that ranges from raw sequencing reads to the statistics of numerical ecology. Dadaist2 implements a new approach that is specifically optimised for amplicons with variable lengths, such as the fungal ITS. The pipeline focuses on streamlining the data flow from the command line to R, with multiple options for statistical analysis and plotting, both interactive and automatic.
Asunto(s)
Código de Barras del ADN Taxonómico/estadística & datos numéricos , Metagenómica/estadística & datos numéricos , Microbiota/genética , Programas Informáticos , Algoritmos , Análisis por Conglomerados , Biología Computacional/métodos , Interpretación Estadística de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , Metadatos , ARN Ribosómico 16S/genética , Análisis de Secuencia de ADNRESUMEN
Horizontal gene transfer (HGT) has changed the way we regard evolution. Instead of waiting for the next generation to establish new traits, especially bacteria are able to take a shortcut via HGT that enables them to pass on genes from one individual to another, even across species boundaries. The tool Daisy offers the first HGT detection approach based on read mapping that provides complementary evidence compared to existing methods. However, Daisy relies on the acceptor and donor organism involved in the HGT being known. We introduce DaisyGPS, a mapping-based pipeline that is able to identify acceptor and donor reference candidates of an HGT event based on sequencing reads. Acceptor and donor identification is akin to species identification in metagenomic samples based on sequencing reads, a problem addressed by metagenomic profiling tools. However, acceptor and donor references have certain properties such that these methods cannot be directly applied. DaisyGPS uses MicrobeGPS, a metagenomic profiling tool tailored towards estimating the genomic distance between organisms in the sample and the reference database. We enhance the underlying scoring system of MicrobeGPS to account for the sequence patterns in terms of mapping coverage of an acceptor and donor involved in an HGT event, and report a ranked list of reference candidates. These candidates can then be further evaluated by tools like Daisy to establish HGT regions. We successfully validated our approach on both simulated and real data, and show its benefits in an investigation of an outbreak involving Methicillin-resistant Staphylococcus aureus data.
Asunto(s)
Evolución Molecular , Transferencia de Gen Horizontal , Metagenoma , Metagenómica/métodos , Modelos Genéticos , Biología Computacional , Simulación por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Brotes de Enfermedades/estadística & datos numéricos , Variación Genética , Genoma Bacteriano , Helicobacter pylori/genética , Humanos , Metagenómica/estadística & datos numéricos , Staphylococcus aureus Resistente a Meticilina/genética , Mutación , Infecciones Estafilocócicas/epidemiología , Infecciones Estafilocócicas/microbiologíaRESUMEN
Large studies profiling microbial communities and their association with healthy or disease phenotypes are now commonplace. Processed data from many of these studies are publicly available but significant effort is required for users to effectively organize, explore and integrate it, limiting the utility of these rich data resources. Effective integrative and interactive visual and statistical tools to analyze many metagenomic samples can greatly increase the value of these data for researchers. We present Metaviz, a tool for interactive exploratory data analysis of annotated microbiome taxonomic community profiles derived from marker gene or whole metagenome shotgun sequencing. Metaviz is uniquely designed to address the challenge of browsing the hierarchical structure of metagenomic data features while rendering visualizations of data values that are dynamically updated in response to user navigation. We use Metaviz to provide the UMD Metagenome Browser web service, allowing users to browse and explore data for more than 7000 microbiomes from published studies. Users can also deploy Metaviz as a web service, or use it to analyze data through the metavizr package to interoperate with state-of-the-art analysis tools available through Bioconductor. Metaviz is free and open source with the code, documentation and tutorials publicly accessible.
Asunto(s)
Biología Computacional/métodos , Metagenoma/genética , Metagenómica/métodos , Secuenciación Completa del Genoma/métodos , Bacterias/clasificación , Bacterias/genética , Niño , Biología Computacional/estadística & datos numéricos , Diarrea/diagnóstico , Diarrea/genética , Humanos , Internet , Metagenómica/estadística & datos numéricos , Reproducibilidad de los Resultados , Navegador Web , Secuenciación Completa del Genoma/estadística & datos numéricosRESUMEN
The reduction of the price of DNA sequencing has resulted in the emergence of large data sets to handle and analyze, especially in microbial ecosystems, which are characterized by high taxonomic and functional diversities. To assess the properties of these complex ecosystems, a conceptual background of the application of NGS technology and bioinformatics analysis to metagenomics is required. Accordingly, this article presents an overview of the evolution of knowledge of microbial ecology from traditional culture-dependent methods to culture-independent methods and the last frontier in knowledge, metagenomics. Topics that will be covered include sample preparation for NGS, starting with total DNA extraction and library preparation, followed by a brief discussion of the chemistry of NGS to help provide an understanding of which bioinformatics pipeline approach may be helpful for achieving a researcher's goals. The importance of selecting appropriate sequencing coverage and depth parameters to obtain a suitable measure of microbial diversity is discussed. As all DNA sequencing processes produce base-calling errors that compromise data analysis, including genome assembly and microbial functional analysis, dedicated software is presented and conceptually discussed with regard to potential applications in the general microbial ecology field.
Asunto(s)
Biología Computacional/métodos , Microbiología Industrial/métodos , Metagenómica/métodos , Biodiversidad , Biblioteca de Genes , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Metagenómica/estadística & datos numéricos , Filogenia , Control de CalidadRESUMEN
Functional annotation of metagenomic and metatranscriptomic data sets relies on similarity searches based on e-value thresholds resulting in an unknown number of false positive and negative matches. To overcome these limitations, we introduce ROCker, aimed at identifying position-specific, most-discriminant thresholds in sliding windows along the sequence of a target protein, accounting for non-discriminative domains shared by unrelated proteins. ROCker employs the receiver operating characteristic (ROC) curve to minimize false discovery rate (FDR) and calculate the best thresholds based on how simulated shotgun metagenomic reads of known composition map onto well-curated reference protein sequences and thus, differs from HMM profiles and related methods. We showcase ROCker using ammonia monooxygenase (amoA) and nitrous oxide reductase (nosZ) genes, mediating oxidation of ammonia and the reduction of the potent greenhouse gas, N2O, to inert N2, respectively. ROCker typically showed 60-fold lower FDR when compared to the common practice of using fixed e-values. Previously uncounted 'atypical' nosZ genes were found to be two times more abundant, on average, than their typical counterparts in most soil metagenomes and the abundance of bacterial amoA was quantified against the highly-related particulate methane monooxygenase (pmoA). Therefore, ROCker can reliably detect and quantify target genes in short-read metagenomes.
Asunto(s)
Metagenómica/estadística & datos numéricos , Organismos Acuáticos/genética , Biología Computacional/métodos , Bases de Datos Genéticas/estadística & datos numéricos , Ecosistema , Consorcios Microbianos/genética , Filogenia , Curva ROC , Microbiología del SueloRESUMEN
The widespread application of next-generation sequencing technologies has revolutionized microbiome research by enabling high-throughput profiling of the genetic contents of microbial communities. How to analyze the resulting large complex datasets remains a key challenge in current microbiome studies. Over the past decade, powerful computational pipelines and robust protocols have been established to enable efficient raw data processing and annotation. The focus has shifted toward downstream statistical analysis and functional interpretation. Here, we introduce MicrobiomeAnalyst, a user-friendly tool that integrates recent progress in statistics and visualization techniques, coupled with novel knowledge bases, to enable comprehensive analysis of common data outputs produced from microbiome studies. MicrobiomeAnalyst contains four modules - the Marker Data Profiling module offers various options for community profiling, comparative analysis and functional prediction based on 16S rRNA marker gene data; the Shotgun Data Profiling module supports exploratory data analysis, functional profiling and metabolic network visualization of shotgun metagenomics or metatranscriptomics data; the Taxon Set Enrichment Analysis module helps interpret taxonomic signatures via enrichment analysis against >300 taxon sets manually curated from literature and public databases; finally, the Projection with Public Data module allows users to visually explore their data with a public reference data for pattern discovery and biological insights. MicrobiomeAnalyst is freely available at http://www.microbiomeanalyst.ca.
Asunto(s)
Biología Computacional/métodos , Redes y Vías Metabólicas/genética , Metagenómica/estadística & datos numéricos , Microbiota/genética , Programas Informáticos , Gráficos por Computador , Código de Barras del ADN Taxonómico/métodos , Conjuntos de Datos como Asunto , Femenino , Tracto Gastrointestinal/microbiología , Humanos , Internet , Masculino , Metaanálisis como Asunto , Metagenómica/métodos , Boca/microbiología , Filogenia , ARN Ribosómico 16S/genética , Piel/microbiología , Vagina/microbiologíaRESUMEN
BACKGROUND: A key step in microbiome sequencing analysis is read assignment to taxonomic units. This is often performed using one of four taxonomic classifications, namely SILVA, RDP, Greengenes or NCBI. It is unclear how similar these are and how to compare analysis results that are based on different taxonomies. RESULTS: We provide a method and software for mapping taxonomic entities from one taxonomy onto another. We use it to compare the four taxonomies and the Open Tree of life Taxonomy (OTT). CONCLUSIONS: While we find that SILVA, RDP and Greengenes map well into NCBI, and all four map well into the OTT, mapping the two larger taxonomies on to the smaller ones is problematic.
Asunto(s)
Algoritmos , Archaea/clasificación , Bacterias/clasificación , Ontología de Genes/estadística & datos numéricos , Filogenia , Archaea/genética , Bacterias/genética , Bases de Datos Genéticas , Metagenómica/métodos , Metagenómica/estadística & datos numéricos , Microbiota/genética , ARN Ribosómico 16S/genéticaRESUMEN
With read lengths of currently up to 2 × 300 bp, high throughput and low sequencing costs Illumina's MiSeq is becoming one of the most utilized sequencing platforms worldwide. The platform is manageable and affordable even for smaller labs. This enables quick turnaround on a broad range of applications such as targeted gene sequencing, metagenomics, small genome sequencing and clinical molecular diagnostics. However, Illumina error profiles are still poorly understood and programs are therefore not designed for the idiosyncrasies of Illumina data. A better knowledge of the error patterns is essential for sequence analysis and vital if we are to draw valid conclusions. Studying true genetic variation in a population sample is fundamental for understanding diseases, evolution and origin. We conducted a large study on the error patterns for the MiSeq based on 16S rRNA amplicon sequencing data. We tested state-of-the-art library preparation methods for amplicon sequencing and showed that the library preparation method and the choice of primers are the most significant sources of bias and cause distinct error patterns. Furthermore we tested the efficiency of various error correction strategies and identified quality trimming (Sickle) combined with error correction (BayesHammer) followed by read overlapping (PANDAseq) as the most successful approach, reducing substitution error rates on average by 93%.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Sesgo , Biblioteca de Genes , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Mutación INDEL , Metagenómica/métodos , Metagenómica/estadística & datos numéricos , Técnicas de Amplificación de Ácido Nucleico/métodos , Técnicas de Amplificación de Ácido Nucleico/estadística & datos numéricos , ARN Ribosómico 16S/genética , Análisis de Secuencia de ADN/estadística & datos numéricos , Programas InformáticosRESUMEN
BACKGROUND: Microbiota-oriented studies based on metagenomic or metatranscriptomic sequencing have revolutionised our understanding on microbial ecology and the roles of both clinical and environmental microbes. The analysis of massive metatranscriptomic data requires extensive computational resources, a collection of bioinformatics tools and expertise in programming. RESULTS: We developed COMAN (Comprehensive Metatranscriptomics Analysis), a web-based tool dedicated to automatically and comprehensively analysing metatranscriptomic data. COMAN pipeline includes quality control of raw reads, removal of reads derived from non-coding RNA, followed by functional annotation, comparative statistical analysis, pathway enrichment analysis, co-expression network analysis and high-quality visualisation. The essential data generated by COMAN are also provided in tabular format for additional analysis and integration with other software. The web server has an easy-to-use interface and detailed instructions, and is freely available at http://sbb.hku.hk/COMAN/ CONCLUSIONS: COMAN is an integrated web server dedicated to comprehensive functional analysis of metatranscriptomic data, translating massive amount of reads to data tables and high-standard figures. It is expected to facilitate the researchers with less expertise in bioinformatics in answering microbiota-related biological questions and to increase the accessibility and interpretation of microbiota RNA-Seq data.
Asunto(s)
Biología Computacional/métodos , Metagenómica/métodos , Microbiota/genética , Programas Informáticos , Transcriptoma , Biología Computacional/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Internet , Metagenómica/estadística & datos numéricos , Análisis de Secuencia de ARNRESUMEN
High-throughput sequencing technologies produce large collections of data, mainly DNA sequences with additional information, requiring the design of efficient and effective methodologies for both their compression and storage. In this context, we first provide a classification of the main techniques that have been proposed, according to three specific research directions that have emerged from the literature and, for each, we provide an overview of the current techniques. Finally, to make this review useful to researchers and technicians applying the existing software and tools, we include a synopsis of the main characteristics of the described approaches, including details on their implementation and availability. Performance of the various methods is also highlighted, although the state of the art does not lend itself to a consistent and coherent comparison among all the methods presented here.
Asunto(s)
Biología Computacional/métodos , Compresión de Datos/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Algoritmos , Compresión de Datos/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Metagenómica/estadística & datos numéricos , Alineación de Secuencia , Programas InformáticosRESUMEN
There is much interest in using high-throughput DNA sequencing methodology to monitor microorganisms, complex plant and animal communities. However, there are experimental and analytical issues to consider before applying a sequencing technology, which was originally developed for genome projects, to ecological projects. Many of these issues have been highlighted by recent microbial studies. Understanding how high-throughput sequencing is best implemented is important for the interpretation of recent results and the success of future applications. Addressing complex biological questions with metagenomics requires the interaction of researchers who bring different skill sets to problem solving. Educators can help by nurturing a collaborative interdisciplinary approach to genome science, which is essential for effective problem solving. Educators are in a position to help students, teachers, the public and policy makers interpret the new knowledge that metagenomics brings. To do this, they need to understand, not only the excitement of the science but also the pitfalls and shortcomings of methodology and research designs. We review these issues and some of the research directions that are helping to move the field forward.
Asunto(s)
Monitoreo del Ambiente/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Algoritmos , Animales , Biología Computacional/educación , Bases de Datos Genéticas/estadística & datos numéricos , Ecosistema , Metagenómica/estadística & datos numéricos , Programas InformáticosRESUMEN
BACKGROUND: Differences in linkage disequilibrium and in allele substitution effects of QTL (quantitative trait loci) may hinder genomic prediction across populations. Our objective was to develop a deterministic formula to estimate the accuracy of across-population genomic prediction, for which reference individuals and selection candidates are from different populations, and to investigate the impact of differences in allele substitution effects across populations and of the number of QTL underlying a trait on the accuracy. METHODS: A deterministic formula to estimate the accuracy of across-population genomic prediction was derived based on selection index theory. Moreover, accuracies were deterministically predicted using a formula based on population parameters and empirically calculated using simulated phenotypes and a GBLUP (genomic best linear unbiased prediction) model. Phenotypes of 1033 Holstein-Friesian, 105 Groninger White Headed and 147 Meuse-Rhine-Yssel cows were simulated by sampling 3000, 300, 30 or 3 QTL from the available high-density SNP (single nucleotide polymorphism) information of three chromosomes, assuming a correlation of 1.0, 0.8, 0.6, 0.4, or 0.2 between allele substitution effects across breeds. The simulated heritability was set to 0.95 to resemble the heritability of deregressed proofs of bulls. RESULTS: Accuracies estimated with the deterministic formula based on selection index theory were similar to empirical accuracies for all scenarios, while accuracies predicted with the formula based on population parameters overestimated empirical accuracies by ~25 to 30%. When the between-breed genetic correlation differed from 1, i.e. allele substitution effects differed across breeds, empirical and deterministic accuracies decreased in proportion to the genetic correlation. Using a multi-trait model, it was possible to accurately estimate the genetic correlation between the breeds based on phenotypes and high-density genotypes. The number of QTL underlying the simulated trait did not affect the accuracy. CONCLUSIONS: The deterministic formula based on selection index theory estimated the accuracy of across-population genomic predictions well. The deterministic formula using population parameters overestimated the across-population genomic accuracy, but may still be useful because of its simplicity. Both formulas could accommodate for genetic correlations between populations lower than 1. The number of QTL underlying a trait did not affect the accuracy of across-population genomic prediction using a GBLUP method.