RESUMEN
The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The final, phase 3 release of the 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202-sample WGS 1kGP resource, which now includes 602 complete trios, sequenced to a depth of 30X using Illumina. We performed single-nucleotide variant (SNV) and short insertion and deletion (INDEL) discovery and generated a comprehensive set of structural variants (SVs) by integrating multiple analytic methods through a machine learning model. We show gains in sensitivity and precision of variant calls compared to phase 3, especially among rare SNVs as well as INDELs and SVs spanning frequency spectrum. We also generated an improved reference imputation panel, making variants discovered here accessible for association studies.
Asunto(s)
Genoma Humano , Secuenciación Completa del Genoma , Femenino , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Mutación INDEL , Masculino , Polimorfismo de Nucleótido SimpleRESUMEN
To sustain and develop the largest fully open human genomic resources the International Genome Sample Resource (IGSR) (https://www.internationalgenome.org) was established. It is built on the foundation of the 1000 Genomes Project, which created the largest openly accessible catalogue of human genomic variation developed from samples spanning five continents. IGSR (i) maintains access to 1000 Genomes Project resources, (ii) updates 1000 Genomes Project resources to the GRCh38 human reference assembly, (iii) adds new data generated on 1000 Genomes Project cell lines, (iv) shares data from samples with a similarly open consent to increase the number of samples and populations represented in the resources and (v) provides support to users of these resources. Among recent updates are the release of variation calls from 1000 Genomes Project data calculated directly on GRCh38 and the addition of high coverage sequence data for the 2504 samples in the 1000 Genomes Project phase three panel. The data portal, which facilitates web-based exploration of the IGSR resources, has been updated to include samples which were not part of the 1000 Genomes Project and now presents a unified view of data and samples across almost 5000 samples from multiple studies. All data is fully open and publicly accessible.
Asunto(s)
Biología Computacional , Bases de Datos Genéticas , Variación Genética , Genoma Humano , Genómica , Programas Informáticos , Biología Computacional/métodos , Genómica/métodos , Humanos , Interfaz Usuario-Computador , Navegador WebRESUMEN
The International Genome Sample Resource (IGSR; http://www.internationalgenome.org) expands in data type and population diversity the resources from the 1000 Genomes Project. IGSR represents the largest open collection of human variation data and provides easy access to these resources. IGSR was established in 2015 to maintain and extend the 1000 Genomes Project data, which has been widely used as a reference set of human variation and by researchers developing analysis methods. IGSR has mapped all of the 1000 Genomes sequence to the newest human reference (GRCh38), and will release updated variant calls to ensure maximal usefulness of the existing data. IGSR is collecting new structural variation data on the 1000 Genomes samples from long read sequencing and other technologies, and will collect relevant functional data into a single comprehensive resource. IGSR is extending coverage with new populations sequenced by collaborating groups. Here, we present the new data and analysis that IGSR has made available. We have also introduced a new data portal that increases discoverability of our data-previously only browseable through our FTP site-by focusing on particular samples, populations or data sets of interest.
Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Variación Genética , Genoma , Genómica/métodos , Navegador WebRESUMEN
The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) offers a rich platform for data sharing, publishing and archiving and a globally comprehensive data set for onward use by the scientific community. With a broad scope spanning raw sequencing reads, genome assemblies and functional annotation, the resource provides extensive data submission, search and download facilities across web and programmatic interfaces. Here, we outline ENA content and major access modalities, highlight major developments in 2016 and outline a number of examples of data reuse from ENA.
Asunto(s)
Bases de Datos de Ácidos Nucleicos , Análisis de Secuencia de ADN , Análisis de Secuencia de ARN , Genómica , Internet , Anotación de Secuencia MolecularRESUMEN
For 10,000 years pigs and humans have shared a close and complex relationship. From domestication to modern breeding practices, humans have shaped the genomes of domestic pigs. Here we present the assembly and analysis of the genome sequence of a female domestic Duroc pig (Sus scrofa) and a comparison with the genomes of wild and domestic pigs from Europe and Asia. Wild pigs emerged in South East Asia and subsequently spread across Eurasia. Our results reveal a deep phylogenetic split between European and Asian wild boars â¼1 million years ago, and a selective sweep analysis indicates selection on genes involved in RNA processing and regulation. Genes associated with immune response and olfaction exhibit fast evolution. Pigs have the largest repertoire of functional olfactory receptor genes, reflecting the importance of smell in this scavenging animal. The pig genome sequence provides an important resource for further improvements of this important livestock species, and our identification of many putative disease-causing variants extends the potential of the pig as a biomedical model.
Asunto(s)
Genoma/genética , Filogenia , Sus scrofa/clasificación , Sus scrofa/genética , Animales , Demografía , Modelos Animales , Datos de Secuencia Molecular , Dinámica PoblacionalRESUMEN
The zebra finch is an important model organism in several fields with unique relevance to human neuroscience. Like other songbirds, the zebra finch communicates through learned vocalizations, an ability otherwise documented only in humans and a few other animals and lacking in the chicken-the only bird with a sequenced genome until now. Here we present a structural, functional and comparative analysis of the genome sequence of the zebra finch (Taeniopygia guttata), which is a songbird belonging to the large avian order Passeriformes. We find that the overall structures of the genomes are similar in zebra finch and chicken, but they differ in many intrachromosomal rearrangements, lineage-specific gene family expansions, the number of long-terminal-repeat-based retrotransposons, and mechanisms of sex chromosome dosage compensation. We show that song behaviour engages gene regulatory networks in the zebra finch brain, altering the expression of long non-coding RNAs, microRNAs, transcription factors and their targets. We also show evidence for rapid molecular evolution in the songbird lineage of genes that are regulated during song experience. These results indicate an active involvement of the genome in neural processes underlying vocal communication and identify potential genetic substrates for the evolution and regulation of this behaviour.
Asunto(s)
Pinzones/genética , Genoma/genética , Regiones no Traducidas 3'/genética , Animales , Percepción Auditiva/genética , Encéfalo/fisiología , Pollos/genética , Evolución Molecular , Femenino , Pinzones/fisiología , Duplicación de Gen , Redes Reguladoras de Genes/genética , Masculino , MicroARNs/genética , Modelos Animales , Familia de Multigenes/genética , Retroelementos/genética , Cromosomas Sexuales/genética , Secuencias Repetidas Terminales/genética , Transcripción Genética/genética , Vocalización Animal/fisiologíaRESUMEN
The Ensembl project (http://www.ensembl.org) provides genome information for sequenced chordate genomes with a particular focus on human, mouse, zebrafish and rat. Our resources include evidenced-based gene sets for all supported species; large-scale whole genome multiple species alignments across vertebrates and clade-specific alignments for eutherian mammals, primates, birds and fish; variation data resources for 17 species and regulation annotations based on ENCODE and other data sets. Ensembl data are accessible through the genome browser at http://www.ensembl.org and through other tools and programmatic interfaces.
Asunto(s)
Bases de Datos Genéticas , Genómica , Animales , Regulación de la Expresión Génica , Variación Genética , Humanos , Internet , Ratones , Anotación de Secuencia Molecular , Ratas , Programas Informáticos , Pez Cebra/genéticaRESUMEN
The Ensembl project (http://www.ensembl.org) provides genome resources for chordate genomes with a particular focus on human genome data as well as data for key model organisms such as mouse, rat and zebrafish. Five additional species were added in the last year including gibbon (Nomascus leucogenys) and Tasmanian devil (Sarcophilus harrisii) bringing the total number of supported species to 61 as of Ensembl release 64 (September 2011). Of these, 55 species appear on the main Ensembl website and six species are provided on the Ensembl preview site (Pre!Ensembl; http://pre.ensembl.org) with preliminary support. The past year has also seen improvements across the project.
Asunto(s)
Bases de Datos Genéticas , Genómica , Animales , Regulación de la Expresión Génica , Variación Genética , Humanos , Ratones , Anotación de Secuencia Molecular , RatasRESUMEN
The Ensembl project (http://www.ensembl.org) seeks to enable genomic science by providing high quality, integrated annotation on chordate and selected eukaryotic genomes within a consistent and accessible infrastructure. All supported species include comprehensive, evidence-based gene annotations and a selected set of genomes includes additional data focused on variation, comparative, evolutionary, functional and regulatory annotation. The most advanced resources are provided for key species including human, mouse, rat and zebrafish reflecting the popularity and importance of these species in biomedical research. As of Ensembl release 59 (August 2010), 56 species are supported of which 5 have been added in the past year. Since our previous report, we have substantially improved the presentation and integration of both data of disease relevance and the regulatory state of different cell types.
Asunto(s)
Bases de Datos Genéticas , Genómica , Animales , Variación Genética , Humanos , Ratones , Anotación de Secuencia Molecular , Ratas , Secuencias Reguladoras de Ácidos Nucleicos , Programas Informáticos , Pez Cebra/genéticaRESUMEN
Ensembl (http://www.ensembl.org) integrates genomic information for a comprehensive set of chordate genomes with a particular focus on resources for human, mouse, rat, zebrafish and other high-value sequenced genomes. We provide complete gene annotations for all supported species in addition to specific resources that target genome variation, function and evolution. Ensembl data is accessible in a variety of formats including via our genome browser, API and BioMart. This year marks the tenth anniversary of Ensembl and in that time the project has grown with advances in genome technology. As of release 56 (September 2009), Ensembl supports 51 species including marmoset, pig, zebra finch, lizard, gorilla and wallaby, which were added in the past year. Major additions and improvements to Ensembl since our previous report include the incorporation of the human GRCh37 assembly, enhanced visualisation and data-mining options for the Ensembl regulatory features and continued development of our software infrastructure.
Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Bases de Datos de Ácidos Nucleicos , Acceso a la Información , Animales , Biología Computacional/tendencias , Bases de Datos de Proteínas , Variación Genética , Genómica/métodos , Humanos , Almacenamiento y Recuperación de la Información/métodos , Internet , Estructura Terciaria de Proteína , Programas Informáticos , Especificidad de la EspecieRESUMEN
Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent-child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average minimum contig length needed to cover 50% of the genome: 26 million base pairs) integrate all forms of genetic variation, even across complex loci. We identified 107,590 structural variants (SVs), of which 68% were not discovered with short-read sequencing, and 278 SV hotspots (spanning megabases of gene-rich sequence). We characterized 130 of the most active mobile element source elements and found that 63% of all SVs arise through homology-mediated mechanisms. This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1526 expression quantitative trait loci as well as SV candidates for adaptive selection within the human population.
Asunto(s)
Variación Genética , Genoma Humano , Haplotipos , Femenino , Genotipo , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Mutación INDEL , Secuencias Repetitivas Esparcidas , Masculino , Grupos de Población/genética , Sitios de Carácter Cuantitativo , Retroelementos , Análisis de Secuencia de ADN , Inversión de Secuencia , Secuenciación Completa del GenomaRESUMEN
The Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution. We describe the GA4GH organization, which is fueled by the development efforts of eight Work Streams and informed by the needs of 24 Driver Projects and other key stakeholders. We present the GA4GH suite of secure, interoperable technical standards and policy frameworks and review the current status of standards, their relevance to key domains of research and clinical care, and future plans of GA4GH. Broad international participation in building, adopting, and deploying GA4GH standards and frameworks will catalyze an unprecedented effort in data sharing that will be critical to advancing genomic medicine and ensuring that all populations can access its benefits.
RESUMEN
We present biallelic SNVs called from 2,548 samples across 26 populations from the 1000 Genomes Project, called directly on GRCh38. We believe this will be a useful reference resource for those using GRCh38, representing an improvement over the "lift-overs" of the 1000 Genomes Project data that have been available to date and providing a resource necessary for the full adoption of GRCh38 by the community. Here, we describe how the call set was created and provide benchmarking data describing how our call set compares to that produced by the final phase of the 1000 Genomes Project on GRCh37.
RESUMEN
The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.
Asunto(s)
Genoma Humano/genética , Variación Estructural del Genoma , Genómica/métodos , Haplotipos/genética , Algoritmos , Mapeo Cromosómico/métodos , Bases de Datos Genéticas , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Mutación INDEL , Secuenciación Completa del Genoma/métodosRESUMEN
Growing demand for staple crops like rice will need to be achieved predominately through agricultural intensification and more efficient use of inputs. To meet this demand it is essential that the genetic diversity within rice is fully utilized. The aus subpopulation is considered an underappreciated resource within that diversity. A new rice panel, the Bengal and Assam Aus Panel (BAAP) of 266 aus accessions was generated with â¼2 million informative SNPs obtained using skim sequencing at â¼4× depth. The BAAP was grown in the field in Bangladesh in the 'boro' season under both continuously flooded and Alternate Wetting and Drying (AWD) irrigation during 2013 and 2014 in Mymensingh and during 2014 in Madhupur. Heading date, grain mass, straw biomass and harvest index were measured. The majority (94%) of BAAP accessions flowered within a relatively small window of 10 days. The AWD irrigation treatment generally caused an increase in grain mass, but no significant genotype by treatment interactions were detected for this trait. Shoot biomass was the only trait that showed evidence of genotype by treatment interaction. The average LD (Linkage Disequilibrium) decay across the genome was 243 Kbp. Genome wide association mapping revealed 115 quantitative trait loci (QTLs). There was little evidence of QTLs specific to the irrigation treatment, and only a few QTLs co-localized with known genes. However, some QTLs were detected across multiple sites and years. These QTLs should be targets for breeding, and include a region around 2.2 Mbp on chromosome 1, a large region in the middle of chromosome 7 and two regions on chromosome 11 (â¼10 Mbp and â¼29 Mbp). The BAAP appears to be a valuable addition to the growing collection of GWA mapping populations of rice.
RESUMEN
The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An updated reference assembly, GRCh38, was released in late 2013, but there was insufficient time for the final phase of the project analysis to change to the new assembly. Although it is possible to lift the coordinates of the 1000 Genomes Project variants to the new assembly, this is a potentially error-prone process as coordinate remapping is most appropriate only for non-repetitive regions of the genome and those that did not see significant change between the two assemblies. It will also miss variants in any region that was newly added to GRCh38. Thus, to produce the highest quality variants and genotypes on GRCh38, the best strategy is to realign the reads and recall the variants based on the new alignment. As the first step of variant calling for the 1000 Genomes Project data, we have finished remapping all of the 1000 Genomes sequence reads to GRCh38 with alternative scaffold-aware BWA-MEM. The resulting alignments are available as CRAM, a reference-based sequence compression format. The data have been released on our FTP site and are also available from European Nucleotide Archive to facilitate researchers discovering variants on the primary sequences and alternative contigs of GRCh38.
Asunto(s)
Mapeo Contig/métodos , Proyecto Genoma Humano , Alineación de Secuencia/métodos , Secuenciación Completa del Genoma/métodos , Algoritmos , Mapeo Contig/normas , Humanos , Estándares de Referencia , Alineación de Secuencia/normas , Secuenciación Completa del Genoma/normasRESUMEN
The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail.Database URL: http://www.ensembl.org/index.html.
Asunto(s)
Bases de Datos de Ácidos Nucleicos , Bases de Datos de Proteínas , Internet , Anotación de Secuencia Molecular/métodos , Animales , Humanos , RatonesRESUMEN
Genetic variation in an infectious disease pathogen can be driven by ecological niche dissimilarities arising from different host species and different geographical locations. Whole genome sequencing was used to compare E. coli O157 isolates from host reservoirs (cattle and sheep) from Scotland and to compare genetic variation of isolates (human, animal, environmental/food) obtained from Scotland, New Zealand, Netherlands, Canada and the USA. Nei's genetic distance calculated from core genome single nucleotide polymorphisms (SNPs) demonstrated that the animal isolates were from the same population. Investigation of the Shiga toxin bacteriophage and their insertion sites (SBI typing) revealed that cattle and sheep isolates had statistically indistinguishable rarefaction profiles, diversity and genotypes. In contrast, isolates from different countries exhibited significant differences in Nei's genetic distance and SBI typing. Hence, after successful international transmission, which has occurred on multiple occasions, local genetic variation occurs, resulting in a global patchwork of continental and trans-continental phylogeographic clades. These findings are important for three reasons: first, understanding transmission and evolution of infectious diseases associated with multiple host reservoirs and multi-geographic locations; second, highlighting the relevance of the sheep reservoir when considering farm based interventions; and third, improving our understanding of why human disease incidence varies across the world.
Asunto(s)
Bacteriófagos/genética , Infecciones por Escherichia coli/genética , Escherichia coli O157/aislamiento & purificación , Genoma , Interacciones Huésped-Patógeno/genética , Filogeografía , Polimorfismo de Nucleótido Simple/genética , Animales , Bovinos , Infecciones por Escherichia coli/epidemiología , Infecciones por Escherichia coli/microbiología , Escherichia coli O157/genética , Variación Genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Nueva Zelanda/epidemiología , OvinosRESUMEN
BACKGROUND: We present the genome sequence of the tammar wallaby, Macropus eugenii, which is a member of the kangaroo family and the first representative of the iconic hopping mammals that symbolize Australia to be sequenced. The tammar has many unusual biological characteristics, including the longest period of embryonic diapause of any mammal, extremely synchronized seasonal breeding and prolonged and sophisticated lactation within a well-defined pouch. Like other marsupials, it gives birth to highly altricial young, and has a small number of very large chromosomes, making it a valuable model for genomics, reproduction and development. RESULTS: The genome has been sequenced to 2 × coverage using Sanger sequencing, enhanced with additional next generation sequencing and the integration of extensive physical and linkage maps to build the genome assembly. We also sequenced the tammar transcriptome across many tissues and developmental time points. Our analyses of these data shed light on mammalian reproduction, development and genome evolution: there is innovation in reproductive and lactational genes, rapid evolution of germ cell genes, and incomplete, locus-specific X inactivation. We also observe novel retrotransposons and a highly rearranged major histocompatibility complex, with many class I genes located outside the complex. Novel microRNAs in the tammar HOX clusters uncover new potential mammalian HOX regulatory elements. CONCLUSIONS: Analyses of these resources enhance our understanding of marsupial gene evolution, identify marsupial-specific conserved non-coding elements and critical genes across a range of biological systems, including reproduction, development and immunity, and provide new insight into marsupial and mammalian biology and genome evolution.
Asunto(s)
Evolución Biológica , Macropodidae/clasificación , Macropodidae/genética , Transcriptoma/genética , Animales , Australia , Mapeo Cromosómico , Cromosomas de los Mamíferos/genética , Femenino , Regulación de la Expresión Génica , Genoma , Impresión Genómica , Hibridación Fluorescente in Situ , Macropodidae/crecimiento & desarrollo , MicroARNs/genética , MicroARNs/metabolismo , Datos de Secuencia Molecular , Reproducción/genética , Alineación de Secuencia , Análisis de Secuencia de ADNRESUMEN
Ovine enzootic abortion (OEA) resulting from infection of sheep and goats with Chlamydophila abortus is of major economic importance worldwide. Over the last 50 years the serological diagnosis of infection has been based mainly on the complement fixation test (CFT), which lacks both sensitivity and specificity because of cross-reactive antibodies to other gram-negative bacteria, including another common chlamydial pathogen of sheep, Chlamydophila pecorum. In the present study, a series of overlapping recombinant antigens representing the polymorphic outer membrane protein POMP90 of C. abortus was assessed by enzyme-linked immunosorbent assay (ELISA) with a panel of 143 serum samples from sheep experimentally infected with C. abortus, from sheep clinically free of OEA, and from specific-pathogen-free lambs experimentally infected with different subtypes of C. pecorum. The results were compared to those obtained by CFT and another recently described test, an indirect ELISA (iELISA) with the recombinant OMP91B (rOMP91B) fragment (rOMP91B iELISA) (D. Longbottom, E. Psarrou, M. Livingstone, and E. Vretou, FEMS Microbiol. Lett. 195:157-161, 2001). The rOMP90-3 and rOMP90-4 ELISAs were identified as being more sensitive and specific than CFT. Assays with both fragments were evaluated further with a panel of 294 field serum samples from flocks with documented histories of abortion, from flocks with no clinical histories of abortion but which had a high proportion of samples seropositive by CFT, and from animals with no histories of abortion but from which various C. pecorum subtypes had been isolated. ELISAs with both POMP90 fragments outperformed CFT with serum samples from C. pecorum-infected animals, producing no false-positive results. However, the ELISA with the rOMP90-4 fragment appeared to be more sensitive than the one with rOMP90-3, as it identified more of the OEA-positive samples. The ELISA with the rOMP90-4 fragment was also able to identify apparently healthy animals that were infected with an enteric strain of C. abortus in flocks that were probably infected with both enteric C. abortus and C. pecorum strains. The identification of animals infected with enteric C. abortus is extremely important in controlling the spread of OEA. Overall, the new rOMP90-4 ELISA was found to be a more sensitive and specific test than CFT for differentiating animals infected with C. abortus from those infected with C. pecorum.