RESUMEN
Driven by the necessity to survive environmental pathogens, the human immune system has evolved exceptional diversity and plasticity, to which several factors contribute including inheritable structural polymorphism of the underlying genes. Characterizing this variation is challenging due to the complexity of these loci, which contain extensive regions of paralogy, segmental duplication and high copy-number repeats, but recent progress in long-read sequencing and optical mapping techniques suggests this problem may now be tractable. Here we assess this by using long-read sequencing platforms from PacBio and Oxford Nanopore, supplemented with short-read sequencing and Bionano optical mapping, to sequence DNA extracted from CD14+ monocytes and peripheral blood mononuclear cells from a single European individual identified as HV31. We use this data to build a de novo assembly of eight genomic regions encoding four key components of the immune system, namely the human leukocyte antigen, immunoglobulins, T cell receptors, and killer-cell immunoglobulin-like receptors. Validation of our assembly using k-mer based and alignment approaches suggests that it has high accuracy, with estimated base-level error rates below 1 in 10 kb, although we identify a small number of remaining structural errors. We use the assembly to identify heterozygous and homozygous structural variation in comparison to GRCh38. Despite analyzing only a single individual, we find multiple large structural variants affecting core genes at all three immunoglobulin regions and at two of the three T cell receptor regions. Several of these variants are not accurately callable using current algorithms, implying that further methodological improvements are needed. Our results demonstrate that assessing haplotype variation in these regions is possible given sufficiently accurate long-read and associated data. Continued reductions in the cost of these technologies will enable application of these methods to larger samples and provide a broader catalogue of germline structural variation at these loci, an important step toward making these regions accessible to large-scale genetic association studies.
Asunto(s)
Variación Genética , Genoma Humano/inmunología , Sistema Inmunológico , Algoritmos , Biología Computacional , Variaciones en el Número de Copia de ADN , Genómica/métodos , Genómica/estadística & datos numéricos , Antígenos HLA/genética , Haplotipos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Fenómenos Inmunogenéticos , Inmunoglobulinas/genética , Receptores de Antígenos de Linfocitos T/genética , Receptores KIR/genética , Análisis de Secuencia de ADN/estadística & datos numéricosRESUMEN
UNLABELLED: Epstein-Barr virus (EBV) infects most of the world's population and is causally associated with several human cancers, but little is known about how EBV genetic variation might influence infection or EBV-associated disease. There are currently no published wild-type EBV genome sequences from a healthy individual and very few genomes from EBV-associated diseases. We have sequenced 71 geographically distinct EBV strains from cell lines, multiple types of primary tumor, and blood samples and the first EBV genome from the saliva of a healthy carrier. We show that the established genome map of EBV accurately represents all strains sequenced, but novel deletions are present in a few isolates. We have increased the number of type 2 EBV genomes sequenced from one to 12 and establish that the type 1/type 2 classification is a major feature of EBV genome variation, defined almost exclusively by variation of EBNA2 and EBNA3 genes, but geographic variation is also present. Single nucleotide polymorphism (SNP) density varies substantially across all known open reading frames and is highest in latency-associated genes. Some T-cell epitope sequences in EBNA3 genes show extensive variation across strains, and we identify codons under positive selection, both important considerations for the development of vaccines and T-cell therapy. We also provide new evidence for recombination between strains, which provides a further mechanism for the generation of diversity. Our results provide the first global view of EBV sequence variation and demonstrate an effective method for sequencing large numbers of genomes to further understand the genetics of EBV infection. IMPORTANCE: Most people in the world are infected by Epstein-Barr virus (EBV), and it causes several human diseases, which occur at very different rates in different parts of the world and are linked to host immune system variation. Natural variation in EBV DNA sequence may be important for normal infection and for causing disease. Here we used rapid, cost-effective sequencing to determine 71 new EBV sequences from different sample types and locations worldwide. We showed geographic variation in EBV genomes and identified the most variable parts of the genome. We identified protein sequences that seem to have been selected by the host immune system and detected variability in known immune epitopes. This gives the first overview of EBV genome variation, important for designing vaccines and immune therapy for EBV, and provides techniques to investigate relationships between viral sequence variation and EBV-associated diseases.
Asunto(s)
Infecciones por Virus de Epstein-Barr/virología , Variación Genética , Genoma Viral , Herpesvirus Humano 4/genética , Secuencia de Aminoácidos , Antígenos Virales/genética , Portador Sano/virología , Línea Celular Tumoral , ADN Viral/genética , Epítopos de Linfocito T/genética , Antígenos Nucleares del Virus de Epstein-Barr/genética , Herpesvirus Humano 4/clasificación , Herpesvirus Humano 4/aislamiento & purificación , Humanos , Filogenia , Polimorfismo de Nucleótido Simple , Recombinación Genética , Proteínas de la Matriz Viral/genéticaRESUMEN
BACKGROUND: Highly parallel, 'second generation' sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of population genomics. Most of these data are publically available as unassembled short-read sequence files that require extensive processing before they can be used for analysis. The provision of data in a uniform format, which can be easily assessed for quality, linked to provenance and phenotype and used for analysis, is therefore necessary. RESULTS: The performance of de novo short-read assembly followed by automatic annotation using the pubMLST.org Neisseria database was assessed and evaluated for 108 diverse, representative, and well-characterised Neisseria meningitidis isolates. High-quality sequences were obtained for >99% of known meningococcal genes among the de novo assembled genomes and four resequenced genomes and less than 1% of reassembled genes had sequence discrepancies or misassembled sequences. A core genome of 1600 loci, present in at least 95% of the population, was determined using the Genome Comparator tool. Genealogical relationships compatible with, but at a higher resolution than, those identified by multilocus sequence typing were obtained with core genome comparisons and ribosomal protein gene analysis which revealed a genomic structure for a number of previously described phenotypes. This unified system for cataloguing Neisseria genetic variation in the genome was implemented and used for multiple analyses and the data are publically available in the PubMLST Neisseria database. CONCLUSIONS: The de novo assembly, combined with automated gene-by-gene annotation, generates high quality draft genomes in which the majority of protein-encoding genes are present with high accuracy. The approach catalogues diversity efficiently, permits analyses of a single genome or multiple genome comparisons, and is a practical approach to interpreting WGS data for large bacterial population samples. The method generates novel insights into the biology of the meningococcus and improves our understanding of the whole population structure, not just disease causing lineages.
Asunto(s)
Genómica/métodos , Anotación de Secuencia Molecular , Neisseria meningitidis/genética , Genoma Bacteriano/genética , Secuenciación de Nucleótidos de Alto Rendimiento , FilogeniaRESUMEN
Background. PCR amplification is a necessary step in many next-generation sequencing (NGS) library preparation methods [1, 2]. Whilst many PCR enzymes are developed to amplify single targets efficiently, accurately and with specificity, few are developed to meet the challenges imposed by NGS PCR, namely unbiased amplification of a wide range of different sizes and GC content. As a result PCR amplification during NGS library prep often results in bias toward GC neutral and smaller fragments. As NGS has matured, optimized NGS library prep kits and polymerase formulations have emerged and in this study we have tested a wide selection of available enzymes for both short-read Illumina library preparation and long fragment amplification ahead of long-read sequencing.We tested over 20 different hi-fidelity PCR enzymes/NGS amplification mixes on a range of Illumina library templates of varying GC content and composition, and find that both yield and genome coverage uniformity characteristics of the commercially available enzymes varied dramatically. Three enzymes Quantabio RepliQa Hifi Toughmix, Watchmaker Library Amplification Hot Start Master Mix (2X) 'Equinox' and Takara Ex Premier were found to give a consistent performance, over all genomes, that mirrored closely that observed for PCR-free datasets. We also test a range of enzymes for long-read sequencing by amplifying size fractionated S. cerevisiae DNA of average size 21.6 and 13.4 kb, respectively.The enzymes of choice for short-read (Illumina) library fragment amplification are Quantabio RepliQa Hifi Toughmix, Watchmaker Library Amplification Hot Start Master Mix (2X) 'Equinox' and Takara Ex Premier, with RepliQa also being the best performing enzyme from the enzymes tested for long fragment amplification prior to long-read sequencing.
Asunto(s)
ADN , Saccharomyces cerevisiae , Reacción en Cadena de la Polimerasa/métodos , Biblioteca de Genes , Secuenciación de Nucleótidos de Alto Rendimiento/métodosRESUMEN
Pathogenic Neisseria meningitidis isolates contain a polysaccharide capsule that is the main virulence determinant for this bacterium. Thirteen capsular polysaccharides have been described, and nuclear magnetic resonance spectroscopy has enabled determination of the structure of capsular polysaccharides responsible for serogroup specificity. Molecular mechanisms involved in N. meningitidis capsule biosynthesis have also been identified, and genes involved in this process and in cell surface translocation are clustered at a single chromosomal locus termed cps. The use of multiple names for some of the genes involved in capsule synthesis, combined with the need for rapid diagnosis of serogroups commonly associated with invasive meningococcal disease, prompted a requirement for a consistent approach to the nomenclature of capsule genes. In this report, a comprehensive description of all N. meningitidis serogroups is provided, along with a proposed nomenclature, which was presented at the 2012 XVIIIth International Pathogenic Neisseria Conference.
Asunto(s)
Cápsulas Bacterianas/genética , Cromosomas Bacterianos , Genes Bacterianos , Neisseria meningitidis/genética , Polisacáridos Bacterianos/genética , Cápsulas Bacterianas/metabolismo , Sitios Genéticos , Humanos , Espectroscopía de Resonancia Magnética , Infecciones Meningocócicas/microbiología , Familia de Multigenes , Neisseria meningitidis/clasificación , Neisseria meningitidis/aislamiento & purificación , Neisseria meningitidis/patogenicidad , Reacción en Cadena de la Polimerasa , Polisacáridos Bacterianos/biosíntesis , Polisacáridos Bacterianos/aislamiento & purificación , Serotipificación , Terminología como Asunto , VirulenciaRESUMEN
Citrobacter rodentium is a natural mouse pathogen that causes attaching and effacing (A/E) lesions. It shares a common virulence strategy with the clinically significant human A/E pathogens enteropathogenic E. coli (EPEC) and enterohaemorrhagic E. coli (EHEC) and is widely used to model this route of pathogenesis. We previously reported the complete genome sequence of C. rodentium ICC168, where we found that the genome displayed many characteristics of a newly evolved pathogen. In this study, through PFGE, sequencing of isolates showing variation, whole genome transcriptome analysis and examination of the mobile genetic elements, we found that, consistent with our previous hypothesis, the genome of C. rodentium is unstable as a result of repeat-mediated, large-scale genome recombination and because of active transposition of mobile genetic elements such as the prophages. We sequenced an additional C. rodentium strain, EX-33, to reveal that the reference strain ICC168 is representative of the species and that most of the inactivating mutations were common to both isolates and likely to have occurred early on in the evolution of this pathogen. We draw parallels with the evolution of other bacterial pathogens and conclude that C. rodentium is a recently evolved pathogen that may have emerged alongside the development of inbred mice as a model for human disease.
Asunto(s)
Citrobacter rodentium/genética , Citrobacter rodentium/patogenicidad , Genoma Bacteriano , Animales , Citrobacter rodentium/clasificación , ADN Bacteriano/genética , Modelos Animales de Enfermedad , Electroforesis en Gel de Campo Pulsado , Infecciones por Enterobacteriaceae/patología , Femenino , Perfilación de la Expresión Génica , Reordenamiento Génico , Inestabilidad Genómica , Humanos , Secuencias Repetitivas Esparcidas , Ratones , Ratones Endogámicos C57BL , Plásmidos/genética , Profagos/genética , Análisis de Secuencia de ADN , VirulenciaRESUMEN
Clostridium difficile has rapidly emerged as the leading cause of antibiotic-associated diarrheal disease, with the transcontinental spread of various PCR ribotypes, including 001, 017, 027 and 078. However, the genetic basis for the emergence of C. difficile as a human pathogen is unclear. Whole genome sequencing was used to analyze genetic variation and virulence of a diverse collection of thirty C. difficile isolates, to determine both macro and microevolution of the species. Horizontal gene transfer and large-scale recombination of core genes has shaped the C. difficile genome over both short and long time scales. Phylogenetic analysis demonstrates C. difficile is a genetically diverse species, which has evolved within the last 1.1-85 million years. By contrast, the disease-causing isolates have arisen from multiple lineages, suggesting that virulence evolved independently in the highly epidemic lineages.
Asunto(s)
Clostridioides difficile/genética , Evolución Molecular , Biología Computacional , Regulación Bacteriana de la Expresión Génica , Técnicas de Transferencia de Gen , Genoma Bacteriano , Datos de Secuencia Molecular , Filogenia , Reacción en Cadena de la Polimerasa , Polimorfismo de Nucleótido Simple , Recombinación Genética , Análisis de Secuencia de ADN , Especificidad de la Especie , Factores de Tiempo , VirulenciaRESUMEN
In common with other bacterial taxa, members of the genus Neisseria are classified using a range of phenotypic and biochemical approaches, which are not entirely satisfactory in assigning isolates to species groups. Recently, there has been increasing interest in using nucleotide sequences for bacterial typing and taxonomy, but to date, no broadly accepted alternative to conventional methods is available. Here, the taxonomic relationships of 55 representative members of the genus Neisseria have been analysed using whole-genome sequence data. As genetic material belonging to the accessory genome is widely shared among different taxa but not present in all isolates, this analysis indexed nucleotide sequence variation within sets of genes, specifically protein-coding genes that were present and directly comparable in all isolates. Variation in these genes identified seven species groups, which were robust to the choice of genes and phylogenetic clustering methods used. The groupings were largely, but not completely, congruent with current species designations, with some minor changes in nomenclature and the reassignment of a few isolates necessary. In particular, these data showed that isolates classified as Neisseria polysaccharea are polyphyletic and probably include more than one taxonomically distinct organism. The seven groups could be reliably and rapidly generated with sequence variation within the 53 ribosomal protein subunit (rps) genes, further demonstrating that ribosomal multilocus sequence typing (rMLST) is a practicable and powerful means of characterizing bacteria at all levels, from domain to strain.
Asunto(s)
Genómica/métodos , Infecciones por Bacterias Gramnegativas/microbiología , Neisseria/clasificación , Neisseria/aislamiento & purificación , Filogenia , Técnicas de Tipificación Bacteriana/métodos , Variación Genética , Humanos , Datos de Secuencia Molecular , Neisseria/genética , Análisis de Secuencia de ADNRESUMEN
We present a genome assembly from a clonal population of Eimeria tenella Houghton parasites (Apicomplexa; Conoidasida; Eucoccidiorida; Eimeriidae). The genome sequence is 53.25 megabases in span. The entire assembly is scaffolded into 15 chromosomal pseudomolecules, with complete mitochondrion and apicoplast organellar genomes also present.
RESUMEN
We present a genome assembly from an individual male Arvicola amphibius (the European water vole; Chordata; Mammalia; Rodentia; Cricetidae). The genome sequence is 2.30 gigabases in span. The majority of the assembly is scaffolded into 18 chromosomal pseudomolecules, including the X sex chromosome. Gene annotation of this assembly on Ensembl has identified 21,394 protein coding genes.
RESUMEN
We present a genome assembly from an individual female Salmo trutta (the brown trout; Chordata; Actinopteri; Salmoniformes; Salmonidae). The genome sequence is 2.37 gigabases in span. The majority of the assembly is scaffolded into 40 chromosomal pseudomolecules. Gene annotation of this assembly on Ensembl has identified 43,935 protein coding genes.
RESUMEN
We present a genome assembly from an individual female Aquila chrysaetos chrysaetos (the European golden eagle; Chordata; Aves; Accipitridae). The genome sequence is 1.23 gigabases in span. The majority of the assembly is scaffolded into 28 chromosomal pseudomolecules, including the W and Z sex chromosomes.
RESUMEN
We present a genome assembly from an individual male Rattus norvegicus (the Norway rat; Chordata; Mammalia; Rodentia; Muridae). The genome sequence is 2.44 gigabases in span. The majority of the assembly is scaffolded into 20 chromosomal pseudomolecules, with both X and Y sex chromosomes assembled. This genome assembly, mRatBN7.2, represents the new reference genome for R. norvegicus and has been adopted by the Genome Reference Consortium.
RESUMEN
We present a genome assembly from an individual female Streptopelia turtur (the European turtle dove; Chordata; Aves; Columbidae). The genome sequence is 1.18 gigabases in span. The majority of the assembly is scaffolded into 35 chromosomal pseudomolecules, with the W and Z sex chromosomes assembled.
RESUMEN
BACKGROUND: Modern sequencing technologies should make the assembly of the relatively small mitochondrial genomes an easy undertaking. However, few tools exist that address mitochondrial assembly directly. RESULTS: As part of the Vertebrate Genomes Project (VGP) we develop mitoVGP, a fully automated pipeline for similarity-based identification of mitochondrial reads and de novo assembly of mitochondrial genomes that incorporates both long (> 10 kbp, PacBio or Nanopore) and short (100-300 bp, Illumina) reads. Our pipeline leads to successful complete mitogenome assemblies of 100 vertebrate species of the VGP. We observe that tissue type and library size selection have considerable impact on mitogenome sequencing and assembly. Comparing our assemblies to purportedly complete reference mitogenomes based on short-read sequencing, we identify errors, missing sequences, and incomplete genes in those references, particularly in repetitive regions. Our assemblies also identify novel gene region duplications. The presence of repeats and duplications in over half of the species herein assembled indicates that their occurrence is a principle of mitochondrial structure rather than an exception, shedding new light on mitochondrial genome evolution and organization. CONCLUSIONS: Our results indicate that even in the "simple" case of vertebrate mitogenomes the completeness of many currently available reference sequences can be further improved, and caution should be exercised before claiming the complete assembly of a mitogenome, particularly from short reads alone.
Asunto(s)
Duplicación de Gen , Genoma Mitocondrial , Genómica , Secuencias Repetitivas de Ácidos Nucleicos , Vertebrados/genética , Animales , Biología Computacional/métodos , Biología Computacional/normas , Evolución Molecular , Genómica/métodos , Secuenciación de Nucleótidos de Alto RendimientoRESUMEN
The 3.1-Mb genome of an outbreak methicillin-resistant Staphylococcus aureus (MRSA) strain (TW20) contains evidence of recently acquired DNA, including two large regions (635 kb and 127 kb). The strain is resistant to a wide range of antibiotics, antiseptics, and heavy metals due to resistance genes encoded on mobile genetic elements and also mutations in housekeeping genes.
Asunto(s)
Farmacorresistencia Bacteriana Múltiple/genética , Genoma Bacteriano/genética , Staphylococcus aureus Resistente a Meticilina/genética , Antibacterianos/farmacología , Antiinfecciosos Locales/farmacología , Metales Pesados/toxicidad , Staphylococcus aureus Resistente a Meticilina/efectos de los fármacos , Modelos Genéticos , Datos de Secuencia MolecularRESUMEN
Citrobacter rodentium (formally Citrobacter freundii biotype 4280) is a highly infectious pathogen that causes colitis and transmissible colonic hyperplasia in mice. In common with enteropathogenic and enterohemorrhagic Escherichia coli (EPEC and EHEC, respectively), C. rodentium exploits a type III secretion system (T3SS) to induce attaching and effacing (A/E) lesions that are essential for virulence. Here, we report the fully annotated genome sequence of the 5.3-Mb chromosome and four plasmids harbored by C. rodentium strain ICC168. The genome sequence revealed key information about the phylogeny of C. rodentium and identified 1,585 C. rodentium-specific (without orthologues in EPEC or EHEC) coding sequences, 10 prophage-like regions, and 17 genomic islands, including the locus for enterocyte effacement (LEE) region, which encodes a T3SS and effector proteins. Among the 29 T3SS effectors found in C. rodentium are all 22 of the core effectors of EPEC strain E2348/69. In addition, we identified a novel C. rodentium effector, named EspS. C. rodentium harbors two type VI secretion systems (T6SS) (CTS1 and CTS2), while EHEC contains only one T6SS (EHS). Our analysis suggests that C. rodentium and EPEC/EHEC have converged on a common host infection strategy through access to a common pool of mobile DNA and that C. rodentium has lost gene functions associated with a previous pathogenic niche.
Asunto(s)
Citrobacter rodentium/genética , Escherichia coli/genética , Evolución Molecular , Genoma Bacteriano/genética , Animales , Citrobacter rodentium/clasificación , Biología Computacional , Humanos , Masculino , Ratones , Datos de Secuencia Molecular , FilogeniaRESUMEN
Comparison of the complete genome sequence of Bacteroides fragilis 638R, originally isolated in the USA, was made with two previously sequenced strains isolated in the UK (NCTC 9343) and Japan (YCH46). The presence of 10 loci containing genes associated with polysaccharide (PS) biosynthesis, each including a putative Wzx flippase and Wzy polymerase, was confirmed in all three strains, despite a lack of cross-reactivity between NCTC 9343 and 638R surface PS-specific antibodies by immunolabelling and microscopy. Genomic comparisons revealed an exceptional level of PS biosynthesis locus diversity. Of the 10 divergent PS-associated loci apparent in each strain, none is similar between NCTC 9343 and 638R. YCH46 shares one locus with NCTC 9343, confirmed by mAb labelling, and a second different locus with 638R, making a total of 28 divergent PS biosynthesis loci amongst the three strains. The lack of expression of the phase-variable large capsule (LC) in strain 638R, observed in NCTC 9343, is likely to be due to a point mutation that generates a stop codon within a putative initiating glycosyltransferase, necessary for the expression of the LC in NCTC 9343. Other major sequence differences were observed to arise from different numbers and variety of inserted extra-chromosomal elements, in particular prophages. Extensive horizontal gene transfer has occurred within these strains, despite the presence of a significant number of divergent DNA restriction and modification systems that act to prevent acquisition of foreign DNA. The level of amongst-strain diversity in PS biosynthesis loci is unprecedented.
Asunto(s)
Cápsulas Bacterianas/genética , Bacteroides fragilis/genética , Variación Genética , Genoma Bacteriano , Cápsulas Bacterianas/biosíntesis , Bacteroides fragilis/aislamiento & purificación , Hibridación Genómica Comparativa , ADN Bacteriano/genética , Humanos , Datos de Secuencia Molecular , Análisis de Secuencia de ADNRESUMEN
Salmonella enterica with the identical antigenic formula 6,7:c:1,5 can be differentiated biochemically and by disease syndrome. One grouping, Salmonella Paratyphi C, is currently considered a typhoidal serovar, responsible for enteric fever in humans. The human-restricted typhoidal serovars (S. Typhi and Paratyphi A, B and C) typically display high levels of genome degradation and are cited as an example of convergent evolution for host adaptation in humans. However, S. Paratyphi C presents a different clinical picture to S. Typhi/Paratyphi A, in a patient group with predisposition, raising the possibility that its natural history is different, and that infection is invasive salmonellosis rather than enteric fever. Using whole genome sequencing and metabolic pathway analysis, we compared the genomes of 17 S. Paratyphi C strains to other members of the 6,7:c:1,5 group and to two typhoidal serovars: S. Typhi and Paratyphi A. The genome degradation observed in S. Paratyphi C was much lower than S. Typhi/Paratyphi A, but similar to the other 6,7:c:1,5 strains. Genomic and metabolic comparisons revealed little to no overlap between S. Paratyphi C and the other typhoidal serovars, arguing against convergent evolution and instead providing evidence of a primary adaptation to pigs in accordance with the 6,7:c:1.5 strains.
RESUMEN
We present a genome assembly from an individual male Sciurus carolinensis (the eastern grey squirrel; Vertebrata; Mammalia; Eutheria; Rodentia; Sciuridae). The genome sequence is 2.82 gigabases in span. The majority of the assembly (92.3%) is scaffolded into 21 chromosomal-level scaffolds, with both X and Y sex chromosomes assembled.