RESUMO
Scots pine (Pinus sylvestris L.) is one of the most widespread and economically important conifer species in the world. Applications like genomic selection and association studies, which could help accelerate breeding cycles, are challenging in Scots pine because of its large and repetitive genome. For this reason, genotyping tools for conifer species, and in particular for Scots pine, are commonly based on transcribed regions of the genome. In this article, we present the Axiom Psyl50K array, the first single nucleotide polymorphism (SNP) genotyping array for Scots pine based on whole-genome resequencing, that represents both genic and intergenic regions. This array was designed following a two-step procedure: first, 192 trees were sequenced, and a 430K SNP screening array was constructed. Then, 480 samples, including haploid megagametophytes, full-sib family trios, breeding population, and range-wide individuals from across Eurasia were genotyped with the screening array. The best 50K SNPs were selected based on quality, replicability, distribution across the draft genome assembly, balance between genic and intergenic regions, and genotype-environment and genotype-phenotype associations. Of the final 49 877 probes tiled in the array, 20 372 (40.84%) occur inside gene models, while the rest lie in intergenic regions. We also show that the Psyl50K array can yield enough high-confidence SNPs for genetic studies in pine species from North America and Eurasia. This new genotyping tool will be a valuable resource for high-throughput fundamental and applied research of Scots pine and other pine species.
Assuntos
Pinus sylvestris , Pinus , Humanos , Pinus sylvestris/genética , Polimorfismo de Nucleotídeo Único/genética , Genótipo , Melhoramento Vegetal , Pinus/genética , DNA IntergênicoRESUMO
Conifers have dominated forests for more than 200 million years and are of huge ecological and economic importance. Here we present the draft assembly of the 20-gigabase genome of Norway spruce (Picea abies), the first available for any gymnosperm. The number of well-supported genes (28,354) is similar to the >100 times smaller genome of Arabidopsis thaliana, and there is no evidence of a recent whole-genome duplication in the gymnosperm lineage. Instead, the large genome size seems to result from the slow and steady accumulation of a diverse set of long-terminal repeat transposable elements, possibly owing to the lack of an efficient elimination mechanism. Comparative sequencing of Pinus sylvestris, Abies sibirica, Juniperus communis, Taxus baccata and Gnetum gnemon reveals that the transposable element diversity is shared among extant conifers. Expression of 24-nucleotide small RNAs, previously implicated in transposable element silencing, is tissue-specific and much lower than in other plants. We further identify numerous long (>10,000 base pairs) introns, gene-like fragments, uncharacterized long non-coding RNAs and short RNAs. This opens up new genomic avenues for conifer forestry and breeding.
Assuntos
Evolução Molecular , Genoma de Planta/genética , Picea/genética , Sequência Conservada/genética , Elementos de DNA Transponíveis/genética , Inativação Gênica , Genes de Plantas/genética , Genômica , Internet , Íntrons/genética , Fenótipo , RNA não Traduzido/genética , Análise de Sequência de DNA , Sequências Repetidas Terminais/genética , Transcrição Gênica/genéticaRESUMO
Domestic animals are excellent models for genetic studies of phenotypic evolution. They have evolved genetic adaptations to a new environment, the farm, and have been subjected to strong human-driven selection leading to remarkable phenotypic changes in morphology, physiology and behaviour. Identifying the genetic changes underlying these developments provides new insight into general mechanisms by which genetic variation shapes phenotypic diversity. Here we describe the use of massively parallel sequencing to identify selective sweeps of favourable alleles and candidate mutations that have had a prominent role in the domestication of chickens (Gallus gallus domesticus) and their subsequent specialization into broiler (meat-producing) and layer (egg-producing) chickens. We have generated 44.5-fold coverage of the chicken genome using pools of genomic DNA representing eight different populations of domestic chickens as well as red jungle fowl (Gallus gallus), the major wild ancestor. We report more than 7,000,000 single nucleotide polymorphisms, almost 1,300 deletions and a number of putative selective sweeps. One of the most striking selective sweeps found in all domestic chickens occurred at the locus for thyroid stimulating hormone receptor (TSHR), which has a pivotal role in metabolic regulation and photoperiod control of reproduction in vertebrates. Several of the selective sweeps detected in broilers overlapped genes associated with growth, appetite and metabolic regulation. We found little evidence that selection for loss-of-function mutations had a prominent role in chicken domestication, but we detected two deletions in coding sequences that we suggest are functionally important. This study has direct application to animal breeding and enhances the importance of the domestic chicken as a model organism for biomedical research.
Assuntos
Galinhas/genética , Loci Gênicos/genética , Genoma/genética , Seleção Genética/genética , Sequência de Aminoácidos , Animais , Evolução Biológica , Feminino , Masculino , Dados de Sequência Molecular , Polimorfismo de Nucleotídeo Único , Alinhamento de Sequência , Análise de Sequência de DNA , Deleção de SequênciaRESUMO
BACKGROUND: Sampling genomes with Fosmid vectors and sequencing of pooled Fosmid libraries on the Illumina platform for massive parallel sequencing is a novel and promising approach to optimizing the trade-off between sequencing costs and assembly quality. RESULTS: In order to sequence the genome of Norway spruce, which is of great size and complexity, we developed and applied a new technology based on the massive production, sequencing, and assembly of Fosmid pools (FP). The spruce chromosomes were sampled with ~40,000 bp Fosmid inserts to obtain around two-fold genome coverage, in parallel with traditional whole genome shotgun sequencing (WGS) of haploid and diploid genomes. Compared to the WGS results, the contiguity and quality of the FP assemblies were high, and they allowed us to fill WGS gaps resulting from repeats, low coverage, and allelic differences. The FP contig sets were further merged with WGS data using a novel software package GAM-NGS. CONCLUSIONS: By exploiting FP technology, the first published assembly of a conifer genome was sequenced entirely with massively parallel sequencing. Here we provide a comprehensive report on the different features of the approach and the optimization of the process.We have made public the input data (FASTQ format) for the set of pools used in this study:ftp://congenie.org/congenie/Nystedt_2013/Assembly/ProcessedData/FosmidPools/.(alternatively accessible via http://congenie.org/downloads).The software used for running the assembly process is available at http://research.scilifelab.se/andrej_alexeyenko/downloads/fpools/.
Assuntos
Vetores Genéticos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Picea/genética , Clonagem Molecular , Genoma de Planta , Sequenciamento de Nucleotídeos em Larga Escala/economia , SoftwareRESUMO
Giardia intestinalis is a major cause of diarrheal disease worldwide and two major Giardia genotypes, assemblages A and B, infect humans. The genome of assemblage A parasite WB was recently sequenced, and the structurally compact 11.7 Mbp genome contains simplified basic cellular machineries and metabolism. We here performed 454 sequencing to 16x coverage of the assemblage B isolate GS, the only Giardia isolate successfully used to experimentally infect animals and humans. The two genomes show 77% nucleotide and 78% amino-acid identity in protein coding regions. Comparative analysis identified 28 unique GS and 3 unique WB protein coding genes, and the variable surface protein (VSP) repertoires of the two isolates are completely different. The promoters of several enzymes involved in the synthesis of the cyst-wall lack binding sites for encystation-specific transcription factors in GS. Several synteny-breaks were detected and verified. The tetraploid GS genome shows higher levels of overall allelic sequence polymorphism (0.5 versus <0.01% in WB). The genomic differences between WB and GS may explain some of the observed biological and clinical differences between the two isolates, and it suggests that assemblage A and B Giardia can be two different species.
Assuntos
Genoma de Protozoário , Giardia lamblia/genética , Giardíase/parasitologia , Animais , Sequência de Bases , Frequência do Gene , Genoma Bacteriano/genética , Giardia lamblia/classificação , Humanos , Íntrons , Dados de Sequência Molecular , Filogenia , Polimorfismo Genético , Porphyromonas gingivalis/genética , Regiões Promotoras Genéticas , Proteínas de Protozoários/genética , Proteínas de Protozoários/metabolismo , Splicing de RNA , RNA Mensageiro/metabolismo , RNA de Protozoário/genética , Alinhamento de Sequência , SinteniaRESUMO
BACKGROUND: Segmental duplications (SD) have been found in genomes of various organisms, often accumulated at the ends of chromosomes. It has been assumed that the sequence homology in-between the SDs allow for ectopic interactions that may contribute to the emergence of new genes or gene variants through recombinatorial events. METHODS: In silico analysis of the 3D7 Plasmodium falciparum genome, conducted to investigate the subtelomeric compartments, led to the identification of subtelomeric SDs. Sequence variation and copy number polymorphisms of the SDs were studied by DNA sequencing, real-time quantitative PCR (qPCR) and fluorescent in situ hybridization (FISH). The levels of transcription and the developmental expression of copy number variant genes were investigated by qPCR. RESULTS: A block of six genes of >10 kilobases in size, including var, rif, pfmc-2tm and three hypothetical genes (n-, o- and q-gene), was found duplicated in the subtelomeric regions of chromosomes 1, 2, 3, 6, 7, 10 and 11 (SD1). The number of SD1 per genome was found to vary from 4 to 8 copies in between different parasites. The intragenic regions of SD1 were found to be highly conserved across ten distinct fresh and long-term cultivated P. falciparum. Sequence variation was detected in a approximately 23 amino-acid long hypervariable region of a surface-exposed loop of PFMC-2TM. A hypothetical gene within SD1, the n-gene, encoding a PEXEL/VTS-containing two-transmembrane protein was found expressed in ring stage parasites. The n-gene transcription levels were found to correlate to the number of n-gene copies. Fragments of SD1 harbouring two or three of the SD1-genes (o-gene, pfmc-2tm, q-gene) were also found in the 3D7 genome. In addition a related second SD, SD2, of approximately 55% sequence identity to SD1 was found duplicated in a fresh clinical isolate but was only present in a single copy in 3D7 and in other P. falciparum lines or clones. CONCLUSION: Plasmodium falciparum carries multiple sequence conserved SDs in the otherwise highly variable subtelomeres of its chromosomes. The uniqueness of the SDs amongst plasmodium species, and the conserved nature of the genes within, is intriguing and suggests an important role of the SD to P. falciparum.
Assuntos
DNA de Protozoário/genética , Duplicação Gênica , Plasmodium falciparum/genética , Telômero , Animais , Biologia Computacional , Sequência Conservada , Dosagem de Genes , Perfilação da Expressão Gênica , Hibridização In Situ , Reação em Cadeia da Polimerase , Proteínas de Protozoários/genética , RNA Mensageiro/biossíntese , RNA de Protozoário/biossíntese , Análise de Sequência de DNA , Homologia de SequênciaRESUMO
BACKGROUND: Ultra-deep pyrosequencing (UDPS) is used to identify rare sequence variants. The sequence depth is influenced by several factors including the error frequency of PCR and UDPS. This study investigated the characteristics and source of errors in raw and cleaned UDPS data. RESULTS: UDPS of a 167-nucleotide fragment of the HIV-1 SG3Δenv plasmid was performed on the Roche/454 platform. The plasmid was diluted to one copy, PCR amplified and subjected to bidirectional UDPS on three occasions. The dataset consisted of 47,693 UDPS reads. Raw UDPS data had an average error frequency of 0.30% per nucleotide site. Most errors were insertions and deletions in homopolymeric regions. We used a cleaning strategy that removed almost all indel errors, but had little effect on substitution errors, which reduced the error frequency to 0.056% per nucleotide. In cleaned data the error frequency was similar in homopolymeric and non-homopolymeric regions, but varied considerably across sites. These site-specific error frequencies were moderately, but still significantly, correlated between runs (r=0.15-0.65) and between forward and reverse sequencing directions within runs (r=0.33-0.65). Furthermore, transition errors were 48-times more common than transversion errors (0.052% vs. 0.001%; p<0.0001). Collectively the results indicate that a considerable proportion of the sequencing errors that remained after data cleaning were generated during the PCR that preceded UDPS. CONCLUSIONS: A majority of the sequencing errors that remained after data cleaning were introduced by PCR prior to sequencing, which means that they will be independent of platform used for next-generation sequencing. The transition vs. transversion error bias in cleaned UDPS data will influence the detection limits of rare mutations and sequence variants.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/normas , Reação em Cadeia da Polimerase/normas , Análise de Sequência de DNA/normas , Artefatos , Sequência de Bases , HIV-1/genéticaRESUMO
BACKGROUND: The tremendous output of massive parallel sequencing technologies requires automated robust and scalable sample preparation methods to fully exploit the new sequence capacity. METHODOLOGY: In this study, a method for automated library preparation of RNA prior to massively parallel sequencing is presented. The automated protocol uses precipitation onto carboxylic acid paramagnetic beads for purification and size selection of both RNA and DNA. The automated sample preparation was compared to the standard manual sample preparation. CONCLUSION/SIGNIFICANCE: The automated procedure was used to generate libraries for gene expression profiling on the Illumina HiSeq 2000 platform with the capacity of 12 samples per preparation with a significantly improved throughput compared to the standard manual preparation. The data analysis shows consistent gene expression profiles in terms of sensitivity and quantification of gene expression between the two library preparation methods.
Assuntos
Perfilação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Automação , Linhagem Celular Tumoral , Precipitação Química , DNA Complementar/biossíntese , Regulação Neoplásica da Expressão Gênica , Humanos , Reação em Cadeia da Polimerase , RNA Neoplásico/genética , RNA Neoplásico/isolamento & purificação , RNA Neoplásico/metabolismo , Análise de Sequência de DNARESUMO
Trypanosoma cruzi is the causative agent of Chagas disease, which affects more than 9 million people in Latin America. We have generated a draft genome sequence of the TcI strain Sylvio X10/1 and compared it to the TcVI reference strain CL Brener to identify lineage-specific features. We found virtually no differences in the core gene content of CL Brener and Sylvio X10/1 by presence/absence analysis, but 6 open reading frames from CL Brener were missing in Sylvio X10/1. Several multicopy gene families, including DGF, mucin, MASP and GP63 were found to contain substantially fewer genes in Sylvio X10/1, based on sequence read estimations. 1,861 small insertion-deletion events and 77,349 nucleotide differences, 23% of which were non-synonymous and associated with radical amino acid changes, further distinguish these two genomes. There were 336 genes indicated as under positive selection, 145 unique to T. cruzi in comparison to T. brucei and Leishmania. This study provides a framework for further comparative analyses of two major T. cruzi lineages and also highlights the need for sequencing more strains to understand fully the genomic composition of this parasite.
Assuntos
DNA de Protozoário/genética , Genoma de Protozoário , Análise de Sequência de DNA , Trypanosoma cruzi/genética , DNA de Protozoário/química , Humanos , América Latina , Dados de Sequência Molecular , Mutagênese Insercional , Deleção de Sequência , Homologia de Sequência , SinteniaRESUMO
BACKGROUND: Neurocysticercosis is a disease caused by the oral ingestion of eggs from the human parasitic worm Taenia solium. Although drugs are available they are controversial because of the side effects and poor efficiency. An expressed sequence tag (EST) library is a method used to describe the gene expression profile and sequence of mRNA from a specific organism and stage. Such information can be used in order to find new targets for the development of drugs and to get a better understanding of the parasite biology. METHODS AND FINDINGS: Here an EST library consisting of 5760 sequences from the pig cysticerca stage has been constructed. In the library 1650 unique sequences were found and of these, 845 sequences (52%) were novel to T. solium and not identified within other EST libraries. Furthermore, 918 sequences (55%) were of unknown function. Amongst the 25 most frequently expressed sequences 6 had no relevant similarity to other sequences found in the Genbank NR DNA database. A prediction of putative signal peptides was also performed and 4 among the 25 were found to be predicted with a signal peptide. Proposed vaccine and diagnostic targets T24, Tsol18/HP6 and Tso31d could also be identified among the 25 most frequently expressed. CONCLUSIONS: An EST library has been produced from pig cysticerca and analyzed. More than half of the different ESTs sequenced contained a sequence with no suggested function and 845 novel EST sequences have been identified. The library increases the knowledge about what genes are expressed and to what level. It can also be used to study different areas of research such as drug and diagnostic development together with parasite fitness via e.g. immune modulation.
Assuntos
Cisticercose/veterinária , Etiquetas de Sequências Expressas , Biblioteca Gênica , Doenças dos Suínos/parasitologia , Taenia solium/genética , Animais , Biologia Computacional , Análise de Sequência de DNA , Homologia de Sequência , SuínosRESUMO
BACKGROUND: Ultra-deep pyrosequencing (UDPS) allows identification of rare HIV-1 variants and minority drug resistance mutations, which are not detectable by standard sequencing. PRINCIPAL FINDINGS: Here, UDPS was used to analyze the dynamics of HIV-1 genetic variation in reverse transcriptase (RT) (amino acids 180-220) in six individuals consecutively sampled before, during and after failing 3TC and AZT containing antiretroviral treatment. Optimized UDPS protocols and bioinformatic software were developed to generate, clean and analyze the data. The data cleaning strategy reduced the error rate of UDPS to an average of 0.05%, which is lower than previously reported. Consequently, the cut-off for detection of resistance mutations was very low. A median of 16,016 (range 2,406-35,401) sequence reads were obtained per sample, which allowed detection and quantification of minority resistance mutations at amino acid position 181, 184, 188, 190, 210, 215 and 219 in RT. In four of five pre-treatment samples low levels (0.07-0.09%) of the M184I mutation were observed. Other resistance mutations, except T215A and T215I were below the detection limit. During treatment failure, M184V replaced M184I and dominated the population in combination with T215Y, while wild-type variants were rarely detected. Resistant virus disappeared rapidly after treatment interruption and was undetectable as early as after 3 months. In most patients, drug resistant variants were replaced by wild-type variants identical to those present before treatment, suggesting rebound from latent reservoirs. CONCLUSIONS: With this highly sensitive UDPS protocol preexisting drug resistance was infrequently observed; only M184I, T215A and T215I were detected at very low levels. Similarly, drug resistant variants in plasma quickly decreased to undetectable levels after treatment interruption. The study gives important insights into the dynamics of the HIV-1 quasispecies and is of relevance for future research and clinical use of the UDPS technology.