ABSTRACT
Microbial transglutaminases (MTGs) catalyze the formation of Gln-Lys isopeptide bonds and are widely used for the cross-linking of proteins and peptides in food and biotechnological applications (e.g. to improve the texture of protein-rich foods or in generating antibody-drug conjugates). Currently used MTGs have low substrate specificity, impeding their biotechnological use as enzymes that do not cross-react with nontarget substrates (i.e. as bio-orthogonal labeling systems). Here, we report the discovery of an MTG from Kutzneria albida (KalbTG), which exhibited no cross-reactivity with known MTG substrates or commonly used target proteins, such as antibodies. KalbTG was produced in Escherichia coli as soluble and active enzyme in the presence of its natural inhibitor ammonium to prevent potentially toxic cross-linking activity. The crystal structure of KalbTG revealed a conserved core similar to other MTGs but very short surface loops, making it the smallest MTG characterized to date. Ultra-dense peptide array technology involving a pool of 1.4 million unique peptides identified specific recognition motifs for KalbTG in these peptides. We determined that the motifs YRYRQ and RYESK are the best Gln and Lys substrates of KalbTG, respectively. By first reacting a bifunctionalized peptide with the more specific KalbTG and in a second step with the less specific MTG from Streptomyces mobaraensis, a successful bio-orthogonal labeling system was demonstrated. Fusing the KalbTG recognition motif to an antibody allowed for site-specific and ratio-controlled labeling using low label excess. Its site specificity, favorable kinetics, ease of use, and cost-effective production render KalbTG an attractive tool for a broad range of applications, including production of therapeutic antibody-drug conjugates.
Subject(s)
Actinomycetales/enzymology , Proteins/chemistry , Proteins/metabolism , Transglutaminases/metabolism , Binding Sites , Models, Molecular , Peptides/chemistry , Peptides/metabolism , Protein Conformation , Staining and Labeling , Substrate Specificity , Transglutaminases/chemistryABSTRACT
Antibodies are of importance for the field of proteomics, both as reagents for imaging cells, tissues, and organs and as capturing agents for affinity enrichment in mass-spectrometry-based techniques. It is important to gain basic insights regarding the binding sites (epitopes) of antibodies and potential cross-reactivity to nontarget proteins. Knowledge about an antibody's linear epitopes is also useful in, for instance, developing assays involving the capture of peptides obtained from trypsin cleavage of samples prior to mass spectrometry analysis. Here, we describe, for the first time, the design and use of peptide arrays covering all human proteins for the analysis of antibody specificity, based on parallel in situ photolithic synthesis of a total of 2.1 million overlapping peptides. This has allowed analysis of on- and off-target binding of both monoclonal and polyclonal antibodies, complemented with precise mapping of epitopes based on full amino acid substitution scans. The analysis suggests that linear epitopes are relatively short, confined to five to seven residues, resulting in apparent off-target binding to peptides corresponding to a large number of unrelated human proteins. However, subsequent analysis using recombinant proteins suggests that these linear epitopes have a strict conformational component, thus giving us new insights regarding how antibodies bind to their antigens.
Subject(s)
Antibodies/genetics , Epitope Mapping/methods , Peptide Biosynthesis/genetics , Proteome , Amino Acid Sequence , Antibodies/immunology , Binding Sites , Epitopes/genetics , Epitopes/immunology , Humans , Mass Spectrometry , Peptide Biosynthesis/immunology , TrypsinABSTRACT
We applied whole-genome resequencing of Escherichia coli to monitor the acquisition and fixation of mutations that conveyed a selective growth advantage during adaptation to a glycerol-based growth medium. We identified 13 different de novo mutations in five different E. coli strains and monitored their fixation over a 44-d period of adaptation. We obtained proof that the observed spontaneous mutations were responsible for improved fitness by creating single, double and triple site-directed mutants that had growth rates matching those of the evolved strains. The success of this new genome-scale approach indicates that real-time evolution studies will now be practical in a wide variety of contexts.
Subject(s)
Directed Molecular Evolution , Escherichia coli/genetics , Genome, Bacterial , Adaptation, Physiological , Culture Media , Escherichia coli/growth & development , Escherichia coli/physiology , Genotype , Glycerol/metabolism , Mutagenesis, Site-Directed , Mutation , Selection, Genetic , Time FactorsABSTRACT
Massively parallel DNA sequencing technologies have greatly increased our ability to generate large amounts of sequencing data at a rapid pace. Several methods have been developed to enrich for genomic regions of interest for targeted sequencing. We have compared three of these methods: Molecular Inversion Probes (MIP), Solution Hybrid Selection (SHS), and Microarray-based Genomic Selection (MGS). Using HapMap DNA samples, we compared each of these methods with respect to their ability to capture an identical set of exons and evolutionarily conserved regions associated with 528 genes (2.61 Mb). For sequence analysis, we developed and used a novel Bayesian genotype-assigning algorithm, Most Probable Genotype (MPG). All three capture methods were effective, but sensitivities (percentage of targeted bases associated with high-quality genotypes) varied for an equivalent amount of pass-filtered sequence: for example, 70% (MIP), 84% (SHS), and 91% (MGS) for 400 Mb. In contrast, all methods yielded similar accuracies of >99.84% when compared to Infinium 1M SNP BeadChip-derived genotypes and >99.998% when compared to 30-fold coverage whole-genome shotgun sequencing data. We also observed a low false-positive rate with all three methods; of the heterozygous positions identified by each of the capture methods, >99.57% agreed with 1M SNP BeadChip, and >98.840% agreed with the whole-genome shotgun data. In addition, we successfully piloted the genomic enrichment of a set of 12 pooled samples via the MGS method using molecular bar codes. We find that these three genomic enrichment methods are highly accurate and practical, with sensitivities comparable to that of 30-fold coverage whole-genome shotgun data.
Subject(s)
Diabetes Mellitus, Type 2/genetics , Genome, Human , Oligonucleotide Array Sequence Analysis/methods , Sequence Analysis, DNA/methods , Algorithms , Bayes Theorem , DNA/genetics , DNA Probes/genetics , Exons , Genotype , Humans , Reproducibility of Results , Sensitivity and SpecificityABSTRACT
Many disease-associated variants identified by genome-wide association (GWA) studies are expected to regulate gene expression. Allele-specific expression (ASE) quantifies transcription from both haplotypes using individuals heterozygous at tested SNPs. We performed deep human transcriptome-wide resequencing (RNA-seq) for ASE analysis and expression quantitative trait locus discovery. We resequenced double poly(A)-selected RNA from primary CD4(+) T cells (n = 4 individuals, both activated and untreated conditions) and developed tools for paired-end RNA-seq alignment and ASE analysis. We generated an average of 20 million uniquely mapping 45 base reads per sample. We obtained sufficient read depth to test 1371 unique transcripts for ASE. Multiple biases inflate the false discovery rate which we estimate to be approximately 50% for random SNPs. However, after controlling for these biases and considering the subset of SNPs that pass HapMap QC, 4.6% of heterozygous SNP-sample pairs show evidence of imbalance (P < 0.001). We validated four findings by both bacterial cloning and Sanger sequencing assays. We also found convincing evidence for allelic imbalance at multiple reporter exonic SNPs in CD6 for two samples heterozygous at the multiple sclerosis-associated variant rs17824933, linking GWA findings with variation in gene expression. Finally, we show in CD4(+) T cells from a further individual that high-throughput sequencing of genomic DNA and RNA-seq following enrichment for targeted gene sequences by sequence capture methods offers an unbiased means to increase the read depth for transcripts of interest, and therefore a method to investigate the regulatory role of many disease-associated genetic variants.
Subject(s)
Allelic Imbalance/genetics , Gene Expression Profiling/methods , Genome-Wide Association Study , High-Throughput Screening Assays/methods , Sequence Analysis, DNA/methods , Alleles , Base Pairing/genetics , Bias , Cells, Cultured , Computational Biology , Disease/genetics , Epigenesis, Genetic , False Positive Reactions , Genetic Loci/genetics , Heterozygote , Humans , Polymorphism, Single Nucleotide/genetics , RNA, Messenger/genetics , RNA, Messenger/metabolism , Reproducibility of ResultsABSTRACT
Soybean (Glycine max) is a self-pollinating species that has relatively low nucleotide polymorphism rates compared with other crop species. Despite the low rate of nucleotide polymorphisms, a wide range of heritable phenotypic variation exists. There is even evidence for heritable phenotypic variation among individuals within some cultivars. Williams 82, the soybean cultivar used to produce the reference genome sequence, was derived from backcrossing a Phytophthora root rot resistance locus from the donor parent Kingwa into the recurrent parent Williams. To explore the genetic basis of intracultivar variation, we investigated the nucleotide, structural, and gene content variation of different Williams 82 individuals. Williams 82 individuals exhibited variation in the number and size of introgressed Kingwa loci. In these regions of genomic heterogeneity, the reference Williams 82 genome sequence consists of a mosaic of Williams and Kingwa haplotypes. Genomic structural variation between Williams and Kingwa was maintained between the Williams 82 individuals within the regions of heterogeneity. Additionally, the regions of heterogeneity exhibited gene content differences between Williams 82 individuals. These findings show that genetic heterogeneity in Williams 82 primarily originated from the differential segregation of polymorphic chromosomal regions following the backcross and single-seed descent generations of the breeding process. We conclude that soybean haplotypes can possess a high rate of structural and gene content variation, and the impact of intracultivar genetic heterogeneity may be significant. This detailed characterization will be useful for interpreting soybean genomic data sets and highlights important considerations for research communities that are developing or utilizing a reference genome sequence.
Subject(s)
Genetic Variation , Genome, Plant , Glycine max/genetics , Comparative Genomic Hybridization , DNA, Plant/genetics , Haplotypes , Inbreeding , Oligonucleotide Array Sequence Analysis , Polymorphism, Single Nucleotide , Sequence Analysis, DNAABSTRACT
Sequence capture technologies, pioneered in mammalian genomes, enable the resequencing of targeted genomic regions. Most capture protocols require blocking DNA, the production of which in large quantities can prove challenging. A blocker-free, two-stage capture protocol was developed using NimbleGen arrays. The first capture depletes the library of repetitive sequences, while the second enriches for target loci. This strategy was used to resequence non-repetitive portions of an approximately 2.2 Mb chromosomal interval and a set of 43 genes dispersed in the 2.3 Gb maize genome. This approach achieved approximately 1800-3000-fold enrichment and 80-98% coverage of targeted bases. More than 2500 SNPs were identified in target genes. Low rates of false-positive SNP predictions were obtained, even in the presence of captured paralogous sequences. Importantly, it was possible to recover novel sequences from non-reference alleles. The ability to design novel repeat-subtraction and target capture arrays makes this technology accessible in any species.
Subject(s)
Genome, Plant , Oligonucleotide Array Sequence Analysis/methods , Sequence Analysis, DNA/methods , Comparative Genomic Hybridization , DNA, Plant/genetics , Genes, Plant , Polymorphism, Single Nucleotide , Zea mays/geneticsABSTRACT
Measurements of serum prostate-specific antigen (PSA) protein levels form the basis for a widely used test to screen men for prostate cancer. Germline variants in the gene that encodes the PSA protein (KLK3) have been shown to be associated with both serum PSA levels and prostate cancer. Based on a resequencing analysis of a 56 kb region on chromosome 19q13.33, centered on the KLK3 gene, we fine mapped this locus by genotyping tag SNPs in 3,522 prostate cancer cases and 3,338 controls from five case-control studies. We did not observe a strong association with the KLK3 variant, reported in previous studies to confer risk for prostate cancer (rs2735839; P = 0.20) but did observe three highly correlated SNPs (rs17632542, rs62113212 and rs62113214) associated with prostate cancer [P = 3.41 × 10(-4), per-allele trend odds ratio (OR) = 0.77, 95% CI = 0.67-0.89]. The signal was apparent only for nonaggressive prostate cancer cases with Gleason score <7 and disease stage
Subject(s)
Chromosomes, Human, Pair 19 , Genetic Predisposition to Disease , Kallikreins/genetics , Prostate-Specific Antigen/biosynthesis , Prostatic Neoplasms/genetics , Case-Control Studies , Chromosome Mapping , Germ-Line Mutation , Humans , Male , Polymorphism, Single NucleotideABSTRACT
Single nucleotide polymorphisms (SNPs) in the KLK3 gene on chromosome 19q13.33 are associated with serum prostate-specific antigen (PSA) levels. Recent genome wide association studies of prostate cancer have yielded conflicting results for association of the same SNPs with prostate cancer risk. Since the KLK3 gene encodes the PSA protein that forms the basis for a widely used screening test for prostate cancer, it is critical to fully characterize genetic variation in this region and assess its relationship with the risk of prostate cancer. We have conducted a next-generation sequence analysis in 78 individuals of European ancestry to characterize common (minor allele frequency, MAF >1%) genetic variation in a 56 kb region on chromosome 19q13.33 centered on the KLK3 gene (chr19:56,019,829-56,076,043 bps). We identified 555 polymorphic loci in the process including 116 novel SNPs and 182 novel insertion/deletion polymorphisms (indels). Based on tagging analysis, 144 loci are necessary to tag the region at an r (2) threshold of 0.8 and MAF of 1% or higher, while 86 loci are required to tag the region at an r (2) threshold of 0.8 and MAF >5%. Our sequence data augments coverage by 35 and 78% as compared to variants in dbSNP and HapMap, respectively. We observed six non-synonymous amino acid or frame shift changes in the KLK3 gene and three changes in each of the neighboring genes, KLK15 and KLK2. Our study has generated a detailed map of common genetic variation in the genomic region surrounding the KLK3 gene, which should be useful for fine-mapping the association signal as well as determining the contribution of this locus to prostate cancer risk and/or regulation of PSA expression.
Subject(s)
Chromosomes, Human, Pair 19/genetics , Kallikreins/genetics , Polymorphism, Single Nucleotide , Prostate-Specific Antigen/genetics , Tissue Kallikreins/genetics , Female , Gene Frequency , Haplotypes , Humans , INDEL Mutation , Linkage Disequilibrium , Male , Mutation , Prostatic Neoplasms/ethnology , Prostatic Neoplasms/genetics , Sequence Analysis, DNA , White People/geneticsABSTRACT
We have generated extreme ionizing radiation resistance in a relatively sensitive bacterial species, Escherichia coli, by directed evolution. Four populations of Escherichia coli K-12 were derived independently from strain MG1655, with each specifically adapted to survive exposure to high doses of ionizing radiation. D(37) values for strains isolated from two of the populations approached that exhibited by Deinococcus radiodurans. Complete genomic sequencing was carried out on nine purified strains derived from these populations. Clear mutational patterns were observed that both pointed to key underlying mechanisms and guided further characterization of the strains. In these evolved populations, passive genomic protection is not in evidence. Instead, enhanced recombinational DNA repair makes a prominent but probably not exclusive contribution to genome reconstitution. Multiple genes, multiple alleles of some genes, multiple mechanisms, and multiple evolutionary pathways all play a role in the evolutionary acquisition of extreme radiation resistance. Several mutations in the recA gene and a deletion of the e14 prophage both demonstrably contribute to and partially explain the new phenotype. Mutations in additional components of the bacterial recombinational repair system and the replication restart primosome are also prominent, as are mutations in genes involved in cell division, protein turnover, and glutamate transport. At least some evolutionary pathways to extreme radiation resistance are constrained by the temporally ordered appearance of specific alleles.
Subject(s)
Directed Molecular Evolution , Escherichia coli/genetics , Escherichia coli/radiation effects , Radiation, Ionizing , Chromatography, High Pressure Liquid , Electrophoresis, Gel, Pulsed-Field , Escherichia coli/growth & development , Mutation , Phylogeny , Rec A Recombinases/genetics , Rec A Recombinases/physiologyABSTRACT
Forward genetics (phenotype-driven approaches) remain the primary source for allelic variants in the mouse. Unfortunately, the gap between observable phenotype and causative genotype limits the widespread use of spontaneous and induced mouse mutants. As alternatives to traditional positional cloning and mutation detection approaches, sequence capture and next-generation sequencing technologies can be used to rapidly sequence subsets of the genome. Application of these technologies to mutation detection efforts in the mouse has the potential to significantly reduce the time and resources required for mutation identification by abrogating the need for high-resolution genetic mapping, long-range PCR, and sequencing of individual PCR amplimers. As proof of principle, we used array-based sequence capture and pyrosequencing to sequence an allelic series from the classically defined Kit locus (approximately 200 kb) from each of five noncomplementing Kit mutants (one known allele and four unknown alleles) and have successfully identified and validated a nonsynonymous coding mutation for each allele. These data represent the first documentation and validation that these new technologies can be used to efficiently discover causative mutations. Importantly, these data also provide a specific methodological foundation for the development of large-scale mutation detection efforts in the laboratory mouse.
Subject(s)
DNA Mutational Analysis/methods , Mice/genetics , Mutation , Oligonucleotide Array Sequence Analysis/methods , Alleles , Amino Acid Sequence , Animals , Base Sequence , Female , Male , Mice, Inbred C57BL , Mice, Inbred DBA , Molecular Sequence Data , Sequence AlignmentABSTRACT
BACKGROUND: Syphilis spirochete Treponema pallidum ssp. pallidum remains the enigmatic pathogen, since no virulence factors have been identified and the pathogenesis of the disease is poorly understood. Increasing rates of new syphilis cases per year have been observed recently. RESULTS: The genome of the SS14 strain was sequenced to high accuracy by an oligonucleotide array strategy requiring hybridization to only three arrays (Comparative Genome Sequencing, CGS). Gaps in the resulting sequence were filled with targeted dideoxy-terminators (DDT) sequencing and the sequence was confirmed by whole genome fingerprinting (WGF). When compared to the Nichols strain, 327 single nucleotide substitutions (224 transitions, 103 transversions), 14 deletions, and 18 insertions were found. On the proteome level, the highest frequency of amino acid-altering substitution polymorphisms was in novel genes, while the lowest was in housekeeping genes, as expected by their evolutionary conservation. Evidence was also found for hypervariable regions and multiple regions showing intrastrain heterogeneity in the T. pallidum chromosome. CONCLUSION: The observed genetic changes do not have influence on the ability of Treponema pallidum to cause syphilitic infection, since both SS14 and Nichols are virulent in rabbit. However, this is the first assessment of the degree of variation between the two syphilis pathogens and paves the way for phylogenetic studies of this fascinating organism.
Subject(s)
Genome, Bacterial , Oligonucleotide Array Sequence Analysis/methods , Treponema pallidum/genetics , Animals , Chromosome Mapping , DNA Fingerprinting , Humans , Molecular Sequence Data , Open Reading Frames , Polymorphism, Single Nucleotide , Rabbits , Reproducibility of Results , Sequence Analysis, DNA , Syphilis/microbiology , Treponema pallidum/isolation & purification , Treponema pallidum/pathogenicityABSTRACT
The Gram-positive pathogen Streptococcus pneumoniae, which can be responsible for serious cases of pneumonia and meningitis, has been intensely studied for almost 100 years. Many of the key experiments have been performed in two strains; the non-pathogenic S. pneumoniae R6 and its pathogenic progenitor, S. pneumoniae D39. Whereas the genomic sequence of the R6 strain has been published, there is relatively little genomic information available on D39. Since R6 was derived from D39, we wished to explore the utility of a new technology, Comparative Genome Sequencing, which uses a set of custom oligonucleotide arrays to compare DNA sequences between similar strains. We report here the nucleotide polymorphisms identified between the R6 strain and D39 based on an R6 sequencing array. During the process, we were also able to confirm all of the high confidence changes reported by the oligonucleotide array chip by sequencing the region in the genome around the changes identified with the genome hybridization chip. We also discuss the potential impact of some of the amino acid changes found between these two widely used strains of pneumococci.
Subject(s)
Genome, Bacterial/genetics , Polymorphism, Single Nucleotide , Streptococcus pneumoniae/genetics , Amino Acid Substitution/genetics , Nucleic Acid Hybridization , Oligonucleotide Array Sequence Analysis , Sequence Analysis, DNA , Species Specificity , Virulence/geneticsABSTRACT
Considerable efforts have been made to develop technologies for selection of peptidic molecules that act as substrates or binders to a protein of interest. Here we demonstrate the combination of rational peptide array library design, parallel screening and stepwise evolution, to discover novel peptide hotspots. These hotspots can be systematically evolved to create high-affinity, high-specificity binding peptides to a protein target in a reproducible and digitally controlled process. The method can be applied to synthesize both linear and cyclic peptides, as well as peptides composed of natural and non-natural amino acid analogs, thereby enabling screens in a much diverse chemical space. We apply this method to stepwise evolve peptide binders to streptavidin, a protein studied for over two decades and report novel peptides that mimic key interactions of biotin to streptavidin.
Subject(s)
Peptide Library , Peptides/metabolism , Streptavidin/metabolism , Amino Acid Sequence , Binding Sites , Molecular Docking Simulation , Peptides/chemistry , Peptides, Cyclic/chemistry , Peptides, Cyclic/metabolism , Protein Binding , Proteins/chemistry , Proteins/metabolism , Streptavidin/chemistryABSTRACT
Light-directed synthesis of high-density microarrays is currently performed in the 3'-->5' direction due to constraints in existing synthesis chemistry. This results in the probes being unavailable for many common types of enzymatic modification. Arrays that are synthesized in the 5'-->3' direction could be utilized to perform parallel genotyping and resequencing directly on the array surface, dramatically increasing the throughput and reducing the cost relative to existing techniques. In this report we demonstrate the use of photoprotected phosphoramidite monomers for light-directed array synthesis in the 5'-->3' direction, using maskless array synthesis technology. These arrays have a dynamic range of >2.5 orders of magnitude, sensitivity below 1 pM and a coefficient of variance of <10% across the array surface. Arrays containing >150,000 probe sequences were hybridized to labeled mouse cRNA producing highly concordant data (average R(2) = 0.998). We have also shown that the 3' ends of array probes are available for sequence-specific primer extension and ligation reactions.
Subject(s)
Oligonucleotide Array Sequence Analysis/methods , Oligonucleotides/radiation effects , Molecular Structure , Oligonucleotides/chemical synthesis , Reproducibility of Results , Sensitivity and Specificity , Time Factors , Ultraviolet RaysABSTRACT
Although the locations of promoters and enhancers have been identified in several cell types, we still have limited information on their connectivity. We developed HiCap, which combines a 4-cutter restriction enzyme Hi-C with sequence capture of promoter regions. Applying the method to mouse embryonic stem cells, we identified promoter-anchored interactions involving 15,905 promoters and 71,984 distal regions. The distal regions were enriched for enhancer marks and transcription, and had a mean fragment size of only 699 bp--close to single-enhancer resolution. High-resolution maps of promoter-anchored interactions with HiCap will be important for detailed characterizations of chromatin interaction landscapes.
Subject(s)
Chromatin/chemistry , Enhancer Elements, Genetic , Genomics/methods , Promoter Regions, Genetic , Animals , Chromosome Mapping , Gene Expression , Gene Regulatory Networks , Mice , Transcription Factors/metabolismABSTRACT
SmMAK16 from the trematode Schistosoma mansoni is a protein that is known to localize in the nucleolus. Recent findings show that SmMAK16 is involved in 60S ribosomal subunit synthesis. Although the SmMAK16 protein contains putative nuclear localization signals (NLS), little is known about their precise function, redundancy or regulation. The goal of the current study was to identify and characterize the presence and functional regulation of the localization signals in SmMAK16. The SmMAK16 coding sequence and specific fragments were individually cloned in-frame into the pEGFP-C2 expression vector to encode Green Fluorescent Protein (GFP) fusion proteins. Constructs were individually transfected into COS-7 cells and fluorescent microscopy used to determine the cellular location and thus the presence of signals regulating nuclear and nucleolar localization. SmMAK16 was found to contain two NLSs and one nucleolar localization signal (NoLS). One of the signals contains a sequence identical to an established nucleolar detention signal that reportedly functions only under acidic cellular conditions. The localization of the SmMAK16-GFP constructs was analyzed under acidic conditions; however, altering pH did not influence the localization of SmMAK16. It has been previously reported that casein kinase 2 (CK2) can phosphorylate SmMAK16 at serines adjacent to one of the NLSs. One of these CK2 sites and the adjacent NLS are conserved with that of the SV40 Large T Antigen (LTA) and phosphorylation of this site in the SV40 LTA regulates the kinetics of the NLS. To discover if kinetic regulation also occurs in SmMAK16, mutant and wild type SmMAK16-GFP proteins were purified and injected into individual COS-7 cells. No difference in the rate of transport was found between wt and mutant SmMAK16 proteins. Therefore, SmMAK16 localizes to the nucleolus using three separate signals, two NLSs and one NoLS, however, these signals appear to function independently of pH and phosphorylation by CK2.
Subject(s)
Helminth Proteins/genetics , Helminth Proteins/metabolism , Nuclear Localization Signals , Schistosoma mansoni/genetics , Schistosoma mansoni/metabolism , Animals , COS Cells , Casein Kinase II/metabolism , Cell Nucleus/chemistry , Chlorocebus aethiops , DNA Mutational Analysis , Genes, Reporter , Green Fluorescent Proteins/analysis , Green Fluorescent Proteins/genetics , Hydrogen-Ion Concentration , Microscopy, Fluorescence , Phosphorylation , Recombinant Fusion Proteins/analysis , Recombinant Fusion Proteins/geneticsABSTRACT
BACKGROUND: Enrichment of loci by DNA hybridization-capture, followed by high-throughput sequencing, is an important tool in modern genetics. Currently, the most common targets for enrichment are the protein coding exons represented by the consensus coding DNA sequence (CCDS). The CCDS, however, excludes many actual or computationally predicted coding exons present in other databases, such as RefSeq and Vega, and non-coding functional elements such as untranslated and regulatory regions. The number of variants per base pair (variant density) and our ability to interrogate regions outside of the CCDS regions is consequently less well understood. RESULTS: We examine capture sequence data from outside of the CCDS regions and find that extremes of GC content that are present in different subregions of the genome can reduce the local capture sequence coverage to less than 50% relative to the CCDS. This effect is due to biases inherent in both the Illumina and SOLiD sequencing platforms that are exacerbated by the capture process. Interestingly, for two subregion types, microRNA and predicted exons, the capture process yields higher than expected coverage when compared to whole genome sequencing. Lastly, we examine the variation present in non-CCDS regions and find that predicted exons, as well as exonic regions specific to RefSeq and Vega, show much higher variant densities than the CCDS. CONCLUSIONS: We show that regions outside of the CCDS perform less efficiently in capture sequence experiments. Further, we show that the variant density in computationally predicted exons is more than 2.5-times higher than that observed in the CCDS.
Subject(s)
Consensus Sequence , Exome , Exons , Open Reading Frames/genetics , Sequence Analysis, DNA , Alleles , Computational Biology , Gene Frequency , Genome, Human , Humans , Introns , Mutation Rate , Polymorphism, Single NucleotideABSTRACT
We have developed a solution-based method for targeted DNA capture-sequencing that is directed to the complete human exome. Using this approach allows the discovery of greater than 95% of all expected heterozygous singe base variants, requires as little as 3 Gbp of raw sequence data and constitutes an effective tool for identifying rare coding alleles in large scale genomic studies.
Subject(s)
Base Pairing/genetics , Databases, Nucleic Acid , Exons/genetics , Sequence Analysis, DNA/methods , Gene Library , Haplotypes/genetics , Humans , Polymorphism, Single Nucleotide/genetics , Reproducibility of Results , Sequence Alignment , SolutionsABSTRACT
We have developed an optimized array-based approach for customizable allele-specific gene expression (ASE) analysis. The central features of the approach are the ability to select SNPs at will for detection, and the absence of need to PCR amplify the target. A surprisingly long probe length (39-49 nt) was needed for allelic discrimination. Reconstitution experiments demonstrate linearity of ASE over a broad range. Using this approach, we have discovered at least two novel imprinted genes, NLRP2, which encodes a member of the inflammasome, and OSBPL1A, which encodes a presumed oxysterol-binding protein, were both preferentially expressed from the maternal allele. In contrast, ERAP2, which encodes an aminopeptidase, did not show preferential parent-of-origin expression, but rather, cis-acting nonimprinted differential allelic control. The approach is scalable to the whole genome and can be used for discovery of functional epigenetic modifications in patient samples.