|

1.

Widespread polycistronic gene expression in green algae.

Gallaher, Sean D; Craig, Rory J; Ganesan, Iniyan; Purvine, Samuel O; McCorkle, Sean R; Grimwood, Jane; Strenkert, Daniela; Davidi, Lital; Roth, Melissa S; Jeffers, Tim L; Lipton, Mary S; Niyogi, Krishna K; Schmutz, Jeremy; Theg, Steven M; Blaby-Haas, Crysten E; Merchant, Sabeeha S.

Proc Natl Acad Sci U S A ; 118(7)2021 02 16.

Article En | MEDLINE | ID: mdl-33579822

Polycistronic gene expression, common in prokaryotes, was thought to be extremely rare in eukaryotes. The development of long-read sequencing of full-length transcript isomers (Iso-Seq) has facilitated a reexamination of that dogma. Using Iso-Seq, we discovered hundreds of examples of polycistronic expression of nuclear genes in two divergent species of green algae: Chlamydomonas reinhardtii and Chromochloris zofingiensis Here, we employ a range of independent approaches to validate that multiple proteins are translated from a common transcript for hundreds of loci. A chromatin immunoprecipitation analysis using trimethylation of lysine 4 on histone H3 marks confirmed that transcription begins exclusively at the upstream gene. Quantification of polyadenylated [poly(A)] tails and poly(A) signal sequences confirmed that transcription ends exclusively after the downstream gene. Coexpression analysis found nearly perfect correlation for open reading frames (ORFs) within polycistronic loci, consistent with expression in a shared transcript. For many polycistronic loci, terminal peptides from both ORFs were identified from proteomics datasets, consistent with independent translation. Synthetic polycistronic gene pairs were transcribed and translated in vitro to recapitulate the production of two distinct proteins from a common transcript. The relative abundance of these two proteins can be modified by altering the Kozak-like sequence of the upstream gene. Replacement of the ORFs with selectable markers or reporters allows production of such heterologous proteins, speaking to utility in synthetic biology approaches. Conservation of a significant number of polycistronic gene pairs between C. reinhardtii, C. zofingiensis, and five other species suggests that this mechanism may be evolutionarily ancient and biologically important in the green algal lineage.

Chlorophyta/genetics , Gene Expression Regulation, Bacterial , Gene Expression Regulation, Plant , Plant Proteins/genetics , Open Reading Frames , Plant Proteins/metabolism , RNA, Messenger/genetics , Transcription, Genetic

2.

The ModelSEED Biochemistry Database for the integration of metabolic annotations and the reconstruction, comparison and analysis of metabolic models for plants, fungi and microbes.

Seaver, Samuel M D; Liu, Filipe; Zhang, Qizhi; Jeffryes, James; Faria, José P; Edirisinghe, Janaka N; Mundy, Michael; Chia, Nicholas; Noor, Elad; Beber, Moritz E; Best, Aaron A; DeJongh, Matthew; Kimbrel, Jeffrey A; D'haeseleer, Patrik; McCorkle, Sean R; Bolton, Jay R; Pearson, Erik; Canon, Shane; Wood-Charlson, Elisha M; Cottingham, Robert W; Arkin, Adam P; Henry, Christopher S.

Nucleic Acids Res ; 49(D1): D575-D588, 2021 01 08.

Article En | MEDLINE | ID: mdl-32986834

For over 10 years, ModelSEED has been a primary resource for the construction of draft genome-scale metabolic models based on annotated microbial or plant genomes. Now being released, the biochemistry database serves as the foundation of biochemical data underlying ModelSEED and KBase. The biochemistry database embodies several properties that, taken together, distinguish it from other published biochemistry resources by: (i) including compartmentalization, transport reactions, charged molecules and proton balancing on reactions; (ii) being extensible by the user community, with all data stored in GitHub; and (iii) design as a biochemical 'Rosetta Stone' to facilitate comparison and integration of annotations from many different tools and databases. The database was constructed by combining chemical data from many resources, applying standard transformations, identifying redundancies and computing thermodynamic properties. The ModelSEED biochemistry is continually tested using flux balance analysis to ensure the biochemical network is modeling-ready and capable of simulating diverse phenotypes. Ontologies can be designed to aid in comparing and reconciling metabolic reconstructions that differ in how they represent various metabolic pathways. ModelSEED now includes 33,978 compounds and 36,645 reactions, available as a set of extensible files on GitHub, and available to search at https://modelseed.org/biochem and KBase.

Bacteria/metabolism , Databases, Factual , Fungi/metabolism , Metabolic Networks and Pathways , Molecular Sequence Annotation , Plants/metabolism , Bacteria/genetics , Genome, Bacterial , Thermodynamics

3.

The ModelSEED Biochemistry Database for the integration of metabolic annotations and the reconstruction, comparison and analysis of metabolic models for plants, fungi and microbes.

Seaver, Samuel M D; Liu, Filipe; Zhang, Qizhi; Jeffryes, James; Faria, José P; Edirisinghe, Janaka N; Mundy, Michael; Chia, Nicholas; Noor, Elad; Beber, Moritz E; Best, Aaron A; DeJongh, Matthew; Kimbrel, Jeffrey A; D'haeseleer, Patrik; McCorkle, Sean R; Bolton, Jay R; Pearson, Erik; Canon, Shane; Wood-Charlson, Elisha M; Cottingham, Robert W; Arkin, Adam P; Henry, Christopher S.

Nucleic Acids Res ; 49(D1): D1555, 2021 01 08.

Article En | MEDLINE | ID: mdl-33179751

4.

Cell context dependent p53 genome-wide binding patterns and enrichment at repeats.

Botcheva, Krassimira; McCorkle, Sean R.

PLoS One ; 9(11): e113492, 2014.

Article En | MEDLINE | ID: mdl-25415302

The p53 ability to elicit stress specific and cell type specific responses is well recognized, but how that specificity is established remains to be defined. Whether upon activation p53 binds to its genomic targets in a cell type and stress type dependent manner is still an open question. Here we show that the p53 binding to the human genome is selective and cell context-dependent. We mapped the genomic binding sites for the endogenous wild type p53 protein in the human cancer cell line HCT116 and compared them to those we previously determined in the normal cell line IMR90. We report distinct p53 genome-wide binding landscapes in two different cell lines, analyzed under the same treatment and experimental conditions, using the same ChIP-seq approach. This is evidence for cell context dependent p53 genomic binding. The observed differences affect the p53 binding sites distribution with respect to major genomic and epigenomic elements (promoter regions, CpG islands and repeats). We correlated the high-confidence p53 ChIP-seq peaks positions with the annotated human repeats (UCSC Human Genome Browser) and observed both common and cell line specific trends. In HCT116, the p53 binding was specifically enriched at LINE repeats, compared to IMR90 cells. The p53 genome-wide binding patterns in HCT116 and IMR90 likely reflect the different epigenetic landscapes in these two cell lines, resulting from cancer-associated changes (accumulated in HCT116) superimposed on tissue specific differences (HCT116 has epithelial, while IMR90 has mesenchymal origin). Our data support the model for p53 binding to the human genome in a highly selective manner, mobilizing distinct sets of genes, contributing to distinct pathways.

DNA/metabolism , Genome, Human , Long Interspersed Nucleotide Elements , Tumor Suppressor Protein p53/metabolism , Binding Sites , Cell Line , Chromatin Immunoprecipitation , Epigenesis, Genetic , HCT116 Cells , Humans , Organ Specificity , Tumor Suppressor Protein p53/genetics

5.

Distinct p53 genomic binding patterns in normal and cancer-derived human cells.

Botcheva, Krassimira; McCorkle, Sean R; McCombie, W R; Dunn, John J; Anderson, Carl W.

Cell Cycle ; 10(24): 4237-49, 2011 Dec 15.

Article En | MEDLINE | ID: mdl-22127205

We report here genome-wide analysis of the tumor suppressor p53 binding sites in normal human cells. 743 high-confidence ChIP-seq peaks representing putative genomic binding sites were identified in normal IMR90 fibroblasts using a reference chromatin sample. More than 40% were located within 2 kb of a transcription start site (TSS), a distribution similar to that documented for individually studied, functional p53 binding sites and, to date, not observed by previous p53 genome-wide studies. Nearly half of the high-confidence binding sites in the IMR90 cells reside in CpG islands, in marked contrast to sites reported in cancer-derived cells. The distinct genomic features of the IMR90 binding sites do not reflect a distinct preference for specific sequences, since the de novo developed p53 motif based on our study is similar to those reported by genome-wide studies of cancer cells. More likely, the different chromatin landscape in normal, compared with cancer-derived cells, influences p53 binding via modulating availability of the sites. We compared the IMR90 ChIPseq peaks to the recently published IMR90 methylome and demonstrated that they are enriched at hypomethylated DNA. Our study represents the first genome-wide, de novo mapping of p53 binding sites in normal human cells and reveals that p53 binding sites reside in distinct genomic landscapes in normal and cancer-derived human cells.

DNA/genetics , Tumor Suppressor Protein p53/metabolism , Base Sequence , Binding Sites/genetics , Chromatin Immunoprecipitation , CpG Islands/genetics , DNA/metabolism , DNA Methylation/genetics , Fibroblasts , Genomics/methods , Humans , Molecular Sequence Data , Sequence Analysis, DNA , Tumor Suppressor Protein p53/genetics

6.

Differential binding of Escherichia coli McrA protein to DNA sequences that contain the dinucleotide m5CpG.

Mulligan, Elizabeth A; Hatchwell, Eli; McCorkle, Sean R; Dunn, John J.

Nucleic Acids Res ; 38(6): 1997-2005, 2010 Apr.

Article En | MEDLINE | ID: mdl-20015968

The Escherichia coli McrA protein, a putative C(5)-methylcytosine/C(5)-hydroxyl methylcytosine-specific nuclease, binds DNA with symmetrically methylated HpaII sequences (Cm5CGG), but its precise recognition sequence remains undefined. To determine McrA's binding specificity, we cloned and expressed recombinant McrA with a C-terminal StrepII tag (rMcrA-S) to facilitate protein purification and affinity capture of human DNA fragments with m5C residues. Sequence analysis of a subset of these fragments and electrophoretic mobility shift assays with model methylated and unmethylated oligonucleotides suggest that N(Y > R) m5CGR is the canonical binding site for rMcrA-S. In addition to binding HpaII-methylated double-stranded DNA, rMcrA-S binds DNA containing a single, hemimethylated HpaII site; however, it does not bind if A, C, T or U is placed across from the m5C residue, but does if I is opposite the m5C. These results provide the first systematic analysis of McrA's in vitro binding specificity.

CpG Islands , DNA Methylation , DNA Restriction Enzymes/metabolism , Escherichia coli Proteins/metabolism , 5-Methylcytosine/analysis , Base Sequence , Binding Sites , DNA/chemistry , DNA/metabolism , Humans

7.

Bioprospecting metagenomes: glycosyl hydrolases for converting biomass.

Li, Luen-Luen; McCorkle, Sean R; Monchy, Sebastien; Taghavi, Safiyh; van der Lelie, Daniel.

Biotechnol Biofuels ; 2: 10, 2009 May 18.

Article En | MEDLINE | ID: mdl-19450243

Throughout immeasurable time, microorganisms evolved and accumulated remarkable physiological and functional heterogeneity, and now constitute the major reserve for genetic diversity on earth. Using metagenomics, namely genetic material recovered directly from environmental samples, this biogenetic diversification can be accessed without the need to cultivate cells. Accordingly, microbial communities and their metagenomes, isolated from biotopes with high turnover rates of recalcitrant biomass, such as lignocellulosic plant cell walls, have become a major resource for bioprospecting; furthermore, this material is a major asset in the search for new biocatalytics (enzymes) for various industrial processes, including the production of biofuels from plant feedstocks. However, despite the contributions from metagenomics technologies consequent upon the discovery of novel enzymes, this relatively new enterprise requires major improvements. In this review, we compare function-based metagenome screening and sequence-based metagenome data mining, discussing the advantages and limitations of both methods. We also describe the unusual enzymes discovered via metagenomics approaches, and discuss the future prospects for metagenome technologies.

8.

A new binding motif for the transcriptional repressor REST uncovers large gene networks devoted to neuronal functions.

Otto, Stefanie J; McCorkle, Sean R; Hover, John; Conaco, Cecilia; Han, Jong-Jin; Impey, Soren; Yochum, Gregory S; Dunn, John J; Goodman, Richard H; Mandel, Gail.

J Neurosci ; 27(25): 6729-39, 2007 Jun 20.

Article En | MEDLINE | ID: mdl-17581960

The repressor element 1 (RE1) silencing transcription factor (REST) helps preserve the identity of nervous tissue by silencing neuronal genes in non-neural tissues. Moreover, in an epithelial model of tumorigenesis, loss of REST function is associated with loss of adhesion, suggesting the aberrant expression of REST-controlled genes encoding this property. To date, no adhesion molecules under REST control have been identified. Here, we used serial analysis of chromatin occupancy to perform genome-wide identification of REST-occupied target sequences (RE1 sites) in a kidney cell line. We discovered novel REST-binding motifs and found that the number of RE1 sites far exceeded previous estimates. A large family of targets encoding adhesion proteins was identified, as were genes encoding signature proteins of neuroendocrine tumors. Unexpectedly, genes considered exclusively non-neuronal also contained an RE1 motif and were expressed in neurons. This supports the model that REST binding is a critical determinant of neuronal phenotype.

Gene Regulatory Networks/physiology , Neurons/physiology , Repressor Proteins/genetics , Repressor Proteins/metabolism , Transcription Factors/genetics , Transcription Factors/metabolism , Amino Acid Motifs , Animals , Binding Sites/physiology , Cell Line , Gene Expression Profiling , Mice , Neurons/metabolism , Repressor Proteins/biosynthesis , Transcription Factors/biosynthesis

9.

Paired-end genomic signature tags: a method for the functional analysis of genomes and epigenomes.

Dunn, John J; McCorkle, Sean R; Everett, Logan; Anderson, Carl W.

Genet Eng (N Y) ; 28: 159-73, 2007.

Article En | MEDLINE | ID: mdl-17153938

Because paired-end genomic signature tags are sequenced-based, they have the potential to become an alternate tool to tiled microarray hybridization as a method for genome-wide localization of transcription factors and other sequence-specific DNA binding proteins. As outlined here the method also can be used for global analysis of DNA methylation. One advantage of this approach is the ability to easily switch between different genome types without having to fabricate a new microarray for each and every DNA type. However, the method does have some disadvantages. Among the most rate-limiting steps of our PE-GST protocol are the need to concatemerize the diTAGs, size fractionate them and then clone them prior to sequencing. This is usually followed by additional steps to amplify and size select for long (> or = 500) concatemer inserts prior to sequencing. These time-consuming steps are important for standard DNA sequencing as they increase efficiency approximately 20-30-fold since each amplified concatemer can now provide information on multiple tags; the limitation on data acqui- sition is read length during sequencing. However, the development of new sequencing methods such as Life Sciences' 454 new nanotechnology-based sequencing instrument (41) could increase tag sequencing efficiency by several orders of magnitude (> or = 100,000 diTAG reads/run), which is sufficient to provide in-depth global analysis of all ChIP PE-GSTs in a single run. This is because the lengths of our paired-end diTAGs (approximately 60 bp) fall well within the region of high accuracy for read lengths on this instrument. In principle, sequence analysis of diTAGs could begin as soon as they are generated, thereby completely bypassing the need for the concatemerization, sizing, downstream cloning steps and sequencing template purification. In addition, our protocol places any one of several unique four-base long nucleotide sequences, such as GATC, between each and every diTAG pair, which could be used to help the instrument's software keep base register and also provide a well-located peak height indicator in the middle of every sequence run. This additional feature could permit multiplexing of the data by simultaneous sequencing of several pooled libraries if each used a different linker sequence during diTAG formation (Figure 4).

Genomics/methods , Base Sequence , Chromatin Immunoprecipitation , CpG Islands , DNA/chemistry , DNA/genetics , DNA Methylation , DNA Restriction Enzymes , Epigenesis, Genetic , Genetic Engineering , Genome , Molecular Sequence Data

10.

A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome.

Zhang, Chaolin; Xuan, Zhenyu; Otto, Stefanie; Hover, John R; McCorkle, Sean R; Mandel, Gail; Zhang, Michael Q.

Nucleic Acids Res ; 34(8): 2238-46, 2006.

Article En | MEDLINE | ID: mdl-16670430

Transcription factor binding sites (TFBSs) are short DNA sequences interacting with transcription factors (TFs), which regulate gene expression. Due to the relatively short length of such binding sites, it is largely unclear how the specificity of protein-DNA interaction is achieved. Here, we have performed a genome-wide analysis of TFBS-like sequences for the transcriptional repressor, RE1 Silencing Transcription Factor (REST), as well as for several other representative mammalian TFs (c-myc, p53, HNF-1 and CREB). We find a nonrandom distribution of inexact sites for these TFs, referred to as highly-degenerate TFBSs, that are enriched around the cognate binding sites. Comparisons among human, mouse and rat orthologous promoters reveal that these highly-degenerate sites are conserved significantly more than expected by random chance, suggesting their positive selection during evolution. We propose that this arrangement provides a favorable genomic landscape for functional target site selection.

Promoter Regions, Genetic , Transcription Factors/metabolism , Animals , Base Sequence , Binding Sites , Conserved Sequence , Genomics , Humans , Mice , Rats , Repressor Proteins/metabolism

11.

Linking enzyme sequence to function using Conserved Property Difference Locator to identify and annotate positions likely to control specific functionality.

Mayer, Kimberly M; McCorkle, Sean R; Shanklin, John.

BMC Bioinformatics ; 6: 284, 2005 Nov 30.

Article En | MEDLINE | ID: mdl-16318626

BACKGROUND: Families of homologous enzymes evolved from common progenitors. The availability of multiple sequences representing each activity presents an opportunity for extracting information specifying the functionality of individual homologs. We present a straightforward method for the identification of residues likely to determine class specific functionality in which multiple sequence alignments are converted to an annotated graphical form by the Conserved Property Difference Locator (CPDL) program. RESULTS: Three test cases, each comprised of two groups of functionally-distinct homologs, are presented. Of the test cases, one is a membrane and two are soluble enzyme families. The desaturase/hydroxylase data was used to design and test the CPDL algorithm because a comparative sequence approach had been successfully applied to manipulate the specificity of these enzymes. The other two cases, ATP/GTP cyclases, and MurD/MurE synthases were chosen because they are well characterized structurally and biochemically. For the desaturase/hydroxylase enzymes, the ATP/GTP cyclases and the MurD/MurE synthases, groups of 8 (of approximately 400), 4 (of approximately 150) and 10 (of >400) residues, respectively, of interest were identified that contain empirically defined specificity determining positions. CONCLUSION: CPDL consistently identifies positions near enzyme active sites that include those predicted from structural and/or biochemical studies to be important for specificity and/or function. This suggests that CPDL will have broad utility for the identification of potential class determining residues based on multiple sequence analysis of groups of homologous proteins. Because the method is sequence, rather than structure, based it is equally well suited for designing structure-function experiments to investigate membrane and soluble proteins.

Computational Biology/methods , Genomics/methods , Algorithms , Amino Acid Sequence , Animals , Arabidopsis/enzymology , Binding Sites , Dipeptides/chemistry , Models, Biological , Models, Molecular , Molecular Sequence Data , Peptidoglycan/chemistry , Programming Languages , Protein Structure, Tertiary , Software , Structure-Activity Relationship , Uridine Diphosphate N-Acetylmuramic Acid/chemistry

12.

Defining the CREB regulon: a genome-wide analysis of transcription factor regulatory regions.

Impey, Soren; McCorkle, Sean R; Cha-Molstad, Hyunjoo; Dwyer, Jami M; Yochum, Gregory S; Boss, Jeremy M; McWeeney, Shannon; Dunn, John J; Mandel, Gail; Goodman, Richard H.

Cell ; 119(7): 1041-54, 2004 Dec 29.

Article En | MEDLINE | ID: mdl-15620361

The CREB transcription factor regulates differentiation, survival, and synaptic plasticity. The complement of CREB targets responsible for these responses has not been identified, however. We developed a novel approach to identify CREB targets, termed serial analysis of chromatin occupancy (SACO), by combining chromatin immunoprecipitation (ChIP) with a modification of SAGE. Using a SACO library derived from rat PC12 cells, we identified approximately 41,000 genomic signature tags (GSTs) that mapped to unique genomic loci. CREB binding was confirmed for all loci supported by multiple GSTs. Of the 6302 loci identified by multiple GSTs, 40% were within 2 kb of the transcriptional start of an annotated gene, 49% were within 1 kb of a CpG island, and 72% were within 1 kb of a putative cAMP-response element (CRE). A large fraction of the SACO loci delineated bidirectional promoters and novel antisense transcripts. This study represents the most comprehensive definition of transcription factor binding sites in a metazoan species.

Cyclic AMP Response Element-Binding Protein/metabolism , Genomics , Regulon/genetics , Response Elements/genetics , Transcription Factors/metabolism , Animals , Binding Sites , Chromatin Immunoprecipitation/methods , CpG Islands/genetics , Cyclic AMP/metabolism , Cyclic AMP Response Element-Binding Protein/genetics , DNA/genetics , DNA/metabolism , Gene Expression Regulation/drug effects , Gene Expression Regulation/radiation effects , Gene Library , Genome , Oligonucleotide Array Sequence Analysis , PC12 Cells , RNA, Antisense/genetics , RNA, Antisense/metabolism , Rats , Reproducibility of Results , Transcription Factors/genetics , Transcription, Genetic/genetics

13.

Transcript profiling of human platelets using microarray and serial analysis of gene expression.

Gnatenko, Dmitri V; Dunn, John J; McCorkle, Sean R; Weissmann, David; Perrotta, Peter L; Bahou, Wadie F.

Blood ; 101(6): 2285-93, 2003 Mar 15.

Article En | MEDLINE | ID: mdl-12433680

Human platelets are anucleate blood cells that retain cytoplasmic mRNA and maintain functionally intact protein translational capabilities. We have adapted complementary techniques of microarray and serial analysis of gene expression (SAGE) for genetic profiling of highly purified human blood platelets. Microarray analysis using the Affymetrix HG-U95Av2 approximately 12 600-probe set maximally identified the expression of 2147 (range, 13%-17%) platelet-expressed transcripts, with approximately 22% collectively involved in metabolism and receptor/signaling, and an overrepresentation of genes with unassigned function (32%). In contrast, a modified SAGE protocol using the Type IIS restriction enzyme MmeI (generating 21-base pair [bp] or 22-bp tags) demonstrated that 89% of tags represented mitochondrial (mt) transcripts (enriched in 16S and 12S ribosomal RNAs), presumably related to persistent mt-transcription in the absence of nuclear-derived transcripts. The frequency of non-mt SAGE tags paralleled average difference values (relative expression) for the most "abundant" transcripts as determined by microarray analysis, establishing the concordance of both techniques for platelet profiling. Quantitative reverse transcription-polymerase chain reaction (PCR) confirmed the highest frequency of mt-derived transcripts, along with the mRNAs for neurogranin (NGN, a protein kinase C substrate) and the complement lysis inhibitor clusterin among the top 5 most abundant transcripts. For confirmatory characterization, immunoblots and flow cytometric analyses were performed, establishing abundant cell-surface expression of clusterin and intracellular expression of NGN. These observations demonstrate a strong correlation between high transcript abundance and protein expression, and they establish the validity of transcript analysis as a tool for identifying novel platelet proteins that may regulate normal and pathologic platelet (and/or megakaryocyte) functions.

Blood Platelets/chemistry , Gene Expression Profiling , Oligonucleotide Array Sequence Analysis , RNA, Messenger/blood , Base Sequence , Blood Platelets/metabolism , Calmodulin-Binding Proteins/blood , Calmodulin-Binding Proteins/genetics , Cell Separation , Clusterin , Deoxyribonucleases, Type II Site-Specific/metabolism , Gene Library , Glycoproteins/blood , Glycoproteins/genetics , Humans , Mitochondria/chemistry , Molecular Chaperones/blood , Molecular Chaperones/genetics , Nerve Tissue Proteins/blood , Nerve Tissue Proteins/genetics , Neurogranin , Reverse Transcriptase Polymerase Chain Reaction

14.

Genomic signature tags (GSTs): a system for profiling genomic DNA.

Dunn, John J; McCorkle, Sean R; Praissman, Laura A; Hind, Geoffrey; Van Der Lelie, Daniel; Bahou, Wadie F; Gnatenko, Dmitri V; Krause, Maureen K.

Genome Res ; 12(11): 1756-65, 2002 Nov.

Article En | MEDLINE | ID: mdl-12421763

Genomic signature tags (GSTs) are the products of a method we have developed for identifying and quantitatively analyzing genomic DNAs. The DNA is initially fragmented with a type II restriction enzyme. An oligonucleotide adaptor containing a recognition site for MmeI, a type IIS restriction enzyme, is then used to release 21-bp tags from fixed positions in the DNA relative to the sites recognized by the fragmenting enzyme. These tags are PCR-amplified, purified, concatenated, and then cloned and sequenced. The tag sequences and abundances are used to create a high-resolution GST sequence profile of the genomic DNA. GSTs are shown to be long enough for use as oligonucleotide primers to amplify adjacent segments of the DNA, which can then be sequenced to provide additional nucleotide information or used as probes to identify specific clones in metagenomic libraries. GST analysis of the 4.7-Mb Yersinia pestis EV766 genome using BamHI as the fragmenting enzyme and NlaIII as the tagging enzyme validated the precision of our approach. The GST profile predicts that this strain has several changes relative to the archetype CO92 strain, including deletion of a 57-kb region of the chromosome known to be an unstable pathogenicity island.

DNA Fingerprinting/methods , DNA, Bacterial/analysis , Binding Sites/genetics , DNA Fragmentation/genetics , DNA, Bacterial/metabolism , Deoxyribonuclease BamHI/metabolism , Deoxyribonucleases, Type II Site-Specific/genetics , Gene Library , Genome, Bacterial , Ligases/metabolism , Nucleic Acid Amplification Techniques/methods , Oligonucleotides/genetics , Polymerase Chain Reaction/methods , Yersinia pestis/genetics