RESUMO
Nucleosome positioning on the chromatin strand plays a critical role in regulating accessibility of DNA to transcription factors and chromatin modifying enzymes. Hence, detailed information on nucleosome depletion or movement at cis-acting regulatory elements has the potential to identify predicted binding sites for trans-acting factors. Using a novel method based on enrichment of mononucleosomal DNA by bacterial artificial chromosome hybridization, we mapped nucleosome positions by deep sequencing across 250 kb, encompassing the cystic fibrosis transmembrane conductance regulator (CFTR) gene. CFTR shows tight tissue-specific regulation of expression, which is largely determined by cis-regulatory elements that lie outside the gene promoter. Although multiple elements are known, the repertoire of transcription factors that interact with these sites to activate or repress CFTR expression remains incomplete. Here, we show that specific nucleosome depletion corresponds to well-characterized binding sites for known trans-acting factors, including hepatocyte nuclear factor 1, Forkhead box A1 and CCCTC-binding factor. Moreover, the cell-type selective nucleosome positioning is effective in predicting binding sites for novel interacting factors, such as BAF155. Finally, we identify transcription factor binding sites that are overrepresented in regions where nucleosomes are depleted in a cell-specific manner. This approach recognizes the glucocorticoid receptor as a novel trans-acting factor that regulates CFTR expression in vivo.
Assuntos
Mapeamento Cromossômico , Regulador de Condutância Transmembrana em Fibrose Cística/genética , Inativação Gênica , Nucleossomos/metabolismo , Receptores de Glucocorticoides/fisiologia , Sítios de Ligação , Fator de Ligação a CCCTC , Células CACO-2 , Imunoprecipitação da Cromatina , Regulador de Condutância Transmembrana em Fibrose Cística/metabolismo , Dexametasona/farmacologia , Loci Gênicos , Glucocorticoides/farmacologia , Fator 3-alfa Nuclear de Hepatócito/metabolismo , Humanos , Nucleossomos/genética , Ligação Proteica , Receptores de Glucocorticoides/metabolismo , Proteínas Repressoras/metabolismo , Elementos de Resposta , Análise de Sequência de DNA , Fatores de Transcrição/metabolismoRESUMO
Eukaryotic genomes are packaged into nucleosome particles that occlude the DNA from interacting with most DNA binding proteins. Nucleosomes have higher affinity for particular DNA sequences, reflecting the ability of the sequence to bend sharply, as required by the nucleosome structure. However, it is not known whether these sequence preferences have a significant influence on nucleosome position in vivo, and thus regulate the access of other proteins to DNA. Here we isolated nucleosome-bound sequences at high resolution from yeast and used these sequences in a new computational approach to construct and validate experimentally a nucleosome-DNA interaction model, and to predict the genome-wide organization of nucleosomes. Our results demonstrate that genomes encode an intrinsic nucleosome organization and that this intrinsic organization can explain approximately 50% of the in vivo nucleosome positions. This nucleosome positioning code may facilitate specific chromosome functions including transcription factor binding, transcription initiation, and even remodelling of the nucleosomes themselves.
Assuntos
DNA Fúngico/genética , Genoma Fúngico/genética , Nucleossomos/genética , Nucleossomos/metabolismo , Saccharomyces cerevisiae/genética , Sequência de Bases , Sítios de Ligação , Montagem e Desmontagem da Cromatina , DNA Fúngico/metabolismo , Genômica , Elementos de Resposta/genética , Termodinâmica , Fatores de Transcrição/metabolismo , Sítio de Iniciação de TranscriçãoRESUMO
DNA sequences that are present in nucleosomes have a preferential approximately 10 bp periodicity of certain dinucleotide signals, but the overall sequence similarity of the nucleosomal DNA is weak, and traditional multiple sequence alignment tools fail to yield meaningful alignments. We develop a mixture model that characterizes the known dinucleotide periodicity probabilistically to improve the alignment of nucleosomal DNAs. We assume that a periodic dinucleotide signal of any type emits according to a probability distribution around a series of 'hot spots' that are equally spaced along nucleosomal DNA with 10 bp period, but with a 1 bp phase shift across the middle of the nucleosome. We model the three statistically most significant dinucleotide signals, AA/TT, GC and TA, simultaneously, while allowing phase shifts between the signals. The alignment is obtained by maximizing the likelihood of both Watson and Crick strands simultaneously. The resulting alignment of 177 chicken nucleosomal DNA sequences revealed that all 10 distinct dinucleotides are periodic, however, with only two distinct phases and varying intensity. By Fourier analysis, we show that our new alignment has enhanced periodicity and sequence identity compared with center alignment. The significance of the nucleosomal DNA sequence alignment is evaluated by comparing it with that obtained using the same model on non-nucleosomal sequences.
Assuntos
Modelos Estatísticos , Nucleossomos/química , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Animais , Composição de Bases , Galinhas/genética , Repetições de Dinucleotídeos , Análise de Fourier , Nucleotídeos/análise , PeriodicidadeRESUMO
BACKGROUND: In expressed sequence tag (EST) sequencing, we are often interested in how many genes we can capture in an EST sample of a targeted size. This information provides insights to sequencing efficiency in experimental design, as well as clues to the diversity of expressed genes in the tissue from which the library was constructed. RESULTS: We propose a compound Poisson process model that can accurately predict the gene capture in a future EST sample based on an initial EST sample. It also allows estimation of the number of expressed genes in one cDNA library or co-expressed in two cDNA libraries. The superior performance of the new prediction method over an existing approach is established by a simulation study. Our analysis of four Arabidopsis thaliana EST sets suggests that the number of expressed genes present in four different cDNA libraries of Arabidopsis thaliana varies from 9155 (root) to 12005 (silique). An observed fraction of co-expressed genes in two different EST sets as low as 25% can correspond to an actual overlap fraction greater than 65%. CONCLUSION: The proposed method provides a convenient tool for gene capture prediction and cDNA library property diagnosis in EST sequencing.
Assuntos
Proteínas de Arabidopsis/genética , DNA de Plantas/genética , Etiquetas de Sequências Expressas , Perfilação da Expressão Gênica/métodos , Modelos Genéticos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Simulação por Computador , Biblioteca Gênica , Modelos Estatísticos , Dados de Sequência Molecular , Distribuição de PoissonRESUMO
MOTIVATION: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. RESULTS: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5' and 3' EST clustering, the Type I error in the 5' EST case is approximately 10 times higher than the 3' EST case (30% versus 3%). An over-stringent identity rule, e.g., P >/= 95%, may even inflate the Type I error in both cases. We demonstrate that approximately 80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5' EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.