Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 48
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Genome Res ; 34(5): 680-695, 2024 06 25.
Artigo em Inglês | MEDLINE | ID: mdl-38777607

RESUMO

Gastric cancer (GC) is the fifth most common cancer worldwide and is a heterogeneous disease. Among GC subtypes, the mesenchymal phenotype (Mes-like) is more invasive than the epithelial phenotype (Epi-like). Although gene expression of the epithelial-to-mesenchymal transition (EMT) has been studied, the regulatory landscape shaping this process is not fully understood. Here we use ATAC-seq and RNA-seq data from a compendium of GC cell lines and primary tumors to detect drivers of regulatory state changes and their transcriptional responses. Using the ATAC-seq data, we developed a machine learning approach to determine the transcription factors (TFs) regulating the subtypes of GC. We identified TFs driving the mesenchymal (RUNX2, ZEB1, SNAI2, AP-1 dimer) and the epithelial (GATA4, GATA6, KLF5, HNF4A, FOXA2, GRHL2) states in GC. We identified DNA copy number alterations associated with dysregulation of these TFs, specifically deletion of GATA4 and amplification of MAPK9 Comparisons with bulk and single-cell RNA-seq data sets identified activation toward fibroblast-like epigenomic and expression signatures in Mes-like GC. The activation of this mesenchymal fibrotic program is associated with differentially accessible DNA cis-regulatory elements flanking upregulated mesenchymal genes. These findings establish a map of TF activity in GC and highlight the role of copy number driven alterations in shaping epigenomic regulatory programs as potential drivers of GC heterogeneity and progression.


Assuntos
Transição Epitelial-Mesenquimal , Regulação Neoplásica da Expressão Gênica , Aprendizado de Máquina , Neoplasias Gástricas , Humanos , Neoplasias Gástricas/genética , Neoplasias Gástricas/patologia , Neoplasias Gástricas/metabolismo , Transição Epitelial-Mesenquimal/genética , Fator de Transcrição AP-1/metabolismo , Fator de Transcrição AP-1/genética , Linhagem Celular Tumoral , Fibrose/genética , Subunidade alfa 1 de Fator de Ligação ao Core/genética , Subunidade alfa 1 de Fator de Ligação ao Core/metabolismo , Variações do Número de Cópias de DNA , Subunidade alfa 2 de Fator de Ligação ao Core
2.
Nat Methods ; 21(4): 723-734, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38504114

RESUMO

The ENCODE Consortium's efforts to annotate noncoding cis-regulatory elements (CREs) have advanced our understanding of gene regulatory landscapes. Pooled, noncoding CRISPR screens offer a systematic approach to investigate cis-regulatory mechanisms. The ENCODE4 Functional Characterization Centers conducted 108 screens in human cell lines, comprising >540,000 perturbations across 24.85 megabases of the genome. Using 332 functionally confirmed CRE-gene links in K562 cells, we established guidelines for screening endogenous noncoding elements with CRISPR interference (CRISPRi), including accurate detection of CREs that exhibit variable, often low, transcriptional effects. Benchmarking five screen analysis tools, we find that CASA produces the most conservative CRE calls and is robust to artifacts of low-specificity single guide RNAs. We uncover a subtle DNA strand bias for CRISPRi in transcribed regions with implications for screen design and analysis. Together, we provide an accessible data resource, predesigned single guide RNAs for targeting 3,275,697 ENCODE SCREEN candidate CREs with CRISPRi and screening guidelines to accelerate functional characterization of the noncoding genome.


Assuntos
Sistemas CRISPR-Cas , Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas , Humanos , Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas/genética , Sistemas CRISPR-Cas/genética , Genoma , Células K562 , RNA Guia de Sistemas CRISPR-Cas
3.
Genome Res ; 31(9): 1638-1645, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34285053

RESUMO

Massively parallel reporter assays (MPRAs) are a high-throughput method for evaluating in vitro activities of thousands of candidate cis-regulatory elements (CREs). In these assays, candidate sequences are cloned upstream or downstream from a reporter gene tagged by unique DNA sequences. However, tag sequences may themselves affect reporter gene expression and lead to major potential biases in the measured cis-regulatory activity. Here, we present a sequence-based method for correcting tag-sequence-specific effects and show that our method can significantly reduce this source of variation and improve the identification of functional regulatory variants by MPRAs. We also show that our model captures sequence features associated with post-transcriptional regulation of mRNA. Thus, this new method helps not only to improve detection of regulatory signals in MPRA experiments but also to design better MPRA protocols.


Assuntos
Regulação da Expressão Gênica , Sequências Reguladoras de Ácido Nucleico , Viés , Bioensaio , Genes Reporter
4.
Gut ; 72(2): 226-241, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-35817555

RESUMO

OBJECTIVE: Gastric cancer (GC) comprises multiple molecular subtypes. Recent studies have highlighted mesenchymal-subtype GC (Mes-GC) as a clinically aggressive subtype with few treatment options. Combining multiple studies, we derived and applied a consensus Mes-GC classifier to define the Mes-GC enhancer landscape revealing disease vulnerabilities. DESIGN: Transcriptomic profiles of ~1000 primary GCs and cell lines were analysed to derive a consensus Mes-GC classifier. Clinical and genomic associations were performed across >1200 patients with GC. Genome-wide epigenomic profiles (H3K27ac, H3K4me1 and assay for transposase-accessible chromatin with sequencing (ATAC-seq)) of 49 primary GCs and GC cell lines were generated to identify Mes-GC-specific enhancer landscapes. Upstream regulators and downstream targets of Mes-GC enhancers were interrogated using chromatin immunoprecipitation followed by sequencing (ChIP-seq), RNA sequencing, CRISPR/Cas9 editing, functional assays and pharmacological inhibition. RESULTS: We identified and validated a 993-gene cancer-cell intrinsic Mes-GC classifier applicable to retrospective cohorts or prospective single samples. Multicohort analysis of Mes-GCs confirmed associations with poor patient survival, therapy resistance and few targetable genomic alterations. Analysis of enhancer profiles revealed a distinctive Mes-GC epigenomic landscape, with TEAD1 as a master regulator of Mes-GC enhancers and Mes-GCs exhibiting preferential sensitivity to TEAD1 pharmacological inhibition. Analysis of Mes-GC super-enhancers also highlighted NUAK1 kinase as a downstream target, with synergistic effects observed between NUAK1 inhibition and cisplatin treatment. CONCLUSION: Our results establish a consensus Mes-GC classifier applicable to multiple transcriptomic scenarios. Mes-GCs exhibit a distinct epigenomic landscape, and TEAD1 inhibition and combinatorial NUAK1 inhibition/cisplatin may represent potential targetable options.


Assuntos
Elementos Facilitadores Genéticos , Epigênese Genética , Regulação Neoplásica da Expressão Gênica , Neoplasias Gástricas , Humanos , Cisplatino/metabolismo , Cisplatino/uso terapêutico , Estudos Prospectivos , Proteínas Quinases/genética , Proteínas Repressoras , Estudos Retrospectivos , Neoplasias Gástricas/genética
5.
Gut ; 72(9): 1651-1663, 2023 09.
Artigo em Inglês | MEDLINE | ID: mdl-36918265

RESUMO

OBJECTIVE: Gastric cancer (GC) is a leading cause of cancer mortality, with ARID1A being the second most frequently mutated driver gene in GC. We sought to decipher ARID1A-specific GC regulatory networks and examine therapeutic vulnerabilities arising from ARID1A loss. DESIGN: Genomic profiling of GC patients including a Singapore cohort (>200 patients) was performed to derive mutational signatures of ARID1A inactivation across molecular subtypes. Single-cell transcriptomic profiles of ARID1A-mutated GCs were analysed to examine tumour microenvironmental changes arising from ARID1A loss. Genome-wide ARID1A binding and chromatin profiles (H3K27ac, H3K4me3, H3K4me1, ATAC-seq) were generated to identify gastric-specific epigenetic landscapes regulated by ARID1A. Distinct cancer hallmarks of ARID1A-mutated GCs were converged at the genomic, single-cell and epigenomic level, and targeted by pharmacological inhibition. RESULTS: We observed prevalent ARID1A inactivation across GC molecular subtypes, with distinct mutational signatures and linked to a NFKB-driven proinflammatory tumour microenvironment. ARID1A-depletion caused loss of H3K27ac activation signals at ARID1A-occupied distal enhancers, but unexpectedly gain of H3K27ac at ARID1A-occupied promoters in genes such as NFKB1 and NFKB2. Promoter activation in ARID1A-mutated GCs was associated with enhanced gene expression, increased BRD4 binding, and reduced HDAC1 and CTCF occupancy. Combined targeting of promoter activation and tumour inflammation via bromodomain and NFKB inhibitors confirmed therapeutic synergy specific to ARID1A-genomic status. CONCLUSION: Our results suggest a therapeutic strategy for ARID1A-mutated GCs targeting both tumour-intrinsic (BRD4-assocatiated promoter activation) and extrinsic (NFKB immunomodulation) cancer phenotypes.


Assuntos
Neoplasias Gástricas , Fatores de Transcrição , Humanos , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Neoplasias Gástricas/genética , Neoplasias Gástricas/terapia , Neoplasias Gástricas/patologia , Proteínas Nucleares/genética , Epigenômica , Mutação , Microambiente Tumoral/genética , Proteínas de Ligação a DNA/genética , Proteínas de Ciclo Celular/genética
6.
Annu Rev Genomics Hum Genet ; 21: 37-54, 2020 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-32443951

RESUMO

Spatiotemporal control of gene expression during development requires orchestrated activities of numerous enhancers, which are cis-regulatory DNA sequences that, when bound by transcription factors, support selective activation or repression of associated genes. Proper activation of enhancers is critical during embryonic development, adult tissue homeostasis, and regeneration, and inappropriate enhancer activity is often associated with pathological conditions such as cancer. Multiple consortia [e.g., the Encyclopedia of DNA Elements (ENCODE) Consortium and National Institutes of Health Roadmap Epigenomics Mapping Consortium] and independent investigators have mapped putative regulatory regions in a large number of cell types and tissues, but the sequence determinants of cell-specific enhancers are not yet fully understood. Machine learning approaches trained on large sets of these regulatory regions can identify core transcription factor binding sites and generate quantitative predictions of enhancer activity and the impact of sequence variants on activity. Here, we review these computational methods in the context of enhancer prediction and gene regulatory network models specifying cell fate.


Assuntos
Biologia Computacional/métodos , Elementos Facilitadores Genéticos , Redes Reguladoras de Genes , Genoma Humano , Humanos
7.
Am J Hum Genet ; 103(6): 874-892, 2018 12 06.
Artigo em Inglês | MEDLINE | ID: mdl-30503521

RESUMO

The progressive loss of midbrain (MB) dopaminergic (DA) neurons defines the motor features of Parkinson disease (PD), and modulation of risk by common variants in PD has been well established through genome-wide association studies (GWASs). We acquired open chromatin signatures of purified embryonic mouse MB DA neurons because we anticipated that a fraction of PD-associated genetic variation might mediate the variants' effects within this neuronal population. Correlation with >2,300 putative enhancers assayed in mice revealed enrichment for MB cis-regulatory elements (CREs), and these data were reinforced by transgenic analyses of six additional sequences in zebrafish and mice. One CRE, within intron 4 of the familial PD gene SNCA, directed reporter expression in catecholaminergic neurons from transgenic mice and zebrafish. Sequencing of this CRE in 986 individuals with PD and 992 controls revealed two common variants associated with elevated PD risk. To assess potential mechanisms of action, we screened >16,000 proteins for DNA binding capacity and identified a subset whose binding is impacted by these enhancer variants. Additional genotyping across the SNCA locus identified a single PD-associated haplotype, containing the minor alleles of both of the aforementioned PD-risk variants. Our work posits a model for how common variation at SNCA might modulate PD risk and highlights the value of cell-context-dependent guided searches for functional non-coding variation.


Assuntos
Cromatina/genética , Neurônios Dopaminérgicos/patologia , Elementos Facilitadores Genéticos/genética , Predisposição Genética para Doença/genética , Doença de Parkinson/genética , alfa-Sinucleína/genética , Adulto , Idoso , Idoso de 80 Anos ou mais , Alelos , Animais , Modelos Animais de Doenças , Feminino , Genótipo , Humanos , Íntrons/genética , Masculino , Camundongos , Camundongos Transgênicos , Pessoa de Meia-Idade , Gravidez , Peixe-Zebra
8.
Hum Mutat ; 40(9): 1280-1291, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31106481

RESUMO

The integrative analysis of high-throughput reporter assays, machine learning, and profiles of epigenomic chromatin state in a broad array of cells and tissues has the potential to significantly improve our understanding of noncoding regulatory element function and its contribution to human disease. Here, we report results from the CAGI 5 regulation saturation challenge where participants were asked to predict the impact of nucleotide substitution at every base pair within five disease-associated human enhancers and nine disease-associated promoters. A library of mutations covering all bases was generated by saturation mutagenesis and altered activity was assessed in a massively parallel reporter assay (MPRA) in relevant cell lines. Reporter expression was measured relative to plasmid DNA to determine the impact of variants. The challenge was to predict the functional effects of variants on reporter expression. Comparative analysis of the full range of submitted prediction results identifies the most successful models of transcription factor binding sites, machine learning algorithms, and ways to choose among or incorporate diverse datatypes and cell-types for training computational models. These results have the potential to improve the design of future studies on more diverse sets of regulatory elements and aid the interpretation of disease-associated genetic variation.


Assuntos
DNA/química , Epigenômica/métodos , Mutação Puntual , Sítios de Ligação , Linhagem Celular , Cromatina/genética , DNA/metabolismo , Elementos Facilitadores Genéticos , Predisposição Genética para Doença , Humanos , Aprendizado de Máquina , Regiões Promotoras Genéticas , Fatores de Transcrição/metabolismo
9.
PLoS Comput Biol ; 14(12): e1006625, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30562350

RESUMO

We report an experimental design issue in recent machine learning formulations of the enhancer-promoter interaction problem arising from the fact that many enhancer-promoter pairs share features. Cross-fold validation schemes which do not correctly separate these feature sharing enhancer-promoter pairs into one test set report high accuracy, which is actually arising from high training set accuracy and a failure to properly evaluate generalization performance. Cross-fold validation schemes which properly segregate pairs with shared features show markedly reduced ability to predict enhancer-promoter interactions from epigenomic state. Parameter scans with multiple models indicate that local epigenomic features of individual pairs of enhancers and promoters cannot distinguish those pairs that interact from those which do with high accuracy, suggesting that additional information is required to predict enhancer-promoter interactions.


Assuntos
Elementos Facilitadores Genéticos , Epigênese Genética , Modelos Genéticos , Regiões Promotoras Genéticas , Linhagem Celular , Biologia Computacional , Humanos , Células K562 , Aprendizado de Máquina , Máquina de Vetores de Suporte
10.
Hum Mutat ; 38(9): 1251-1258, 2017 09.
Artigo em Inglês | MEDLINE | ID: mdl-28120510

RESUMO

We participated in the Critical Assessment of Genome Interpretation eQTL challenge to further test computational models of regulatory variant impact and their association with human disease. Our prediction model is based on a discriminative gapped-kmer SVM (gkm-SVM) trained on genome-wide chromatin accessibility data in the cell type of interest. The comparisons with massively parallel reporter assays (MPRA) in lymphoblasts show that gkm-SVM is among the most accurate prediction models even though all other models used the MPRA data for model training, and gkm-SVM did not. In addition, we compare gkm-SVM with other MPRA datasets and show that gkm-SVM is a reliable predictor of expression and that deltaSVM is a reliable predictor of variant impact in K562 cells and mouse retina. We further show that DHS (DNase-I hypersensitive sites) and ATAC-seq (assay for transposase-accessible chromatin using sequencing) data are equally predictive substrates for training gkm-SVM, and that DHS regions flanked by H3K27Ac and H3K4me1 marks are more predictive than DHS regions alone.


Assuntos
Elementos Facilitadores Genéticos , Variação Genética , Retina/química , Análise de Sequência de DNA/métodos , Animais , Biologia Computacional/métodos , Histonas/metabolismo , Humanos , Células K562 , Camundongos , Modelos Genéticos , Locos de Características Quantitativas , Retina/metabolismo , Software , Máquina de Vetores de Suporte
11.
Hum Mutat ; 38(9): 1240-1250, 2017 09.
Artigo em Inglês | MEDLINE | ID: mdl-28220625

RESUMO

In many human diseases, associated genetic changes tend to occur within noncoding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such noncoding regions: given a region that is statistically associated with changes in gene expression (expression quantitative trait locus [eQTL]), does it in fact play a regulatory role? And if so, how is this role "coded" in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta-analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.


Assuntos
Biologia Computacional/métodos , Expressão Gênica , Regulação da Expressão Gênica , Predisposição Genética para Doença , Humanos , Locos de Características Quantitativas
12.
Genome Res ; 24(12): 1932-44, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25319996

RESUMO

Combinatorial actions of relatively few transcription factors control hematopoietic differentiation. To investigate this process in erythro-megakaryopoiesis, we correlated the genome-wide chromatin occupancy signatures of four master hematopoietic transcription factors (GATA1, GATA2, TAL1, and FLI1) and three diagnostic histone modification marks with the gene expression changes that occur during development of primary cultured megakaryocytes (MEG) and primary erythroblasts (ERY) from murine fetal liver hematopoietic stem/progenitor cells. We identified a robust, genome-wide mechanism of MEG-specific lineage priming by a previously described stem/progenitor cell-expressed transcription factor heptad (GATA2, LYL1, TAL1, FLI1, ERG, RUNX1, LMO2) binding to MEG-associated cis-regulatory modules (CRMs) in multipotential progenitors. This is followed by genome-wide GATA factor switching that mediates further induction of MEG-specific genes following lineage commitment. Interaction between GATA and ETS factors appears to be a key determinant of these processes. In contrast, ERY-specific lineage priming is biased toward GATA2-independent mechanisms. In addition to its role in MEG lineage priming, GATA2 plays an extensive role in late megakaryopoiesis as a transcriptional repressor at loci defined by a specific DNA signature. Our findings reveal important new insights into how ERY and MEG lineages arise from a common bipotential progenitor via overlapping and divergent functions of shared hematopoietic transcription factors.


Assuntos
Diferenciação Celular , Linhagem da Célula , Eritropoese/fisiologia , Células-Tronco Hematopoéticas/citologia , Células-Tronco Hematopoéticas/metabolismo , Trombopoese/fisiologia , Fatores de Transcrição/metabolismo , Animais , Sequência de Bases , Fatores de Transcrição Hélice-Alça-Hélice Básicos/metabolismo , Sítios de Ligação , Cromatina/genética , Cromatina/metabolismo , Análise por Conglomerados , Fator de Transcrição GATA1/metabolismo , Fator de Transcrição GATA2/metabolismo , Perfilação da Expressão Gênica , Inativação Gênica , Estudo de Associação Genômica Ampla , Histonas/metabolismo , Camundongos , Modelos Biológicos , Motivos de Nucleotídeos , Ligação Proteica , Proteína Proto-Oncogênica c-fli-1/metabolismo , Proteínas Proto-Oncogênicas/metabolismo , Proteínas Proto-Oncogênicas c-ets/metabolismo , Proteína 1 de Leucemia Linfocítica Aguda de Células T , Fatores de Transcrição/genética , Transcrição Gênica
13.
Bioinformatics ; 32(14): 2205-7, 2016 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-27153639

RESUMO

UNLABELLED: We present a new R package for training gapped-kmer SVM classifiers for DNA and protein sequences. We describe an improved algorithm for kernel matrix calculation that speeds run time by about 2 to 5-fold over our original gkmSVM algorithm. This package supports several sequence kernels, including: gkmSVM, kmer-SVM, mismatch kernel and wildcard kernel. AVAILABILITY AND IMPLEMENTATION: gkmSVM package is freely available through the Comprehensive R Archive Network (CRAN), for Linux, Mac OS and Windows platforms. The C ++ implementation is available at www.beerlab.org/gkmsvm CONTACT: mghandi@gmail.com or mbeer@jhu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Software , Máquina de Vetores de Suporte , Algoritmos
14.
Proc Natl Acad Sci U S A ; 111(48): 17224-9, 2014 Dec 02.
Artigo em Inglês | MEDLINE | ID: mdl-25413365

RESUMO

Although the similarities between humans and mice are typically highlighted, morphologically and genetically, there are many differences. To better understand these two species on a molecular level, we performed a comparison of the expression profiles of 15 tissues by deep RNA sequencing and examined the similarities and differences in the transcriptome for both protein-coding and -noncoding transcripts. Although commonalities are evident in the expression of tissue-specific genes between the two species, the expression for many sets of genes was found to be more similar in different tissues within the same species than between species. These findings were further corroborated by associated epigenetic histone mark analyses. We also find that many noncoding transcripts are expressed at a low level and are not detectable at appreciable levels across individuals. Moreover, the majority lack obvious sequence homologs between species, even when we restrict our attention to those which are most highly reproducible across biological replicates. Overall, our results indicate that there is considerable RNA expression diversity between humans and mice, well beyond what was described previously, likely reflecting the fundamental physiological differences between these two organisms.


Assuntos
DNA Intergênico/genética , Perfilação da Expressão Gênica/métodos , Especificidade de Órgãos/genética , Proteínas/genética , Animais , Epigenômica/métodos , Evolução Molecular , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Camundongos Endogâmicos C57BL , Análise de Sequência de RNA , Especificidade da Espécie , Transcriptoma/genética
16.
Genome Res ; 22(11): 2290-301, 2012 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-23019145

RESUMO

We take a comprehensive approach to the study of regulatory control of gene expression in melanocytes that proceeds from large-scale enhancer discovery facilitated by ChIP-seq; to rigorous validation in silico, in vitro, and in vivo; and finally to the use of machine learning to elucidate a regulatory vocabulary with genome-wide predictive power. We identify 2489 putative melanocyte enhancer loci in the mouse genome by ChIP-seq for EP300 and H3K4me1. We demonstrate that these putative enhancers are evolutionarily constrained, enriched for sequence motifs predicted to bind key melanocyte transcription factors, located near genes relevant to melanocyte biology, and capable of driving reporter gene expression in melanocytes in culture (86%; 43/50) and in transgenic zebrafish (70%; 7/10). Next, using the sequences of these putative enhancers as a training set for a supervised machine learning algorithm, we develop a vocabulary of 6-mers predictive of melanocyte enhancer function. Lastly, we demonstrate that this vocabulary has genome-wide predictive power in both the mouse and human genomes. This study provides deep insight into the regulation of gene expression in melanocytes and demonstrates a powerful approach to the investigation of regulatory sequences that can be applied to other cell types.


Assuntos
Inteligência Artificial , Imunoprecipitação da Cromatina/métodos , Elementos Facilitadores Genéticos , Melanócitos/metabolismo , Algoritmos , Animais , Proteína p300 Associada a E1A/genética , Proteína p300 Associada a E1A/metabolismo , Evolução Molecular , Regulação da Expressão Gênica , Genes Reporter , Genoma Humano , Histonas/metabolismo , Humanos , Camundongos , Análise de Sequência de DNA/métodos , Fatores de Transcrição/metabolismo , Peixe-Zebra
17.
PLoS Comput Biol ; 10(7): e1003711, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-25033408

RESUMO

Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.


Assuntos
Biologia Computacional/métodos , Modelos Genéticos , Sequências Reguladoras de Ácido Nucleico/genética , Análise de Sequência de DNA/métodos , Sequência de Bases , Teorema de Bayes , Imunoprecipitação da Cromatina , Oligonucleotídeos/genética , Especificidade de Órgãos/genética , Máquina de Vetores de Suporte
18.
Nucleic Acids Res ; 41(Web Server issue): W544-56, 2013 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-23771147

RESUMO

Massively parallel sequencing technologies have made the generation of genomic data sets a routine component of many biological investigations. For example, Chromatin immunoprecipitation followed by sequence assays detect genomic regions bound (directly or indirectly) by specific factors, and DNase-seq identifies regions of open chromatin. A major bottleneck in the interpretation of these data is the identification of the underlying DNA sequence code that defines, and ultimately facilitates prediction of, these transcription factor (TF) bound or open chromatin regions. We have recently developed a novel computational methodology, which uses a support vector machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short transcription factor-binding sites, which determine the tissue specificity of these genomic assays (Lee, Karchin and Beer, Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21:2167-80). This regulatory information can (i) give confidence in genomic experiments by recovering previously known binding sites, and (ii) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here, we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published data sets and demonstrate how this tool identifies accessory factors and repressive sequence elements. kmer-SVM is available at http://kmersvm.beerlab.org.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Elementos Reguladores de Transcrição , Análise de Sequência de DNA , Software , Máquina de Vetores de Suporte , Fatores de Transcrição/metabolismo , Animais , Sítios de Ligação , Genômica , Humanos , Internet , Camundongos
19.
Genome Res ; 21(12): 2167-80, 2011 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-21875935

RESUMO

Accurately predicting regulatory sequences and enhancers in entire genomes is an important but difficult problem, especially in large vertebrate genomes. With the advent of ChIP-seq technology, experimental detection of genome-wide EP300/CREBBP bound regions provides a powerful platform to develop predictive tools for regulatory sequences and to study their sequence properties. Here, we develop a support vector machine (SVM) framework which can accurately identify EP300-bound enhancers using only genomic sequence and an unbiased set of general sequence features. Moreover, we find that the predictive sequence features identified by the SVM classifier reveal biologically relevant sequence elements enriched in the enhancers, but we also identify other features that are significantly depleted in enhancers. The predictive sequence features are evolutionarily conserved and spatially clustered, providing further support of their functional significance. Although our SVM is trained on experimental data, we also predict novel enhancers and show that these putative enhancers are significantly enriched in both ChIP-seq signal and DNase I hypersensitivity signal in the mouse brain and are located near relevant genes. Finally, we present results of comparisons between other EP300/CREBBP data sets using our SVM and uncover sequence elements enriched and/or depleted in the different classes of enhancers. Many of these sequence features play a role in specifying tissue-specific or developmental-stage-specific enhancer activity, but our results indicate that some features operate in a general or tissue-independent manner. In addition to providing a high confidence list of enhancer targets for subsequent experimental investigation, these results contribute to our understanding of the general sequence structure of vertebrate enhancers.


Assuntos
Proteína de Ligação a CREB , Genoma/fisiologia , Elementos de Resposta/fisiologia , Animais , Córtex Cerebral/citologia , Córtex Cerebral/metabolismo , Estudo de Associação Genômica Ampla/métodos , Camundongos , Neurônios/citologia , Neurônios/metabolismo , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Especificidade de Órgãos/fisiologia , Análise de Sequência de DNA
20.
J Math Biol ; 69(2): 469-500, 2014 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-23861010

RESUMO

Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.


Assuntos
DNA/química , Genoma Humano , Fatores de Transcrição/química , Sítios de Ligação , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA