Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Genome Res ; 31(9): 1646-1662, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34285090

RESUMO

High-throughput sequencing-based assays measure different biochemical activities pertaining to gene regulation, genome-wide. These activities include transcription factor (TF)-DNA binding, enhancer activity, open chromatin, and more. A major goal is to understand underlying sequence components, or motifs, that can explain the measured activity. It is usually not one motif but a combination of motifs bound by cooperatively acting proteins that confers activity to such regions. Furthermore, regions can be diverse, governed by different combinations of TFs/motifs. Current approaches do not take into account this issue of combinatorial diversity. We present a new statistical framework, cisDIVERSITY, which models regions as diverse modules characterized by combinations of motifs while simultaneously learning the motifs themselves. Because cisDIVERSITY does not rely on knowledge of motifs, modules, cell type, or organism, it is general enough to be applied to regions reported by most high-throughput assays. For example, in enhancer predictions resulting from different assays-GRO-cap, STARR-seq, and those measuring chromatin structure-cisDIVERSITY discovers distinct modules and combinations of TF binding sites, some specific to the assay. From protein-DNA binding data, cisDIVERSITY identifies potential cofactors of the profiled TF, whereas from ATAC-seq data, it identifies tissue-specific regulatory modules. Finally, analysis of single-cell ATAC-seq data suggests that regions open in one cell-state encode information about future states, with certain modules staying open and others closing down in the next time point.


Assuntos
DNA , Fatores de Transcrição , Sítios de Ligação/genética , Imunoprecipitação da Cromatina , DNA/genética , DNA/metabolismo , Ligação Proteica/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
2.
Genome Res ; 31(4): 607-621, 2021 04.
Artigo em Inglês | MEDLINE | ID: mdl-33514624

RESUMO

The establishment of centromeric chromatin and its propagation by the centromere-specific histone CENPA is mediated by epigenetic mechanisms in most eukaryotes. DNA replication origins, origin binding proteins, and replication timing of centromere DNA are important determinants of centromere function. The epigenetically regulated regional centromeres in the budding yeast Candida albicans have unique DNA sequences that replicate earliest in every chromosome and are clustered throughout the cell cycle. In this study, the genome-wide occupancy of the replication initiation protein Orc4 reveals its abundance at all centromeres in C. albicans Orc4 is associated with four different DNA sequence motifs, one of which coincides with tRNA genes (tDNA) that replicate early and cluster together in space. Hi-C combined with genome-wide replication timing analyses identify that early replicating Orc4-bound regions interact with themselves stronger than with late replicating Orc4-bound regions. We simulate a polymer model of chromosomes of C. albicans and propose that the early replicating and highly enriched Orc4-bound sites preferentially localize around the clustered kinetochores. We also observe that Orc4 is constitutively localized to centromeres, and both Orc4 and the helicase Mcm2 are essential for cell viability and CENPA stability in C. albicans Finally, we show that new molecules of CENPA are recruited to centromeres during late anaphase/telophase, which coincides with the stage at which the CENPA-specific chaperone Scm3 localizes to the kinetochore. We propose that the spatiotemporal localization of Orc4 within the nucleus, in collaboration with Mcm2 and Scm3, maintains centromeric chromatin stability and CENPA recruitment in C. albicans.


Assuntos
Candida albicans , Centrômero , Cromatina , Complexo de Reconhecimento de Origem/metabolismo , Candida albicans/genética , Centrômero/genética , Cromatina/química , Cromatina/genética , Cromatina/metabolismo , Histonas/metabolismo , Cinetocoros , Origem de Replicação/genética
3.
Bioinformatics ; 37(Suppl_1): i367-i375, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252930

RESUMO

MOTIVATION: High-throughput chromatin immunoprecipitation (ChIP) sequencing-based assays capture genomic regions associated with the profiled transcription factor (TF). ChIP-exo is a modified protocol, which uses lambda exonuclease to digest DNA close to the TF-DNA complex, in order to improve on the positional resolution of the TF-DNA contact. Because the digestion occurs in the 5'-3' orientation, the protocol produces directional footprints close to the complex, on both sides of the double stranded DNA. Like all ChIP-based methods, ChIP-exo reports a mixture of different regions associated with the TF: those bound directly to the TF as well as via intermediaries. However, the distribution of footprints are likely to be indicative of the complex forming at the DNA. RESULTS: We present ExoDiversity, which uses a model-based framework to learn a joint distribution over footprints and motifs, thus resolving the mixture of ChIP-exo footprints into diverse binding modes. It uses no prior motif or TF information and automatically learns the number of different modes from the data. We show its application on a wide range of TFs and organisms/cell-types. Because its goal is to explain the complete set of reported regions, it is able to identify co-factor TF motifs that appear in a small fraction of the dataset. Further, ExoDiversity discovers small nucleotide variations within and outside canonical motifs, which co-occur with variations in footprints, suggesting that the TF-DNA structural configuration at those regions is likely to be different. Finally, we show that detected modes have specific DNA shape features and conservation signals, giving insights into the structure and function of the putative TF-DNA complexes. AVAILABILITY AND IMPLEMENTATION: The code for ExoDiversity is available on https://github.com/NarlikarLab/exoDIVERSITY. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
DNA , Exonucleases , Sítios de Ligação , Imunoprecipitação da Cromatina , DNA/metabolismo , Pegada de DNA , Ligação Proteica , Análise de Sequência de DNA
4.
Immunity ; 35(2): 299-311, 2011 Aug 26.
Artigo em Inglês | MEDLINE | ID: mdl-21867929

RESUMO

The transcription factor GATA3 plays an essential role during T cell development and T helper 2 (Th2) cell differentiation. To understand GATA3-mediated gene regulation, we identified genome-wide GATA3 binding sites in ten well-defined developmental and effector T lymphocyte lineages. In the thymus, GATA3 directly regulated many critical factors, including Th-POK, Notch1, and T cell receptor subunits. In the periphery, GATA3 induced a large number of Th2 cell-specific as well as Th2 cell-nonspecific genes, including several transcription factors. Our data also indicate that GATA3 regulates both active and repressive histone modifications of many target genes at their regulatory elements near GATA3 binding sites. Overall, although GATA3 binding exhibited both shared and cell-specific patterns among various T cell lineages, many genes were either positively or negatively regulated by GATA3 in a cell type-specific manner, suggesting that GATA3-mediated gene regulation depends strongly on cofactors existing in different T cells.


Assuntos
Fator de Transcrição GATA3/metabolismo , Proteínas Mutantes/metabolismo , Subpopulações de Linfócitos T/metabolismo , Células Th2/metabolismo , Animais , Linhagem da Célula/genética , Metilação de DNA , Fator de Transcrição GATA3/genética , Fator de Transcrição GATA3/imunologia , Regulação da Expressão Gênica , Genoma/imunologia , Estudo de Associação Genômica Ampla , Histonas/genética , Histonas/metabolismo , Linfopoese/genética , Camundongos , Camundongos Endogâmicos C57BL , Camundongos Transgênicos , Proteínas Mutantes/genética , Proteínas Mutantes/imunologia , Ligação Proteica , Receptor Notch1/genética , Receptor Notch1/metabolismo , Receptores de Antígenos de Linfócitos T alfa-beta/genética , Receptores de Antígenos de Linfócitos T alfa-beta/metabolismo , Subpopulações de Linfócitos T/imunologia , Subpopulações de Linfócitos T/patologia , Células Th2/imunologia , Células Th2/patologia , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
5.
Nucleic Acids Res ; 46(5): e29, 2018 03 16.
Artigo em Inglês | MEDLINE | ID: mdl-29267972

RESUMO

We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1-2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large 'window' sizes (≥50 bp), much longer than typical binding sites (7-15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity.


Assuntos
Algoritmos , Biologia Computacional/métodos , DNA/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Motivos de Nucleotídeos/genética , Sítios de Ligação/genética , Cromatina/genética , Cromatina/metabolismo , Imunoprecipitação da Cromatina/métodos , Análise por Conglomerados , DNA/química , DNA/metabolismo , Genômica/métodos , Humanos , Ligação Proteica , Reprodutibilidade dos Testes , Fatores de Transcrição/metabolismo
6.
PLoS Comput Biol ; 14(4): e1006090, 2018 04.
Artigo em Inglês | MEDLINE | ID: mdl-29684008

RESUMO

Genome-wide in vivo protein-DNA interactions are routinely mapped using high-throughput chromatin immunoprecipitation (ChIP). ChIP-reported regions are typically investigated for enriched sequence-motifs, which are likely to model the DNA-binding specificity of the profiled protein and/or of co-occurring proteins. However, simple enrichment analyses can miss insights into the binding-activity of the protein. Note that ChIP reports regions making direct contact with the protein as well as those binding through intermediaries. For example, consider a ChIP experiment targeting protein X, which binds DNA at its cognate sites, but simultaneously interacts with four other proteins. Each of these proteins also binds to its own specific cognate sites along distant parts of the genome, a scenario consistent with the current view of transcriptional hubs and chromatin loops. Since ChIP will pull down all X-associated regions, the final reported data will be a union of five distinct sets of regions, each containing binding sites of one of the five proteins, respectively. Characterizing all five different motifs and the corresponding sets is important to interpret the ChIP experiment and ultimately, the role of X in regulation. We present diversity which attempts exactly this: it partitions the data so that each partition can be characterized with its own de novo motif. Diversity uses a Bayesian approach to identify the optimal number of motifs and the associated partitions, which together explain the entire dataset. This is in contrast to standard motif finders, which report motifs individually enriched in the data, but do not necessarily explain all reported regions. We show that the different motifs and associated regions identified by diversity give insights into the various complexes that may be forming along the chromatin, something that has so far not been attempted from ChIP data. Webserver at http://diversity.ncl.res.in/; standalone (Mac OS X/Linux) from https://github.com/NarlikarLab/DIVERSITY/releases/tag/v1.0.0.


Assuntos
Imunoprecipitação da Cromatina/estatística & dados numéricos , Software , Algoritmos , Animais , Teorema de Bayes , Sítios de Ligação , Cromatina/genética , Cromatina/metabolismo , Biologia Computacional , DNA/genética , DNA/metabolismo , Proteínas de Ligação a DNA/genética , Proteínas de Ligação a DNA/metabolismo , Evolução Molecular , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Neurônios/metabolismo , Motivos de Nucleotídeos , Ligação Proteica , Análise de Sequência de DNA/estatística & dados numéricos , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
7.
Bioinformatics ; 32(5): 779-81, 2016 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-26530723

RESUMO

UNLABELLED: Promoters have diverse regulatory architectures and thus activate genes differently. For example, some have a TATA-box, many others do not. Even the ones with it can differ in its position relative to the transcription start site (TSS). No Promoter Left Behind (NPLB) is an efficient, organism-independent method for characterizing such diverse architectures directly from experimentally identified genome-wide TSSs, without relying on known promoter elements. As a test case, we show its application in identifying novel architectures in the fly genome. AVAILABILITY AND IMPLEMENTATION: Web-server at http://nplb.ncl.res.in Standalone also at https://github.com/computationalBiology/NPLB/ (Mac OSX/Linux). CONTACT: l.narlikar@ncl.res.in SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Regiões Promotoras Genéticas , Genoma , Sítio de Iniciação de Transcrição
8.
Nucleic Acids Res ; 42(20): 12388-403, 2014 Nov 10.
Artigo em Inglês | MEDLINE | ID: mdl-25326324

RESUMO

An important question in biology is how different promoter-architectures contribute to the diversity in regulation of transcription initiation. A step forward has been the production of genome-wide maps of transcription start sites (TSSs) using high-throughput sequencing. However, the subsequent step of characterizing promoters and their functions is still largely done on the basis of previously established promoter-elements like the TATA-box in eukaryotes or the -10 box in bacteria. Unfortunately, a majority of promoters and their activities cannot be explained by these few elements. Traditional motif discovery methods that identify novel elements also fail here, because TSS neighborhoods are often highly heterogeneous containing no overrepresented motif. We present a new, organism-independent method that explicitly models this heterogeneity while unraveling different promoter-architectures. For example, in five bacteria, we detect the presence of a pyrimidine preceding the TSS under very specific circumstances. In tuberculosis, we show for the first time that the spacing between the bacterial 10-motif and TSS is utilized by the pathogen for dynamic gene-regulation. In eukaryotes, we identify several new elements that are important for development. Identified promoter-architectures show differential patterns of evolution, chromatin structure and TSS spread, suggesting distinct regulatory functions. This work highlights the importance of characterizing heterogeneity within high-throughput genomic data rather than analyzing average patterns of nucleotide composition.


Assuntos
Genômica/métodos , Regiões Promotoras Genéticas , Sítio de Iniciação de Transcrição , Animais , Cromatina/química , Drosophila/genética , Escherichia coli/genética , Genoma Bacteriano , Genoma Humano , Humanos , Klebsiella pneumoniae/genética , Mycobacterium tuberculosis/genética , Transcrição Gênica
9.
Nucleic Acids Res ; 41(1): 21-32, 2013 Jan 07.
Artigo em Inglês | MEDLINE | ID: mdl-23093591

RESUMO

High-throughput chromatin immunoprecipitation has become the method of choice for identifying genomic regions bound by a protein. Such regions are then investigated for overrepresented sequence motifs, the assumption being that they must correspond to the binding specificity of the profiled protein. However this approach often fails: many bound regions do not contain the 'expected' motif. This is because binding DNA directly at its recognition site is not the only way the protein can cause the region to immunoprecipitate. Its binding specificity can change through association with different co-factors, it can bind DNA indirectly, through intermediaries, or even enforce its function through long-range chromosomal interactions. Conventional motif discovery methods, though largely capable of identifying overrepresented motifs from bound regions, lack the ability to characterize such diverse modes of protein-DNA binding and binding specificities. We present a novel Bayesian method that identifies distinct protein-DNA binding mechanisms without relying on any motif database. The method successfully identifies co-factors of proteins that do not bind DNA directly, such as mediator and p300. It also predicts literature-supported enhancer-promoter interactions. Even for well-studied direct-binding proteins, this method provides compelling evidence for previously uncharacterized dependencies within positions of binding sites, long-range chromosomal interactions and dimerization.


Assuntos
Imunoprecipitação da Cromatina , Proteínas de Ligação a DNA/metabolismo , Software , Fatores de Transcrição/metabolismo , Animais , Teorema de Bayes , DNA/química , DNA/metabolismo , Células-Tronco Embrionárias/metabolismo , Fator de Transcrição GATA3/metabolismo , Genômica/métodos , Camundongos , Ligação Proteica , Linfócitos T/metabolismo
10.
Nucleic Acids Res ; 41(3): 1416-24, 2013 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-23267010

RESUMO

The structural simplicity and ability to capture serial correlations make Markov models a popular modeling choice in several genomic analyses, such as identification of motifs, genes and regulatory elements. A critical, yet relatively unexplored, issue is the determination of the order of the Markov model. Most biological applications use a predetermined order for all data sets indiscriminately. Here, we show the vast variation in the performance of such applications with the order. To identify the 'optimal' order, we investigated two model selection criteria: Akaike information criterion and Bayesian information criterion (BIC). The BIC optimal order delivers the best performance for mammalian phylogeny reconstruction and motif discovery. Importantly, this order is different from orders typically used by many tools, suggesting that a simple additional step determining this order can significantly improve results. Further, we describe a novel classification approach based on BIC optimal Markov models to predict functionality of tissue-specific promoters. Our classifier discriminates between promoters active across 12 different tissues with remarkable accuracy, yielding 3 times the precision expected by chance. Application to the metagenomics problem of identifying the taxum from a short DNA fragment yields accuracies at least as high as the more complex mainstream methodologies, while retaining conceptual and computational simplicity.


Assuntos
Cadeias de Markov , Análise de Sequência de DNA/métodos , Animais , Genômica/métodos , Humanos , Metagenômica/métodos , Modelos Estatísticos , Motivos de Nucleotídeos , Regiões Promotoras Genéticas
11.
Neuropsychiatr Dis Treat ; 20: 923-936, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38716091

RESUMO

Introduction: Stigma contributes to a significant part of the burden of schizophrenia (SCZ), therefore reducing false positives from the diagnosis would be liberating for the individuals with SCZ and desirable for the clinicians. The stigmatization associated with schizophrenia advocates the need for high-precision diagnosis. In this study, we present an ensemble learning-based approach for high-precision diagnosis of SCZ using peripheral blood gene expression profiles. Methodology: The machine learning (ML) models, support vector machines (SVM), and prediction analysis for microarrays (PAM) were developed using differentially expressed genes (DEGs) as features. The SCZ samples were classified based on a voting ensemble classifier of SVM and PAM. Further, microarray-based learning was used to classify RNA sequencing (RNA-Seq) samples from our case-control study (Pune-SCZ) to assess cross-platform compatibility. Results: Ensemble learning using ML models resulted in a significantly higher precision of 80.41% (SD: 0.04) when compared to the individual models (SVM-radial: 71.69%, SD: 0.04 and PAM 77.20%, SD: 0.02). The RNA sequencing samples from our case-control study (Pune-SCZ) resulted in a moderate precision (59.92%, SD: 0.05). The feature genes used for model building were enriched for biological processes such as response to stress, regulation of the immune system, and metabolism of organic nitrogen compounds. The network analysis identified RBX1, CUL4B, DDB1, PRPF19, and COPS4 as hub genes. Conclusion: In summary, this study developed robust models for higher diagnostic precision in psychiatric disorders. Future efforts will be directed towards multi-omic integration and developing "explainable" diagnostic models.

12.
Genome Res ; 20(3): 381-92, 2010 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-20075146

RESUMO

The various organogenic programs deployed during embryonic development rely on the precise expression of a multitude of genes in time and space. Identifying the cis-regulatory elements responsible for this tightly orchestrated regulation of gene expression is an essential step in understanding the genetic pathways involved in development. We describe a strategy to systematically identify tissue-specific cis-regulatory elements that share combinations of sequence motifs. Using heart development as an experimental framework, we employed a combination of Gibbs sampling and linear regression to build a classifier that identifies heart enhancers based on the presence and/or absence of various sequence features, including known and putative transcription factor (TF) binding specificities. In distinguishing heart enhancers from a large pool of random noncoding sequences, the performance of our classifier is vastly superior to four commonly used methods, with an accuracy reaching 92% in cross-validation. Furthermore, most of the binding specificities learned by our method resemble the specificities of TFs widely recognized as key players in heart development and differentiation, such as SRF, MEF2, ETS1, SMAD, and GATA. Using our classifier as a predictor, a genome-wide scan identified over 40,000 novel human heart enhancers. Although the classifier used no gene expression information, these novel enhancers are strongly associated with genes expressed in the heart. Finally, in vivo tests of our predictions in mouse and zebrafish achieved a validation rate of 62%, significantly higher than what is expected by chance. These results support the existence of underlying cis-regulatory codes dictating tissue-specific transcription in mammalian genomes and validate our enhancer classifier strategy as a method to uncover these regulatory codes.


Assuntos
Genoma , Coração/embriologia , Motivos de Aminoácidos/genética , Animais , Sequência de Bases , Feminino , Humanos , Mamíferos/genética , Camundongos/embriologia , Gravidez , Ligação Proteica/genética , Sequências Reguladoras de Ácido Nucleico/genética , Reprodutibilidade dos Testes
13.
Bioinformatics ; 28(4): 581-3, 2012 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-22199387

RESUMO

UNLABELLED: CLARE is a computational method designed to reveal sequence encryption of tissue-specific regulatory elements. Starting with a set of regulatory elements known to be active in a particular tissue/process, it learns the sequence code of the input set and builds a predictive model from features specific to those elements. The resulting model can then be applied to user-supplied genomic regions to identify novel candidate regulatory elements. CLARE's model also provides a detailed analysis of transcription factors that most likely bind to the elements, making it an invaluable tool for understanding mechanisms of tissue-specific gene regulation. AVAILABILITY: CLARE is freely accessible at http://clare.dcode.org/.


Assuntos
Sequências Reguladoras de Ácido Nucleico , Software , Animais , Elementos Facilitadores Genéticos , Regulação da Expressão Gênica , Genômica , Humanos , Camundongos , Especificidade de Órgãos , Prosencéfalo/metabolismo , Fatores de Transcrição/metabolismo
14.
Heliyon ; 9(8): e18211, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37520992

RESUMO

Transcription factors (TFs) and their binding sites have evolved to interact cooperatively or competitively with each other. Here we examine in detail, across multiple cell lines, such cooperation or competition among TFs both in sequential and spatial proximity (using chromatin conformation capture assays), considering in vivo binding data as well as TF binding motifs in DNA. We ascertain significantly co-occurring ("attractive") or avoiding ("repulsive") TF pairs using robust randomized models that retain the essential characteristics of the experimental data. Across human cell lines TFs organize into two groups, with intra-group attraction and inter-group repulsion. This is true for both sequential and spatial proximity, and for both in vivo binding and sequence motifs. Attractive TF pairs exhibit significantly more physical interactions suggesting an underlying mechanism. The two TF groups differ significantly in their genomic and network properties, as well in their function-while one group regulates housekeeping function, the other potentially regulates lineage-specific functions, that are disrupted in cancer. Weaker binding sites tend to occur in spatially interacting regions of the genome. Our results suggest that a complex pattern of spatial cooperativity of TFs and chromatin has evolved with the genome to support housekeeping and lineage-specific functions.

15.
iScience ; 26(10): 107846, 2023 Oct 20.
Artigo em Inglês | MEDLINE | ID: mdl-37767000

RESUMO

Early onset of type 2 diabetes and cardiovascular disease are common complications for women diagnosed with gestational diabetes. Prediabetes refers to a condition in which blood glucose levels are higher than normal, but not yet high enough to be diagnosed as type 2 diabetes. Currently, there is no accurate way of knowing which women with gestational diabetes are likely to develop postpartum prediabetes. This study aims to predict the risk of postpartum prediabetes in women diagnosed with gestational diabetes. Our sparse logistic regression approach selects only two variables - antenatal fasting glucose at OGTT and HbA1c soon after the diagnosis of GDM - as relevant, but gives an area under the receiver operating characteristic curve of 0.72, outperforming all other methods. We envision this to be a practical solution, which coupled with a targeted follow-up of high-risk women, could yield better cardiometabolic outcomes in women with a history of GDM.

16.
Nucleic Acids Res ; 38(6): e90, 2010 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-20047961

RESUMO

As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for using conservation information for TF binding site discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It requires neither sequence alignments nor the phylogenetic relationships between the orthologous sequences, yet it is more effective on real biological data than methods that do.


Assuntos
Regiões Promotoras Genéticas , Análise de Sequência de DNA/métodos , Fatores de Transcrição/metabolismo , Sequência de Bases , Sítios de Ligação , Sequência Conservada , Dados de Sequência Molecular , Alinhamento de Sequência
17.
PLoS One ; 17(3): e0264648, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35255105

RESUMO

OBJECTIVE: The aim of the present study was to identify the factors associated with non-attendance of immediate postpartum glucose test using a machine learning algorithm following gestational diabetes mellitus (GDM) pregnancy. METHOD: A retrospective cohort study of all GDM women (n = 607) for postpartum glucose test due between January 2016 and December 2019 at the George Eliot Hospital NHS Trust, UK. RESULTS: Sixty-five percent of women attended postpartum glucose test. Type 2 diabetes was diagnosed in 2.8% and 21.6% had persistent dysglycaemia at 6-13 weeks post-delivery. Those who did not attend postpartum glucose test seem to be younger, multiparous, obese, and continued to smoke during pregnancy. They also had higher fasting glucose at antenatal oral glucose tolerance test. Our machine learning algorithm predicted postpartum glucose non-attendance with an area under the receiver operating characteristic curve of 0.72. The model could achieve a sensitivity of 70% with 66% specificity at a risk score threshold of 0.46. A total of 233 (38.4%) women attended subsequent glucose test at least once within the first two years of delivery and 24% had dysglycaemia. Compared to women who attended postpartum glucose test, those who did not attend had higher conversion rate to type 2 diabetes (2.5% vs 11.4%; p = 0.005). CONCLUSION: Postpartum screening following GDM is still poor. Women who did not attend postpartum screening appear to have higher metabolic risk and higher conversion to type 2 diabetes by two years post-delivery. Machine learning model can predict women who are unlikely to attend postpartum glucose test using simple antenatal factors. Enhanced, personalised education of these women may improve postpartum glucose screening.


Assuntos
Diabetes Mellitus Tipo 2 , Diabetes Gestacional , Glicemia/metabolismo , Diabetes Mellitus Tipo 2/diagnóstico , Diabetes Mellitus Tipo 2/epidemiologia , Diabetes Gestacional/diagnóstico , Diabetes Gestacional/epidemiologia , Diabetes Gestacional/metabolismo , Feminino , Glucose , Humanos , Aprendizado de Máquina , Masculino , Período Pós-Parto , Gravidez , Estudos Retrospectivos
18.
Brief Funct Genomic Proteomic ; 8(4): 215-30, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19498043

RESUMO

Proper development and functioning of an organism depends on precise spatial and temporal expression of all its genes. These coordinated expression-patterns are maintained primarily through the process of transcriptional regulation. Transcriptional regulation is mediated by proteins binding to regulatory elements on the DNA in a combinatorial manner, where particular combinations of transcription factor binding sites establish specific regulatory codes. In this review, we survey experimental and computational approaches geared towards the identification of proximal and distal gene regulatory elements in the genomes of complex eukaryotes. Available approaches that decipher the genetic structure and function of regulatory elements by exploiting various sources of information like gene expression data, chromatin structure, DNA-binding specificities of transcription factors, cooperativity of transcription factors, etc. are highlighted. We also discuss the relevance of regulatory elements in the context of human health through examples of mutations in some of these regions having serious implications in misregulation of genes and being strongly associated with human disorders.


Assuntos
Células Eucarióticas/metabolismo , Genoma/genética , Sequências Reguladoras de Ácido Nucleico/genética , Animais , Biologia Computacional , Regulação da Expressão Gênica , Saúde , Humanos
19.
PLoS Comput Biol ; 3(11): e215, 2007 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-17997593

RESUMO

Finding functional DNA binding sites of transcription factors (TFs) throughout the genome is a crucial step in understanding transcriptional regulation. Unfortunately, these binding sites are typically short and degenerate, posing a significant statistical challenge: many more matches to known TF motifs occur in the genome than are actually functional. However, information about chromatin structure may help to identify the functional sites. In particular, it has been shown that active regulatory regions are usually depleted of nucleosomes, thereby enabling TFs to bind DNA in those regions. Here, we describe a novel motif discovery algorithm that employs an informative prior over DNA sequence positions based on a discriminative view of nucleosome occupancy. When a Gibbs sampling algorithm is applied to yeast sequence-sets identified by ChIP-chip, the correct motif is found in 52% more cases with our informative prior than with the commonly used uniform prior. This is the first demonstration that nucleosome occupancy information can be used to improve motif discovery. The improvement is dramatic, even though we are using only a statistical model to predict nucleosome occupancy; we expect our results to improve further as high-resolution genome-wide experimental nucleosome occupancy data becomes increasingly available.


Assuntos
DNA Fúngico/genética , Nucleossomos/genética , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Proteínas de Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/genética , Análise de Sequência de DNA/métodos , Fatores de Transcrição/genética , Sítios de Ligação , Ligação Proteica , Mapeamento de Interação de Proteínas/métodos
20.
Bioinformatics ; 22(14): e384-92, 2006 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-16873497

RESUMO

MOTIVATION: An important problem in molecular biology is to identify the locations at which a transcription factor (TF) binds to DNA, given a set of DNA sequences believed to be bound by that TF. In previous work, we showed that information in the DNA sequence of a binding site is sufficient to predict the structural class of the TF that binds it. In particular, this suggests that we can predict which locations in any DNA sequence are more likely to be bound by certain classes of TFs than others. Here, we argue that traditional methods for de novo motif finding can be significantly improved by adopting an informative prior probability that a TF binding site occurs at each sequence location. To demonstrate the utility of such an approach, we present priority, a powerful new de novo motif finding algorithm. RESULTS: Using data from TRANSFAC, we train three classifiers to recognize binding sites of basic leucine zipper, forkhead, and basic helix loop helix TFs. These classifiers are used to equip priority with three class-specific priors, in addition to a default prior to handle TFs of other classes. We apply priority and a number of popular motif finding programs to sets of yeast intergenic regions that are reported by ChIP-chip to be bound by particular TFs. priority identifies motifs the other methods fail to identify, and correctly predicts the structural class of the TF recognizing the identified binding sites. AVAILABILITY: Supplementary material and code can be found at http://www.cs.duke.edu/~amink/.


Assuntos
DNA/química , DNA/genética , Modelos Genéticos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Fatores de Transcrição/química , Fatores de Transcrição/genética , Algoritmos , Motivos de Aminoácidos , Sequência de Bases , Sítios de Ligação , Simulação por Computador , Modelos Químicos , Modelos Moleculares , Dados de Sequência Molecular , Ligação Proteica , Fatores de Transcrição/classificação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA