RESUMO
BACKGROUND: Regulation of transcription is central to the emergence of new cell types during development, and it often involves activation of genes via proximal and distal regulatory regions. The activity of regulatory elements is determined by transcription factors (TFs) and epigenetic marks, but despite extensive mapping of such patterns, the extraction of regulatory principles remains challenging. RESULTS: Here we study differentially and similarly expressed genes along with their associated epigenomic profiles, chromatin accessibility and DNA methylation, during lineage specification at gastrulation in mice. Comparison of the three lineages allows us to identify genomic and epigenomic features that distinguish the two classes of genes. We show that differentially expressed genes are primarily regulated by distal elements, while similarly expressed genes are controlled by proximal housekeeping regulatory programs. Differentially expressed genes are relatively isolated within topologically associated domains, while similarly expressed genes tend to be located in gene clusters. Transcription of differentially expressed genes is associated with differentially open chromatin at distal elements including enhancers, while that of similarly expressed genes is associated with ubiquitously accessible chromatin at promoters. CONCLUSION: Based on these associations of (linearly) distal genes' transcription start sites (TSSs) and putative enhancers for developmental genes, our findings allow us to link putative enhancers to their target promoters and to infer lineage-specific repertoires of putative driver transcription factors, within which we define subgroups of pioneers and co-operators.
Assuntos
Epigenômica , Genes Essenciais , Animais , Camundongos , Cromatina/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Perfilação da Expressão GênicaRESUMO
BACKGROUND: The use of artificial data to evaluate the performance of aligners and peak callers not only improves its accuracy and reliability, but also makes it possible to reduce the computational time. One of the natural ways to achieve such time reduction is by mapping a single chromosome. RESULTS: We investigated whether a single chromosome mapping causes any artefacts in the alignments' performances. In this paper, we compared the accuracy of the performance of seven aligners on well-controlled simulated benchmark data which was sampled from a single chromosome and also from a whole genome. We found that commonly used statistical methods are insufficient to evaluate an aligner performance, and applied a novel measure of a read density distribution similarity, which allowed to reveal artefacts in aligners' performances. We also calculated some interesting mismatch statistics, and constructed mismatch frequency distributions along the read. CONCLUSIONS: The generation of artificial data by mapping of reads generated from a single chromosome to a reference chromosome is justified from the point of view of reducing the benchmarking time. The proposed quality assessment method allows to identify the inherent shortcoming of aligners that are not detected by conventional statistical methods, and can affect the quality of alignment of real data.
Assuntos
Artefatos , Mapeamento Cromossômico/métodos , GenômicaRESUMO
NOTCH1 mutant clones occupy the majority of normal human esophagus by middle age but are comparatively rare in esophageal cancers, suggesting NOTCH1 mutations drive clonal expansion but impede carcinogenesis. Here we test this hypothesis. Sequencing NOTCH1 mutant clones in aging human esophagus reveals frequent biallelic mutations that block NOTCH1 signaling. In mouse esophagus, heterozygous Notch1 mutation confers a competitive advantage over wild-type cells, an effect enhanced by loss of the second allele. Widespread Notch1 loss alters transcription but has minimal effects on the epithelial structure and cell dynamics. In a carcinogenesis model, Notch1 mutations were less prevalent in tumors than normal epithelium. Deletion of Notch1 reduced tumor growth, an effect recapitulated by anti-NOTCH1 antibody treatment. Notch1 null tumors showed reduced proliferation. We conclude that Notch1 mutations in normal epithelium are beneficial as wild-type Notch1 favors tumor expansion. NOTCH1 blockade may have therapeutic potential in preventing esophageal squamous cancer.
Assuntos
Neoplasias Esofágicas , Animais , Humanos , Camundongos , Pessoa de Meia-Idade , Carcinogênese/patologia , Epitélio/patologia , Neoplasias Esofágicas/genética , Neoplasias Esofágicas/patologia , Mutação , Receptor Notch1/genéticaRESUMO
The incidence of keratinocyte cancer (basal cell and squamous cell carcinomas of the skin) is 17-fold lower in Singapore than the UK1-3, despite Singapore receiving 2-3 times more ultraviolet (UV) radiation4,5. Aging skin contains somatic mutant clones from which such cancers develop6,7. We hypothesized that differences in keratinocyte cancer incidence may be reflected in the normal skin mutational landscape. Here we show that, compared to Singapore, aging facial skin from populations in the UK has a fourfold greater mutational burden, a predominant UV mutational signature, increased copy number aberrations and increased mutant TP53 selection. These features are shared by keratinocyte cancers from high-incidence and low-incidence populations8-13. In Singaporean skin, most mutations result from cell-intrinsic processes; mutant NOTCH1 and NOTCH2 are more strongly selected than in the UK. Aging skin in a high-incidence country has multiple features convergent with cancer that are not found in a low-risk country. These differences may reflect germline variation in UV-protective genes.
Assuntos
Carcinoma de Células Escamosas , Neoplasias Cutâneas , Humanos , Neoplasias Cutâneas/genética , Neoplasias Cutâneas/patologia , Carcinoma de Células Escamosas/genética , Queratinócitos , Raios Ultravioleta/efeitos adversos , MutaçãoRESUMO
Aging normal human oesophagus accumulates TP53 mutant clones. These are the origin of most oesophageal squamous carcinomas, in which biallelic TP53 disruption is almost universal. However, how p53 mutant clones expand and contribute to cancer development is unclear. Here we show that inducing the p53R245W mutant in single oesophageal progenitor cells in transgenic mice confers a proliferative advantage and clonal expansion but does not disrupt normal epithelial structure. Loss of the remaining p53 allele in mutant cells results in genomically unstable p53R245W/null epithelium with giant polyaneuploid cells and copy number altered clones. In carcinogenesis, p53 mutation does not initiate tumour formation, but tumours developing from areas with p53 mutation and LOH are larger and show extensive chromosomal instability compared to lesions arising in wild type epithelium. We conclude that p53 has distinct functions at different stages of carcinogenesis and that LOH within p53 mutant clones in normal epithelium is a critical step in malignant transformation.
Assuntos
Carcinogênese , Proteína Supressora de Tumor p53 , Humanos , Camundongos , Animais , Proteína Supressora de Tumor p53/genética , Carcinogênese/genética , Células Clonais , Esôfago , Camundongos Transgênicos , Instabilidade Cromossômica , MutaçãoRESUMO
Esophageal squamous cell carcinoma (ESCC) shows remarkable variation in incidence that is not fully explained by known lifestyle and environmental risk factors. It has been speculated that an unknown exogenous exposure(s) could be responsible. Here we combine the fields of mutational signature analysis with cancer epidemiology to study 552 ESCC genomes from eight countries with varying incidence rates. Mutational profiles were similar across all countries studied. Associations between specific mutational signatures and ESCC risk factors were identified for tobacco, alcohol, opium and germline variants, with modest impacts on mutation burden. We find no evidence of a mutational signature indicative of an exogenous exposure capable of explaining differences in ESCC incidence. Apolipoprotein B mRNA-editing enzyme, catalytic polypeptide-like (APOBEC)-associated mutational signatures single-base substitution (SBS)2 and SBS13 were present in 88% and 91% of cases, respectively, and accounted for 25% of the mutation burden on average, indicating that APOBEC activation is a crucial step in ESCC tumor development.
Assuntos
Neoplasias Esofágicas/epidemiologia , Neoplasias Esofágicas/genética , Carcinoma de Células Escamosas do Esôfago/epidemiologia , Carcinoma de Células Escamosas do Esôfago/genética , Mutação , Desaminases APOBEC/genética , Adulto , Idoso , Idoso de 80 Anos ou mais , Aldeído-Desidrogenase Mitocondrial/genética , Brasil/epidemiologia , China/epidemiologia , Feminino , Humanos , Incidência , Irã (Geográfico)/epidemiologia , Masculino , Pessoa de Meia-Idade , Proteína Supressora de Tumor p53/genética , Reino Unido/epidemiologia , Sequenciamento Completo do GenomaRESUMO
MOTIVATION: Primary data analysis methods are of critical importance in second generation DNA sequencing. Improved methods have the potential to increase yield and reduce the error rates. Openly documented analysis tools enable the user to understand the primary data, this is important for the optimization and validity of their scientific work. RESULTS: In this article, we describe Swift, a new tool for performing primary data analysis on the Illumina Solexa Sequencing Platform. Swift is the first tool, outside of the vendors own software, which completes the full analysis process, from raw images through to base calls. As such it provides an alternative to, and independent validation of, the vendor supplied tool. Our results show that Swift is able to increase yield by 13.8%, at comparable error rate.
Assuntos
Análise de Sequência de DNA/métodos , Software , Sequência de Bases , Biologia Computacional , Dados de Sequência MolecularRESUMO
In a recent study, 1373 highly conserved non-coding elements (CNEs) were detected by aligning the human and Takifugu rubripes (Fugu) genomes. The remarkable degree of sequence conservation in CNEs compared with their surroundings suggested comparing the base composition within CNEs with their 5' and 3' flanking regions. The analysis reveals a novel, sharp and distinct signal of nucleotide frequency bias precisely at the border between CNEs and flanking regions.
Assuntos
DNA/genética , Vertebrados/genética , Animais , Composição de Bases , Sequência Conservada , DNA/química , Humanos , Takifugu/genéticaRESUMO
In addition to protein coding sequence, the human genome contains a significant amount of regulatory DNA, the identification of which is proving somewhat recalcitrant to both in silico and functional methods. An approach that has been used with some success is comparative sequence analysis, whereby equivalent genomic regions from different organisms are compared in order to identify both similarities and differences. In general, similarities in sequence between highly divergent organisms imply functional constraint. We have used a whole-genome comparison between humans and the pufferfish, Fugu rubripes, to identify nearly 1,400 highly conserved non-coding sequences. Given the evolutionary divergence between these species, it is likely that these sequences are found in, and furthermore are essential to, all vertebrates. Most, and possibly all, of these sequences are located in and around genes that act as developmental regulators. Some of these sequences are over 90% identical across more than 500 bases, being more highly conserved than coding sequence between these two species. Despite this, we cannot find any similar sequences in invertebrate genomes. In order to begin to functionally test this set of sequences, we have used a rapid in vivo assay system using zebrafish embryos that allows tissue-specific enhancer activity to be identified. Functional data is presented for highly conserved non-coding sequences associated with four unrelated developmental regulators (SOX21, PAX6, HLXB9, and SHH), in order to demonstrate the suitability of this screen to a wide range of genes and expression patterns. Of 25 sequence elements tested around these four genes, 23 show significant enhancer activity in one or more tissues. We have identified a set of non-coding sequences that are highly conserved throughout vertebrates. They are found in clusters across the human genome, principally around genes that are implicated in the regulation of development, including many transcription factors. These highly conserved non-coding sequences are likely to form part of the genomic circuitry that uniquely defines vertebrate development.
Assuntos
Regulação da Expressão Gênica no Desenvolvimento , Genoma Humano , Sequências Reguladoras de Ácido Nucleico , Takifugu/genética , Animais , Sequência Conservada , Bases de Dados Genéticas , Elementos Facilitadores Genéticos , Proteínas do Olho/metabolismo , Genoma , Proteínas de Fluorescência Verde/metabolismo , Proteínas Hedgehog , Proteínas de Grupo de Alta Mobilidade/metabolismo , Proteínas de Homeodomínio/metabolismo , Humanos , Dados de Sequência Molecular , Família Multigênica , Proteínas de Neoplasias/metabolismo , Fator de Transcrição PAX6 , Fatores de Transcrição Box Pareados/metabolismo , Proteínas Repressoras/metabolismo , Fatores de Transcrição SOXB2 , Análise de Sequência de DNA , Especificidade da Espécie , Transativadores/metabolismo , Fatores de Transcrição/metabolismoRESUMO
In this paper we analyse the efficiency of two methods, rescaled range analysis and detrended fluctuation analysis, in distinguishing between coding DNA, regulatory DNA and non-coding non-regulatory DNA of Drosophila melanogaster. Both methods were used to estimate the degree of sequential dependence (or persistence) among nucleotides. We found that these three types of DNA can be discriminated by both methods, although rescaled range analysis performs slightly better than detrended fluctuation analysis. On average, non-coding, non-regulatory DNA has the highest degree of sequential persistence. Coding DNA could be characterised as being anti-persistent, which is in line with earlier findings of latent periodicity. Regulatory regions are shown to possess intermediate sequential dependency. Together with other available methods, rescaled range and detrended fluctuation analysis on the basis of a combined purine/pyrimidine and weak/strong classification of the nucleotides are useful tools for refined structural and functional segmentation of DNA.
Assuntos
Fases de Leitura Aberta/genética , RNA não Traduzido/genética , Sequências Reguladoras de Ácido Nucleico/genética , Animais , Computadores , DNA/análise , DNA/genética , Genoma/genética , HumanosRESUMO
Identifying regions of DNA with extreme statistical characteristics is an important aspect of the structural analysis of complete genomes. Linguistic methods, mainly based on estimating word frequency, can be used for this as they allow for the delineation of regions of low complexity. Low complexity may be due to biased nucleotide composition, by tandem- or dispersed repeats, by palindrome-hairpin structures, as well as by a combination of all these features. We developed software tools in which various numerical measures of text complexity are implemented, including combinatorial and linguistic ones. We also added Hurst exponent estimate to the software to measure dependencies in DNA sequences. By applying these tools to various functional genomic regions, we demonstrate that the complexity of introns and regulatory regions is lower than that of coding regions, whilst Hurst exponent is larger. Further analysis of promoter sequences revealed that the lower complexity of these regions is associated with long-range correlations caused by transcription factor binding sites.
Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Modelos Genéticos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Sequência de Bases , Simulação por Computador , Interpretação Estatística de Dados , Entropia , Modelos Estatísticos , Dados de Sequência Molecular , SoftwareRESUMO
One of the main goals of analysing DNA sequences is to understand the temporal and positional information that specifies gene expression. An important step in this process is the recognition of gene expression regulatory elements. Experimental procedures for this are slow and costly. In this paper we present a computational non-supervised algorithm that facilitates the process by statistically identifying the most likely regions within a putative regulatory sequence. A probabilistic technique is presented, based on the approximation of regulatory DNA with a Markov chain, for the location of putative transcription factor binding sites in a single stretch of DNA. Hereto we developed a procedure to approximate the order of Markov model for a given DNA sequence that circumvents some of the prohibitive assumptions underlying Markov modeling. Application of the algorithm to data from 55 genes in five species shows the high sensitivity of this Markov search algorithm. Our algorithm does not require any prior knowledge in the form of description or cross-genomic comparison; it is context sensitive and takes DNA heterogeneity into account.
Assuntos
Mapeamento Cromossômico/métodos , DNA/genética , Sequências Reguladoras de Ácido Nucleico/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Fatores de Transcrição/genética , Inteligência Artificial , Sítios de Ligação , Simulação por Computador , DNA/química , Cadeias de Markov , Modelos Genéticos , Modelos Estatísticos , Reconhecimento Automatizado de Padrão , Ligação Proteica , Fatores de Transcrição/químicaRESUMO
BACKGROUND: This paper addresses the problem of recognising DNA cis-regulatory modules which are located far from genes. Experimental procedures for this are slow and costly, and computational methods are hard, because they lack positional information. RESULTS: We present a novel statistical method, the "fluffy-tail test", to recognise regulatory DNA. We exploit one of the basic informational properties of regulatory DNA: abundance of over-represented transcription factor binding site (TFBS) motifs, although we do not look for specific TFBS motifs, per se . Though overrepresentation of TFBS motifs in regulatory DNA has been intensively exploited by many algorithms, it is still a difficult problem to distinguish regulatory from other genomic DNA. CONCLUSION: We show that, in the data used, our method is able to distinguish cis-regulatory modules by exploiting statistical differences between the probability distributions of similar words in regulatory and other DNA. The potential application of our method includes annotation of new genomic sequences and motif discovery.
Assuntos
Biologia Computacional/métodos , DNA/química , Drosophila melanogaster/genética , Genoma , Análise de Sequência de DNA , Algoritmos , Motivos de Aminoácidos , Animais , Sequência de Bases , Sítios de Ligação , Núcleo Celular/metabolismo , Cromatina/metabolismo , Análise por Conglomerados , Bases de Dados Genéticas , Genes de Insetos , Genes Reguladores , Genômica , Modelos Estatísticos , Dados de Sequência Molecular , Transcrição GênicaRESUMO
The new generation of short-read sequencing technologies requires reliable measures of data quality. Such measures are especially important for variant calling. However, in the particular case of SNP calling, a great number of false-positive SNPs may be obtained. One needs to distinguish putative SNPs from sequencing or other errors. We found that not only the probability of sequencing errors (i.e. the quality value) is important to distinguish an FP-SNP but also the conditional probability of "correcting" this error (the "second best call" probability, conditional on that of the first call). Surprisingly, around 80% of mismatches can be "corrected" with this second call. Another way to reduce the rate of FP-SNPs is to retrieve DNA motifs that seem to be prone to sequencing errors, and to attach a corresponding conditional quality value to these motifs. We have developed several measures to distinguish between sequence errors and candidate SNPs, based on a base call's nucleotide context and its mismatch type. In addition, we suggested a simple method to correct the majority of mismatches, based on conditional probability of their "second" best intensity call. We attach a corresponding second call confidence (quality value) of being corrected to each mismatch.
Assuntos
Análise de Sequência de DNA/métodos , Algoritmos , Motivos de Nucleotídeos , Polimorfismo de Nucleotídeo Único , Projetos de PesquisaRESUMO
As was the case in the beginning of the sequencing era, the new generation of short-read sequencing technologies still requires both accuracy of data processing methods and reliable measures of that accuracy. Inspired by the classic of the genre, the Phred method, we generalized those findings in the area of base quality value calibration. We introduce a simple, straightforward statistically established way to measure the performance of a calibrator, and to find an optimal way to assess its reliability. We illustrate the method by assessing the performance of several calibrators/predictors for Illumina, Genome Analyser 2 (GA2) data. The choice of the best predictor is based on optimization of validity, discriminative ability and discrimination power for several candidate predictors. We applied the method on data from one experimental run for genome of the phage varphiX, and found the best predictor out of ten candidates to be 'Purity', a statistics derived from corrected cluster intensities. The source code for the comparison of the predictors is available from the authors by request.
Assuntos
Algoritmos , Artefatos , Mapeamento Cromossômico/métodos , Interpretação Estatística de Dados , Análise de Sequência de DNA/métodos , Software , Sequência de Bases , Dados de Sequência MolecularRESUMO
There are no well-known properties in regulatory DNA analogous to those in coding sequences; their spatial location is not regular, the consensus regulatory elements are often degenerate and there are no understandable rules governing their evolution. This makes it difficult to recognize regulatory regions within genome. We review developments in the statistical characterization of regulatory regions and methods of their recognition in eukaryotic genomes.
Assuntos
Biologia Computacional , DNA/fisiologia , Genoma , Sequências Reguladoras de Ácido Nucleico/genética , Análise de Sequência de DNA/estatística & dados numéricos , Valor Preditivo dos TestesRESUMO
We recently identified approximately 1400 conserved non-coding elements (CNEs) shared by the genomes of fugu (Takifugu rubripes) and human that appear to be associated with developmental regulation in vertebrates [Woolfe, A., Goodson, M., Goode, D.K., Snell, P., McEwen, G.K., Vavouri, T., Smith, S.F., North, P., Callaway, H., Kelly, K., Walter, K., Abnizova, I., Gilks, W., Edwards, Y.J.K., Cooke, J.E., Elgar, G., 2005. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 3 (1), e7]. This study encompassed a multi-disciplinary approach using bioinformatics, statistical methods and functional assays to identify and characterise the CNEs. Using an in vivo enhancer assay, over 90% of tested CNEs up-regulate tissue-specific GFP expression. Here we review our group's research in the field of characterising non-coding sequences conserved in vertebrates. We take this opportunity to discuss our research in progress and present some results of new and additional analyses. These include a phylogenomics analysis of CNEs, sequence conservation patterns in vertebrate CNEs and the distribution of human SNPs in the CNEs. We highlight the usefulness of the CNE dataset to help correlate genetic variation in health and disease. We also discuss the functional analysis using the enhancer assay and the enrichment of predicted transcription factor binding sites for two CNEs. Public access to the CNEs plus annotation is now possible and is described. The content of this review was presented by Dr. Y.J.K. Edwards at the TODAI International Symposium on Functional Genomics of the Pufferfish, Tokyo, Japan, 3-6 November 2004.