RESUMO
Elucidating the human transcriptional regulatory network is a challenge of the post-genomic era. Technical progress so far is impressive, including detailed understanding of regulatory mechanisms for at least a few genes in multicellular organisms, rapid and precise localization of regulatory regions within extensive regions of DNA by means of cross-species comparison, and de novo determination of transcription-factor binding specificities from large-scale yeast expression data. Here we address two problems involved in extending these results to the human genome: first, it has been unclear how many model organism genomes will be needed to delineate most regulatory regions; and second, the discovery of transcription-factor binding sites (response elements) from expression data has not yet been generalized from single-celled organisms to multicellular organisms. We found that 98% (74/75) of experimentally defined sequence-specific binding sites of skeletal-muscle-specific transcription factors are confined to the 19% of human sequences that are most conserved in the orthologous rodent sequences. Also we found that in using this restriction, the binding specificities of all three major muscle-specific transcription factors (MYF, SRF and MEF2) can be computationally identified.
Assuntos
Genoma Humano , Camundongos/genética , Sequências Reguladoras de Ácido Nucleico , Algoritmos , Animais , Sequência de Bases , Sequência Consenso , Regulação da Expressão Gênica , Humanos , Modelos Genéticos , Alinhamento de Sequência , Transcrição GênicaRESUMO
GenBank, the national repository for nucleotide sequence data, has implemented a new model of scientific data management, which we term electronic data publishing. In traditional publishing, both scientific conclusions and supporting data are communicated via the printed page, and in electronic journal publishing, both types of information are communicated via electronic media. In electronic data publishing, by contrast, conclusions are published in a journal while data are published via a network-accessible, electronic database.
Assuntos
Bases de Dados Factuais , Eletrônica , Editoração , Sequência de Bases , DNA/genética , Coleta de Dados/métodos , Projeto Genoma Humano , Humanos , SoftwareRESUMO
Discovering new genes, and their functions, can be aided not only by special purpose gene (and coding region) finding software, but also by searches in key databases, and by programs for finding particular sites relevant to gene expression, such as promoters and splice sites. No one software package includes all the necessary tools. I describe here the main kinds of tools; their working principles, strengths and limitations; and how combined evidence from multiple tools can aid in optimum gene identification.
Assuntos
Biologia Computacional , Bases de Dados Factuais , Genes , Sequência de Aminoácidos , Animais , Sequência de Bases , Códon , DNA/química , Éxons , Humanos , Dados de Sequência Molecular , Sequências Repetitivas de Ácido Nucleico , SoftwareRESUMO
Myocyte-specific enhancer factor 2 (MEF2) is a family of closely related transcription factors that play a key role in the differentiation of muscle tissues and are important in the muscle-specific expression of a number of genes. Given the centrality of MEF2 in muscle differentiation, regulatory regions newly determined to be muscle specific are often studied for potential MEF2 binding sites. Possible sites are often located by comparison to a homologous gene or by matching to the consensus MEF2 sequence. Enough data have accumulated that a richer description of the MEF2 binding site, a position weight matrix, can be reliably constructed and its usefulness can be assessed. It was shown that scores from such a matrix approximate MEF2 binding energy and enable recognition of naturally occurring MEF2 sites with high sensitivity and specificity. Regulation of genes via MEF2-like sites is complicated by the fact that a number of transcription factors are involved. Not only is MEF2 itself a family of proteins, but several other, nonhomologous, transcription factors overlap MEF2 in DNA-binding specificity. Thus, more quantitative methods for recognizing potential sites may help with the lengthy process of disentangling the complex regulatory circuits of muscle-specific expression.
Assuntos
Proteínas de Ligação a DNA/genética , Proteínas de Ligação a DNA/metabolismo , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Sequência de Aminoácidos , Animais , Sítios de Ligação/genética , Biometria , DNA/metabolismo , Humanos , Fatores de Transcrição MEF2 , Dados de Sequência Molecular , Músculos/metabolismo , Mutagênese Sítio-Dirigida , Fatores de Regulação MiogênicaRESUMO
With the growing number of completely sequenced bacterial genes, accurate gene prediction in bacterial genomes remains an important problem. Although the existing tools predict genes in bacterial genomes with high overall accuracy, their ability to pinpoint the translation start site remains unsatisfactory. In this paper, we present a novel approach to bacterial start site prediction that takes into account multiple features of a potential start site, viz., ribosome binding site (RBS) binding energy, distance of the RBS from the start codon, distance from the beginning of the maximal ORF to the start codon, the start codon itself and the coding/non-coding potential around the start site. Mixed integer programing was used to optimize the discriminatory system. The accuracy of this approach is up to 90%, compared to 70%, using the most common tools in fully automated mode (that is, without expert human post-processing of results). The approach is evaluated using Bacillus subtilis, Escherichia coli and Pyrococcus furiosus. These three genomes cover a broad spectrum of bacterial genomes, since B.subtilis is a Gram-positive bacterium, E.coli is a Gram-negative bacterium and P. furiosus is an archaebacterium. A significant problem is generating a set of 'true' start sites for algorithm training, in the absence of experimental work. We found that sequence conservation between P. furiosus and the related Pyrococcus horikoshii clearly delimited the gene start in many cases, providing a sufficient training set.
Assuntos
Códon de Iniciação , Genoma Bacteriano , Biossíntese de Proteínas , Algoritmos , Sequência de Aminoácidos , Bacillus subtilis/genética , Sequência Conservada , Escherichia coli/genética , Dados de Sequência Molecular , Pyrococcus furiosus/genética , Homologia de Sequência de AminoácidosRESUMO
We have studied the behavior of a number of sequence statistics, mostly indicative of protein coding function, in a large set of human clone sequences randomly selected in the course of genome mapping (randomly selected clone sequences), and compared this with the behavior in known sequences containing genes (which we term genic sequences). As expected, given the higher coding density of the genic sequences, the sequence statistics studied behave in a substantially different manner in the randomly selected clone sequences (mostly intergenic DNA) and in the genic sequences. Strong differences in behavior of a number of such statistics are also observed, however when the randomly selected clone sequences are compared with only the non-coding fraction of the genic sequences, suggesting that intergenic and genic non-coding DNA constitute two different classes of non-coding DNA. By studying the behavior of the sequence statistics in simulated DNA of different C+G content, we have observed that a number of them are strongly dependent on C+G content. Thus, most differences between intergenic and genic non-coding DNA can be explained by differences in C+G content. A+T-rich intergenic DNA appears to be at the compositional equilibrium expected under random mutation, while C+G richer non-coding genic DNA is far from this equilibrium. The results obtained in simulated DNA indicate, on the other hand, that a very large fraction of the variation in the coding statistics that underlie gene identification algorithms is due simply to C+G content, and is not directly related to protein coding function. It appears, thus, that the performance of gene-finding algorithms should be improved by carefully distinguishing the effects of protein coding function from those of mere base compositional variation on such coding statistics.
Assuntos
Sequência de Bases/genética , DNA/genética , Genes/genética , Algoritmos , Composição de Bases , Bases de Dados Factuais , Análise Discriminante , Humanos , Fases de Leitura Aberta/genética , Proteínas/genéticaRESUMO
For many newly sequenced genes, sequence analysis of the putative protein yields no clue on function. It would be beneficial to be able to identify in the genome the regulatory regions that confer temporal and spatial expression patterns for the uncharacterized genes. Additionally, it would be advantageous to identify regulatory regions within genes of known expression pattern without performing the costly and time consuming laboratory studies now required. To achieve these goals, the wealth of case studies performed over the past 15 years will have to be collected into predictive models of expression. Extensive studies of genes expressed in skeletal muscle have identified specific transcription factors which bind to regulatory elements to control gene expression. However, potential binding sites for these factors occur with sufficient frequency that it is rare for a gene to be found without one. Analysis of experimentally determined muscle regulatory sequences indicates that muscle expression requires multiple elements in close proximity. A model is generated with predictive capability for identifying these muscle-specific regulatory modules. Phylogenetic footprinting, the identification of sequences conserved between distantly related species, complements the statistical predictions. Through the use of logistic regression analysis, the model promises to be easily modified to take advantage of the elucidation of additional factors, cooperation rules, and spacing constraints.
Assuntos
Regulação da Expressão Gênica , Músculo Esquelético/metabolismo , Sequências Reguladoras de Ácido Nucleico , Fatores de Transcrição/metabolismo , Sítios de Ligação , Pegada de DNA , Teste de Complementação Genética , Genoma , Computação Matemática , Modelos Moleculares , Filogenia , Fatores de Transcrição/genéticaRESUMO
A complex network of regulatory controls governs the patterns of gene expression. Enabled by the tools of molecular cloning, initial experimental queries into the gene regulatory network elucidated a wide array of transcription factors and their cognate binding sites from hundreds of genes. The recent fusion of genome-scale experimental tools, a more comprehensive gene catalog, and concomitant advances in computational methodology, has extended the range of questions being posed. The potential to further our understanding of the biochemical mechanisms of transcriptional regulation and to accelerate the delineation of regulatory control regions in the human genome is enormous.
Assuntos
Biologia Computacional , Sequências Reguladoras de Ácido Nucleico/genética , Fatores de Transcrição/metabolismo , Transcrição Gênica/genética , Animais , Sequência de Bases , Sítios de Ligação , Pegada de DNA , Proteínas de Ligação a DNA/metabolismo , Humanos , Filogenia , Regiões Promotoras Genéticas/genéticaRESUMO
The MEF2 and MyoD families of transcriptional regulatory factors both play central roles in the terminal differentiation of skeletal muscle. Further, binding sites for the two families often occur nearby, and there have been a number of indications that members of the two families may bind coordinately. The present study provides evidence that known binding sites for the two occur with precise geometric restrictions related to the DNA helical repeat unit, that pairs of putative sites following these restrictions are indicative of skeletal muscle-specific transcriptional regulatory regions, and that the geometric relationship can help provide a consistent interpretation for data that has until now been difficult to explain.
Assuntos
Proteínas de Ligação a DNA/metabolismo , Miogenina/metabolismo , Fatores de Transcrição/metabolismo , Animais , Sequência de Bases , Sítios de Ligação , Evolução Biológica , Sequência Conservada , Proteínas de Ligação a DNA/genética , Elementos Facilitadores Genéticos , Humanos , Fatores de Transcrição MEF2 , Dados de Sequência Molecular , Fatores de Regulação Miogênica , Miogenina/genética , Oligodesoxirribonucleotídeos , Fatores de Transcrição/genética , Transcrição GênicaRESUMO
The length of an open reading frame (ORF) is one important piece of evidence often used in locating new genes, particularly in organisms where splicing is rare. However, there have been no systematic studies quantifying the degree of correlation between length of ORF, on the one hand, and likelihood of gene function, on the other. In this paper, techniques are derived to estimate the conditional probability of gene function, given ORF length, based on evidence both from the databases and from simulation. Several complete chromosomes of Saccharomyces cerevisiae have now been sequenced, and considerable effort is being expended on locating and characterizing the genes in these sequences. Thus, we illustrate the techniques for this organism.
Assuntos
Cromossomos Fúngicos , Bases de Dados Factuais , Genes , Fases de Leitura Aberta , Saccharomyces cerevisiae/genética , Sequência de Aminoácidos , Sequência de Bases , Proteínas Fúngicas/química , Proteínas Fúngicas/genética , Biossíntese de Proteínas , Splicing de RNARESUMO
SCORE, a program for computer-assisted scoring of Southern blots of clone DNA, retains the use of expert human judgment while taking over much of the drudgery of the scoring task. The primary functions of the program are to help make an aligned overlay of the fluorescence gel image and the autoradiogram blot image, to keep track of band and lane locations and to store the resulting data directly into a database. Use of SCORE has resulted in greatly increased efficiency and accuracy.
Assuntos
Southern Blotting , Software , Autorradiografia , Mapeamento Cromossômico/métodos , Impressões Digitais de DNA/métodos , Eletroforese em Gel de Ágar , Humanos , Processamento de Imagem Assistida por Computador/métodosAssuntos
DNA/genética , Análise de Sequência de DNA/métodos , Sequência de Bases , Códon/genética , Biologia Computacional , Redes de Comunicação de Computadores , Bases de Dados Factuais , Dados de Sequência Molecular , RNA de Transferência/genética , Sequências Repetitivas de Ácido Nucleico , Análise de Sequência de DNA/estatística & dados numéricos , SoftwareRESUMO
We show how to speed up sequence alignment algorithms of the type introduced by Needleman and Wunsch (and generalized by Sellers and others). Faster alignment algorithms have been introduced, but always at the cost of possibly getting sub-optimal alignments. Our modification results in the optimal alignment still being found, often in 1/10 the usual time. What we do is reorder the computation of the usual alignment matrix so that the optimal alignment is ordinarily found when only a small fraction of the matrix is filled. The number of matrix elements which have to be computed is related to the distance between the sequences being aligned; the better the optimal alignment, the faster the algorithm runs.
Assuntos
Sequência de Bases , Ácidos Nucleicos , Computadores , Sistemas de InformaçãoRESUMO
The gene identification problem is the problem of interpreting nucleotide sequences by computer, in order to provide tentative annotation on the location, structure, and functional class of protein-coding genes. This problem is of self-evident importance, and is far from being fully solved, particularly for higher eukaryotes. Thus it is not surprising that the number of algorithm and software developers working in the area is rapidly increasing. The present paper is an overview of the field, with an emphasis on eukaryotes, for such developers.
Assuntos
Genes/genética , Sequência de Bases/genética , Códon/genética , Éxons/genética , Expressão Gênica/genética , Modelos Genéticos , Homologia de SequênciaRESUMO
One expects that in DNA without protein coding function, stop codons (which constitute three of the 64 possible codons) should occur frequently in all reading frames, and that a long open reading frame (ORF) can be interpreted as a sign for the existence of a gene. We make a beginning on introducing quantitative measures of confidence into this inference--taking Saccharomyces cerevisiae as a sample case--and show that some common assumptions can reasonably be questioned. In particular we show that statistical support for the biological function of shorter ORFs listed as putative genes in recent papers is in fact very weak. This is an issue of practical as well as theoretical interest, since researching the function of a putative gene is difficult and expensive.
Assuntos
Genes , Fases de Leitura Aberta , Composição de Bases , Cromossomos Artificiais de Levedura , DNA Fúngico/genética , Genes Fúngicos , Modelos Genéticos , Saccharomyces cerevisiae/genéticaRESUMO
We give a test for protein coding regions which is based on simple and universal differences between protein-coding and noncoding DNA. The test is simple enough to use without a computer and is completely objective. The test has been thoroughly proven on 400,000 bases of sequence data: it misclassifies 5% of the regions tested and gives an answer of "No Opinion" one fifth of the time. We predict some new coding and noncoding regions in published sequences.
Assuntos
DNA/genética , Genes , Proteínas/genética , Computadores , Modelos Genéticos , ProbabilidadeRESUMO
A number of methods for recognizing protein coding genes in DNA sequence have been published over the last 13 years, and new, more comprehensive algorithms, drawing on the repertoire of existing techniques, continue to be developed. To optimize continued development, it is valuable to systematically review and evaluate published techniques. At the core of most gene recognition algorithms is one or more coding measures--functions which produce, given any sample window of sequence, a number or vector intended to measure the degree to which a sample sequence resembles a window of 'typical' exonic DNA. In this paper we review and synthesize the underlying coding measures from published algorithms. A standardized benchmark is described, and each of the measures is evaluated according to this benchmark. Our main conclusion is that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures. Different measures contain different information. However there is a great deal of redundancy in the current suite of measures. We show that in future development of gene recognition algorithms, attention can probably be limited to six of the twenty or so measures proposed to date.
Assuntos
Sequência de Bases , DNA/genética , Genes , Técnicas Genéticas , Proteínas/genética , Algoritmos , Composição de Bases , Códon/genética , Éxons , Análise de Fourier , HumanosRESUMO
A number of experimental methods have been reported for estimating the number of genes in a genome, or the closely related coding density of a genome, defined as the fraction of base pairs in codons. Recently, DNA sequence data representative of the genome as a whole have become available for several organisms, making the problem of estimating coding density amenable to sequence analytic methods. Estimates of coding density for a single genome vary widely, so that methods with characterized error bounds have become increasingly desirable. We present a method to estimate the protein coding density in a corpus of DNA sequence data, in which a 'coding statistic' is calculated for a large number of windows of the sequence under study, and the distribution of the statistic is decomposed into two normal distributions, assumed to be the distributions of the coding statistic in the coding and noncoding fractions of the sequence windows. The accuracy of the method is evaluated using known data and application is made to the yeast chromosome III sequence and to C. elegans cosmid sequences. It can also be applied to fragmentary data, for example a collection of short sequences determined in the course of STS mapping.
Assuntos
Composição de Bases , Códon , DNA/química , Proteínas/genética , Animais , Caenorhabditis elegans/genética , Cosmídeos , DNA/análise , Genes Fúngicos , Humanos , Análise de Sequência de DNA , Estatística como AssuntoRESUMO
The nucleic acid sequence databases of Los Alamos National Laboratory, European Molecular Biology Laboratory, and others are organized in a single relational database. This organization with a suitable relational database management program facilitates the tasks of reporting statistics, making cross-references, and double-checking of the original databases.
Assuntos
Sequência de Bases , Sistemas de Informação , Ácidos NucleicosRESUMO
We model the base compositional structure of the human and Escherichia coli genomes. Three particular properties are first quantified: (1) There is a significant tendency for any region of either genome to have a strand-symmetric base composition. (2) The variation in base composition from region to region, within each genome, is very much larger than expected from common homogeneous stochastic models. (3) A given local base composition tends to persist over a scale of at least kilobases (E. coli) or tens of kilobases (human). Multidomain stochastic models from the literature are reviewed and sharpened. In particular, quantitative measurements of the third property lead us to suggest a significant shift in the style of domain models, in which the variation of A+T content with position is modeled by a random walk with frequent small steps rather than with large quantum jumps. As an application, we suggest a way to reduce the amount of computation in the assembly of large sequences from sequences of randomly chosen fragments.