Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 30
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 17(4): 300-8, 2001 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-11301298

RESUMO

MOTIVATION AND RESULTS: A relational schema is described for capturing highly parallel gene expression experiments using different technologies. This schema grew out of efforts to build a database for collaborators working on different biological systems and using different types of platforms in their gene expression experiments as well as different types of image quantification software. The tables are conceptually organized into three categories of information: Platform, Experiment (which includes image scanning and quantification), and Data. The strengths of the schema are: (i) integrating information on array elements using a gene index; (ii) describing samples using ontologies; (iii) reducing an experiment to a single RNA source for precise descriptions yet not losing the relationships between experiments done at the same time or for the same project; and (iv) maintaining both raw and processed (e.g. cleansed and normalized) data and recording how the data is processed. The result is a novel schema, which can hold both array and non-array data, is extensible for detailed experimental descriptions that are precise and consistent, and allows for meaningful comparisons of genes between experiments.


Assuntos
Bases de Dados Factuais , Expressão Gênica , Análise de Sequência com Séries de Oligonucleotídeos
2.
Bioinformatics ; 16(8): 685-98, 2000 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-11099255

RESUMO

MOTIVATION: A protocol is described to attach expression patterns to genes represented in a collection of hybridization array experiments. Discrete values are used to provide an easily interpretable description of differential expression. Binning cutoffs for each sample type are chosen automatically, depending on the desired false-positive rate for the predictions of differential expression. Confidence levels are derived for the statement that changes in observed levels represent true changes in expression. We have a novel method for calculating this confidence, which gives better results than the standard methods. Our method reflects the broader change of focus in the field from studying a few genes with many replicates to studying many (possibly thousands) of genes simultaneously, but with relatively few replicates. Our approach differs from standard methods in that it exploits the fact that there are many genes on the arrays. These are used to estimate for each sample type an appropriate distribution that is employed to control the false-positive rate of the predictions made. Satisfactory results can be obtained using this method with as few as two replicates. RESULTS: The method is illustrated through applications to macroarray and microarray datasets. The first is an erythroid development dataset that we have generated using nylon filter arrays. Clones for genes whose expression is known in these cells were assigned expression patterns which are in accordance with what was expected and which are not picked up by the standards methods. Moreover, genes differentially expressed between normal and leukemic cells were identified. These included genes whose expression was altered upon induction of the leukemic cells to differentiate. The second application is to the microarray data by Alizadeh et al. (2000). Our results are in accordance with their major findings and offer confidence measures for the predictions made. They also provide new insights for further analysis.


Assuntos
Bases de Dados Factuais , Perfilação da Expressão Gênica , Análise de Sequência com Séries de Oligonucleotídeos , Algoritmos , Humanos , Leucemia Eritroblástica Aguda/genética , Nylons , Células Tumorais Cultivadas
3.
Science ; 288(5471): 1635-40, 2000 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-10834841

RESUMO

Blood cell production originates from a rare population of multipotent, self-renewing stem cells. A genome-wide gene expression analysis was performed in order to define regulatory pathways in stem cells as well as their global genetic program. Subtracted complementary DNA libraries from highly purified murine fetal liver stem cells were analyzed with bioinformatic and array hybridization strategies. A large percentage of the several thousand gene products that have been characterized correspond to previously undescribed molecules with properties suggestive of regulatory functions. The complete data, available in a biological process-oriented database, represent the molecular phenotype of the hematopoietic stem cell.


Assuntos
Perfilação da Expressão Gênica , Genes , Células-Tronco Hematopoéticas/fisiologia , Proteínas/genética , Proteínas/fisiologia , Sequência de Aminoácidos , Animais , Biologia Computacional , Bases de Dados Factuais , Etiquetas de Sequências Expressas , Biblioteca Gênica , Células-Tronco Hematopoéticas/química , Células-Tronco Hematopoéticas/citologia , Fígado/citologia , Fígado/embriologia , Proteínas de Membrana/química , Proteínas de Membrana/genética , Proteínas de Membrana/fisiologia , Camundongos , Dados de Sequência Molecular , Reação em Cadeia da Polimerase , Proteínas/química , Transdução de Sinais , Fatores de Transcrição/química , Fatores de Transcrição/genética , Fatores de Transcrição/fisiologia
4.
Nucleic Acids Res ; 28(1): 298-301, 2000 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-10592253

RESUMO

Transcription Regulatory Regions Database (TRRD) has been developed for accumulation of experimental information on the structure-function features of regulatory regions of eukaryotic genes. Each entry in TRRD corresponds to a particular gene and contains a description of structure-function features of its regulatory regions (transcription factor binding sites, promoters, enhancers, silencers, etc.) and gene expression regulation patterns. The current release, TRRD 4.2.5, comprises the description of 760 genes, 3403 expression patterns, and >4600 regulatory elements including 3604 transcription factor binding sites, 600 promoters and 152 enhancers. This information was obtained through annotation of 2537 scientific publications. TRRD 4.2.5 is available through the WWW at http://wwwmgs.bionet.nsc.ru/mgs/dbases/trrd4/


Assuntos
Bases de Dados Factuais , Transcrição Gênica , Elementos Facilitadores Genéticos , Internet , Regiões Promotoras Genéticas , Sequências Reguladoras de Ácido Nucleico
5.
Biofizika ; 44(4): 649-54, 1999.
Artigo em Russo | MEDLINE | ID: mdl-10544815

RESUMO

A systemic approach is proposed, which makes it possible to increase the accuracy of recognition of functional sites in arbitrary DNA sequences. The approach is based on the Central limit theorem and consists in the averaging of a large number of recognitions of a particular site. To obtain a rather large number of recognitions within the framework of conventional methods of recognition, consensus, and frequency matrix, 20 novel oligonucleotide alphabets were used. The approach was used to study the binding sites of GATA-1 and C/EBP transcription factors. It was found that the averaged recognition of these sites is more precise than each of specific recognitions, which just follows from the Central limit theorem.


Assuntos
DNA/metabolismo , Genoma Humano , Sequência de Bases , Sítios de Ligação , Proteínas Estimuladoras de Ligação a CCAAT , DNA/genética , Proteínas de Ligação a DNA/metabolismo , Fatores de Ligação de DNA Eritroide Específicos , Fator de Transcrição GATA1 , Humanos , Proteínas Nucleares/metabolismo , Fatores de Transcrição/metabolismo
6.
Bioinformatics ; 15(7-8): 631-43, 1999.
Artigo em Inglês | MEDLINE | ID: mdl-10487871

RESUMO

MOTIVATION: Recognition of functional sites remains a key event in the course of genomic DNA annotation. It is well known that a number of sites have their own specific oligonucleotide content. This pinpoints the fact that the preference of the site-specific nucleotide combinations at adjacent positions within an analyzed functional site could be informative for this site recognition. Hence, Web-available resources describing the site-specific oligonucleotide content of the functional DNA sites and applying the above approach for site recognition are needed. However, they have been poorly developed up to now. RESULTS: To describe the specific oligonucleotide content of the functional DNA sites, we introduce the oligonucleotide alphabets, out of which the frequency matrix for a given site could be constructed in addition to a traditional nucleotide frequency matrix. Thus, site recognition accuracy increases. This approach was implemented in the activated MATRIX database accumulating oligonucleotide frequency matrices of the functional DNA sites. We have demonstrated that the false-positive error of the functional site recognition decreases if the oligonucleotide frequency matrixes are added to the nucleotide frequency matrixes commonly used. AVAILABILITY: The MATRIX database is available on the Web, http://wwwmgs.bionet.nsc.ru/Dbases/MATRIX/ and the mirror site, http://www.cbil.upenn.edu/mgs/systems/c onsfreq/.


Assuntos
DNA/genética , DNA/metabolismo , Bases de Dados Factuais , Algoritmos , Sequência de Bases , Sítios de Ligação/genética , Proteínas de Ligação a DNA/metabolismo , Genoma , Dados de Sequência Molecular , Fatores de Transcrição NFI , Oligodesoxirribonucleotídeos/genética , Fatores de Transcrição/metabolismo
7.
Bioinformatics ; 15(7-8): 669-86, 1999.
Artigo em Inglês | MEDLINE | ID: mdl-10487874

RESUMO

MOTIVATION: The goal of the work was to develop a WWW-oriented computer system providing a maximal integration of informational and software resources on the regulation of gene expression and navigation through them. Rapid growth of the variety and volume of information accumulated in the databases on regulation of gene expression necessarily requires the development of computer systems for automated discovery of the knowledge that can be further used for analysis of regulatory genomic sequences. RESULTS: The GeneExpress system developed includes the following major informational and software modules: (1) Transcription Regulation (TRRD) module, which contains the databases on transcription regulatory regions of eukaryotic genes and TRRD Viewer for data visualization; (2) Site Activity Prediction (ACTIVITY), the module for analysis of functional site activity and its prediction; (3) Site Recognition module, which comprises (a) B-DNA-VIDEO system for detecting the conformational and physicochemical properties of DNA sites significant for their recognition, (b) Consensus and Weight Matrices (ConsFrec) and (c) Transcription Factor Binding Sites Recognition (TFBSR) systems for detecting conservative contextual regions of functional sites and their recognition; (4) Gene Networks (GeneNet), which contains an object-oriented database accumulating the data on gene networks and signal transduction pathways, and the Java-based Viewer for exploration and visualization of the GeneNet information; (5) mRNA Translation (Leader mRNA), designed to analyze structural and contextual properties of mRNA 5'-untranslated regions (5'-UTRs) and predict their translation efficiency; (6) other program modules designed to study the structure-function organization of regulatory genomic sequences and regulatory proteins. AVAILABILITY: GeneExpress is available at http://wwwmgs.bionet.nsc. ru/systems/GeneExpress/ and the links to the mirror site(s) can be found at http://wwwmgs.bionet.nsc.ru/mgs/links/mirrors.html+ ++.


Assuntos
Sistemas Computacionais , Bases de Dados Factuais , Expressão Gênica , Algoritmos , Inteligência Artificial , Sequência de Bases , Sítios de Ligação/genética , Fenômenos Químicos , Físico-Química , DNA/química , DNA/genética , DNA/metabolismo , Células Eucarióticas , Internet , Conformação de Ácido Nucleico , Regiões Promotoras Genéticas , Biossíntese de Proteínas , RNA Mensageiro/genética , Software , TATA Box , Fatores de Transcrição/metabolismo
8.
Bioinformatics ; 15(7-8): 654-68, 1999.
Artigo em Inglês | MEDLINE | ID: mdl-10487873

RESUMO

MOTIVATION: A reliable recognition of transcription factor binding sites is essential for analysis of regulatory genomic sequences. The experimental data make evident an important role of DNA conformational features for site functioning. However, Internet-available tools for revealing conformational and physicochemical DNA features significant for the site functioning and subsequent use of these features for site recognition have not been developed up to now. RESULTS: We suggest an approach for revealing significant conformational and physicochemical properties of functional sites implemented in the database B-DNA-VIDEO. This database is designed to study the sets of various transcription factor binding sites, providing evidence that transcription factor binding sites are characterized by specific sets of significant conformational and physicochemical DNA properties. For a fixed site, by using the B-DNA features selected for this site recognition, the C-program recognizing this site may be generated, control tested and stored in the database B-DNA-VIDEO. Each B-DNA-VIDEO entry links to the Web-applet recognizing the site, whose significant B-DNA features are stored in this entry as the 'site recognition programs'. The pairwise linked entry-applet pairs are compiled within the B-DNA-VIDEO system, which is simultaneously the database and the program tools package applicable immediately for recognizing the sites stored in the database. Indeed, this is the novelty. Hence, B-DNA-VIDEO is the Web resource of both 'searching for static data' and 'active computation' type, that is why it was called an 'activated database'. AVAILABILITY: B-DNA-VIDEO is available at http://wwwmgs.bionet.nsc.ru/systems/BDNAVideo/ and the mirror site at http://www.cbil.upenn.edu/mgs/systems/c onsfreq/.


Assuntos
DNA/química , DNA/genética , Bases de Dados Factuais , Fatores de Transcrição/metabolismo , Sequência de Bases , Sítios de Ligação/genética , Fenômenos Químicos , Físico-Química , DNA/metabolismo , Internet , Dados de Sequência Molecular , Conformação de Ácido Nucleico , Software , TATA Box
9.
Bioinformatics ; 15(7-8): 687-703, 1999.
Artigo em Inglês | MEDLINE | ID: mdl-10487875

RESUMO

MOTIVATION: The commonly accepted statistical mechanical theory is now multiply confirmed by using the weight matrix methods successfully recognizing DNA sites binding regulatory proteins in prokaryotes. Nevertheless, the recent evaluation of weight matrix methods application for transcription factor binding site recognition in eukaryotes has unexpectedly revealed that the matrix scores correlate better to each other than to the activity of DNA sites interacting with proteins. This observation points out that molecular mechanisms of DNA/protein recognition are more complicated in eukaryotes than in prokaryotes. As the extra events in eukaryotes, the following processes may be considered: (i) competition between the proteins and nucleosome core particle for DNA sites binding these proteins and (ii) interaction between two synergetic/antagonist proteins recognizing a composed element compiled from two DNA sites binding these proteins. That is why identification of the sequence-dependent DNA features correlating with affinity magnitudes of DNA sites interacting with a protein can pinpoint the molecular event limiting this protein/DNA recognition machinery. RESULTS: An approach for predicting site activity based on its primary nucleotide sequence has been developed. The approach is realized in the computer system ACTIVITY, containing the databases on site activity and on conformational and physicochemical DNA/RNA parameters. By using the system ACTIVITY, an analysis of some sites was provided and the methods for predicting site activity were constructed. The methods developed are in good agreement with the experimental data. AVAILABILITY: The database ACTIVITY is available at http://wwwmgs.bionet.nsc.ru/systems/Activity/ and the mirror site, http://www.cbil.upenn.edu/mgs/systems/acti vity/.


Assuntos
Sistemas Computacionais , DNA/genética , DNA/metabolismo , Proteínas/metabolismo , Algoritmos , Animais , Sequência de Bases , Sítios de Ligação/genética , Fenômenos Químicos , Físico-Química , DNA/química , Bases de Dados Factuais , Humanos , Proteínas de Domínio MADS , Fatores de Transcrição MEF2 , Dados de Sequência Molecular , Mutação , Fatores de Regulação Miogênica/genética , Fatores de Regulação Miogênica/metabolismo , Conformação de Ácido Nucleico , TATA Box
10.
Nucleic Acids Res ; 27(1): 200-3, 1999 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-9847180

RESUMO

EpoDB is a database of genes expressed in vertebrate red blood cells. It is also a prototype for the creation of cell and tissue-specific databases from multiple external sources. The information in EpoDB obtained from GenBank, SWISS-PROT, Transfac, TRRD and GERD is curated to provide high quality data for sequence analysis aimed at understanding gene regulation during erythropoiesis. New protocols have been developed for data integration and updating entries. Using a BLAST-based algorithm, we have grouped GenBank entries representing the same gene together. This sequence similarity protocol was also used to identify new entries to be included in EpoDB. We have recently implemented our database in Sybase (relational tables) in addition to SICStus Prolog to provide us with greater flexibility in asking complex queries that utilize information from multiple sources. New additions to the public web site (http://www.cbil.upenn.edu/epodb) for accessing EpoDB are the ability to retrieve groups of entries representing different variants of the same gene and to retrieve gene expression data. The BLAST query has been enhanced by incorporating BLASTView, an interactive and graphical display of BLAST results. We have also enhanced the queries for retrieving sequence from specified genes by the addition of MEME, a motif discovery tool, to the integrated analysis tools which include CLUSTALW and TESS.


Assuntos
Bases de Dados Factuais , Eritrócitos/metabolismo , Eritropoese/genética , Expressão Gênica , Animais , Sequência de Bases , Armazenamento e Recuperação da Informação , Internet , Homologia de Sequência , Software , Vertebrados
11.
Bioinformatics ; 15(10): 837-46, 1999 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-10705436

RESUMO

MOTIVATION: The presentation of genomics data in a perspicuous visual format is critical for its rapid interpretation and validation. Relatively few public database developers have the resources to implement sophisticated front-end user interfaces themselves. Accordingly, these developers would benefit from a reusable toolkit of user interface and data visualization components. RESULTS: We have designed the bioWidget toolkit as a set of JavaBean components. It includes a wide array of user interface components and defines an architecture for assembling applications. The toolkit is founded on established software engineering design patterns and principles, including componentry, Model-View-Controller, factored models and schema neutrality. As a proof of concept, we have used the bioWidget toolkit to create three extendible applications: AnnotView, BlastView and AlignView.


Assuntos
Bases de Dados Factuais , Genoma , Interface Usuário-Computador , Sequência de Aminoácidos , Sequência de Bases , Biologia Computacional , Gráficos por Computador , Simulação por Computador , DNA/genética , Dados de Sequência Molecular , Proteínas/genética , Alinhamento de Sequência
12.
Pac Symp Biocomput ; : 291-302, 1998.
Artigo em Inglês | MEDLINE | ID: mdl-9697190

RESUMO

We describe a software framework, GAIA, that supports semi-automated annotation of uncharacterized sequence data. The annotation framework incorporates annotation by data source integration, data analysis, and manual data entry. Components of the system include a configurable, open data analysis pipeline, a relational information storage manager, and Java-based graphical user interfaces. We discuss design decisions and tradeoffs in building such a system, and policies and strategies for producing consistent, uniform, high quality annotation.


Assuntos
Sequência de Bases , Cromossomos Humanos Par 22 , Biologia Computacional/métodos , Genoma Humano , Genoma , Modelos Genéticos , Software , Gráficos por Computador , Etiquetas de Sequências Expressas , Humanos , Mapeamento Físico do Cromossomo/métodos , Moldes Genéticos
13.
Genome Res ; 8(4): 362-76, 1998 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-9548972

RESUMO

We have performed a systematic analysis of gene identification in genomic sequence by similarity search against expressed sequence tags (ESTs) to assess the suitability of this method for automated annotation of the human genome. A BLAST-based strategy was constructed to examine the potential of this approach, and was applied to test sets containing all human genomic sequences longer than 5 kb in public databases, plus 300 kb of exhaustively characterized benchmark sequence. At high stringency, 70%-90% of all annotated genes are detected by near-identity to EST sequence; >95% of ESTs aligning with well-annotated sequences overlap a gene. These ESTs provide immediate access to the corresponding cDNA clones for follow-up laboratory verification and subsequent biologic analysis. At lower stringency, up to 97% of annotated genes were identified by similarity to ESTs. The apparent false-positive rate rose to 55% of ESTs among all sequences and 20% among benchmark sequences at the lowest stringency, indicating that many genes in public database entries are unannotated. Approximately half of the alignments span multiple exons, and thus aid in the construction of gene predictions and elucidation of alternative splicing. In addition, ESTs from multiple cDNA libraries frequently cluster over genes, providing a starting point for crude expression profiles. Clone IDs may be used to form EST pairs, and particularly to extend models by associating alignments of lower stringency with high-quality alignments. These results demonstrate that EST similarity search is a practical general-purpose annotation technique that complements pattern recognition methods as a tool for gene characterization.


Assuntos
Expressão Gênica/genética , Genoma Humano , Sequência de Bases/genética , Clonagem Molecular , Biologia Computacional/métodos , DNA Complementar/análise , Bases de Dados Factuais , Éxons , Reações Falso-Positivas , Humanos , Íntrons
14.
Genome Res ; 8(3): 234-50, 1998 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-9521927

RESUMO

As increasing amounts of genomic sequence from many organisms become available, and as DNA sequences become a primary reagent in biologic investigations, the role of annotation as a prospective guide for laboratory experiments will expand rapidly. Here we describe a process of high-throughput, reliable annotation, called framework annotation, which is designed to provide a foundation for initial biologic characterization of previously unexamined sequence. To examine this concept in practice, we have constructed Genome Annotation and Information Analysis (GAIA), a prototype software architecture that implements several elements important for framework annotation. The center of GAIA consists of an annotation database and the associated data management subsystem that forms the software bus along which other components communicate. The schema for this database defines three principal concepts: (1) Entries, consisting of sequence and associated historical data; (2) Features, comprising information of biologic interest; and (3) Experiments, describing the evidence that supports Features. The database permits tracking of annotation results over time, as well as assessment of the reliability of particular results. New framework annotation is produced by CARTA, a set of autonomous sensors that perform automatic analyses and assert results into the annotation database. These results are available via a Web-based query interface that uses graphical Java applets as well as text-based HTML pages to display data at different levels of resolution and permit interactive exploration of annotation. We present results for initial application of framework annotation to a set of test sequences, demonstrating its effectiveness in providing a starting point for biologic investigation, and discuss ways in which the current prototype can be improved. The prototype is available for public use and comment at http://www.cbil.upenn.edu/gaia.


Assuntos
Sequência de Bases , Biologia Computacional/métodos , Projeto Genoma Humano , Sistemas On-Line , Software , Sequência de Aminoácidos , Animais , Bases de Dados Factuais , Humanos , Dados de Sequência Molecular , Análise de Sequência de DNA/métodos
15.
Genome Res ; 8(1): 18-28, 1998 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-9445484

RESUMO

To accelerate gene discovery and facilitate genetic mapping in the protozoan parasite Toxoplasma gondii, we have generated >7000 new ESTs from the 5' ends of randomly selected tachyzoite cDNAs. Comparison of the ESTs with the existing gene databases identified possible functions for more than 500 new T. gondii genes by virtue of sequence motifs shared with conserved protein families, including factors involved in transcription, translation, protein secretion, signal transduction, cytoskeleton organization, and metabolism. Despite this success in identifying new genes, more than 50% of the ESTs correspond to genes of unknown function, reflecting the divergent evolutionary status of this parasite. A newly recognized class of genes was identified based on its similarity to sequences known only from other members of the same phylum, therefore identifying sequences that are apparently restricted to the Apicomplexa. Such genes may underlie pathways common to this group of medically important parasites, therefore identifying potential targets for intervention.


Assuntos
Apicomplexa/genética , Expressão Gênica , Genes de Protozoários , Família Multigênica , Toxoplasma/genética , Animais , Biologia Computacional/métodos , Sequência Conservada , DNA Complementar/análise , Humanos , Proteínas de Protozoários/classificação , Proteínas de Protozoários/genética , Homologia de Sequência do Ácido Nucleico
16.
Nucleic Acids Res ; 26(1): 288-9, 1998 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-9399855

RESUMO

EpoDB is a database designed for the study of gene regulation during differentiation and development of vertebrate red blood cells. In building EpoDB, we have taken the in advance approach to the data integration problem: we have extracted data relevant to red blood cells from GenBank, SWISS-PROT, TRRD (transcriptional regulation data) and GERD (expression levels data) to create a single integrated, highly curated view. Tools have been developed to automate data extraction from online resources, cleanse data of errors, enter information manually from the primary literature, generate a uniform, canonical representation of information and maintain data currency. The database is organized around biological features, e.g., genes, rather than sequences, which are supported by a controlled and consistent vocabulary for gene names and gene family names. Beyond the standard database queries, the functionality of EpoDB includes the ability to extract features and subsequences, display sequences and features graphically using bioWidget viewers and integrated analysis tools. EpoDB may be accessed at: http://cbil.humgen.upenn.edu/epodb/


Assuntos
Bases de Dados Factuais , Eritropoese/genética , Regulação da Expressão Gênica no Desenvolvimento , Animais , Redes de Comunicação de Computadores , Software , Vertebrados/genética
17.
Artigo em Inglês | MEDLINE | ID: mdl-9322048

RESUMO

Transcription factors, proteins required for the regulation of gene expression, recognize and bind short stretches of DNA on the order of 4 to 10 bases in length. In general, each factor recognizes a family of "similar" sequences rather than a single unique sequence. Ultimately, the transcriptional state of a gene is determined by the cooperative interaction of several bound factors. We have developed a method using Gibbs Sampling and the Minimum Description Length principle for automatically and reliably creating weight matrix models of binding sites from a database (TRANSFAC) of known binding site sequences. Determining the relationship between sequence and binding affinity for a particular factor is an important first step in predicting whether a given uncharacterized sequence is part of a promoter site or other control region. Here we describe the foundation for the methods we will use to develop weight matrix models for transcription factor binding sites.


Assuntos
Algoritmos , Modelos Biológicos , Fatores de Transcrição/metabolismo , Sequência de Bases , Sítios de Ligação/genética , DNA/genética , DNA/metabolismo , Bases de Dados Factuais , Cadeias de Markov , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos , Processos Estocásticos
18.
J Biol Chem ; 271(3): 1738-47, 1996 Jan 19.
Artigo em Inglês | MEDLINE | ID: mdl-8576177

RESUMO

Factor VII is a vitamin K-dependent coagulation protein essential for proper hemostasis. The human Factor VII gene spans 13 kilobase pairs and is located on chromosome 13 just 2.8 kilobase pairs 5' to the Factor X gene. In this report, we show that Factor VII transcripts are restricted to the liver and that steady state levels of mRNA are much lower than those of Factor X. The major transcription start site is mapped at -51 by RNase protection assay and primer extension experiments. The first 185 base pairs 5' of the translation start site are sufficient to confer maximal promoter activity in HepG2 cells. Protein binding sites are identified at nucleotides -51 to -32, -63 to -58, -108 to -84, and -233 to -215 by DNase I footprint analysis and gel mobility shift assays. A liver-enriched transcription factor, hepatocyte nuclear factor-4 (HNF-4), and a ubiquitous transcription factor, Spl, are shown to bind within the first 108 base pairs of the promoter region at nucleotide sequences ACTTTG and CCCCTCCCCC, respectively. The importance of these binding sites in promoter activity is demonstrated through independent functional mutagenesis experiments, which show dramatically reduced promoter activity. Transactivation studies with an HNF-4 expression plasmid in HeLa cells also demonstrate the importance of HNF-4 in promoting transcription in non-hepatocyte derived cells. Additionally, the sequence of a naturally occurring allele containing a previously described decanucleotide insert polymorphism at -323 is shown to reduce promoter activity by 33% compared with the more common allelic sequence.


Assuntos
Cromossomos Humanos Par 13 , Fator VII/genética , Regiões Promotoras Genéticas , Sequências Reguladoras de Ácido Nucleico , Sequência de Aminoácidos , Sequência de Bases , Fatores de Transcrição de Zíper de Leucina e Hélice-Alça-Hélix Básicos , Núcleo Celular/metabolismo , Mapeamento Cromossômico , Sequência Consenso , DNA/química , DNA/metabolismo , Pegada de DNA , Primers do DNA , Proteínas de Ligação a DNA/metabolismo , Desoxirribonuclease I , Fator VII/biossíntese , Fator X/genética , Expressão Gênica , Células HeLa , Fator 4 Nuclear de Hepatócito , Humanos , Fígado/metabolismo , Dados de Sequência Molecular , Fosfoproteínas/metabolismo , RNA Mensageiro/análise , RNA Mensageiro/biossíntese , Fator de Transcrição Sp1/metabolismo , Fatores de Transcrição/metabolismo
19.
Comput Appl Biosci ; 10(4): 369-78, 1994 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-7804870

RESUMO

The National Center for Biotechnology Information (NCBI) has created a database collection that includes several protein and nucleic acid sequence databases, a biosequence-specific subset of MEDLINE, as well as value-added information such as links between similar sequences. Information in the NCBI database is modeled in Abstract Syntax Notation 1 (ASN.1) an Open Systems Interconnection protocol designed for the purpose of exchanging structured data between software applications rather than as a data model for database systems. While the NCBI database is distributed with an easy-to-use information retrieval system, ENTREZ, the ASN.1 data model currently lacks an ad hoc query language for general-purpose data access. For that reason, we have developed a software package, SORTEZ, that transforms the ASN.1 database (or other databases with nested data structures) to a relational data model and subsequently to a relational database management system (Sybase) where information can be accessed through the relational query language, SQL. Because the need to transform data from one data model and schema to another arises naturally in several important contexts, including efficient execution of specific applications, access to multiple databases and adaptation to database evolution this work also serves as a practical study of the issues involved in the various stages of database transformation. We show that transformation from the ASN.1 data model to a relational data model can be largely automated, but that schema transformation and data conversion require considerable domain expertise and would greatly benefit from additional support tools.


Assuntos
Bases de Dados Factuais , Software , Algoritmos , Sequência de Aminoácidos , Sequência de Bases , Sistemas de Gerenciamento de Base de Dados , Humanos , National Library of Medicine (U.S.) , Design de Software , Estados Unidos
20.
J Comput Biol ; 1(1): 3-14, 1994.
Artigo em Inglês | MEDLINE | ID: mdl-8790449

RESUMO

We have developed a general system, QGB, for performing complex queries on the information in the DDBJ/EMBL/GenBank databases, including queries over the structural features of sequences implied in the FEATURE TABLE. Queries are formed in a Structured Query Language (SQL)-like syntax with language extensions to support complex types (e.g., sets, ordered sets, and records) appropriate for representing and querying sequence data. A novel aspect of QGB is its ability to deduce missing features and infer relationships among features as a consequence of constructing a parse tree of sequence structure from information described in the FEATURE TABLE. The grammar for the parse tree is implemented in a customized form of the Definite Clause Grammar syntax of the logic programming language Prolog. The logic grammar formalism was chosen because it provides a perspicuous representation for features and constraints, and Prolog provides an execution model for the grammar rules. Construction of the parse tree also identifies inconsistencies and errors in the FEATURE TABLE that can in some cases be corrected automatically and used to generate an augmented version of the table.


Assuntos
Sequência de Bases , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Factuais , Armazenamento e Recuperação da Informação , Hemoglobinas/genética , Humanos , Cariotipagem , Dados de Sequência Molecular , Linguagens de Programação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...