Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 31
Filter
Add more filters










Publication year range
1.
Nucleic Acids Res ; 34(13): 3687-97, 2006.
Article in English | MEDLINE | ID: mdl-16893953

ABSTRACT

Different biological notions of pathways are used in different pathway databases. Those pathway ontologies significantly impact pathway computations. Computational users of pathway databases will obtain different results depending on the pathway ontology used by the databases they employ, and different pathway ontologies are preferable for different end uses. We explore differences in pathway ontologies by comparing the BioCyc and KEGG ontologies. The BioCyc ontology defines a pathway as a conserved, atomic module of the metabolic network of a single organism, i.e. often regulated as a unit, whose boundaries are defined at high-connectivity stable metabolites. KEGG pathways are on average 4.2 times larger than BioCyc pathways, and combine multiple biological processes from different organisms to produce a substrate-centered reaction mosaic. We compared KEGG and BioCyc pathways using genome context methods, which determine the functional relatedness of pairs of genes. For each method we employed, a pair of genes randomly selected from a BioCyc pathway is more likely to be related by that method than is a pair of genes randomly selected from a KEGG pathway, supporting the conclusion that the BioCyc pathway conceptualization is closer to a single conserved biological process than is that of KEGG.


Subject(s)
Computational Biology , Databases, Genetic , Metabolism/genetics , Genomics/methods , Vocabulary, Controlled
2.
Nucleic Acids Res ; 33(13): 4035-9, 2005.
Article in English | MEDLINE | ID: mdl-16034025

ABSTRACT

We report on a new type of systematic annotation error in genome and pathway databases that results from the misinterpretation of partial Enzyme Commission (EC) numbers such as '1.1.1.-'. This error results in the assignment of genes annotated with a partial EC number to many or all biochemical reactions that are annotated with the same partial EC number. That inference is faulty because of the ambiguous nature of partial EC numbers. We have observed this type of error in multiple databases, including KEGG, VIMSS and IMG, all of which assign genes to KEGG pathways. The Escherichia coli subset of the KEGG database exhibits this error for 6.8% of its gene-reaction assignments. For example, KEGG contains 17 reactions that are annotated with EC 1.1.1.-. A group of three E.coli genes, b1580 [putative dehydrogenase, NAD(P)-binding, starvation-sensing protein], b3787 (UDP-N-acetyl-D-mannosaminuronic acid dehydrogenase) and b0207 (2,5-diketo-D-gluconate reductase B), is assigned to 15 of those reactions, despite experimental evidence indicating different single functions for two of the three genes. Furthermore, the databases (DBs) are internally inconsistent in that the description of gene functions for genes with partial EC numbers is inconsistent with the activities implied by reactions to which the genes were assigned. We infer that these inconsistencies result from the processing used to match gene products to reactions within KEGG's metabolic pathways. These errors affect scientists who use these DBs as online encyclopedias and they affect bioinformaticists who use these DBs to train and validate newly developed algorithms.


Subject(s)
Databases, Genetic , Enzymes/genetics , Genomics , Vocabulary, Controlled , Base Sequence , Escherichia coli/enzymology , Escherichia coli/genetics , Humans , Molecular Sequence Data , Reproducibility of Results
3.
Pac Symp Biocomput ; : 190-201, 2004.
Article in English | MEDLINE | ID: mdl-14992503

ABSTRACT

An important emerging need in Model Organism Databases (MODs) and other bioinformatics databases (DBs) is that of capturing the scientific evidence that supports the information within a DB. This need has become particularly acute as more DB content consists of computationally predicted information, such as predicted gene functions, operons, metabolic pathways, and protein properties. This paper presents an ontology for encoding the type of support and the degree of support for DB assertions, and for encoding the literature source in which that support is reported. The ontology includes a hierarchy of 35 evidence codes for modeling different types of wet-lab and computational evidence for the existence of operons and metabolic pathways, and for gene functions. We also describe an implementation of the ontology within the Pathway Tools software environment, which is used to query and update Pathway/Genome DBs such as EcoCyc, MetaCyc, and HumanCyc.


Subject(s)
Computational Biology , Databases, Genetic , Genomics/statistics & numerical data , Models, Genetic , Software
4.
Bioinformatics ; 20(5): 709-17, 2004 Mar 22.
Article in English | MEDLINE | ID: mdl-14751985

ABSTRACT

MOTIVATION: The prediction of transcription units (TUs, which are similar to operons) is an important problem that has been tackled using many different approaches. The availability of complete microbial genomes has made genome-wide TU predictions possible. Pathway-genome databases (PGDBs) add metabolic and other organizational (i.e. protein complexes) information to the annotated genome, and are able to capture TU organization information. These characteristics of PGDBs make them a suitable framework for the development and implementation of TU predictors. RESULTS: We implemented a TU predictor that uses only intergenic distance and functional classification of genes to predict TU boundaries, and applied it to EcoCyc, our PGDB of Escherichia coli. To this original predictor, we added information on metabolic pathways, protein complexes and transporters, all readily available in EcoCyc, in order to generate an enhanced predictor. The enhanced predictor correctly predicted 80% of the known E.coli TUs (69% of the known operons), a moderate improvement over the original predictor's performance (75% of TUs and 65% of operons correctly predicted), demonstrating that the extra information available in the PGDB does indeed improve prediction performance. Performance of this E.coli-based predictor on a genome other than that of E.coli was tested on BsubCyc, our computationally generated PGDB for Bacillus subtilis, for which a set of 100 known operons is available. Prediction accuracy decreased substantially (46% of the known operons correctly predicted). This was due in part to missing information in BsubCyc, which prevented full use of the predictor's features. The augmented predictor has been implemented as part of our Pathway Tools software suite, and can be used to populate a PGDB with predicted TUs. AVAILABILITY: The TU predictor is included in version 7.0 of the Pathway Tools software suite. Pathway Tools 7.0 is available free of charge to academic institutions and for a fee to commercial enterprises. It runs on Sun Solaris 8, Linux and Windows. TUs predicted on the Caulobacter crescentus and Mycobacterium tuberculosis (H37Rv) genomes are available in our CauloCyc and MtbrvCyc databases, available at the BioCyc web site (http://biocyc.org). To obtain version 7.0 of Pathway Tools, follow the directions in our web site, http://biocyc.org/download.shtml.


Subject(s)
Chromosome Mapping/methods , Databases, Genetic , Gene Expression Profiling/methods , Gene Expression Regulation, Bacterial/physiology , Genome, Bacterial , Transcription Factors/genetics , Transcription Factors/metabolism , Algorithms , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Information Storage and Retrieval/methods , Signal Transduction/physiology , Software , Structure-Activity Relationship
5.
Science ; 294(5550): 2317-23, 2001 Dec 14.
Article in English | MEDLINE | ID: mdl-11743193

ABSTRACT

The 5.67-megabase genome of the plant pathogen Agrobacterium tumefaciens C58 consists of a circular chromosome, a linear chromosome, and two plasmids. Extensive orthology and nucleotide colinearity between the genomes of A. tumefaciens and the plant symbiont Sinorhizobium meliloti suggest a recent evolutionary divergence. Their similarities include metabolic, transport, and regulatory systems that promote survival in the highly competitive rhizosphere; differences are apparent in their genome structure and virulence gene complement. Availability of the A. tumefaciens sequence will facilitate investigations into the molecular basis of pathogenesis and the evolutionary divergence of pathogenic and symbiotic lifestyles.


Subject(s)
Agrobacterium tumefaciens/genetics , Genome, Bacterial , Sequence Analysis, DNA , Agrobacterium tumefaciens/classification , Agrobacterium tumefaciens/pathogenicity , Agrobacterium tumefaciens/physiology , Bacterial Adhesion/genetics , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Carrier Proteins/genetics , Carrier Proteins/metabolism , Chromosomes, Bacterial/genetics , Conjugation, Genetic , DNA Replication , Genes, Bacterial , Genes, Regulator , Membrane Proteins/genetics , Membrane Proteins/metabolism , Molecular Sequence Data , Phylogeny , Plants/microbiology , Plasmids , Replicon , Rhizobiaceae/genetics , Rhizobiaceae/physiology , Sinorhizobium meliloti/genetics , Sinorhizobium meliloti/physiology , Symbiosis , Virulence/genetics
6.
Science ; 293(5537): 2040-4, 2001 Sep 14.
Article in English | MEDLINE | ID: mdl-11557880

ABSTRACT

A pathway database (DB) is a DB that describes biochemical pathways, reactions, and enzymes. The EcoCyc pathway DB (see http://ecocyc.org) describes the metabolic, transport, and genetic-regulatory networks of Escherichia coli. EcoCyc is an example of a computational symbolic theory, which is a DB that structures a scientific theory within a formal ontology so that it is available for computational analysis. It is argued that by encoding scientific theories in symbolic form, we open new realms of analysis and understanding for theories that would otherwise be too large and complex for scientists to reason with effectively.


Subject(s)
Computational Biology , Databases, Factual , Escherichia coli/genetics , Escherichia coli/metabolism , Genome, Bacterial , Artificial Intelligence , Culture Media , Escherichia coli/enzymology , Escherichia coli/growth & development , Internet , Software
7.
Bioinformatics ; 17(6): 526-32; discussion 533-4, 2001 Jun.
Article in English | MEDLINE | ID: mdl-11395429

ABSTRACT

PROBLEM STATEMENT: We have studied the relationships among SWISS-PROT, TrEMBL, and GenBank with two goals. First is to determine whether users can reliably identify those proteins in SWISS-PROT whose functions were determined experimentally, as opposed to proteins whose functions were predicted computationally. If this information was present in reasonable quantities, it would allow researchers to decrease the propagation of incorrect function predictions during sequence annotation, and to assemble training sets for developing the next generation of sequence-analysis algorithms. Second is to assess the consistency between translated GenBank sequences and sequences in SWISS-PROT and TrEMBL. RESULTS: (1) Contrary to claims by the SWISS-PROT authors, we conclude that SWISS-PROT does not identify a significant number of experimentally characterized proteins. (2) SWISS-PROT is more incomplete than we expected in that version 38.0 from July 1999 lacks many proteins from the full genomes of important organisms that were sequenced years earlier. (3) Even if we combine SWISS-PROT and TrEMBL, some sequences from the full genomes are missing from the combined dataset. (4) In many cases, translated GenBank genes do not exactly match the corresponding SWISS-PROT sequences, for reasons that include missing or removed methionines, differing translation start positions, individual amino-acid differences, and inclusion of sequence data from multiple sequencing projects. For example, results show that for Escherichia coli, 80.6% of the proteins in the GenBank entry for the complete genome have identical sequence matches with SWISS-PROT/TrEMBL sequences, 13.4% have exact substring matches, and matches for 4.1% can be found using BLAST search; the remaining 2.0% of E.coli protein sequences (most of which are ORFs) have no clear matches to SWISS-PROT/TrEMBL. Although many of these differences can be explained by the complexity of the DB, and by the curation processes used to create it, the scale of the differences is notable.


Subject(s)
Algorithms , Databases, Factual/standards , Gene Library , Human Genome Project , Base Sequence , Data Interpretation, Statistical , Escherichia coli/classification , Escherichia coli/genetics , Haemophilus influenzae/genetics , Helicobacter pylori/genetics , Open Reading Frames/genetics , Protein Biosynthesis/genetics , Species Specificity
8.
Comp Funct Genomics ; 2(1): 25-7, 2001.
Article in English | MEDLINE | ID: mdl-18628940

ABSTRACT

A survey of Genbank entries for complete microbial genomes reveals that the majority do not conform to the Genbank standard. Typical deviations from the Genbank standard include records with information in incorrect fields, addition of extraneous and confusing information within a field, and omission of useful fields. This situation results from two principal causes: genome centres do not submit Genbank records in the proper form and the Genbank, EMBL and DDBJ staffs do not enforce the database standards that they have defined.

9.
Bioinformatics ; 16(3): 269-85, 2000 Mar.
Article in English | MEDLINE | ID: mdl-10869020

ABSTRACT

MOTIVATIONS: A number of important bioinformatics computations involve computing with function: executing computational operations whose inputs or outputs are descriptions of the functions of biomolecules. Examples include performing functional queries to sequence and pathway databases, and determining functional equality to evaluate algorithms that predict function from sequence. A prerequisite to computing with function is the existence of an ontology that provides a structured semantic encoding of function. Functional bioinformatics is an emerging subfield of bioinformatics that is concerned with developing ontologies and algorithms for computing with biological function. RESULTS: The article explores the notion of computing with function, and explains the importance of ontologies of function to bioinformatics. The functional ontology developed for the EcoCyc database is presented. This ontology can encode a diverse array of biochemical processes, including enzymatic reactions involving small-molecule substrates and macromolecular substrates, signal-transduction processes, transport events, and mechanisms of regulation of gene expression. The ontology is validated through its use to express complex functional queries for the EcoCyc DB. CONTACT: pkarp@ai.sri.com


Subject(s)
Computational Biology , Proteins/physiology , Computing Methodologies , Databases, Factual , Proteins/metabolism
10.
Genome Res ; 10(4): 568-76, 2000 Apr.
Article in English | MEDLINE | ID: mdl-10779499

ABSTRACT

The EcoCyc database characterizes the known network of Escherichia coli small-molecule metabolism. Here we present a computational analysis of the global properties of that network, which consists of 744 reactions that are catalyzed by 607 enzymes. The reactions are organized into 131 pathways. Of the metabolic enzymes, 100 are multifunctional, and 68 of the reactions are catalyzed by >1 enzyme. The network contains 791 chemical substrates. Other properties considered by the analysis include the distribution of enzyme subunit organization, and the distribution of modulators of enzyme activity and of enzyme cofactors. The dimensions chosen for this analysis can be employed for comparative functional analysis of complete genomes.


Subject(s)
Escherichia coli/metabolism , Catalysis , Computational Biology/methods , Databases, Factual , Enzyme Activation/genetics , Escherichia coli/enzymology , Escherichia coli/genetics , Genome, Bacterial , Multienzyme Complexes/genetics
11.
Nucleic Acids Res ; 28(1): 56-9, 2000 Jan 01.
Article in English | MEDLINE | ID: mdl-10592180

ABSTRACT

EcoCyc is an organism-specific Pathway/Genome Database that describes the metabolic and signal-transduction pathways of Escherichia coli, its enzymes, and-a new addition-its transport proteins. MetaCyc is a new metabolic-pathway database that describes pathways and enzymes of many different organisms, with a microbial focus. Both databases are queried using the Pathway Tools graphical user interface, which provides a wide variety of query operations and visualization tools. EcoCyc and MetaCyc are available at http://ecocyc.PangeaSystems.com/ecocyc/


Subject(s)
Databases, Factual , Database Management Systems , Escherichia coli/genetics , Genome, Bacterial
12.
Trends Biotechnol ; 17(7): 275-81, 1999 Jul.
Article in English | MEDLINE | ID: mdl-10370234

ABSTRACT

Integrated pathway-genome databases describe the genes and genome of an organism, as well as its predicted pathways, reactions, enzymes and metabolites. In conjunction with visualization and analysis software, these databases provide a framework for improved understanding of microbial physiology and for antimicrobial drug discovery. We describe pathway-based analyses of the genomes of a number of medically relevant microorganisms and a novel software tool that visualizes gene-expression data on a diagram showing the whole metabolic network of the microorganism.


Subject(s)
Anti-Infective Agents/pharmacology , Database Management Systems , Genome, Bacterial , Genome, Fungal , Systems Integration , Anti-Bacterial Agents , Anti-Infective Agents/chemical synthesis , Bacteria/drug effects , Bacteria/genetics , Bacteria/metabolism , Drug Design , Saccharomyces cerevisiae/drug effects , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Species Specificity
13.
Nucleic Acids Res ; 27(1): 55-8, 1999 Jan 01.
Article in English | MEDLINE | ID: mdl-9847140

ABSTRACT

The EcoCyc database describes the genome and gene products of Escherichia coli, its metabolic and signal-transduction pathways, and its tRNAs. The database describes 4391 genes of E.coli, 695 enzymes encoded by a subset of these genes, 904 metabolic reactions that occur in E.coli, and the organization of these reactions into 129 metabolic pathways. The EcoCyc graphical user interface allows scientists to query and explore the EcoCyc database using visualization tools such as genomic-map browsers and automatic layouts of metabolic pathways. EcoCyc has many references to the primary literature, and is a (qualitative) computational model of E. coli metabolism. EcoCyc is available at URL http://ecocyc. PangeaSystems.com/ecocyc/


Subject(s)
Databases, Factual , Escherichia coli/genetics , Escherichia coli/metabolism , Genes, Bacterial , Classification , Enzymes/genetics , Enzymes/metabolism , Genome, Bacterial , Information Storage and Retrieval , Internet , Signal Transduction , User-Computer Interface
15.
Nucleic Acids Res ; 26(1): 50-3, 1998 Jan 01.
Article in English | MEDLINE | ID: mdl-9399798

ABSTRACT

The encyclopedia of Escherichia coli genes and metabolism (EcoCyc) is a database that combines information about the genome and the intermediary metabolism of E.coli. The database describes 3030 genes of E.coli , 695 enzymes encoded by a subset of these genes, 595 metabolic reactions that occur in E.coli, and the organization of these reactions into 123 metabolic pathways. The EcoCyc graphical user interface allows scientists to query and explore the EcoCyc database using visualization tools such as genomic-map browsers and automatic layouts of metabolic pathways. EcoCyc can be thought of as an electronic review article because of its copious references to the primary literature, and as a (qualitative) computational model of E.coli metabolism. EcoCyc is available at URL http://ecocyc.PangeaSystems.com/ecocyc/


Subject(s)
Databases, Factual , Escherichia coli/genetics , Escherichia coli/metabolism , Genes, Bacterial , Computer Graphics , Databases, Factual/trends , Encyclopedias as Topic , User-Computer Interface
17.
Comput Appl Biosci ; 13(5): 537-43, 1997 Oct.
Article in English | MEDLINE | ID: mdl-9367126

ABSTRACT

MOTIVATION: Group contribution methods are frequently used for estimating physical properties of compounds from their molecular structures. An algorithm for estimating Gibbs energies of formation through group contribution methods has been automated in an object-oriented framework. The algorithm decomposes compound structures according to a basis set of groups. It permits the use of wildcards and is able to distinguish between ring groups and chain groups that use similar search structures. Past methods relied on manual decomposition of compounds into constituent groups. RESULTS: The software is written in Common LISP and requires < 2 min to estimate Gibbs energies of formation for a database of 780 species of varying size and complexity. The software allows rapid expansion to incorporate different basis sets and to estimate a variety of other physical properties.


Subject(s)
Biotransformation , Databases, Factual , Molecular Structure , Software , Algorithms , Artificial Intelligence , Thermodynamics
18.
Nature ; 388(6642): 539-47, 1997 Aug 07.
Article in English | MEDLINE | ID: mdl-9252185

ABSTRACT

Helicobacter pylori, strain 26695, has a circular genome of 1,667,867 base pairs and 1,590 predicted coding sequences. Sequence analysis indicates that H. pylori has well-developed systems for motility, for scavenging iron, and for DNA restriction and modification. Many putative adhesins, lipoproteins and other outer membrane proteins were identified, underscoring the potential complexity of host-pathogen interaction. Based on the large number of sequence-related genes encoding outer membrane proteins and the presence of homopolymeric tracts and dinucleotide repeats in coding sequences, H. pylori, like several other mucosal pathogens, probably uses recombination and slipped-strand mispairing within repeats as mechanisms for antigenic variation and adaptive evolution. Consistent with its restricted niche, H. pylori has a few regulatory networks, and a limited metabolic repertoire and biosynthetic capacity. Its survival in acid conditions depends, in part, on its ability to establish a positive inside-membrane potential in low pH.


Subject(s)
Genome, Bacterial , Helicobacter pylori/genetics , Antigenic Variation , Bacterial Adhesion , Bacterial Proteins/metabolism , Base Sequence , Biological Evolution , Cell Division , DNA Repair , DNA, Bacterial/genetics , Gene Expression Regulation, Bacterial , Helicobacter pylori/metabolism , Helicobacter pylori/pathogenicity , Hydrogen-Ion Concentration , Molecular Sequence Data , Protein Biosynthesis , Recombination, Genetic , Transcription, Genetic , Virulence
19.
Nucleic Acids Res ; 25(1): 43-51, 1997 Jan 01.
Article in English | MEDLINE | ID: mdl-9016502

ABSTRACT

The Encyclopedia of Genes and Metabolism (EcoCyc) is a database that combines information about the genome and the intermediary metabolism of Escherichia coli. It describes 2970 genes of E.coli, 547 enzymes encoded by these genes, 702 metabolic reactions that occur in E.coli and the organization of these reactions into 107 metabolic pathways. The EcoCyc graphical user interface allows scientists to query and explore the EcoCyc database using visualization tools such as genomic-map browsers and automatic layouts of metabolic pathways. EcoCyc spans the space from sequence to function to allow scientists to investigate an unusually broad range of questions. EcoCyc can be thought of as both an electronic review article because of its copious references to the primary literature, and as an in silicio model of E.coli metabolism that can be probed and analyzed through computational means.


Subject(s)
Databases, Factual , Escherichia coli/genetics , Escherichia coli/metabolism , Genes, Bacterial , Amino Acid Sequence , Base Sequence , User-Computer Interface
20.
Article in English | MEDLINE | ID: mdl-9322021

ABSTRACT

We describe a novel approach for predicting the function of a protein from its amino-acid sequence. Given features that can be computed from the amino-acid sequence in a straightforward fashion (such as pI, molecular weight, and amino-acid composition), the technique allows us to answer questions such as: Is the protein an enzyme? If so, in which Enzyme Commission (EC) class does it belong? Our approach uses machine learning (ML) techniques to induce classifiers that predict the EC class of an enzyme from features extracted from its primary sequence. We report on a variety of experiments in which we explored the use of three different ML techniques in conjunction with training datasets derived from PDB and from Swiss-Prot. We also explored the use of several different feature sets. Our method is able to predict the first EC number of an enzyme with 74% accuracy (thereby assigning the enzyme to one of six broad categories of enzyme function), and to predict the second EC number of an enzyme with 68% accuracy (thereby assigning the enzyme to one of 57 subcategories of enzyme function). This technique could be a valuable complement to sequence-similarity searches and to pathway-analysis methods.


Subject(s)
Artificial Intelligence , Enzymes/chemistry , Enzymes/classification , Proteins/chemistry , Algorithms , Amino Acid Sequence , Databases, Factual , Enzymes/genetics , Proteins/genetics , Sequence Alignment
SELECTION OF CITATIONS
SEARCH DETAIL