Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 34
Filter
1.
Bioinformatics ; 31(8): 1274-8, 2015 Apr 15.
Article in English | MEDLINE | ID: mdl-25540181

ABSTRACT

UNLABELLED: The Chemical Component Dictionary (CCD) is a chemical reference data resource that describes all residue and small molecule components found in Protein Data Bank (PDB) entries. The CCD contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands and solvent molecules. Each chemical definition includes descriptions of chemical properties such as stereochemical assignments, chemical descriptors, systematic chemical names and idealized coordinates. The content, preparation, validation and distribution of this CCD chemical reference dataset are described. AVAILABILITY AND IMPLEMENTATION: The CCD is updated regularly in conjunction with the scheduled weekly release of new PDB structure data. The CCD and amino acid variant reference datasets are hosted in the public PDB ftp repository at ftp://ftp.wwpdb.org/pub/pdb/data/monomers/components.cif.gz, ftp://ftp.wwpdb.org/pub/pdb/data/monomers/aa-variants-v1.cif.gz, and its mirror sites, and can be accessed from http://wwpdb.org. CONTACT: jwest@rcsb.rutgers.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Databases, Chemical , Databases, Protein , Dictionaries, Chemical as Topic , Macromolecular Substances/chemistry , Molecular Sequence Annotation , Internet , Ligands , User-Computer Interface
2.
Acta Crystallogr D Biol Crystallogr ; 70(Pt 3): 904-6, 2014 Mar.
Article in English | MEDLINE | ID: mdl-24598758

ABSTRACT

Atomic coordinates in the Worldwide Protein Data Bank (wwPDB) are generally reported to greater precision than the experimental structure determinations have actually achieved. By using information theory and data compression to study the compressibility of protein atomic coordinates, it is possible to quantify the amount of randomness in the coordinate data and thereby to determine the realistic precision of the reported coordinates. On average, the value of each C(α) coordinate in a set of selected protein structures solved at a variety of resolutions is good to about 0.1 Å.


Subject(s)
Databases, Protein/standards , User-Computer Interface , Crystallography, X-Ray/standards , Dictionaries, Chemical as Topic , Magnetic Resonance Spectroscopy/standards , Microscopy, Electron/standards , Predictive Value of Tests , Random Allocation
3.
J Chem Inf Model ; 52(8): 1907-16, 2012 Aug 27.
Article in English | MEDLINE | ID: mdl-22725613

ABSTRACT

A previous paper [Spadaccini and Hall J. Chem. Inf. Model. doi:10.1021/ci300074v] details extensions to the STAR File [Hall J. Chem. Inf. Comput. Sci. 1991, 31, 326-333] syntax that will improve the exchange and archiving of electronic data. This paper describes a dictionary definition language (DDLm) for defining STAR File data items in a domain dictionary. A dictionary that defines the ontology and vocabulary of a discipline is built with DDLm, which is itself implemented in STAR, and is extensible and machine parsable. The DDLm is semantically rich and highly specific; provides strong data typing, data enumerations, and ranges; enables relationship keys between data items; and uses imbedded methods written in dREL [Spadaccini et al. J. Chem. Inf. Model. doi:10.1021/ci300076w] for data validation and evaluation and for refining data definitions. It promotes the modular definition of the discipline ontology and reuse through the ability to import definitions from other local and remote dictionaries, thus encouraging the sharing of data dictionaries within and across domains.


Subject(s)
Dictionaries, Chemical as Topic , Programming Languages , Electronic Data Processing , Informatics
4.
Bioinformatics ; 28(12): 1633-40, 2012 Jun 15.
Article in English | MEDLINE | ID: mdl-22500000

ABSTRACT

MOTIVATION: The accurate identification of chemicals in text is important for many applications, including computer-assisted reconstruction of metabolic networks or retrieval of information about substances in drug development. But due to the diversity of naming conventions and traditions for such molecules, this task is highly complex and should be supported by computational tools. RESULTS: We present ChemSpot, a named entity recognition (NER) tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and International Union of Pure and Applied Chemistry entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a hybrid approach combining a Conditional Random Field with a dictionary. It achieves an F(1) measure of 68.1% on the SCAI corpus, outperforming the only other freely available chemical NER tool, OSCAR4, by 10.8 percentage points. AVAILABILITY: ChemSpot is freely available at: http://www.informatik.hu-berlin.de/wbi/resources.


Subject(s)
Artificial Intelligence , Dictionaries, Chemical as Topic , Information Storage and Retrieval/methods , Natural Language Processing , Pharmaceutical Preparations/classification , Computational Biology/methods , Software , Terminology as Topic
5.
Behav Res Methods ; 44(4): 1015-27, 2012 Dec.
Article in English | MEDLINE | ID: mdl-22477438

ABSTRACT

Words that are homonyms-that is, for which a single written and spoken form is associated with multiple, unrelated interpretations, such as COMPOUND, which can denote an < enclosure > or a < composite > meaning-are an invaluable class of items for studying word and discourse comprehension. When using homonyms as stimuli, it is critical to control for the relative frequencies of each interpretation, because this variable can drastically alter the empirical effects of homonymy. Currently, the standard method for estimating these frequencies is based on the classification of free associates generated for a homonym, but this approach is both assumption-laden and resource-demanding. Here, we outline an alternative norming methodology based on explicit ratings of the relative meaning frequencies of dictionary definitions. To evaluate this method, we collected and analyzed data in a norming study involving 544 English homonyms, using the eDom norming software that we developed for this purpose. Dictionary definitions were generally sufficient to exhaustively cover word meanings, and the methods converged on stable norms with fewer data and less effort on the part of the experimenter. The predictive validity of the norms was demonstrated in analyses of lexical decision data from the English Lexicon Project (Balota et al., Behavior Research Methods, 39, 445-459, 2007), and from Armstrong and Plaut (Proceedings of the 33rd Annual Meeting of the Cognitive Science Society, 2223-2228, 2011). On the basis of these results, our norming method obviates relying on the unsubstantiated assumptions involved in estimating relative meaning frequencies on the basis of classification of free associates. Additional details of the norming procedure, the meaning frequency norms, and the source code, standalone binaries, and user manual for the software are available at http://edom.cnbc.cmu.edu .


Subject(s)
Comprehension , Language , Software , Adult , Dictionaries, Chemical as Topic , Female , Humans , Male , Reproducibility of Results , Terminology as Topic , Young Adult
6.
Molecules ; 17(3): 2877-928, 2012 Mar 07.
Article in English | MEDLINE | ID: mdl-22399140

ABSTRACT

A listing of carotenoids with heteroatoms (X = F, Cl, Br, I, Si, N, S, Se, Fe) directly attached to the carotenoid carbon skeleton has been compiled. The 178 listed carotenoids with C, H, X atoms demonstrate that the classical division of carotenoids into hydrocarbon carotenoids (C, H) and xanthophylls (C, H, O) has become obsolete.


Subject(s)
Carotenoids/classification , Xenobiotics/classification , Carotenoids/chemistry , Dictionaries, Chemical as Topic , Molecular Structure , Terminology as Topic , Xenobiotics/chemistry
7.
J Chem Inf Model ; 52(1): 51-62, 2012 Jan 23.
Article in English | MEDLINE | ID: mdl-22148717

ABSTRACT

The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.


Subject(s)
Computational Biology/methods , Dictionaries, Chemical as Topic , Natural Language Processing , Software , Data Mining , Databases, Factual , Patents as Topic
8.
Proteomics ; 10(23): 4306-10, 2010 Dec.
Article in English | MEDLINE | ID: mdl-21082763

ABSTRACT

Recent developments in MS-based proteomics have increased the emphasis on peptides as a primary observable. While peptides are identified by tandem mass spectra, the link between peptide and protein remains implicit given the bottom-up nature of the experiment in which proteins are enzymatically digested prior to sequencing. It is therefore useful to provide a fast lookup from peptide to protein in order to systematically establish the broadest possible protein basis for the observed peptides. Here, we describe Pep2Pro, a fast web-service providing protein lookup by peptides covering the entire protein space comprising ∼10 million UniRef100 sequences. We demonstrate the usefulness of the service by reanalyzing peptides from two recent meta-proteomic data sets and identifying taxon-specific peptides, thereby implicating individual species as being present in these complex samples. The Pep2Pro web service can be accessed at http://www.pep2pro.org.


Subject(s)
Databases, Protein , Dictionaries, Chemical as Topic , Proteomics/methods , Humans , Internet , Peptides/chemistry , Proteins/chemistry
9.
Nucleic Acids Res ; 38(Database issue): D488-91, 2010 Jan.
Article in English | MEDLINE | ID: mdl-19767608

ABSTRACT

The University of Minnesota Biocatalysis/Biodegradation Database (UM-BBD, http://umbbd.msi.umn.edu/) began in 1995 and now contains information on almost 1200 compounds, over 800 enzymes, almost 1300 reactions and almost 500 microorganism entries. Besides these data, it includes a Biochemical Periodic Table (UM-BPT) and a rule-based Pathway Prediction System (UM-PPS) (http://umbbd.msi.umn.edu/predict/) that predicts plausible pathways for microbial degradation of organic compounds. Currently, the UM-PPS contains 260 biotransformation rules derived from reactions found in the UM-BBD and scientific literature. Public access to UM-BBD data is increasing. UM-BBD compound data are now contributed to PubChem and ChemSpider, the public chemical databases. A new mirror website of the UM-BBD, UM-BPT and UM-PPS is being developed at ETH Zürich to improve speed and reliability of online access from anywhere in the world.


Subject(s)
Biodegradation, Environmental , Computational Biology/methods , Databases, Genetic , Databases, Nucleic Acid , Access to Information , Biochemistry/methods , Biotransformation , Computational Biology/trends , Dictionaries, Chemical as Topic , Environmental Pollutants , Genome, Bacterial , Information Storage and Retrieval/methods , Internet , Microbiology , Software , User-Computer Interface
10.
Nucleic Acids Res ; 38(Database issue): D249-54, 2010 Jan.
Article in English | MEDLINE | ID: mdl-19854951

ABSTRACT

Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on 'small' chemical compounds. The molecular entities in question are either natural products or synthetic products used to intervene in the processes of living organisms. Genome-encoded macromolecules (nucleic acids, proteins and peptides derived from proteins by cleavage) are not as a rule included in ChEBI. In addition to molecular entities, ChEBI contains groups (parts of molecular entities) and classes of entities. ChEBI includes an ontological classification, whereby the relationships between molecular entities or classes of entities and their parents and/or children are specified. ChEBI is available online at http://www.ebi.ac.uk/chebi/. This article reports on new features in ChEBI since the last NAR report in 2007, including substructure and similarity searching, a submission tool for authoring of ChEBI datasets by the community and a 30-fold increase in the number of chemical structures stored in ChEBI.


Subject(s)
Computational Biology/methods , Databases, Factual , Agrochemicals/chemistry , Animals , Biological Products/chemistry , Computational Biology/trends , Dictionaries, Chemical as Topic , Humans , Information Storage and Retrieval/methods , Internet , Pharmaceutical Preparations/chemistry , Software , User-Computer Interface , Vocabulary, Controlled
11.
Nucleic Acids Res ; 38(Database issue): D255-66, 2010 Jan.
Article in English | MEDLINE | ID: mdl-19933261

ABSTRACT

The PubChem BioAssay database (http://pubchem.ncbi.nlm.nih.gov) is a public repository for biological activities of small molecules and small interfering RNAs (siRNAs) hosted by the US National Institutes of Health (NIH). It archives experimental descriptions of assays and biological test results and makes the information freely accessible to the public. A PubChem BioAssay data entry includes an assay description, a summary and detailed test results. Each assay record is linked to the molecular target, whenever possible, and is cross-referenced to other National Center for Biotechnology Information (NCBI) database records. 'Related BioAssays' are identified by examining the assay target relationship and activity profile of commonly tested compounds. A key goal of PubChem BioAssay is to make the biological activity information easily accessible through the NCBI information retrieval system-Entrez, and various web-based PubChem services. An integrated suite of data analysis tools are available to optimize the utility of the chemical structure and biological activity information within PubChem, enabling researchers to aggregate, compare and analyze biological test results contributed by multiple organizations. In this work, we describe the PubChem BioAssay database, including data model, bioassay deposition and utilities that PubChem provides for searching, downloading and analyzing the biological activity information contained therein.


Subject(s)
Biological Assay , Computational Biology/methods , Databases, Factual , Dictionaries, Chemical as Topic , Animals , Computational Biology/trends , Databases, Protein , Humans , Information Storage and Retrieval/methods , Internet , National Library of Medicine (U.S.) , Pharmaceutical Preparations/chemistry , Pharmacology , Software , Structure-Activity Relationship , United States
12.
Ambix ; 56(1): 36-48, 2009 Mar.
Article in English | MEDLINE | ID: mdl-19831258

ABSTRACT

The Arabic MS Sprenger 1908 (Staatsbibliothek, Berlin) is a handbook of medieval alchemy. Among the works it preserves, we can find the only extant witness to the Arabic original of the well-known Liber de aluminibus et salibus. In this paper, I focus on a detailed alchemical dictionary preserved in this manuscript (fols. 3r-6r) whose explicit aim is to clarify the meaning of the secret language used by the alchemists to conceal the names of substances and operations. Other versions of the same alchemical lexicon are found in Syriac and karsuni in MSS Oriental 1593 and Egerton 709, both preserved in the British Library. After describing these manuscripts, I analyse the contents of this dictionary, its structure, its different versions, and the features of the alchemical language that it attests to, providing some examples to show how this kind of dictionary is still a useful tool for the contemporary researcher.


Subject(s)
Alchemy , Dictionaries, Chemical as Topic , Manuscripts as Topic/history , Arabia , Chemistry/history , History, 20th Century , History, Medieval , Language
13.
Bioinformatics ; 25(22): 2983-91, 2009 Nov 15.
Article in English | MEDLINE | ID: mdl-19759196

ABSTRACT

MOTIVATION: From the scientific community, a lot of effort has been spent on the correct identification of gene and protein names in text, while less effort has been spent on the correct identification of chemical names. Dictionary-based term identification has the power to recognize the diverse representation of chemical information in the literature and map the chemicals to their database identifiers. RESULTS: We developed a dictionary for the identification of small molecules and drugs in text, combining information from UMLS, MeSH, ChEBI, DrugBank, KEGG, HMDB and ChemIDplus. Rule-based term filtering, manual check of highly frequent terms and disambiguation rules were applied. We tested the combined dictionary and the dictionaries derived from the individual resources on an annotated corpus, and conclude the following: (i) each of the different processing steps increase precision with a minor loss of recall; (ii) the overall performance of the combined dictionary is acceptable (precision 0.67, recall 0.40 (0.80 for trivial names); (iii) the combined dictionary performed better than the dictionary in the chemical recognizer OSCAR3; (iv) the performance of a dictionary based on ChemIDplus alone is comparable to the performance of the combined dictionary. AVAILABILITY: The combined dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web site http://www.biosemantics.org/chemlist.


Subject(s)
Computational Biology/methods , Dictionaries, Chemical as Topic , Information Storage and Retrieval/methods , Abstracting and Indexing/methods , Dictionaries as Topic , Natural Language Processing , Pharmaceutical Preparations/chemistry , Software , Unified Medical Language System
14.
Curr Protoc Bioinformatics ; Chapter 14: 14.9.1-14.9.20, 2009 Jun.
Article in English | MEDLINE | ID: mdl-19496059

ABSTRACT

Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on "small" chemical compounds. This unit provides a detailed guide to browsing, searching, downloading, and programmatic access to the ChEBI database.


Subject(s)
Computational Biology/methods , Databases, Factual , Dictionaries, Chemical as Topic , Software , Databases, Genetic , Databases, Protein
15.
Rev Hist Pharm (Paris) ; 57(363): 267-76, 2009 Oct.
Article in French | MEDLINE | ID: mdl-20481122

ABSTRACT

Very few science books circulated amongst the European scientists, with so efficiency as Lemery's. His Course of Chemistry has been published 18 times in French and has been translated in Latin, English, German, Spanish, Italian and Deutch. His Treaty or Dictionnary of Drugs has been published 14 times in French and translated in foreign languages. His Universal Pharmacopeia has not been published less than 17 times in French and has also been translated in foreign languages. The longevity of these books was quite unusual, because, for exemple, his Dictionnary, first published in 1698, was published again in 1807!


Subject(s)
Chemistry/history , Pharmacology/history , Pharmacopoeias as Topic/history , Dictionaries, Chemical as Topic , France , History, 17th Century , History, 18th Century , History, 19th Century
16.
Nucleic Acids Res ; 37(Database issue): D195-200, 2009 Jan.
Article in English | MEDLINE | ID: mdl-18842629

ABSTRACT

The increasing structural information about target-bound compounds provide a rich basis to study the binding mechanisms of metabolites and drugs. SuperSite is a database, which combines the structural information with various tools for the analysis of molecular recognition. The main data is made up of 8000 metabolites including 1300 drugs, bound to about 290,000 different receptor binding sites. The analysis tools include features, like the highlighting of evolutionary conserved receptor residues, the marking of putative binding pockets and the superpositioning of different binding sites of the same ligand. User-defined compounds can be edited or uploaded and will be superimposed with the most similar co-crystallized ligand. The user can examine all results online with the molecule viewer Jmol. An implemented search algorithm allows the screening of uploaded proteins, in order to detect potential drug binding sites, which are similar to known binding pockets. The huge data set of target-bound compounds in combination with the provided analysis tools allow to inspect the characteristics of molecular recognition, especially for drug target interactions. SuperSite is publicly available at: http://bioinformatics.charite.de/supersite.


Subject(s)
Databases, Protein , Pharmaceutical Preparations/chemistry , Proteins/chemistry , Binding Sites , Computer Graphics , Dictionaries, Chemical as Topic , Encyclopedias as Topic , Ligands , Metabolism , Software , Vitamin B 6/chemistry
17.
Bioinformatics ; 24(13): i268-76, 2008 Jul 01.
Article in English | MEDLINE | ID: mdl-18586724

ABSTRACT

MOTIVATION: Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools. RESULTS: We present an IUPAC name recognizer with an F(1) measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F(1) measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run. AVAILABILITY: We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component.


Subject(s)
Abstracting and Indexing/methods , Dictionaries, Chemical as Topic , MEDLINE , Natural Language Processing , Pharmaceutical Preparations/classification , Terminology as Topic , Vocabulary, Controlled , Artificial Intelligence , Pattern Recognition, Automated/methods
18.
Nucleic Acids Res ; 36(Database issue): D344-50, 2008 Jan.
Article in English | MEDLINE | ID: mdl-17932057

ABSTRACT

Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on 'small' chemical compounds. The molecular entities in question are either natural products or synthetic products used to intervene in the processes of living organisms. Genome-encoded macromolecules (nucleic acids, proteins and peptides derived from proteins by cleavage) are not as a rule included in ChEBI. In addition to molecular entities, ChEBI contains groups (parts of molecular entities) and classes of entities. ChEBI includes an ontological classification, whereby the relationships between molecular entities or classes of entities and their parents and/or children are specified. ChEBI is available online at http://www.ebi.ac.uk/chebi/


Subject(s)
Databases, Factual , Dictionaries, Chemical as Topic , Agrochemicals/chemistry , Biological Products/chemistry , Indicators and Reagents/chemistry , Internet , Isotopes/chemistry , Pharmaceutical Preparations/chemistry , User-Computer Interface , Vocabulary, Controlled
19.
Nucleic Acids Res ; 36(Database issue): D426-33, 2008 Jan.
Article in English | MEDLINE | ID: mdl-18073189

ABSTRACT

The Worldwide Protein Data Bank (wwPDB; wwpdb.org) is the international collaboration that manages the deposition, processing and distribution of the PDB archive. The online PDB archive at ftp://ftp.wwpdb.org is the repository for the coordinates and related information for more than 47 000 structures, including proteins, nucleic acids and large macromolecular complexes that have been determined using X-ray crystallography, NMR and electron microscopy techniques. The members of the wwPDB-RCSB PDB (USA), MSD-EBI (Europe), PDBj (Japan) and BMRB (USA)-have remediated this archive to address inconsistencies that have been introduced over the years. The scope and methods used in this project are presented.


Subject(s)
Databases, Protein , Macromolecular Substances/chemistry , Archives , Crystallography, X-Ray , Databases, Protein/standards , Dictionaries, Chemical as Topic , Internet , Microscopy, Electron , Nuclear Magnetic Resonance, Biomolecular , Nucleic Acids/chemistry , Proteins/chemistry , Reproducibility of Results , Terminology as Topic
20.
Bioinformatics ; 23(4): 515-6, 2007 Feb 15.
Article in English | MEDLINE | ID: mdl-17204463

ABSTRACT

MOTIVATION: The size of current protein databases is a challenge for many Bioinformatics applications, both in terms of processing speed and information redundancy. It may be therefore desirable to efficiently reduce the database of interest to a maximally representative subset. RESULTS: The MinSet method employs a combination of a Suffix Tree and a Genetic Algorithm for the generation, selection and assessment of database subsets. The approach is generally applicable to any type of string-encoded data, allowing for a drastic reduction of the database size whilst retaining most of the information contained in the original set. We demonstrate the performance of the method on a database of protein domain structures encoded as strings. We used the SCOP40 domain database by translating protein structures into character strings by means of a structural alphabet and by extracting optimized subsets according to an entropy score that is based on a constant-length fragment dictionary. Therefore, optimized subsets are maximally representative for the distribution and range of local structures. Subsets containing only 10% of the SCOP structure classes show a coverage of >90% for fragments of length 1-4. AVAILABILITY: http://mathbio.nimr.mrc.ac.uk/~jkleinj/MinSet. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Data Compression/methods , Database Management Systems , Databases, Protein , Proteins/chemistry , Proteins/classification , Sequence Analysis, Protein/methods , Dictionaries, Chemical as Topic , Peptide Fragments/chemistry , Peptide Fragments/classification , Software
SELECTION OF CITATIONS
SEARCH DETAIL