RESUMO
UNLABELLED: The Chemical Component Dictionary (CCD) is a chemical reference data resource that describes all residue and small molecule components found in Protein Data Bank (PDB) entries. The CCD contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands and solvent molecules. Each chemical definition includes descriptions of chemical properties such as stereochemical assignments, chemical descriptors, systematic chemical names and idealized coordinates. The content, preparation, validation and distribution of this CCD chemical reference dataset are described. AVAILABILITY AND IMPLEMENTATION: The CCD is updated regularly in conjunction with the scheduled weekly release of new PDB structure data. The CCD and amino acid variant reference datasets are hosted in the public PDB ftp repository at ftp://ftp.wwpdb.org/pub/pdb/data/monomers/components.cif.gz, ftp://ftp.wwpdb.org/pub/pdb/data/monomers/aa-variants-v1.cif.gz, and its mirror sites, and can be accessed from http://wwpdb.org. CONTACT: jwest@rcsb.rutgers.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Bases de Dados de Compostos Químicos , Bases de Dados de Proteínas , Dicionários Químicos como Assunto , Substâncias Macromoleculares/química , Anotação de Sequência Molecular , Internet , Ligantes , Interface Usuário-ComputadorRESUMO
Atomic coordinates in the Worldwide Protein Data Bank (wwPDB) are generally reported to greater precision than the experimental structure determinations have actually achieved. By using information theory and data compression to study the compressibility of protein atomic coordinates, it is possible to quantify the amount of randomness in the coordinate data and thereby to determine the realistic precision of the reported coordinates. On average, the value of each C(α) coordinate in a set of selected protein structures solved at a variety of resolutions is good to about 0.1â Å.
Assuntos
Bases de Dados de Proteínas/normas , Interface Usuário-Computador , Cristalografia por Raios X/normas , Dicionários Químicos como Assunto , Espectroscopia de Ressonância Magnética/normas , Microscopia Eletrônica/normas , Valor Preditivo dos Testes , Distribuição AleatóriaRESUMO
A previous paper [Spadaccini and Hall J. Chem. Inf. Model. doi:10.1021/ci300074v] details extensions to the STAR File [Hall J. Chem. Inf. Comput. Sci. 1991, 31, 326-333] syntax that will improve the exchange and archiving of electronic data. This paper describes a dictionary definition language (DDLm) for defining STAR File data items in a domain dictionary. A dictionary that defines the ontology and vocabulary of a discipline is built with DDLm, which is itself implemented in STAR, and is extensible and machine parsable. The DDLm is semantically rich and highly specific; provides strong data typing, data enumerations, and ranges; enables relationship keys between data items; and uses imbedded methods written in dREL [Spadaccini et al. J. Chem. Inf. Model. doi:10.1021/ci300076w] for data validation and evaluation and for refining data definitions. It promotes the modular definition of the discipline ontology and reuse through the ability to import definitions from other local and remote dictionaries, thus encouraging the sharing of data dictionaries within and across domains.
Assuntos
Dicionários Químicos como Assunto , Linguagens de Programação , Processamento Eletrônico de Dados , InformáticaRESUMO
MOTIVATION: The accurate identification of chemicals in text is important for many applications, including computer-assisted reconstruction of metabolic networks or retrieval of information about substances in drug development. But due to the diversity of naming conventions and traditions for such molecules, this task is highly complex and should be supported by computational tools. RESULTS: We present ChemSpot, a named entity recognition (NER) tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and International Union of Pure and Applied Chemistry entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a hybrid approach combining a Conditional Random Field with a dictionary. It achieves an F(1) measure of 68.1% on the SCAI corpus, outperforming the only other freely available chemical NER tool, OSCAR4, by 10.8 percentage points. AVAILABILITY: ChemSpot is freely available at: http://www.informatik.hu-berlin.de/wbi/resources.
Assuntos
Inteligência Artificial , Dicionários Químicos como Assunto , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Preparações Farmacêuticas/classificação , Biologia Computacional/métodos , Software , Terminologia como AssuntoRESUMO
Words that are homonyms-that is, for which a single written and spoken form is associated with multiple, unrelated interpretations, such as COMPOUND, which can denote an < enclosure > or a < composite > meaning-are an invaluable class of items for studying word and discourse comprehension. When using homonyms as stimuli, it is critical to control for the relative frequencies of each interpretation, because this variable can drastically alter the empirical effects of homonymy. Currently, the standard method for estimating these frequencies is based on the classification of free associates generated for a homonym, but this approach is both assumption-laden and resource-demanding. Here, we outline an alternative norming methodology based on explicit ratings of the relative meaning frequencies of dictionary definitions. To evaluate this method, we collected and analyzed data in a norming study involving 544 English homonyms, using the eDom norming software that we developed for this purpose. Dictionary definitions were generally sufficient to exhaustively cover word meanings, and the methods converged on stable norms with fewer data and less effort on the part of the experimenter. The predictive validity of the norms was demonstrated in analyses of lexical decision data from the English Lexicon Project (Balota et al., Behavior Research Methods, 39, 445-459, 2007), and from Armstrong and Plaut (Proceedings of the 33rd Annual Meeting of the Cognitive Science Society, 2223-2228, 2011). On the basis of these results, our norming method obviates relying on the unsubstantiated assumptions involved in estimating relative meaning frequencies on the basis of classification of free associates. Additional details of the norming procedure, the meaning frequency norms, and the source code, standalone binaries, and user manual for the software are available at http://edom.cnbc.cmu.edu .
Assuntos
Compreensão , Idioma , Software , Adulto , Dicionários Químicos como Assunto , Feminino , Humanos , Masculino , Reprodutibilidade dos Testes , Terminologia como Assunto , Adulto JovemRESUMO
A listing of carotenoids with heteroatoms (X = F, Cl, Br, I, Si, N, S, Se, Fe) directly attached to the carotenoid carbon skeleton has been compiled. The 178 listed carotenoids with C, H, X atoms demonstrate that the classical division of carotenoids into hydrocarbon carotenoids (C, H) and xanthophylls (C, H, O) has become obsolete.
Assuntos
Carotenoides/classificação , Xenobióticos/classificação , Carotenoides/química , Dicionários Químicos como Assunto , Estrutura Molecular , Terminologia como Assunto , Xenobióticos/químicaRESUMO
The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.
Assuntos
Biologia Computacional/métodos , Dicionários Químicos como Assunto , Processamento de Linguagem Natural , Software , Mineração de Dados , Bases de Dados Factuais , Patentes como AssuntoRESUMO
Recent developments in MS-based proteomics have increased the emphasis on peptides as a primary observable. While peptides are identified by tandem mass spectra, the link between peptide and protein remains implicit given the bottom-up nature of the experiment in which proteins are enzymatically digested prior to sequencing. It is therefore useful to provide a fast lookup from peptide to protein in order to systematically establish the broadest possible protein basis for the observed peptides. Here, we describe Pep2Pro, a fast web-service providing protein lookup by peptides covering the entire protein space comprising â¼10 million UniRef100 sequences. We demonstrate the usefulness of the service by reanalyzing peptides from two recent meta-proteomic data sets and identifying taxon-specific peptides, thereby implicating individual species as being present in these complex samples. The Pep2Pro web service can be accessed at http://www.pep2pro.org.
Assuntos
Bases de Dados de Proteínas , Dicionários Químicos como Assunto , Proteômica/métodos , Humanos , Internet , Peptídeos/química , Proteínas/químicaRESUMO
Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on 'small' chemical compounds. The molecular entities in question are either natural products or synthetic products used to intervene in the processes of living organisms. Genome-encoded macromolecules (nucleic acids, proteins and peptides derived from proteins by cleavage) are not as a rule included in ChEBI. In addition to molecular entities, ChEBI contains groups (parts of molecular entities) and classes of entities. ChEBI includes an ontological classification, whereby the relationships between molecular entities or classes of entities and their parents and/or children are specified. ChEBI is available online at http://www.ebi.ac.uk/chebi/. This article reports on new features in ChEBI since the last NAR report in 2007, including substructure and similarity searching, a submission tool for authoring of ChEBI datasets by the community and a 30-fold increase in the number of chemical structures stored in ChEBI.
Assuntos
Biologia Computacional/métodos , Bases de Dados Factuais , Agroquímicos/química , Animais , Produtos Biológicos/química , Biologia Computacional/tendências , Dicionários Químicos como Assunto , Humanos , Armazenamento e Recuperação da Informação/métodos , Internet , Preparações Farmacêuticas/química , Software , Interface Usuário-Computador , Vocabulário ControladoRESUMO
The PubChem BioAssay database (http://pubchem.ncbi.nlm.nih.gov) is a public repository for biological activities of small molecules and small interfering RNAs (siRNAs) hosted by the US National Institutes of Health (NIH). It archives experimental descriptions of assays and biological test results and makes the information freely accessible to the public. A PubChem BioAssay data entry includes an assay description, a summary and detailed test results. Each assay record is linked to the molecular target, whenever possible, and is cross-referenced to other National Center for Biotechnology Information (NCBI) database records. 'Related BioAssays' are identified by examining the assay target relationship and activity profile of commonly tested compounds. A key goal of PubChem BioAssay is to make the biological activity information easily accessible through the NCBI information retrieval system-Entrez, and various web-based PubChem services. An integrated suite of data analysis tools are available to optimize the utility of the chemical structure and biological activity information within PubChem, enabling researchers to aggregate, compare and analyze biological test results contributed by multiple organizations. In this work, we describe the PubChem BioAssay database, including data model, bioassay deposition and utilities that PubChem provides for searching, downloading and analyzing the biological activity information contained therein.
Assuntos
Bioensaio , Biologia Computacional/métodos , Bases de Dados Factuais , Dicionários Químicos como Assunto , Animais , Biologia Computacional/tendências , Bases de Dados de Proteínas , Humanos , Armazenamento e Recuperação da Informação/métodos , Internet , National Library of Medicine (U.S.) , Preparações Farmacêuticas/química , Farmacologia , Software , Relação Estrutura-Atividade , Estados UnidosRESUMO
The University of Minnesota Biocatalysis/Biodegradation Database (UM-BBD, http://umbbd.msi.umn.edu/) began in 1995 and now contains information on almost 1200 compounds, over 800 enzymes, almost 1300 reactions and almost 500 microorganism entries. Besides these data, it includes a Biochemical Periodic Table (UM-BPT) and a rule-based Pathway Prediction System (UM-PPS) (http://umbbd.msi.umn.edu/predict/) that predicts plausible pathways for microbial degradation of organic compounds. Currently, the UM-PPS contains 260 biotransformation rules derived from reactions found in the UM-BBD and scientific literature. Public access to UM-BBD data is increasing. UM-BBD compound data are now contributed to PubChem and ChemSpider, the public chemical databases. A new mirror website of the UM-BBD, UM-BPT and UM-PPS is being developed at ETH Zürich to improve speed and reliability of online access from anywhere in the world.
Assuntos
Biodegradação Ambiental , Biologia Computacional/métodos , Bases de Dados Genéticas , Bases de Dados de Ácidos Nucleicos , Acesso à Informação , Bioquímica/métodos , Biotransformação , Biologia Computacional/tendências , Dicionários Químicos como Assunto , Poluentes Ambientais , Genoma Bacteriano , Armazenamento e Recuperação da Informação/métodos , Internet , Microbiologia , Software , Interface Usuário-ComputadorRESUMO
The Arabic MS Sprenger 1908 (Staatsbibliothek, Berlin) is a handbook of medieval alchemy. Among the works it preserves, we can find the only extant witness to the Arabic original of the well-known Liber de aluminibus et salibus. In this paper, I focus on a detailed alchemical dictionary preserved in this manuscript (fols. 3r-6r) whose explicit aim is to clarify the meaning of the secret language used by the alchemists to conceal the names of substances and operations. Other versions of the same alchemical lexicon are found in Syriac and karsuni in MSS Oriental 1593 and Egerton 709, both preserved in the British Library. After describing these manuscripts, I analyse the contents of this dictionary, its structure, its different versions, and the features of the alchemical language that it attests to, providing some examples to show how this kind of dictionary is still a useful tool for the contemporary researcher.
Assuntos
Alquimia , Dicionários Químicos como Assunto , Manuscritos como Assunto/história , Arábia , Química/história , História do Século XX , História Medieval , IdiomaRESUMO
MOTIVATION: From the scientific community, a lot of effort has been spent on the correct identification of gene and protein names in text, while less effort has been spent on the correct identification of chemical names. Dictionary-based term identification has the power to recognize the diverse representation of chemical information in the literature and map the chemicals to their database identifiers. RESULTS: We developed a dictionary for the identification of small molecules and drugs in text, combining information from UMLS, MeSH, ChEBI, DrugBank, KEGG, HMDB and ChemIDplus. Rule-based term filtering, manual check of highly frequent terms and disambiguation rules were applied. We tested the combined dictionary and the dictionaries derived from the individual resources on an annotated corpus, and conclude the following: (i) each of the different processing steps increase precision with a minor loss of recall; (ii) the overall performance of the combined dictionary is acceptable (precision 0.67, recall 0.40 (0.80 for trivial names); (iii) the combined dictionary performed better than the dictionary in the chemical recognizer OSCAR3; (iv) the performance of a dictionary based on ChemIDplus alone is comparable to the performance of the combined dictionary. AVAILABILITY: The combined dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web site http://www.biosemantics.org/chemlist.
Assuntos
Biologia Computacional/métodos , Dicionários Químicos como Assunto , Armazenamento e Recuperação da Informação/métodos , Indexação e Redação de Resumos/métodos , Dicionários como Assunto , Processamento de Linguagem Natural , Preparações Farmacêuticas/química , Software , Unified Medical Language SystemRESUMO
Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on "small" chemical compounds. This unit provides a detailed guide to browsing, searching, downloading, and programmatic access to the ChEBI database.
Assuntos
Biologia Computacional/métodos , Bases de Dados Factuais , Dicionários Químicos como Assunto , Software , Bases de Dados Genéticas , Bases de Dados de ProteínasRESUMO
Very few science books circulated amongst the European scientists, with so efficiency as Lemery's. His Course of Chemistry has been published 18 times in French and has been translated in Latin, English, German, Spanish, Italian and Deutch. His Treaty or Dictionnary of Drugs has been published 14 times in French and translated in foreign languages. His Universal Pharmacopeia has not been published less than 17 times in French and has also been translated in foreign languages. The longevity of these books was quite unusual, because, for exemple, his Dictionnary, first published in 1698, was published again in 1807!
Assuntos
Química/história , Farmacologia/história , Farmacopeias como Assunto/história , Dicionários Químicos como Assunto , França , História do Século XVII , História do Século XVIII , História do Século XIXRESUMO
The increasing structural information about target-bound compounds provide a rich basis to study the binding mechanisms of metabolites and drugs. SuperSite is a database, which combines the structural information with various tools for the analysis of molecular recognition. The main data is made up of 8000 metabolites including 1300 drugs, bound to about 290,000 different receptor binding sites. The analysis tools include features, like the highlighting of evolutionary conserved receptor residues, the marking of putative binding pockets and the superpositioning of different binding sites of the same ligand. User-defined compounds can be edited or uploaded and will be superimposed with the most similar co-crystallized ligand. The user can examine all results online with the molecule viewer Jmol. An implemented search algorithm allows the screening of uploaded proteins, in order to detect potential drug binding sites, which are similar to known binding pockets. The huge data set of target-bound compounds in combination with the provided analysis tools allow to inspect the characteristics of molecular recognition, especially for drug target interactions. SuperSite is publicly available at: http://bioinformatics.charite.de/supersite.
Assuntos
Bases de Dados de Proteínas , Preparações Farmacêuticas/química , Proteínas/química , Sítios de Ligação , Gráficos por Computador , Dicionários Químicos como Assunto , Enciclopédias como Assunto , Ligantes , Metabolismo , Software , Vitamina B 6/químicaRESUMO
MOTIVATION: Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools. RESULTS: We present an IUPAC name recognizer with an F(1) measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F(1) measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run. AVAILABILITY: We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component.
Assuntos
Indexação e Redação de Resumos/métodos , Dicionários Químicos como Assunto , MEDLINE , Processamento de Linguagem Natural , Preparações Farmacêuticas/classificação , Terminologia como Assunto , Vocabulário Controlado , Inteligência Artificial , Reconhecimento Automatizado de Padrão/métodosRESUMO
Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on 'small' chemical compounds. The molecular entities in question are either natural products or synthetic products used to intervene in the processes of living organisms. Genome-encoded macromolecules (nucleic acids, proteins and peptides derived from proteins by cleavage) are not as a rule included in ChEBI. In addition to molecular entities, ChEBI contains groups (parts of molecular entities) and classes of entities. ChEBI includes an ontological classification, whereby the relationships between molecular entities or classes of entities and their parents and/or children are specified. ChEBI is available online at http://www.ebi.ac.uk/chebi/
Assuntos
Bases de Dados Factuais , Dicionários Químicos como Assunto , Agroquímicos/química , Produtos Biológicos/química , Indicadores e Reagentes/química , Internet , Isótopos/química , Preparações Farmacêuticas/química , Interface Usuário-Computador , Vocabulário ControladoRESUMO
The Worldwide Protein Data Bank (wwPDB; wwpdb.org) is the international collaboration that manages the deposition, processing and distribution of the PDB archive. The online PDB archive at ftp://ftp.wwpdb.org is the repository for the coordinates and related information for more than 47 000 structures, including proteins, nucleic acids and large macromolecular complexes that have been determined using X-ray crystallography, NMR and electron microscopy techniques. The members of the wwPDB-RCSB PDB (USA), MSD-EBI (Europe), PDBj (Japan) and BMRB (USA)-have remediated this archive to address inconsistencies that have been introduced over the years. The scope and methods used in this project are presented.
Assuntos
Bases de Dados de Proteínas , Substâncias Macromoleculares/química , Arquivos , Cristalografia por Raios X , Bases de Dados de Proteínas/normas , Dicionários Químicos como Assunto , Internet , Microscopia Eletrônica , Ressonância Magnética Nuclear Biomolecular , Ácidos Nucleicos/química , Proteínas/química , Reprodutibilidade dos Testes , Terminologia como AssuntoRESUMO
MOTIVATION: The size of current protein databases is a challenge for many Bioinformatics applications, both in terms of processing speed and information redundancy. It may be therefore desirable to efficiently reduce the database of interest to a maximally representative subset. RESULTS: The MinSet method employs a combination of a Suffix Tree and a Genetic Algorithm for the generation, selection and assessment of database subsets. The approach is generally applicable to any type of string-encoded data, allowing for a drastic reduction of the database size whilst retaining most of the information contained in the original set. We demonstrate the performance of the method on a database of protein domain structures encoded as strings. We used the SCOP40 domain database by translating protein structures into character strings by means of a structural alphabet and by extracting optimized subsets according to an entropy score that is based on a constant-length fragment dictionary. Therefore, optimized subsets are maximally representative for the distribution and range of local structures. Subsets containing only 10% of the SCOP structure classes show a coverage of >90% for fragments of length 1-4. AVAILABILITY: http://mathbio.nimr.mrc.ac.uk/~jkleinj/MinSet. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.