Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
J Cheminform ; 16(1): 19, 2024 Feb 20.
Artículo en Inglés | MEDLINE | ID: mdl-38378618

RESUMEN

The rapid increase of publicly available chemical structures and associated experimental data presents a valuable opportunity to build robust QSAR models for applications in different fields. However, the common concern is the quality of both the chemical structure information and associated experimental data. This is especially true when those data are collected from multiple sources as chemical substance mappings can contain many duplicate structures and molecular inconsistencies. Such issues can impact the resulting molecular descriptors and their mappings to experimental data and, subsequently, the quality of the derived models in terms of accuracy, repeatability, and reliability. Herein we describe the development of an automated workflow to standardize chemical structures according to a set of standard rules and generate two and/or three-dimensional "QSAR-ready" forms prior to the calculation of molecular descriptors. The workflow was designed in the KNIME workflow environment and consists of three high-level steps. First, a structure encoding is read, and then the resulting in-memory representation is cross-referenced with any existing identifiers for consistency. Finally, the structure is standardized using a series of operations including desalting, stripping of stereochemistry (for two-dimensional structures), standardization of tautomers and nitro groups, valence correction, neutralization when possible, and then removal of duplicates. This workflow was initially developed to support collaborative modeling QSAR projects to ensure consistency of the results from the different participants. It was then updated and generalized for other modeling applications. This included modification of the "QSAR-ready" workflow to generate "MS-ready structures" to support the generation of substance mappings and searches for software applications related to non-targeted analysis mass spectrometry. Both QSAR and MS-ready workflows are freely available in KNIME, via standalone versions on GitHub, and as docker container resources for the scientific community. Scientific contribution: This work pioneers an automated workflow in KNIME, systematically standardizing chemical structures to ensure their readiness for QSAR modeling and broader scientific applications. By addressing data quality concerns through desalting, stereochemistry stripping, and normalization, it optimizes molecular descriptors' accuracy and reliability. The freely available resources in KNIME, GitHub, and docker containers democratize access, benefiting collaborative research and advancing diverse modeling endeavors in chemistry and mass spectrometry.

2.
Mol Pharm ; 19(2): 674-689, 2022 02 07.
Artículo en Inglés | MEDLINE | ID: mdl-34964633

RESUMEN

Tuberculosis (TB) is a major global health challenge, with approximately 1.4 million deaths per year. There is still a need to develop novel treatments for patients infected with Mycobacterium tuberculosis (Mtb). There have been many large-scale phenotypic screens that have led to the identification of thousands of new compounds. Yet, there is very limited investment in TB drug discovery which points to the need for new methods to increase the efficiency of drug discovery against Mtb. We have used machine learning approaches to learn from the public Mtb data, resulting in many data sets and models with robust enrichment and hit rates leading to the discovery of new active compounds. Recently, we have curated predominantly small-molecule Mtb data and developed new machine learning classification models with 18 886 molecules at different activity cutoffs. We now describe the further validation of these Bayesian models using a library of over 1000 molecules synthesized as part of EU-funded New Medicines for TB and More Medicines for TB programs. We highlight molecular features which are enriched in these active compounds. In addition, we provide new regression and classification models that can be used for scoring compound libraries or used to design new molecules. We have also visualized these molecules in the context of known molecular targets and identified clusters in chemical property space, which may aid in future target identification efforts. Finally, we are also making these data sets publicly available, representing a significant increase to the available Mtb inhibition data in the public domain.


Asunto(s)
Mycobacterium tuberculosis , Tuberculosis , Antituberculosos/química , Teorema de Bayes , Humanos , Aprendizaje Automático , Tuberculosis/tratamiento farmacológico
3.
J Phys Chem Lett ; 12(38): 9213-9219, 2021 Sep 30.
Artículo en Inglés | MEDLINE | ID: mdl-34529429

RESUMEN

The use of machine learning in chemistry has become a common practice. At the same time, despite the success of modern machine learning methods, the lack of data limits their use. Using a transfer learning methodology can help solve this problem. This methodology assumes that a model built on a sufficient amount of data captures general features of the chemical compound structure on which it was trained and that the further reuse of these features on a data set with a lack of data will greatly improve the quality of the new model. In this paper, we develop this approach for small organic molecules, implementing transfer learning with graph convolutional neural networks. The paper shows a significant improvement in the performance of the models for target properties with a lack of data. The effects of the data set composition on the model's quality and the applicability domain of the resulting models are also considered.

4.
Molecules ; 26(11)2021 May 27.
Artículo en Inglés | MEDLINE | ID: mdl-34072262

RESUMEN

Modern structure-property models are widely used in chemistry; however, in many cases, they are still a kind of a "black box" where there is no clear path from molecule structure to target property. Here we present an example of deep learning usage not only to build a model but also to determine key structural fragments of ligands influencing metal complexation. We have a series of chemically similar lanthanide ions, and we have collected data on complexes' stability, built models, predicting stability constants and decoded the models to obtain key fragments responsible for complexation efficiency. The results are in good correlation with the experimental ones, as well as modern theories of complexation. It was shown that the main influence on the constants had a mutual location of the binding centers.

6.
J Chem Inf Model ; 60(1): 22-28, 2020 01 27.
Artículo en Inglés | MEDLINE | ID: mdl-31860296

RESUMEN

Nowadays the development of new functional materials/chemical compounds using machine learning (ML) techniques is a hot topic and includes several crucial steps, one of which is the choice of chemical structure representation. The classical approach of rigorous feature engineering in ML typically improves the performance of the predictive model, but at the same time, it narrows down the scope of applicability and decreases the physical interpretability of predicted results. In this study, we present graph convolutional neural networks (GCNNs) as an architecture that allows for successfully predicting the properties of compounds from diverse domains of chemical space, using a minimal set of meaningful descriptors. The applicability of GCNN models has been demonstrated by a wide range of chemical domain-specific properties. Their performance is comparable to state-of-the-art techniques; however, this architecture exempts from the need to carry out precise feature engineering.


Asunto(s)
Aprendizaje Automático , Redes Neurales de la Computación , Algoritmos , Cristalización , Teoría Funcional de la Densidad , Modelos Moleculares , Relación Estructura-Actividad
7.
J Cheminform ; 11(1): 60, 2019 Sep 18.
Artículo en Inglés | MEDLINE | ID: mdl-33430972

RESUMEN

BACKGROUND: The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to pass through the plasma membrane. Thus, pKa affects chemical absorption, distribution, metabolism, excretion, and toxicity properties. Multiple proprietary software packages exist for the prediction of pKa, but to the best of our knowledge no free and open-source programs exist for this purpose. Using a freely available data set and three machine learning approaches, we developed open-source models for pKa prediction. METHODS: The experimental strongest acidic and strongest basic pKa values in water for 7912 chemicals were obtained from DataWarrior, a freely available software package. Chemical structures were curated and standardized for quantitative structure-activity relationship (QSAR) modeling using KNIME, and a subset comprising 79% of the initial set was used for modeling. To evaluate different approaches to modeling, several datasets were constructed based on different processing of chemical structures with acidic and/or basic pKas. Continuous molecular descriptors, binary fingerprints, and fragment counts were generated using PaDEL, and pKa prediction models were created using three machine learning methods, (1) support vector machines (SVM) combined with k-nearest neighbors (kNN), (2) extreme gradient boosting (XGB) and (3) deep neural networks (DNN). RESULTS: The three methods delivered comparable performances on the training and test sets with a root-mean-squared error (RMSE) around 1.5 and a coefficient of determination (R2) around 0.80. Two commercial pKa predictors from ACD/Labs and ChemAxon were used to benchmark the three best models developed in this work, and performance of our models compared favorably to the commercial products. CONCLUSIONS: This work provides multiple QSAR models to predict the strongest acidic and strongest basic pKas of chemicals, built using publicly available data, and provided as free and open-source software on GitHub.

8.
Mol Pharm ; 15(10): 4346-4360, 2018 10 01.
Artículo en Inglés | MEDLINE | ID: mdl-29672063

RESUMEN

Tuberculosis is a global health dilemma. In 2016, the WHO reported 10.4 million incidences and 1.7 million deaths. The need to develop new treatments for those infected with Mycobacterium tuberculosis ( Mtb) has led to many large-scale phenotypic screens and many thousands of new active compounds identified in vitro. However, with limited funding, efforts to discover new active molecules against Mtb needs to be more efficient. Several computational machine learning approaches have been shown to have good enrichment and hit rates. We have curated small molecule Mtb data and developed new models with a total of 18,886 molecules with activity cutoffs of 10 µM, 1 µM, and 100 nM. These data sets were used to evaluate different machine learning methods (including deep learning) and metrics and to generate predictions for additional molecules published in 2017. One Mtb model, a combined in vitro and in vivo data Bayesian model at a 100 nM activity yielded the following metrics for 5-fold cross validation: accuracy = 0.88, precision = 0.22, recall = 0.91, specificity = 0.88, kappa = 0.31, and MCC = 0.41. We have also curated an evaluation set ( n = 153 compounds) published in 2017, and when used to test our model, it showed the comparable statistics (accuracy = 0.83, precision = 0.27, recall = 1.00, specificity = 0.81, kappa = 0.36, and MCC = 0.47). We have also compared these models with additional machine learning algorithms showing Bayesian machine learning models constructed with literature Mtb data generated by different laboratories generally were equivalent to or outperformed deep neural networks with external test sets. Finally, we have also compared our training and test sets to show they were suitably diverse and different in order to represent useful evaluation sets. Such Mtb machine learning models could help prioritize compounds for testing in vitro and in vivo.


Asunto(s)
Antituberculosos/farmacología , Mycobacterium tuberculosis/efectos de los fármacos , Teorema de Bayes , Descubrimiento de Drogas , Aprendizaje Automático , Máquina de Vectores de Soporte
9.
Mol Pharm ; 14(12): 4462-4475, 2017 12 04.
Artículo en Inglés | MEDLINE | ID: mdl-29096442

RESUMEN

Machine learning methods have been applied to many data sets in pharmaceutical research for several decades. The relative ease and availability of fingerprint type molecular descriptors paired with Bayesian methods resulted in the widespread use of this approach for a diverse array of end points relevant to drug discovery. Deep learning is the latest machine learning algorithm attracting attention for many of pharmaceutical applications from docking to virtual screening. Deep learning is based on an artificial neural network with multiple hidden layers and has found considerable traction for many artificial intelligence applications. We have previously suggested the need for a comparison of different machine learning methods with deep learning across an array of varying data sets that is applicable to pharmaceutical research. End points relevant to pharmaceutical research include absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) properties, as well as activity against pathogens and drug discovery data sets. In this study, we have used data sets for solubility, probe-likeness, hERG, KCNQ1, bubonic plague, Chagas, tuberculosis, and malaria to compare different machine learning methods using FCFP6 fingerprints. These data sets represent whole cell screens, individual proteins, physicochemical properties as well as a data set with a complex end point. Our aim was to assess whether deep learning offered any improvement in testing when assessed using an array of metrics including AUC, F1 score, Cohen's kappa, Matthews correlation coefficient and others. Based on ranked normalized scores for the metrics or data sets Deep Neural Networks (DNN) ranked higher than SVM, which in turn was ranked higher than all the other machine learning methods. Visualizing these properties for training and test sets using radar type plots indicates when models are inferior or perhaps over trained. These results also suggest the need for assessing deep learning further using multiple metrics with much larger scale comparisons, prospective testing as well as assessment of different fingerprints and DNN architectures beyond those used.


Asunto(s)
Descubrimiento de Drogas/métodos , Aprendizaje Automático , Redes Neurales de la Computación , Teorema de Bayes , Conjuntos de Datos como Asunto
10.
J Cheminform ; 8: 66, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27933103

RESUMEN

BACKGROUND: Three-dimensional (3D) printed crystal structures are useful for chemistry teaching and research. Current manual methods of converting crystal structures into 3D printable files are time-consuming and tedious. To overcome this limitation, we developed a programmatic method that allows for facile conversion of thousands of crystal structures directly into 3D printable files. RESULTS: A collection of over 30,000 crystal structures in crystallographic information file (CIF) format from the Crystallography Open Database (COD) were programmatically converted into 3D printable files (VRML format) using Jmol scripting. The resulting data file conversion of the 30,000 CIFs proceeded as expected, however some inconsistencies and unintended results were observed with co-crystallized structures, racemic mixtures, and structures with large counterions that led to 3D printable files not containing the desired chemical structure. Potential solutions to these challenges are considered and discussed. Further, a searchable Jmol 3D Print website was created that allows users to both discover the 3D file dataset created in this work and create custom 3D printable files for any structure in the COD. CONCLUSIONS: Over 30,000 crystal structures were programmatically converted into 3D printable files, allowing users to have quick access to a sizable collection of 3D printable crystal structures. Further, any crystal structure (>350,000) in the COD can now be conveniently converted into 3D printable file formats using the Jmol 3D Print website created in this work. The 3D Print website also allows users to convert their own CIFs into 3D printable files. 3D file data, scripts, and the Jmol 3D Print website are provided openly to the community in an effort to promote discovery and use of 3D printable crystal structures. The 3D file dataset and Jmol 3D Print website will find wide use with researchers and educators seeking to 3D print chemical structures, while the scripts will be useful for programmatically converting large database collections of crystal structures into 3D printable files.

11.
Artículo en Inglés | MEDLINE | ID: mdl-26989153

RESUMEN

The High-performance Integrated Virtual Environment (HIVE) is a distributed storage and compute environment designed primarily to handle next-generation sequencing (NGS) data. This multicomponent cloud infrastructure provides secure web access for authorized users to deposit, retrieve, annotate and compute on NGS data, and to analyse the outcomes using web interface visual environments appropriately built in collaboration with research and regulatory scientists and other end users. Unlike many massively parallel computing environments, HIVE uses a cloud control server which virtualizes services, not processes. It is both very robust and flexible due to the abstraction layer introduced between computational requests and operating system processes. The novel paradigm of moving computations to the data, instead of moving data to computational nodes, has proven to be significantly less taxing for both hardware and network infrastructure.The honeycomb data model developed for HIVE integrates metadata into an object-oriented model. Its distinction from other object-oriented databases is in the additional implementation of a unified application program interface to search, view and manipulate data of all types. This model simplifies the introduction of new data types, thereby minimizing the need for database restructuring and streamlining the development of new integrated information systems. The honeycomb model employs a highly secure hierarchical access control and permission system, allowing determination of data access privileges in a finely granular manner without flooding the security subsystem with a multiplicity of rules. HIVE infrastructure will allow engineers and scientists to perform NGS analysis in a manner that is both efficient and secure. HIVE is actively supported in public and private domains, and project collaborations are welcomed. Database URL: https://hive.biochemistry.gwu.edu.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Interfaz Usuario-Computador , Biología Computacional , Mutación/genética , Poliovirus/genética , Vacunas contra Poliovirus/inmunología , Proteómica , Recombinación Genética , Alineación de Secuencia , Estadística como Asunto
12.
J Cheminform ; 7: 30, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26155308

RESUMEN

BACKGROUND: There are presently hundreds of online databases hosting millions of chemical compounds and associated data. As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online. In order to help facilitate validation and standardization of chemical structure datasets from various sources we have delivered a freely available internet-based platform to the community for the processing of chemical compound datasets. RESULTS: The chemical validation and standardization platform (CVSP) both validates and standardizes chemical structure representations according to sets of systematic rules. The chemical validation algorithms detect issues with submitted molecular representations using pre-defined or user-defined dictionary-based molecular patterns that are chemically suspicious or potentially requiring manual review. Each identified issue is assigned one of three levels of severity - Information, Warning, and Error - in order to conveniently inform the user of the need to browse and review subsets of their data. The validation process includes validation of atoms and bonds (e.g., making aware of query atoms and bonds), valences, and stereo. The standard form of submission of collections of data, the SDF file, allows the user to map the data fields to predefined CVSP fields for the purpose of cross-validating associated SMILES and InChIs with the connection tables contained within the SDF file. This platform has been applied to the analysis of a large number of data sets prepared for deposition to our ChemSpider database and in preparation of data for the Open PHACTS project. In this work we review the results of the automated validation of the DrugBank dataset, a popular drug and drug target database utilized by the community, and ChEMBL 17 data set. CVSP web site is located at http://cvsp.chemspider.com/. CONCLUSION: A platform for the validation and standardization of chemical structure representations of various formats has been developed and made available to the community to assist and encourage the processing of chemical structure files to produce more homogeneous compound representations for exchange and interchange between online databases. While the CVSP platform is designed with flexibility inherent to the rules that can be used for processing the data we have produced a recommended rule set based on our own experiences with the large data sets such as DrugBank, ChEMBL, and data sets from ChemSpider.

13.
J Chem Inf Model ; 55(3): 501-9, 2015 Mar 23.
Artículo en Inglés | MEDLINE | ID: mdl-25679543

RESUMEN

In designing an Electronic Lab Notebook (ELN), there is a balance to be struck between keeping it as general and multidisciplinary as possible for simplicity of use and maintenance and introducing more domain-specific functionality to increase its appeal to target research areas. Here, we describe the results of a collaboration between the Royal Society of Chemistry (RSC) and the University of Southampton, guided by the aims of the Dial-a-Molecule Grand Challenge, intended to achieve the best of both worlds and augment a discipline-agnostic ELN, LabTrove, with chemistry-specific functionality and using data provided by the ChemSpider platform. This has been done using plug-in technology to ensure maximum transferability with minimal effort of the chemistry functionality to other ELNs and equally other subject-specific functionality to LabTrove. The resulting product, ChemTrove, has undergone a usability trial by selected academics, and the resulting feedback will guide the future development of the underlying ELN technology.


Asunto(s)
Química/métodos , Almacenamiento y Recuperación de la Información , Internet , Programas Informáticos , Laboratorios
14.
J Comput Aided Mol Des ; 28(10): 1023-30, 2014 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-25086851

RESUMEN

Since 2009 the Royal Society of Chemistry (RSC) has been delivering access to chemistry data and cheminformatics tools via the ChemSpider database and has garnered a significant community following in terms of usage and contribution to the platform. ChemSpider has focused only on those chemical entities that can be represented as molecular connection tables or, to be more specific, the ability to generate an InChI from the input structure. As a structure centric hub ChemSpider is built around the molecular structure with other data and links being associated with this structure. As a result the platform has been limited in terms of the types of data that can be managed, and the flexibility of its searches, and it is constrained by the data model. New technologies and approaches, specifically taking into account a shift from relational to NoSQL databases, and the growing importance of the semantic web, has motivated RSC to rearchitect and create a more generic data repository utilizing these new technologies. This article will provide an overview of our activities in delivering data sharing platforms for the chemistry community including the development of the new data repository expanding into more extensive domains of chemistry data.


Asunto(s)
Bases de Datos de Compuestos Químicos , Sociedades Científicas , Conducta Cooperativa , Difusión de la Información , Internet , Reino Unido , Interfaz Usuario-Computador
15.
J Cheminform ; 5(1): 23, 2013 May 08.
Artículo en Inglés | MEDLINE | ID: mdl-23657106

RESUMEN

BACKGROUND: Making data available as Linked Data using Resource Description Framework (RDF) promotes integration with other web resources. RDF documents can natively link to related data, and others can link back using Uniform Resource Identifiers (URIs). RDF makes the data machine-readable and uses extensible vocabularies for additional information, making it easier to scale up inference and data analysis. RESULTS: This paper describes recent developments in an ongoing project converting data from the ChEMBL database into RDF triples. Relative to earlier versions, this updated version of ChEMBL-RDF uses recently introduced ontologies, including CHEMINF and CiTO; exposes more information from the database; and is now available as dereferencable, linked data. To demonstrate these new features, we present novel use cases showing further integration with other web resources, including Bio2RDF, Chem2Bio2RDF, and ChemSpider, and showing the use of standard ontologies for querying. CONCLUSIONS: We have illustrated the advantages of using open standards and ontologies to link the ChEMBL database to other databases. Using those links and the knowledge encoded in standards and ontologies, the ChEMBL-RDF resource creates a foundation for integrated semantic web cheminformatics applications, such as the presented decision support.

16.
Drug Discov Today ; 17(13-14): 685-701, 2012 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-22426180

RESUMEN

In recent years there has been a dramatic increase in the number of freely accessible online databases serving the chemistry community. The internet provides chemistry data that can be used for data-mining, for computer models, and integration into systems to aid drug discovery. There is however a responsibility to ensure that the data are high quality to ensure that time is not wasted in erroneous searches, that models are underpinned by accurate data and that improved discoverability of online resources is not marred by incorrect data. In this article we provide an overview of some of the experiences of the authors using online chemical compound databases, critique the approaches taken to assemble data and we suggest approaches to deliver definitive reference data sources.


Asunto(s)
Bases de Datos de Compuestos Químicos/normas , Descubrimiento de Drogas/métodos , Sector Público , Mejoramiento de la Calidad/tendencias , Bases de Datos de Compuestos Químicos/tendencias , Descubrimiento de Drogas/normas , Internet/normas , Control de Calidad
17.
J Am Soc Mass Spectrom ; 23(1): 179-85, 2012 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-22069037

RESUMEN

In many cases, an unknown to an investigator is actually known in the chemical literature, a reference database, or an internet resource. We refer to these types of compounds as "known unknowns." ChemSpider is a very valuable internet database of known compounds useful in the identification of these types of compounds in commercial, environmental, forensic, and natural product samples. The database contains over 26 million entries from hundreds of data sources and is provided as a free resource to the community. Accurate mass mass spectrometry data is used to query the database by either elemental composition or a monoisotopic mass. Searching by elemental composition is the preferred approach. However, it is often difficult to determine a unique elemental composition for compounds with molecular weights greater than 600 Da. In these cases, searching by the monoisotopic mass is advantageous. In either case, the search results are refined by sorting the number of references associated with each compound in descending order. This raises the most useful candidates to the top of the list for further evaluation. These approaches were shown to be successful in identifying "known unknowns" noted in our laboratory and for compounds of interest to others.

18.
J Comput Aided Mol Des ; 25(6): 533-54, 2011 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-21660515

RESUMEN

The Online Chemical Modeling Environment is a web-based platform that aims to automate and simplify the typical steps required for QSAR modeling. The platform consists of two major subsystems: the database of experimental measurements and the modeling framework. A user-contributed database contains a set of tools for easy input, search and modification of thousands of records. The OCHEM database is based on the wiki principle and focuses primarily on the quality and verifiability of the data. The database is tightly integrated with the modeling framework, which supports all the steps required to create a predictive model: data search, calculation and selection of a vast variety of molecular descriptors, application of machine learning methods, validation, analysis of the model and assessment of the applicability domain. As compared to other similar systems, OCHEM is not intended to re-implement the existing tools or models but rather to invite the original authors to contribute their results, make them publicly available, share them with other users and to become members of the growing research community. Our intention is to make OCHEM a widely used platform to perform the QSPR/QSAR studies online and share it with other users on the Web. The ultimate goal of OCHEM is collecting all possible chemoinformatics tools within one simple, reliable and user-friendly resource. The OCHEM is free for web users and it is available online at http://www.ochem.eu.


Asunto(s)
Bases de Datos Factuales , Internet , Modelos Químicos , Difusión de la Información , Gestión de la Información , Relación Estructura-Actividad Cuantitativa , Interfaz Usuario-Computador
20.
J Cheminform ; 2(1): 3, 2010 03 23.
Artículo en Inglés | MEDLINE | ID: mdl-20331846

RESUMEN

BACKGROUND: Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships. RESULTS: We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation. CONCLUSIONS: We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...