Asunto(s)
Cristalografía por Rayos X , Bases de Datos de Compuestos Químicos , Fraude , Mala Conducta Científica , Cristalografía por Rayos X/normas , Bases de Datos de Compuestos Químicos/normas , Bases de Datos de Compuestos Químicos/estadística & datos numéricos , Fraude/estadística & datos numéricos , Mala Conducta Científica/estadística & datos numéricosRESUMEN
The Worldwide Protein Data Bank (wwPDB) has provided validation reports based on recommendations from community Validation Task Forces for structures in the PDB since 2013. To further enhance validation of small molecules as recommended from the 2016 Ligand Validation Workshop, wwPDB, Global Phasing Ltd., and the Noguchi Institute, recently formed a public/private partnership to incorporate some of their software tools into the wwPDB validation package. Augmented wwPDB validation report features include: two-dimensional (2D) diagrams of small-molecule ligands and carbohydrates, highlighting geometric validation outcomes; 2D topological diagrams of oligosaccharides present in branched entities generated using 2D Symbol Nomenclature for Glycan representation; and views of 3D electron density maps for ligands and carbohydrates, illustrating the goodness-of-fit between the atomic structure and experimental data (X-ray crystallographic structures only). These improvements will impact confidence in ligand conformation and ligand-macromolecular interactions that will aid in understanding biochemical function and contribute to small-molecule drug discovery.
Asunto(s)
Carbohidratos/química , Bases de Datos de Proteínas/normas , Simulación del Acoplamiento Molecular/métodos , Proteómica/métodos , Bibliotecas de Moléculas Pequeñas/química , Quimioinformática/métodos , Bases de Datos de Compuestos Químicos/normas , Humanos , Ligandos , Unión Proteica , Proteoma/química , Proteoma/metabolismoRESUMEN
The European Registration, Evaluation, Authorization and Restriction of Chemical Substances Regulation, requires marketed chemicals to be evaluated for Ready Biodegradability (RB), considering in silico prediction as valid alternative to experimental testing. However, currently available models may not be relevant to predict compounds of industrial interest, due to accuracy and applicability domain restriction issues. In this work, we present a new and extended RB dataset (2830 compounds), issued by the merging of several public data sources. It was used to train classification models, which were externally validated and benchmarked against already-existing tools on a set of 316 compounds coming from the industrial context. New models showed good performances in terms of predictive power (Balance Accuracy (BA) = 0.74-0.79) and data coverage (83-91%). The Generative Topographic Mapping approach identified several chemotypes and structural motifs unique to the industrial dataset, highlighting for which chemical classes currently available models may have less reliable predictions. Finally, public and industrial data were merged into global dataset containing 3146 compounds. This is the biggest dataset reported in the literature so far, covering some chemotypes absent in the public data. Thus, predictive model developed on the Global dataset has larger applicability domain than the existing ones.
Asunto(s)
Bases de Datos de Compuestos Químicos , Contaminantes Ambientales/química , Modelos Químicos , Algoritmos , Benchmarking , Biodegradación Ambiental , Simulación por Computador , Bases de Datos de Compuestos Químicos/normas , Relación Estructura-Actividad Cuantitativa , Reproducibilidad de los ResultadosRESUMEN
The discovery of antiviral drugs is a rapidly developing area of medicinal chemistry research. The emergence of resistant variants and outbreaks of poorly studied viral diseases make this area constantly developing. The amount of antiviral activity data available in ChEMBL consistently grows, but virus taxonomy annotation of these data is not sufficient for thorough studies of antiviral chemical space. We developed a procedure for semi-automatic extraction of antiviral activity data from ChEMBL and mapped them to the virus taxonomy developed by the International Committee for Taxonomy of Viruses (ICTV). The procedure is based on the lists of virus-related values of ChEMBL annotation fields and a dictionary of virus names and acronyms mapped to ICTV taxa. Application of this data extraction procedure allows retrieving from ChEMBL 1.6 times more assays linked to 2.5 times more compounds and data points than ChEMBL web interface allows. Mapping of these data to ICTV taxa allows analyzing all the compounds tested against each viral species. Activity values and structures of the compounds were standardized, and the antiviral activity profile was created for each standard structure. Data set compiled using this algorithm was called ViralChEMBL. As case studies, we compared descriptor and scaffold distributions for the full ChEMBL and its `viral' and `non-viral' subsets, identified the most studied compounds and created a self-organizing map for ViralChEMBL. Our approach to data annotation appeared to be a very efficient tool for the study of antiviral chemical space.
Asunto(s)
Antivirales/química , Antivirales/clasificación , Curaduría de Datos , Bases de Datos de Compuestos Químicos , Bases de Datos de Compuestos Químicos/normas , Toma de Decisiones , Estándares de ReferenciaRESUMEN
Identification of discrepant data in aggregated databases is a key step in data curation and remediation. We have applied the ALATIS approach, which is based on the international chemical shift identifier (InChI) model, to the full PubChem Compound database to generate unique and reproducible compound and atom identifiers for all entries for which three-dimensional structures were available. This exercise also served to identify entries with discrepancies between structures and chemical formulas or InChI strings. The use of unique compound identifiers and atom nomenclature should support more rigorous links between small-molecule databases including those containing atom-specific information of the type available from crystallography and spectroscopy. The comprehensive results from this analysis are publicly available through our webserver [http://alatis.nmrfam.wisc.edu/].
Asunto(s)
Exactitud de los Datos , Bases de Datos de Compuestos Químicos , Bases de Datos de Compuestos Químicos/normasRESUMEN
1880 known drugs were collected and analysed for their mainstream molecular descriptors: MW, log P, HA, HD, RB and PSA. The statistical distributions were fitted to Gaussian functions for each of the descriptors. This gave a mathematical tool to calculate a weighted score, or an Index, for each descriptor. Known Drug Indexes (KDIs) were derived either by summation or multiplication of the Indexes, giving one number for each molecule calculated. The KDI summation and multiplication methods give a theoretical maxima of 6 and 1 respectively. According to both methods, methysergide (5.89/0.90), amsacrine (5.89/0.89) and fluorometholone (5.88/0.88) have the scores of the most well-balanced pharmaceuticals. The KDIs are advantageous tools in identifying the most well-balanced screening compounds based on the properties of known drugs; the screening collection can be optimised to only include quality compounds, which in turn produce tractable hit and lead compounds from the screening campaign.
Asunto(s)
Bases de Datos de Compuestos Químicos/normas , Descubrimiento de Drogas/métodos , Ensayos Analíticos de Alto Rendimiento/métodos , Bibliotecas de Moléculas Pequeñas/normas , Algoritmos , Humanos , Relación Estructura-Actividad Cuantitativa , Bibliotecas de Moléculas Pequeñas/química , Bibliotecas de Moléculas Pequeñas/farmacologíaRESUMEN
A key consideration at the screening stages of drug discovery is inâ vitro metabolic stability, often measured in human liver microsomes. Computational prediction models can be built using a large quantity of experimental data available from public databases, but these databases typically contain data measured using various protocols in different laboratories, raising the issue of data quality. In this study, we retrieved the intrinsic clearance (CLint ) measurements from an open database and performed extensive manual curation. Then, chemical descriptors were calculated using freely available software, and prediction models were built using machine learning algorithms. The models trained on the curated data showed better performance than those trained on the non-curated data and achieved performance comparable to previously published models, showing the importance of manual curation in data preparation. The curated data were made available, to make our models fully reproducible.
Asunto(s)
Bases de Datos de Compuestos Químicos/normas , Descubrimiento de Drogas/métodos , Eliminación Hepatobiliar , Aprendizaje Automático , Descubrimiento de Drogas/normas , Humanos , Tasa de Depuración Metabólica , Microsomas Hepáticos/metabolismoRESUMEN
Glycoinformatics is an actively developing scientific discipline, which provides scientists with the means of access to the data on natural glycans and with various tools of their processing. However, the informatization of glycomics has a long way to go before catching up with genomics and proteomics. In this Viewpoint, we review the current situation in glycoinformatics and discuss its achievements and shortcomings, emphasizing the major drawbacks: the lack of recognized standards, protocols, data indices and tools, and the informational isolation of the existing projects. We reiterate possible solutions of the persistent issues and describe our vision of an ideal glycoinformatics project.
Asunto(s)
Carbohidratos/análisis , Bases de Datos de Compuestos Químicos , Glicómica , Animales , Biología Computacional/métodos , Biología Computacional/normas , Bases de Datos de Compuestos Químicos/normas , Glicómica/métodos , Glicómica/normas , Humanos , Programas InformáticosRESUMEN
NMR is a widely used analytical technique with a growing number of repositories available. As a result, demands for a vendor-agnostic, open data format for long-term archiving of NMR data have emerged with the aim to ease and encourage sharing, comparison, and reuse of NMR data. Here we present nmrML, an open XML-based exchange and storage format for NMR spectral data. The nmrML format is intended to be fully compatible with existing NMR data for chemical, biochemical, and metabolomics experiments. nmrML can capture raw NMR data, spectral data acquisition parameters, and where available spectral metadata, such as chemical structures associated with spectral assignments. The nmrML format is compatible with pure-compound NMR data for reference spectral libraries as well as NMR data from complex biomixtures, i.e., metabolomics experiments. To facilitate format conversions, we provide nmrML converters for Bruker, JEOL and Agilent/Varian vendor formats. In addition, easy-to-use Web-based spectral viewing, processing, and spectral assignment tools that read and write nmrML have been developed. Software libraries and Web services for data validation are available for tool developers and end-users. The nmrML format has already been adopted for capturing and disseminating NMR data for small molecules by several open source data processing tools and metabolomics reference spectral libraries, e.g., serving as storage format for the MetaboLights data repository. The nmrML open access data standard has been endorsed by the Metabolomics Standards Initiative (MSI), and we here encourage user participation and feedback to increase usability and make it a successful standard.
Asunto(s)
Bases de Datos de Compuestos Químicos/normas , Espectroscopía de Resonancia Magnética/estadística & datos numéricos , Metabolómica/métodos , Programas InformáticosRESUMEN
WikiPathways (wikipathways.org) captures the collective knowledge represented in biological pathways. By providing a database in a curated, machine readable way, omics data analysis and visualization is enabled. WikiPathways and other pathway databases are used to analyze experimental data by research groups in many fields. Due to the open and collaborative nature of the WikiPathways platform, our content keeps growing and is getting more accurate, making WikiPathways a reliable and rich pathway database. Previously, however, the focus was primarily on genes and proteins, leaving many metabolites with only limited annotation. Recent curation efforts focused on improving the annotation of metabolism and metabolic pathways by associating unmapped metabolites with database identifiers and providing more detailed interaction knowledge. Here, we report the outcomes of the continued growth and curation efforts, such as a doubling of the number of annotated metabolite nodes in WikiPathways. Furthermore, we introduce an OpenAPI documentation of our web services and the FAIR (Findable, Accessible, Interoperable and Reusable) annotation of resources to increase the interoperability of the knowledge encoded in these pathways and experimental omics data. New search options, monthly downloads, more links to metabolite databases, and new portals make pathway knowledge more effortlessly accessible to individual researchers and research communities.
Asunto(s)
Bases de Datos de Compuestos Químicos , Metabolómica , Animales , Curaduría de Datos , Minería de Datos , Bases de Datos de Compuestos Químicos/normas , Bases de Datos Genéticas , Humanos , Redes y Vías Metabólicas , Control de Calidad , Motor de Búsqueda , Programas InformáticosRESUMEN
The Library of Integrated Network-Based Cellular Signatures (LINCS) is an NIH Common Fund program that catalogs how human cells globally respond to chemical, genetic, and disease perturbations. Resources generated by LINCS include experimental and computational methods, visualization tools, molecular and imaging data, and signatures. By assembling an integrated picture of the range of responses of human cells exposed to many perturbations, the LINCS program aims to better understand human disease and to advance the development of new therapies. Perturbations under study include drugs, genetic perturbations, tissue micro-environments, antibodies, and disease-causing mutations. Responses to perturbations are measured by transcript profiling, mass spectrometry, cell imaging, and biochemical methods, among other assays. The LINCS program focuses on cellular physiology shared among tissues and cell types relevant to an array of diseases, including cancer, heart disease, and neurodegenerative disorders. This Perspective describes LINCS technologies, datasets, tools, and approaches to data accessibility and reusability.
Asunto(s)
Catalogación/métodos , Biología de Sistemas/métodos , Biología Computacional/métodos , Bases de Datos de Compuestos Químicos/normas , Perfilación de la Expresión Génica/métodos , Biblioteca de Genes , Humanos , Almacenamiento y Recuperación de la Información/métodos , Programas Nacionales de Salud , National Institutes of Health (U.S.)/normas , Transcriptoma , Estados UnidosRESUMEN
The threshold of toxicological concern (TTC) approach is a resource-effective de minimis method for the safety assessment of chemicals, based on distributional analysis of the results of a large number of toxicological studies. It is being increasingly used to screen and prioritize substances with low exposure for which there is little or no toxicological information. The first step in the approach is the identification of substances that may be DNA-reactive mutagens, to which the lowest TTC value is applied. This TTC value was based on the analysis of the cancer potency database and involved a number of assumptions that no longer reflect the state-of-the-science and some of which were not as transparent as they could have been. Hence, review and updating of the database is proposed, using inclusion and exclusion criteria reflecting current knowledge. A strategy for the selection of appropriate substances for TTC determination, based on consideration of weight of evidence for genotoxicity and carcinogenicity is outlined. Identification of substances that are carcinogenic by a DNA-reactive mutagenic mode of action and those that clearly act by a non-genotoxic mode of action will enable the protectiveness to be determined of both the TTC for DNA-reactive mutagenicity and that applied by default to substances that may be carcinogenic but are unlikely to be DNA-reactive mutagens (i.e. for Cramer class I-III compounds). Critical to the application of the TTC approach to substances that are likely to be DNA-reactive mutagens is the reliability of the software tools used to identify such compounds. Current methods for this task are reviewed and recommendations made for their application.
Asunto(s)
Carcinógenos/química , Bases de Datos de Compuestos Químicos/normas , Mutágenos/química , Programas Informáticos/normas , Humanos , Medición de RiesgoRESUMEN
The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.
Asunto(s)
Curaduría de Datos/métodos , Bases de Datos de Compuestos Químicos/normas , Conjuntos de Datos como Asunto/normas , Relación Estructura-Actividad Cuantitativa , Aprendizaje Automático , Estructura MolecularRESUMEN
The emergence of a number of publicly available bioactivity databases, such as ChEMBL, PubChem BioAssay and BindingDB, has raised awareness about the topics of data curation, quality and integrity. Here we provide an overview and discussion of the current and future approaches to activity, assay and target data curation of the ChEMBL database. This curation process involves several manual and automated steps and aims to: (1) maximise data accessibility and comparability; (2) improve data integrity and flag outliers, ambiguities and potential errors; and (3) add further curated annotations and mappings thus increasing the usefulness and accuracy of the ChEMBL data for all users and modellers in particular. Issues related to activity, assay and target data curation and integrity along with their potential impact for users of the data are discussed, alongside robust selection and filter strategies in order to avoid or minimise these, depending on the desired application.
Asunto(s)
Bioensayo , Exactitud de los Datos , Bases de Datos de Compuestos Químicos , Curaduría de Datos/normas , Bases de Datos de Compuestos Químicos/normas , Bases de Datos Factuales , Concentración 50 InhibidoraRESUMEN
The French National Compound Library (Chimiothèque Nationale) has been created in 2003 and is the federation of local collections. It contains more than 56 000 small molecules and natural compounds synthesised or isolated in different laboratories over the past years. This explains the diversity of the collection. The strength of this initiative is the ability to connect chemists and biologists for the development of hits. This development involves the synthesis of analogues or/and chemical tools to find new targets. These collaborations lead to the identification of new chemical probes. These probes able to modulate a biological function are essential to study biological pathways. They can also be useful for therapeutic applications. This article will describe the major achievements and perspectives of the French Chemical Library.
Asunto(s)
Bibliotecas de Moléculas Pequeñas , Bases de Datos de Compuestos Químicos/normas , Bases de Datos de Compuestos Químicos/provisión & distribución , Bases de Datos de Compuestos Químicos/tendencias , Evaluación Preclínica de Medicamentos , Servicios de Información sobre Medicamentos/normas , Servicios de Información sobre Medicamentos/provisión & distribución , Servicios de Información sobre Medicamentos/tendencias , Francia , Humanos , Difusión de la Información , Conformación Molecular , Bibliotecas de Moléculas Pequeñas/provisión & distribuciónRESUMEN
In this paper we take a historical view of e-Science and e-Research developments within the Chemical Sciences at the University of Southampton, showing the development of several stages of the evolving data ecosystem as Chemistry moves into the digital age of the 21(st) Century. We cover our research on aspects of the representation of chemical information in the context of the world wide web (WWW) and its semantic enhancement (the Semantic Web) and illustrate this with the example of the representation of quantities and units within the Semantic Web. We explore the changing nature of laboratories as computing power becomes increasing powerful and pervasive and specifically look at the function and role of electronic or digital notebooks. Having focussed on the creation of chemical data and information in context, we finish the paper by following the use and reuse of this data as facilitated by the features provided by digital repositories and their importance in facilitating the exchange of chemical information touching on the issues of open and or intelligent access to the data.
Asunto(s)
Simulación por Computador/tendencias , Bases de Datos de Compuestos Químicos/normas , Bases de Datos de Compuestos Químicos/tendencias , InternetRESUMEN
Molecular information systems play an important part in modern data-driven drug discovery. They do not only support decision making but also enable new discoveries via association and inference. In this review, we outline the scientific requirements identified by the Innovative Medicines Initiative (IMI) Open PHACTS consortium for the design of an open pharmacological space (OPS) information system. The focus of this work is the integration of compound-target-pathway-disease/phenotype data for public and industrial drug discovery research. Typical scientific competency questions provided by the consortium members will be analyzed based on the underlying data concepts and associations needed to answer the questions. Publicly available data sources used to target these questions as well as the need for and potential of semantic web-based technology will be presented.
Asunto(s)
Bases de Datos de Compuestos Químicos , Bases de Datos Farmacéuticas , Descubrimiento de Drogas/métodos , Sistemas de Información , Semántica , Integración de Sistemas , Minería de Datos , Bases de Datos de Compuestos Químicos/normas , Bases de Datos Farmacéuticas/normas , Descubrimiento de Drogas/normas , Guías como Asunto , Sistemas de Información/normas , Bases del Conocimiento , Estructura Molecular , Relación Estructura-ActividadRESUMEN
With the rapidly increasing availability of High-Throughput Screening (HTS) data in the public domain, such as the PubChem database, methods for ligand-based computer-aided drug discovery (LB-CADD) have the potential to accelerate and reduce the cost of probe development and drug discovery efforts in academia. We assemble nine data sets from realistic HTS campaigns representing major families of drug target proteins for benchmarking LB-CADD methods. Each data set is public domain through PubChem and carefully collated through confirmation screens validating active compounds. These data sets provide the foundation for benchmarking a new cheminformatics framework BCL::ChemInfo, which is freely available for non-commercial use. Quantitative structure activity relationship (QSAR) models are built using Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Decision Trees (DTs), and Kohonen networks (KNs). Problem-specific descriptor optimization protocols are assessed including Sequential Feature Forward Selection (SFFS) and various information content measures. Measures of predictive power and confidence are evaluated through cross-validation, and a consensus prediction scheme is tested that combines orthogonal machine learning algorithms into a single predictor. Enrichments ranging from 15 to 101 for a TPR cutoff of 25% are observed.