RESUMEN
Using an open-access distribution model, the Crystallography Open Database (COD, http://www.crystallography.net) collects all known 'small molecule / small to medium sized unit cell' crystal structures and makes them available freely on the Internet. As of today, the COD has aggregated ~150,000 structures, offering basic search capabilities and the possibility to download the whole database, or parts thereof using a variety of standard open communication protocols. A newly developed website provides capabilities for all registered users to deposit published and so far unpublished structures as personal communications or pre-publication depositions. Such a setup enables extension of the COD database by many users simultaneously. This increases the possibilities for growth of the COD database, and is the first step towards establishing a world wide Internet-based collaborative platform dedicated to the collection and curation of structural knowledge.
Asunto(s)
Cristalografía , Bases de Datos Factuales , Conducta Cooperativa , Modelos Moleculares , Interfaz Usuario-ComputadorRESUMEN
The Open Databases Integration for Materials Design (OPTIMADE) application programming interface (API) empowers users with holistic access to a growing federation of databases, enhancing the accessibility and discoverability of materials and chemical data. Since the first release of the OPTIMADE specification (v1.0), the API has undergone significant development, leading to the v1.2 release, and has underpinned multiple scientific studies. In this work, we highlight the latest features of the API format, accompanying software tools, and provide an update on the implementation of OPTIMADE in contributing materials databases. We end by providing several use cases that demonstrate the utility of the OPTIMADE API in materials research that continue to drive its ongoing development.
RESUMEN
Published reports of chemical compounds often contain multiple machine-readable descriptions which may supplement each other in order to yield coherent and complete chemical representations. This publication presents a method to cross-check such descriptions using a canonical representation and isomorphism of molecular graphs. If immediate agreement between compound descriptions is not found, the algorithm derives the minimal set of simplifications required for both descriptions to arrive to a matching form (if any). The proposed algorithm is used to cross-check chemical descriptions from the Crystallography Open Database to identify coherently described entries as well as those requiring further curation.
RESUMEN
Knowledge about the 3-dimensional structure, orientation and interaction of chemical compounds is important in many areas of science and technology. X-ray crystallography is one of the experimental techniques capable of providing a large amount of structural information for a given compound, and it is widely used for characterisation of organic and metal-organic molecules. The method provides precise 3D coordinates of atoms inside crystals, however, it does not directly deliver information about certain chemical characteristics such as bond orders, delocalization, charges, lone electron pairs or lone electrons. These aspects of a molecular model have to be derived from crystallographic data using refined information about interatomic distances and atom types as well as employing general chemical knowledge. This publication describes a curated automatic pipeline for the derivation of chemical attributes of molecules from crystallographic models. The method is applied to build a catalogue of chemical entities in an open-access crystallographic database, the Crystallography Open Database (COD). The catalogue of such chemical entities is provided openly as a derived database. The content of this catalogue and the problems arising in the fully automated pipeline are discussed, along with the possibilities to introduce manual data curation into the process.
RESUMEN
We introduce a protein-ligand binding database (PLBD) that presents thermodynamic and kinetic data of reversible protein interactions with small molecule compounds. The manually curated binding data are linked to protein-ligand crystal structures, enabling structure-thermodynamics correlations to be determined. The database contains over 5500 binding datasets of 556 sulfonamide compound interactions with the 12 catalytically active human carbonic anhydrase isozymes defined by fluorescent thermal shift assay, isothermal titration calorimetry, inhibition of enzymatic activity and surface plasmon resonance. In the PLBD, the intrinsic thermodynamic parameters of interactions are provided, which account for the binding-linked protonation reactions. In addition to the protein-ligand binding affinities, the database provides calorimetrically measured binding enthalpies, providing additional mechanistic understanding. The PLBD can be applied to investigations of protein-ligand recognition and could be integrated into small molecule drug design. Database URL https://plbd.org/.
Asunto(s)
Inhibidores de Anhidrasa Carbónica , Anhidrasas Carbónicas , Humanos , Ligandos , Inhibidores de Anhidrasa Carbónica/farmacología , Inhibidores de Anhidrasa Carbónica/química , Inhibidores de Anhidrasa Carbónica/metabolismo , Termodinámica , Anhidrasas Carbónicas/química , Anhidrasas Carbónicas/metabolismo , Unión ProteicaRESUMEN
Data curation practices of the Crystallography Open Database (COD) are described with additional focus being placed on the formal validation using the Crystallographic Information Framework (CIF). The cif_validate program, capable of validating CIF files against both the DDL1 and the DDLm dictionaries, is presented and used to process the entirety of the COD. Validation results collected from over 450â 000 CIF files are demonstrated to be a useful resource in the data maintenance process as well as the development of the underlying ontologies. A set of programs intended to aid in the dictionary migration from DDL1 to DDLm is also presented.
RESUMEN
The Open Databases Integration for Materials Design (OPTIMADE) consortium has designed a universal application programming interface (API) to make materials databases accessible and interoperable. We outline the first stable release of the specification, v1.0, which is already supported by many leading databases and several software packages. We illustrate the advantages of the OPTIMADE API through worked examples on each of the public materials databases that support the full API specification.
RESUMEN
The ever-growing availability of computing power and the sustained development of advanced computational methods have contributed much to recent scientific progress. These developments present new challenges driven by the sheer amount of calculations and data to manage. Next-generation exascale supercomputers will harden these challenges, such that automated and scalable solutions become crucial. In recent years, we have been developing AiiDA (aiida.net), a robust open-source high-throughput infrastructure addressing the challenges arising from the needs of automated workflow management and data provenance recording. Here, we introduce developments and capabilities required to reach sustained performance, with AiiDA supporting throughputs of tens of thousands processes/hour, while automatically preserving and storing the full data provenance in a relational database making it queryable and traversable, thus enabling high-performance data analytics. AiiDA's workflow language provides advanced automation, error handling features and a flexible plugin model to allow interfacing with external simulation software. The associated plugin registry enables seamless sharing of extensions, empowering a vibrant user community dedicated to making simulations more robust, user-friendly and reproducible.
RESUMEN
Detailed crystallographic information provided by X-ray diffraction (XRD) is complementary to molecular information provided by Raman spectroscopy. Accordingly, the combined use of these techniques allows the identification of an unknown compound without ambiguity. However, a full combination of Raman and XRD results requires an appropriate and reliable reference database with complete information. This is already available for XRD. The main objective of this paper is to introduce and describe the recently developed Raman Open Database (ROD, http://solsa.crystallography.net/rod). It comprises a collection of high-quality uncorrected Raman spectra. The novelty of this database is its interconnectedness with other open databases like the Crystallography Open Database (http://www.crystallography.net/cod and Theoretical Crystallography Open Database (http://www.crystallography.net/tcod/). The syntax adopted to format entries in the ROD is based on the worldwide recognized and used CIF format, which offers a simple way for data exchange, writing and description. ROD also uses JCAMP-DX files as an alternative format for submitted spectra. JCAMP-DX files are compatible to varying degrees with most commercial Raman software and can be read and edited using standard text editors.
RESUMEN
Computer descriptions of chemical molecular connectivity are necessary for searching chemical databases and for predicting chemical properties from molecular structure. In this article, the ongoing work to describe the chemical connectivity of entries contained in the Crystallography Open Database (COD) in SMILES format is reported. This collection of SMILES is publicly available for chemical (substructure) search or for any other purpose on an open-access basis, as is the COD itself. The conventions that have been followed for the representation of compounds that do not fit into the valence bond theory are outlined for the most frequently found cases. The procedure for getting the SMILES out of the CIF files starts with checking whether the atoms in the asymmetric unit are a chemically acceptable image of the compound. When they are not (molecule in a symmetry element, disorder, polymeric species,etc.), the previously published cif_molecule program is used to get such image in many cases. The program package Open Babel is then applied to get SMILES strings from the CIF files (either those directly taken from the COD or those produced by cif_molecule when applicable). The results are then checked and/or fixed by a human editor, in a computer-aided task that at present still consumes a great deal of human time. Even if the procedure still needs to be improved to make it more automatic (and hence faster), it has already yielded more than 160,000 curated chemical structures and the purpose of this article is to announce the existence of this work to the chemical community as well as to spread the use of its results.
RESUMEN
Two-dimensional (2D) materials have emerged as promising candidates for next-generation electronic and optoelectronic applications. Yet, only a few dozen 2D materials have been successfully synthesized or exfoliated. Here, we search for 2D materials that can be easily exfoliated from their parent compounds. Starting from 108,423 unique, experimentally known 3D compounds, we identify a subset of 5,619 compounds that appear layered according to robust geometric and bonding criteria. High-throughput calculations using van der Waals density functional theory, validated against experimental structural data and calculated random phase approximation binding energies, further allowed the identification of 1,825 compounds that are either easily or potentially exfoliable. In particular, the subset of 1,036 easily exfoliable cases provides novel structural prototypes and simple ternary compounds as well as a large portfolio of materials to search from for optimal properties. For a subset of 258 compounds, we explore vibrational, electronic, magnetic and topological properties, identifying 56 ferromagnetic and antiferromagnetic systems, including half-metals and half-semiconductors.
RESUMEN
In order to make results of computational scientific research findable, accessible, interoperable and re-usable, it is necessary to decorate them with standardised metadata. However, there are a number of technical and practical challenges that make this process difficult to achieve in practice. Here the implementation of a protocol is presented to tag crystal structures with their computed properties, without the need of human intervention to curate the data. This protocol leverages the capabilities of AiiDA, an open-source platform to manage and automate scientific computational workflows, and the TCOD, an open-access database storing computed materials properties using a well-defined and exhaustive ontology. Based on these, the complete procedure to deposit computed data in the TCOD database is automated. All relevant metadata are extracted from the full provenance information that AiiDA tracks and stores automatically while managing the calculations. Such a protocol also enables reproducibility of scientific data in the field of computational materials science. As a proof of concept, the AiiDA-TCOD interface is used to deposit 170 theoretical structures together with their computed properties and their full provenance graphs, consisting in over 4600 AiiDA nodes.
RESUMEN
A freely available small-molecule structure database, the Crystallography Open Database (COD), is used for the extraction of molecular-geometry information on small-molecule compounds. The results are used for the generation of new ligand descriptions, which are subsequently used by macromolecular model-building and structure-refinement software. To increase the reliability of the derived data, and therefore the new ligand descriptions, the entries from this database were subjected to very strict validation. The selection criteria made sure that the crystal structures used to derive atom types, bond and angle classes are of sufficiently high quality. Any suspicious entries at a crystal or molecular level were removed from further consideration. The selection criteria included (i) the resolution of the data used for refinement (entries solved at 0.84â Å resolution or higher) and (ii) the structure-solution method (structures must be from a single-crystal experiment and all atoms of generated molecules must have full occupancies), as well as basic sanity checks such as (iii) consistency between the valences and the number of connections between atoms, (iv) acceptable bond-length deviations from the expected values and (v) detection of atomic collisions. The derived atom types and bond classes were then validated using high-order moment-based statistical techniques. The results of the statistical analyses were fed back to fine-tune the atom typing. The developed procedure was repeated four times, resulting in fine-grained atom typing, bond and angle classes. The procedure will be repeated in the future as and when new entries are deposited in the COD. The whole procedure can also be applied to any source of small-molecule structures, including the Cambridge Structural Database and the ZINC database.
Asunto(s)
Cristalografía por Rayos X , Conformación Molecular , Bibliotecas de Moléculas Pequeñas/química , Bases de Datos Factuales , Ligandos , Modelos Moleculares , Programas InformáticosRESUMEN
The program AceDRG is designed for the derivation of stereochemical information about small molecules. It uses local chemical and topological environment-based atom typing to derive and organize bond lengths and angles from a small-molecule database: the Crystallography Open Database (COD). Information about the hybridization states of atoms, whether they belong to small rings (up to seven-membered rings), ring aromaticity and nearest-neighbour information is encoded in the atom types. All atoms from the COD have been classified according to the generated atom types. All bonds and angles have also been classified according to the atom types and, in a certain sense, bond types. Derived data are tabulated in a machine-readable form that is freely available from CCP4. AceDRG can also generate stereochemical information, provided that the basic bonding pattern of a ligand is known. The basic bonding pattern is perceived from one of the computational chemistry file formats, including SMILES, mmCIF, SDF MOL and SYBYL MOL2 files. Using the bonding chemistry, atom types, and bond and angle tables generated from the COD, AceDRG derives the `ideal' bond lengths, angles, plane groups, aromatic rings and chirality information, and writes them to an mmCIF file that can be used by the refinement program REFMAC5 and the model-building program Coot. Other refinement and model-building programs such as PHENIX and BUSTER can also use these files. AceDRG also generates one or more coordinate sets corresponding to the most favourable conformation(s) of a given ligand. AceDRG employs RDKit for chemistry perception and for initial conformation generation, as well as for the interpretation of SMILES strings, SDF MOL and SYBYL MOL2 files.
Asunto(s)
Bibliotecas de Moléculas Pequeñas/química , Programas Informáticos , Proteínas Bacterianas/química , Proteínas Bacterianas/metabolismo , Sitios de Unión , Cristalografía por Rayos X , Bases de Datos Factuales , Deinococcus/química , Deinococcus/metabolismo , Transferasas Intramoleculares/química , Transferasas Intramoleculares/metabolismo , Ligandos , Modelos Moleculares , Conformación Molecular , Mycobacterium tuberculosis/química , Mycobacterium tuberculosis/metabolismo , Proteínas Represoras/química , Proteínas Represoras/metabolismo , Bibliotecas de Moléculas Pequeñas/metabolismo , EstereoisomerismoRESUMEN
A syntax-correcting CIF parser, COD::CIF::Parser, is presented that can parse CIF 1.1 files and accurately report the position and the nature of the discovered syntactic problems. In addition, the parser is able to automatically fix the most common and the most obvious syntactic deficiencies of the input files. Bindings for Perl, C and Python programming environments are available. Based on COD::CIF::Parser, the cod-tools package for manipulating the CIFs in the Crystallography Open Database (COD) has been developed. The cod-tools package has been successfully used for continuous updates of the data in the automated COD data deposition pipeline, and to check the validity of COD data against the IUCr data validation guidelines. The performance, capabilities and applications of different parsers are compared.
RESUMEN
Crystallographic investigations deliver high-accuracy information about positions of atoms in crystal unit cells. For chemists, however, the structure of a molecule is most often of interest. The structure must thus be reconstructed from crystallographic files using symmetry information and chemical properties of atoms. Most existing algorithms faithfully reconstruct separate molecules but not the overall stoichiometry of the complex present in a crystal. Here, an algorithm that can reconstruct stoichiometrically correct multimolecular ensembles is described. This algorithm uses only the crystal symmetry information for determining molecule numbers and their stoichiometric ratios. The algorithm can be used by chemists and crystallographers as a standalone implementation for investigating above-molecular ensembles or as a function implemented in graphical crystal analysis software. The greatest envisaged benefit of the algorithm, however, is for the users of large crystallographic and chemical databases, since it will permit database maintainers to generate stoichiometrically correct chemical representations of crystal structures automatically and to match them against chemical databases, enabling multidisciplinary searches across multiple databases.