ABSTRACT
Pfizer's Crystal Structure Database (CSDB) is a key enabling technology that allows scientists on structure-based projects rapid access to Pfizer's vast library of in-house crystal structures, as well as a significant number of structures imported from the Protein Data Bank. In addition to capturing basic information such as the asymmetric unit coordinates, reflection data, and the like, CSDB employs a variety of automated methods to first ensure a standard level of annotations and error checking, and then to add significant value for design teams by processing the structures through a sequence of algorithms that prepares the structures for use in modeling. The structures are made available, both as the original asymmetric unit as submitted, as well as the final prepared structures, through REST-based web services that are consumed by several client desktop applications. The structures can be searched by keyword, sequence, submission date, ligand substructure and similarity search, and other common queries.
Subject(s)
Algorithms , Databases, Protein , Humans , LigandsABSTRACT
Chemical mixtures have recently come to the attention of open standards and data structures for capturing machine-readable descriptions for informatics uses. At the present time, essentially all transmission of information about mixtures is done using short text descriptions that are readable only by trained scientists, and there are no accessible repositories of marked-up mixture data. We have designed a machine learning tool that can interpret mixture descriptions and upgrade them to the high-level Mixfile format, which can in turn be used to generate Mixtures InChI notation. The interpretation achieves a high success rate and can be used at scale to markup large catalogs and inventories, with some expert checking to catch edge cases. The training data that was accumulated during the project is made openly available, along with previously released mixture editing tools and utilities.