RESUMEN
In a typical high-throughput screening (HTS) campaign, less than 1 % of the small-molecule library is characterized by confirmatory experiments. As much as 99 % of the library's molecules are set aside--and not included in downstream analysis--although some of these molecules would prove active were they sent for confirmatory testing. These missing experimental measurements prevent active molecules from being identified by screeners. In this study, we propose managing missing measurements using imputation--a powerful technique from the machine learning community--to fill in accurate guesses where measurements are missing. We then use these imputed measurements to construct an imputed visualization of HTS results, based on the scaffold tree visualization from the literature. This imputed visualization identifies almost all groups of active molecules from a HTS, even those that would otherwise be missed. We validate our methodology by simulating HTS experiments using the data from eight quantitative HTS campaigns, and the implications for drug discovery are discussed. In particular, this method can rapidly and economically identify novel active molecules, each of which could have novel function in either binding or selectivity in addition to representing new intellectual property.
Asunto(s)
Ensayos Analíticos de Alto Rendimiento , Bibliotecas de Moléculas Pequeñas , Inteligencia Artificial , Descubrimiento de Drogas , Humanos , Programas InformáticosRESUMEN
MOTIVATION: In high-throughput screens (HTS) of small molecules for activity in an in vitro assay, it is common to search for active scaffolds, with at least one example successfully confirmed as an active. The number of active scaffolds better reflects the success of the screen than the number of active molecules. Many existing algorithms for deciding which hits should be sent for confirmatory testing neglect this concern. RESULTS: We derived a new extension of a recently proposed economic framework, diversity-oriented prioritization (DOP), that aims-by changing which hits are sent for confirmatory testing-to maximize the number of scaffolds with at least one confirmed active. In both retrospective and prospective experiments, DOP accurately predicted the number of scaffold discoveries in a batch of confirmatory experiments, improved the rate of scaffold discovery by 8-17%, and was surprisingly robust to the size of the confirmatory test batches. As an extension of our previously reported economic framework, DOP can be used to decide the optimal number of hits to send for confirmatory testing by iteratively computing the cost of discovering an additional scaffold, the marginal cost of discovery. CONTACT: swamidass@wustl.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Ensayos Analíticos de Alto Rendimiento , Algoritmos , Análisis por ConglomeradosRESUMEN
Most methods of deciding which hits from a screen to send for confirmatory testing assume that all confirmed actives are equally valuable and aim only to maximize the number of confirmed hits. In contrast, "utility-aware" methods are informed by models of screeners' preferences and can increase the rate at which the useful information is discovered. Clique-oriented prioritization (COP) extends a recently proposed economic framework and aims--by changing which hits are sent for confirmatory testing--to maximize the number of scaffolds with at least two confirmed active examples. In both retrospective and prospective experiments, COP enables accurate predictions of the number of clique discoveries in a batch of confirmatory experiments and improves the rate of clique discovery by more than 3-fold. In contrast, other similarity-based methods like ontology-based pattern identification (OPI) and local hit-rate analysis (LHR) reduce the rate of scaffold discovery by about half. The utility-aware algorithm used to implement COP is general enough to implement several other important models of screener preferences.
Asunto(s)
Proteínas de Caenorhabditis elegans/antagonistas & inhibidores , Proteínas Cromosómicas no Histona/antagonistas & inhibidores , Descubrimiento de Drogas/métodos , Histona Demetilasas con Dominio de Jumonji/antagonistas & inhibidores , Algoritmos , Animales , Caenorhabditis elegans , Proteínas de Caenorhabditis elegans/química , Proteínas Cromosómicas no Histona/química , Ensayos Analíticos de Alto Rendimiento , Humanos , Histona Demetilasas con Dominio de Jumonji/química , Modelos Moleculares , Bibliotecas de Moléculas PequeñasRESUMEN
Public databases that store the data from small-molecule screens are a rich and untapped resource of chemical and biological information. However, screening databases are unorganized, which makes interpreting their data difficult. We propose a method of inferring workflow graphs--which encode the relationships between assays in screening projects--directly from screening data and using these workflows to organize each project's data. On the basis of four heuristics regarding the organization of screening projects, we designed an algorithm that extracts a project's workflow graph from screening data. Where possible, the algorithm is evaluated by comparing each project's inferred workflow to its documentation. In the majority of cases, there are no discrepancies between the two. Most errors can be traced to points in the project where screeners chose additional molecules to test based on structural similarity to promising molecules, a case our algorithm is not yet capable of handling. Nonetheless, these workflows accurately organize most of the data and also provide a method of visualizing a screening project. This method is robust enough to build a workflow-oriented front-end to PubChem and is currently being used regularly by both our lab and our collaborators. A Python implementation of the algorithm is available online, and a searchable database of all PubChem workflows is available at http://swami.wustl.edu/flow.
Asunto(s)
Algoritmos , Minería de Datos/métodos , Bases de Datos de Compuestos Químicos , Evaluación Preclínica de Medicamentos/métodos , Biología Computacional , Gráficos por Computador , Sistemas de Administración de Bases de Datos , Ensayos Analíticos de Alto Rendimiento/métodos , Estructura Molecular , Bibliotecas de Moléculas Pequeñas/farmacologíaRESUMEN
Identifying the overrepresented substructures from a set of molecules with similar activity is a common task in chemical informatics. Existing substructure miners are deterministic, requiring the activity of all mined molecules to be known with high confidence. In contrast, we introduce pGraphSig, a probabilistic structure miner, which effectively mines structures from noisy data, where many molecules are labeled with their probability of being active. We benchmark pGraphSig on data from several small-molecule high throughput screens, finding that it can more effectively identify overrepresented structures than a deterministic structure miner.