Pesquisa | Biblioteca Virtual em Saúde

1.

The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods.

Zdrazil, Barbara; Felix, Eloy; Hunter, Fiona; Manners, Emma J; Blackshaw, James; Corbett, Sybilla; de Veij, Marleen; Ioannidis, Harris; Lopez, David Mendez; Mosquera, Juan F; Magarinos, Maria Paula; Bosc, Nicolas; Arcila, Ricardo; Kizilören, Tevfik; Gaulton, Anna; Bento, A Patrícia; Adasme, Melissa F; Monecke, Peter; Landrum, Gregory A; Leach, Andrew R.

Nucleic Acids Res ; 52(D1): D1180-D1192, 2024 Jan 05.

Artigo em Inglês | MEDLINE | ID: mdl-37933841

RESUMO

ChEMBL (https://www.ebi.ac.uk/chembl/) is a manually curated, high-quality, large-scale, open, FAIR and Global Core Biodata Resource of bioactive molecules with drug-like properties, previously described in the 2012, 2014, 2017 and 2019 Nucleic Acids Research Database Issues. Since its introduction in 2009, ChEMBL's content has changed dramatically in size and diversity of data types. Through incorporation of multiple new datasets from depositors since the 2019 update, ChEMBL now contains slightly more bioactivity data from deposited data vs data extracted from literature. In collaboration with the EUbOPEN consortium, chemical probe data is now regularly deposited into ChEMBL. Release 27 made curated data available for compounds screened for potential anti-SARS-CoV-2 activity from several large-scale drug repurposing screens. In addition, new patent bioactivity data have been added to the latest ChEMBL releases, and various new features have been incorporated, including a Natural Product likeness score, updated flags for Natural Products, a new flag for Chemical Probes, and the initial annotation of the action type for â¼270 000 bioactivity measurements.

Assuntos

Descoberta de Drogas , Bases de Dados Factuais , Fatores de Tempo

2.

Combining IC₅₀ or K_i Values from Different Sources Is a Source of Significant Noise.

Landrum, Gregory A; Riniker, Sereina.

J Chem Inf Model ; 64(5): 1560-1567, 2024 03 11.

Artigo em Inglês | MEDLINE | ID: mdl-38394344

RESUMO

As part of the ongoing quest to find or construct large data sets for use in validating new machine learning (ML) approaches for bioactivity prediction, it has become distressingly common for researchers to combine literature IC50 data generated using different assays into a single data set. It is well-known that there are many situations where this is a scientifically risky thing to do, even when the assays are against exactly the same target, but the risks of assays being incompatible are even higher when pulling data from large collections of literature data like ChEMBL. Here, we estimate the amount of noise present in combined data sets using cases where measurements for the same compound are reported in multiple assays against the same target. This approach shows that IC50 assays selected using minimal curation settings have poor agreement with each other: almost 65% of the points differ by more than 0.3 log units, 27% differ by more than one log unit, and the correlation between the assays, as measured by Kendall's τ, is only 0.51. Requiring that most of the assay metadata in ChEMBL matches ("maximal curation") in order to combine two assays improves the situation (48% of the points differ by more than 0.3 log units, 13% by more than one log unit, and Kendall's τ is 0.71) at the expense of having smaller data sets. Surprisingly, our analysis shows similar amounts of noise when combining data from different literature Ki assays. We suggest that good scientific practice requires careful curation when combining data sets from different assays and hope that our maximal curation strategy will help to improve the quality of the data that are being used to build and validate ML models for bioactivity prediction. To help achieve this, the code and ChEMBL queries that we used for the maximal curation approach are available as open-source software in our GitHub repository, https://github.com/rinikerlab/overlapping_assays.

Assuntos

Aprendizado de Máquina , Software , Bioensaio

3.

Understanding and Quantifying Molecular Flexibility: Torsion Angular Bin Strings.

Braun, Jessica; Katzberger, Paul; Landrum, Gregory A; Riniker, Sereina.

J Chem Inf Model ; 64(20): 7917-7924, 2024 Oct 28.

Artigo em Inglês | MEDLINE | ID: mdl-39390326

RESUMO

Molecular flexibility is a commonly used, but not easily quantified term. It is at the core of understanding composition and size of a conformational ensemble and contributes to many molecular properties. For many computational workflows, it is necessary to reduce a conformational ensemble to meaningful representatives, however defining them and guaranteeing the ensemble's completeness is difficult. We introduce the concepts of torsion angular bin strings (TABS) as a discrete vector representation of a conformer's dihedral angles and the number of possible TABS (nTABS) as an estimation for the ensemble size of a molecule, respectively. Here, we show that nTABS corresponds to an upper limit for the size of the conformational space of small molecules and compare the classification of conformer ensembles by TABS with classifications by RMSD. Overcoming known drawbacks like the molecular size dependency and threshold picking of the RMSD measure, TABS is shown to meaningfully discretize the conformational space and hence allows e.g. for fast checks of the coverage of the conformational space. The current proof-of-concept implementation is based on the ETKDGv3 conformer generator as implemented in the RDKit and known torsion preferences extracted from small-molecule crystallographic data.

Assuntos

Conformação Molecular , Modelos Moleculares

4.

lwreg: A Lightweight System for Chemical Registration and Data Storage.

Landrum, Gregory A; Braun, Jessica; Katzberger, Paul; Lehner, Marc T; Riniker, Sereina.

J Chem Inf Model ; 64(16): 6247-6252, 2024 Aug 26.

Artigo em Inglês | MEDLINE | ID: mdl-39114929

RESUMO

Here, we present lwreg, a lightweight, yet flexible chemical registration system supporting the capture of both two-dimensional molecular structures (topologies) and three-dimensional conformers. lwreg is open source, with a simple Python API, and is designed to be easily integrated into computational workflows. In addition to lwreg itself, we also introduce a straightforward schema for storing experimental data and metadata in the registration database. This direct connection between compound structural information and data generated using those structures creates a powerful tool for data analysis and experimental reproducibility. The software is available at and installable directly from https://github.com/rinikerlab/lightweight-registration.

Assuntos

Armazenamento e Recuperação da Informação , Software , Bases de Dados de Compostos Químicos , Conformação Molecular

5.

DASH properties: Estimating atomic and molecular properties from a dynamic attention-based substructure hierarchy.

Lehner, Marc T; Katzberger, Paul; Maeder, Niels; Landrum, Gregory A; Riniker, Sereina.

J Chem Phys ; 161(7)2024 Aug 21.

Artigo em Inglês | MEDLINE | ID: mdl-39145551

RESUMO

Recently, we presented a method to assign atomic partial charges based on the DASH (dynamic attention-based substructure hierarchy) tree with high efficiency and quantum mechanical (QM)-like accuracy. In addition, the approach can be considered "rule based"-where the rules are derived from the attention values of a graph neural network-and thus, each assignment is fully explainable by visualizing the underlying molecular substructures. In this work, we demonstrate that these hierarchically sorted substructures capture the key features of the local environment of an atom and allow us to predict different atomic properties with high accuracy without building a new DASH tree for each property. The fast prediction of atomic properties in molecules with the DASH tree can, for example, be used as an efficient way to generate feature vectors for machine learning without the need for expensive QM calculations. The final DASH tree with the different atomic properties as well as the complete dataset with wave functions is made freely available.

6.

DASH: Dynamic Attention-Based Substructure Hierarchy for Partial Charge Assignment.

Lehner, Marc T; Katzberger, Paul; Maeder, Niels; Schiebroek, Carl C G; Teetz, Jakob; Landrum, Gregory A; Riniker, Sereina.

J Chem Inf Model ; 63(19): 6014-6028, 2023 Oct 09.

Artigo em Inglês | MEDLINE | ID: mdl-37738206

RESUMO

We present a robust and computationally efficient approach for assigning partial charges of atoms in molecules. The method is based on a hierarchical tree constructed from attention values extracted from a graph neural network (GNN), which was trained to predict atomic partial charges from accurate quantum-mechanical (QM) calculations. The resulting dynamic attention-based substructure hierarchy (DASH) approach provides fast assignment of partial charges with the same accuracy as the GNN itself, is software-independent, and can easily be integrated in existing parametrization pipelines, as shown for the Open force field (OpenFF). The implementation of the DASH workflow, the final DASH tree, and the training set are available as open source/open data from public repositories.

7.

Incorporating NOE-Derived Distances in Conformer Generation of Cyclic Peptides with Distance Geometry.

Wang, Shuzhe; Krummenacher, Kajo; Landrum, Gregory A; Sellers, Benjamin D; Di Lello, Paola; Robinson, Sarah J; Martin, Bryan; Holden, Jeffrey K; Tom, Jeffrey Y K; Murthy, Anastasia C; Popovych, Nataliya; Riniker, Sereina.

J Chem Inf Model ; 62(3): 472-485, 2022 02 14.

Artigo em Inglês | MEDLINE | ID: mdl-35029985

RESUMO

Nuclear magnetic resonance (NMR) data from NOESY (nuclear Overhauser enhancement spectroscopy) and ROESY (rotating frame Overhauser enhancement spectroscopy) experiments can easily be combined with distance geometry (DG) based conformer generators by modifying the molecular distance bounds matrix. In this work, we extend the modern DG based conformer generator ETKDG, which has been shown to reproduce experimental crystal structures from small molecules to large macrocycles well, to include NOE-derived interproton distances. In noeETKDG, the experimentally derived interproton distances are incorporated into the distance bounds matrix as loose upper (or lower) bounds to generate large conformer sets. Various subselection techniques can subsequently be applied to yield a conformer bundle that best reproduces the NOE data. The approach is benchmarked using a set of 24 (mostly) cyclic peptides for which NOE-derived distances as well as reference solution structures obtained by other software are available. With respect to other packages currently available, the advantages of noeETKDG are its speed and that no prior force-field parametrization is required, which is especially useful for peptides with unnatural amino acids. The resulting conformer bundles can be further processed with the use of structural refinement techniques to improve the modeling of the intramolecular nonbonded interactions. The noeETKDG code is released as a fully open-source software package available at www.github.com/rinikerlab/customETKDG.

Assuntos

Peptídeos Cíclicos , Peptídeos , Imageamento por Ressonância Magnética , Espectroscopia de Ressonância Magnética/métodos , Modelos Moleculares , Conformação Proteica

8.

GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning.

Esposito, Carmen; Landrum, Gregory A; Schneider, Nadine; Stiefl, Nikolaus; Riniker, Sereina.

J Chem Inf Model ; 61(6): 2623-2640, 2021 06 28.

Artigo em Inglês | MEDLINE | ID: mdl-34100609

RESUMO

Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure-activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.

Assuntos

Algoritmos , Aprendizado de Máquina

9.

rdScaffoldNetwork: The Scaffold Network Implementation in RDKit.

Kruger, Franziska; Stiefl, Nikolaus; Landrum, Gregory A.

J Chem Inf Model ; 60(7): 3331-3335, 2020 07 27.

Artigo em Inglês | MEDLINE | ID: mdl-32584031

RESUMO

We present an implementation of the scaffold network in the open source cheminformatics toolkit RDKit. Scaffold networks have been introduced in the literature as a powerful method to navigate and analyze large screening data sets in medicinal chemistry. Such a network can be created by iteratively applying predefined fragmentation rules to the investigated set of small molecules and by linking the produced fragments according to their descendence. This procedure results in a network graph, where the nodes correspond to the fragments and the edges correspond to the operations producing one fragment from another. In extension to the scaffold network implementations suggested in the literature, the presented implementation in RDKit allows an enhanced flexibility in terms of customizing the fragmentation rules and enables the inclusion of atom- and bond-generic scaffolds into the network. The output, providing node and edge information on the network, enables a simple and elegant navigation through the network, laying the basis to organize and better understand the data set being investigated.

Assuntos

Quimioinformática , Software , Química Farmacêutica

10.

Improving Conformer Generation for Small Rings and Macrocycles Based on Distance Geometry and Experimental Torsional-Angle Preferences.

Wang, Shuzhe; Witek, Jagna; Landrum, Gregory A; Riniker, Sereina.

J Chem Inf Model ; 60(4): 2044-2058, 2020 04 27.

Artigo em Inglês | MEDLINE | ID: mdl-32155061

RESUMO

The conformer generator ETKDG is a stochastic search method that utilizes distance geometry together with knowledge derived from experimental crystal structures. It has been shown to generate good conformers for acyclic, flexible molecules. This work builds on ETKDG to improve conformer generation of molecules containing small or large aliphatic (i.e., non-aromatic) rings. For one, we devise additional torsional-angle potentials to describe small aliphatic rings and adapt the previously developed potentials for acyclic bonds to facilitate the sampling of macrocycles. However, due to the larger number of degrees of freedom of macrocycles, the conformational space to sample is much broader than for small molecules, creating a challenge for conformer generators. We therefore introduce different heuristics to restrict the search space of macrocycles and bias the sampling toward more experimentally relevant structures. Specifically, we show the usage of elliptical geometry and customizable Coulombic interactions as heuristics. The performance of the improved ETKDG is demonstrated on test sets of diverse macrocycles and cyclic peptides. The code developed here will be incorporated into the 2020.03 release of the open-source cheminformatics library RDKit.

Assuntos

Heurística , Peptídeos Cíclicos , Modelos Moleculares , Conformação Molecular

11.

Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach.

Schneider, Nadine; Fechner, Nikolas; Landrum, Gregory A; Stiefl, Nikolaus.

J Chem Inf Model ; 57(8): 1816-1831, 2017 08 28.

Artigo em Inglês | MEDLINE | ID: mdl-28715190

RESUMO

Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called "topic modeling" from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to "chemical topics" and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like "proteins", "DNA", or "steroids". Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.

Assuntos

Mineração de Dados/métodos , Bases de Dados de Compostos Químicos , Algoritmos

12.

What's What: The (Nearly) Definitive Guide to Reaction Role Assignment.

Schneider, Nadine; Stiefl, Nikolaus; Landrum, Gregory A.

J Chem Inf Model ; 56(12): 2336-2346, 2016 12 27.

Artigo em Inglês | MEDLINE | ID: mdl-28024398

RESUMO

When analyzing chemical reactions it is essential to know which molecules are actively involved in the reaction and which educts will form the product molecules. Assigning reaction roles, like reactant, reagent, or product, to the molecules of a chemical reaction might be a trivial problem for hand-curated reaction schemes but it is more difficult to automate, an essential step when handling large amounts of reaction data. Here, we describe a new fingerprint-based and data-driven approach to assign reaction roles which is also applicable to rather unbalanced and noisy reaction schemes. Given a set of molecules involved and knowing the product(s) of a reaction we assign the most probable reactants and sort out the remaining reagents. Our approach was validated using two different data sets: one hand-curated data set comprising about 680 diverse reactions extracted from patents which span more than 200 different reaction types and include up to 18 different reactants. A second set consists of 50â¯000 randomly picked reactions from US patents. The results of the second data set were compared to results obtained using two different atom-to-atom mapping algorithms. For both data sets our method assigns the reaction roles correctly for the vast majority of the reactions, achieving an accuracy of 88% and 97% respectively. The median time needed, about 8 ms, indicates that the algorithm is fast enough to be applied to large collections. The new method is available as part of the RDKit toolkit and the data sets and Jupyter notebooks used for evaluation of the new method are available in the Supporting Information of this publication.

Assuntos

Descoberta de Drogas , Modelos Químicos , Software , Algoritmos , Bases de Dados de Compostos Químicos , Descoberta de Drogas/métodos , Indicadores e Reagentes/química , Patentes como Assunto

13.

Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation.

Riniker, Sereina; Landrum, Gregory A.

J Chem Inf Model ; 55(12): 2562-74, 2015 Dec 28.

Artigo em Inglês | MEDLINE | ID: mdl-26575315

RESUMO

Small organic molecules are often flexible, i.e., they can adopt a variety of low-energy conformations in solution that exist in equilibrium with each other. Two main search strategies are used to generate representative conformational ensembles for molecules: systematic and stochastic. In the first approach, each rotatable bond is sampled systematically in discrete intervals, limiting its use to molecules with a small number of rotatable bonds. Stochastic methods, on the other hand, sample the conformational space of a molecule randomly and can thus be applied to more flexible molecules. Different methods employ different degrees of experimental data for conformer generation. So-called knowledge-based methods use predefined libraries of torsional angles and ring conformations. In the distance geometry approach, on the other hand, a smaller amount of empirical information is used, i.e., ideal bond lengths, ideal bond angles, and a few ideal torsional angles. Distance geometry is a computationally fast method to generate conformers, but it has the downside that purely distance-based constraints tend to lead to distorted aromatic rings and sp(2) centers. To correct this, the resulting conformations are often minimized with a force field, adding computational complexity and run time. Here we present an alternative strategy that combines the distance geometry approach with experimental torsion-angle preferences obtained from small-molecule crystallographic data. The torsional angles are described by a previously developed set of hierarchically structured SMARTS patterns. The new approach is implemented in the open-source cheminformatics library RDKit, and its performance is assessed by comparing the diversity of the generated ensemble and the ability to reproduce crystal conformations taken from the crystal structures of small molecules and protein-ligand complexes.

Assuntos

Algoritmos , Modelos Moleculares , Processos Estocásticos , Conformação Molecular , Compostos Orgânicos/química

14.

Get Your Atoms in Order--An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm.

Schneider, Nadine; Sayle, Roger A; Landrum, Gregory A.

J Chem Inf Model ; 55(10): 2111-20, 2015 Oct 26.

Artigo em Inglês | MEDLINE | ID: mdl-26441310

RESUMO

Finding a canonical ordering of the atoms in a molecule is a prerequisite for generating a unique representation of the molecule. The canonicalization of a molecule is usually accomplished by applying some sort of graph relaxation algorithm, the most common of which is the Morgan algorithm. There are known issues with that algorithm that lead to noncanonical atom orderings as well as problems when it is applied to large molecules like proteins. Furthermore, each cheminformatics toolkit or software provides its own version of a canonical ordering, most based on unpublished algorithms, which also complicates the generation of a universal unique identifier for molecules. We present an alternative canonicalization approach that uses a standard stable-sorting algorithm instead of a Morgan-like index. Two new invariants that allow canonical ordering of molecules with dependent chirality as well as those with highly symmetrical cyclic graphs have been developed. The new approach proved to be robust and fast when tested on the 1.45 million compounds of the ChEMBL 20 data set in different scenarios like random renumbering of input atoms or SMILES round tripping. Our new algorithm is able to generate a canonical order of the atoms of protein molecules within a few milliseconds. The novel algorithm is implemented in the open-source cheminformatics toolkit RDKit. With this paper, we provide a reference Python implementation of the algorithm that could easily be integrated in any cheminformatics toolkit. This provides a first step toward a common standard for canonical atom ordering to generate a universal unique identifier for molecules other than InChI.

Assuntos

Algoritmos , Modelos Moleculares , Bibliotecas de Moléculas Pequenas/química , Software , Estereoisomerismo

15.

Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity.

Schneider, Nadine; Lowe, Daniel M; Sayle, Roger A; Landrum, Gregory A.

J Chem Inf Model ; 55(1): 39-53, 2015 Jan 26.

Artigo em Inglês | MEDLINE | ID: mdl-25541888

RESUMO

Fingerprint methods applied to molecules have proven to be useful for similarity determination and as inputs to machine-learning models. Here, we present the development of a new fingerprint for chemical reactions and validate its usefulness in building machine-learning models and in similarity assessment. Our final fingerprint is constructed as the difference of the atom-pair fingerprints of products and reactants and includes agents via calculated physicochemical properties. We validated the fingerprints on a large data set of reactions text-mined from granted United States patents from the last 40 years that have been classified using a substructure-based expert system. We applied machine learning to build a 50-class predictive model for reaction-type classification that correctly predicts 97% of the reactions in an external test set. Impressive accuracies were also observed when applying the classifier to reactions from an in-house electronic laboratory notebook. The performance of the novel fingerprint for assessing reaction similarity was evaluated by a cluster analysis that recovered 48 out of 50 of the reaction classes with a median F-score of 0.63 for the clusters. The data sets used for training and primary validation as well as all python scripts required to reproduce the analysis are provided in the Supporting Information.

Assuntos

Inteligência Artificial , Bases de Dados de Compostos Químicos , Modelos Químicos , Análise por Conglomerados , Fenômenos de Química Orgânica , Patentes como Assunto , Reprodutibilidade dos Testes

16.

Using information from historical high-throughput screens to predict active compounds.

Riniker, Sereina; Wang, Yuan; Jenkins, Jeremy L; Landrum, Gregory A.

J Chem Inf Model ; 54(7): 1880-91, 2014 Jul 28.

Artigo em Inglês | MEDLINE | ID: mdl-24933016

RESUMO

Modern high-throughput screening (HTS) is a well-established approach for hit finding in drug discovery that is routinely employed in the pharmaceutical industry to screen more than a million compounds within a few weeks. However, as the industry shifts to more disease-relevant but more complex phenotypic screens, the focus has moved to piloting smaller but smarter chemically/biologically diverse subsets followed by an expansion around hit compounds. One standard method for doing this is to train a machine-learning (ML) model with the chemical fingerprints of the tested subset of molecules and then select the next compounds based on the predictions of this model. An alternative approach would be to take advantage of the wealth of bioactivity information contained in older (full-deck) screens using so-called HTS fingerprints, where each element of the fingerprint corresponds to the outcome of a particular assay, as input to machine-learning algorithms. We constructed HTS fingerprints using two collections of data: 93 in-house assays and 95 publicly available assays from PubChem. For each source, an additional set of 51 and 46 assays, respectively, was collected for testing. Three different ML methods, random forest (RF), logistic regression (LR), and naïve Bayes (NB), were investigated for both the HTS fingerprint and a chemical fingerprint, Morgan2. RF was found to be best suited for learning from HTS fingerprints yielding area under the receiver operating characteristic curve (AUC) values >0.8 for 78% of the internal assays and enrichment factors at 5% (EF(5%)) >10 for 55% of the assays. The RF(HTS-fp) generally outperformed the LR trained with Morgan2, which was the best ML method for the chemical fingerprint, for the majority of assays. In addition, HTS fingerprints were found to retrieve more diverse chemotypes. Combining the two models through heterogeneous classifier fusion led to a similar or better performance than the best individual model for all assays. Further validation using a pair of in-house assays and data from a confirmatory screen--including a prospective set of around 2000 compounds selected based on our approach--confirmed the good performance. Thus, the combination of machine-learning with HTS fingerprints and chemical fingerprints utilizes information from both domains and presents a very promising approach for hit expansion, leading to more hits. The source code used with the public data is provided.

Assuntos

Ensaios de Triagem em Larga Escala/métodos , Informática/métodos , Algoritmos , Inteligência Artificial , Teorema de Bayes , Modelos Logísticos

17.

Heterogeneous classifier fusion for ligand-based virtual screening: or, how decision making by committee can be a good thing.

Riniker, Sereina; Fechner, Nikolas; Landrum, Gregory A.

J Chem Inf Model ; 53(11): 2829-36, 2013 Nov 25.

Artigo em Inglês | MEDLINE | ID: mdl-24171408

RESUMO

The concept of data fusion - the combination of information from different sources describing the same object with the expectation to generate a more accurate representation - has found application in a very broad range of disciplines. In the context of ligand-based virtual screening (VS), data fusion has been applied to combine knowledge from either different active molecules or different fingerprints to improve similarity search performance. Machine-learning (ML) methods based on fusion of multiple homogeneous classifiers, in particular random forests, have also been widely applied in the ML literature. The heterogeneous version of classifier fusion - fusing the predictions from different model types - has been less explored. Here, we investigate heterogeneous classifier fusion for ligand-based VS using three different ML methods, RF, naïve Bayes (NB), and logistic regression (LR), with four 2D fingerprints, atom pairs, topological torsions, RDKit fingerprint, and circular fingerprint. The methods are compared using a previously developed benchmarking platform for 2D fingerprints which is extended to ML methods in this article. The original data sets are filtered for difficulty, and a new set of challenging data sets from ChEMBL is added. Data sets were also generated for a second use case: starting from a small set of related actives instead of diverse actives. The final fused model consistently outperforms the other approaches across the broad variety of targets studied, indicating that heterogeneous classifier fusion is a very promising approach for ligand-based VS. The new data sets together with the adapted source code for ML methods are provided in the Supporting Information .

Assuntos

Algoritmos , Inteligência Artificial , Mineração de Dados , Ensaios de Triagem em Larga Escala/estatística & dados numéricos , Proteínas/química , Interface Usuário-Computador , Teorema de Bayes , Benchmarking , Bases de Dados de Compostos Químicos , Tomada de Decisões , Ligantes , Modelos Logísticos , Modelos Moleculares , Proteínas/agonistas , Proteínas/antagonistas & inibidores

18.

SIMPD: an algorithm for generating simulated time splits for validating machine learning approaches.

Landrum, Gregory A; Beckers, Maximilian; Lanini, Jessica; Schneider, Nadine; Stiefl, Nikolaus; Riniker, Sereina.

J Cheminform ; 15(1): 119, 2023 Dec 11.

Artigo em Inglês | MEDLINE | ID: mdl-38082357

RESUMO

Time-split cross-validation is broadly recognized as the gold standard for validating predictive models intended for use in medicinal chemistry projects. Unfortunately this type of data is not broadly available outside of large pharmaceutical research organizations. Here we introduce the SIMPD (simulated medicinal chemistry project data) algorithm to split public data sets into training and test sets that mimic the differences observed in real-world medicinal chemistry project data sets. SIMPD uses a multi-objective genetic algorithm with objectives derived from an extensive analysis of the differences between early and late compounds in more than 130 lead-optimization projects run within the Novartis Institutes for BioMedical Research. Applying SIMPD to the real-world data sets produced training/test splits which more accurately reflect the differences in properties and machine-learning performance observed for temporal splits than other standard approaches like random or neighbor splits. We applied the SIMPD algorithm to bioactivity data extracted from ChEMBL and created 99 public data sets which can be used for validating machine-learning models intended for use in the setting of a medicinal chemistry project. The SIMPD code and simulated data sets are available under open-source/open-data licenses at github.com/rinikerlab/molecular_time_series.

19.

Corrections to "development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity".

Schneider, Nadine; Lowe, Daniel M; Sayle, Roger A; Landrum, Gregory A.

J Chem Inf Model ; 55(2): 474, 2015 Feb 23.

Artigo em Inglês | MEDLINE | ID: mdl-25647286

20.

KNIME for reproducible cross-domain analysis of life science data.

Fillbrunn, Alexander; Dietz, Christian; Pfeuffer, Julianus; Rahn, René; Landrum, Gregory A; Berthold, Michael R.

J Biotechnol ; 261: 149-156, 2017 Nov 10.

Artigo em Inglês | MEDLINE | ID: mdl-28757290

RESUMO

Experiments in the life sciences often involve tools from a variety of domains such as mass spectrometry, next generation sequencing, or image processing. Passing the data between those tools often involves complex scripts for controlling data flow, data transformation, and statistical analysis. Such scripts are not only prone to be platform dependent, they also tend to grow as the experiment progresses and are seldomly well documented, a fact that hinders the reproducibility of the experiment. Workflow systems such as KNIME Analytics Platform aim to solve these problems by providing a platform for connecting tools graphically and guaranteeing the same results on different operating systems. As an open source software, KNIME allows scientists and programmers to provide their own extensions to the scientific community. In this review paper we present selected extensions from the life sciences that simplify data exploration, analysis, and visualization and are interoperable due to KNIME's unified data model. Additionally, we name other workflow systems that are commonly used in the life sciences and highlight their similarities and differences to KNIME.

Assuntos

Biologia Computacional , Software , Disciplinas das Ciências Biológicas , Sequenciamento de Nucleotídeos em Larga Escala , Processamento de Imagem Assistida por Computador , Espectrometria de Massas

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA