Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 32
1.
Nucleic Acids Res ; 2024 Apr 15.
Article En | MEDLINE | ID: mdl-38619040

When preparing biomolecular structures for molecular dynamics simulations, pKa calculations are required to provide at least a representative protonation state at a given pH value. Neglecting this step and adopting the reference protonation states of the amino acid residues in water, often leads to wrong electrostatics and nonphysical simulations. Fortunately, several methods have been developed to prepare structures considering the protonation preference of residues in their specific environments (pKa values), and some are even available for online usage. In this work, we present the PypKa server, which allows users to run physics-based, as well as ML-accelerated methods suitable for larger systems, to obtain pKa values, isoelectric points, titration curves, and structures with representative pH-dependent protonation states compatible with commonly used force fields (AMBER, CHARMM, GROMOS). The user may upload a custom structure or submit an identifier code from PBD or UniProtKB. The results for over 200k structures taken from the Protein Data Bank and the AlphaFold DB have been precomputed, and their data can be retrieved without extra calculations. All this information can also be obtained from an application programming interface (API) facilitating its usage and integration into existing pipelines as well as other web services. The web server is available at pypka.org.

2.
Bioinformatics ; 40(1)2024 01 02.
Article En | MEDLINE | ID: mdl-38175786

SUMMARY: We created bigwig-loader, a data-loader for epigenetic profiles from BigWig files that decompresses and processes information for multiple intervals from multiple BigWig files in parallel. This is an access pattern needed to create training batches for typical machine learning models on epigenetics data. Using a new codec, the decompression can be done on a graphical processing unit (GPU) making it fast enough to create the training batches during training, mitigating the need for saving preprocessed training examples to disk. AVAILABILITY AND IMPLEMENTATION: The bigwig-loader installation instructions and source code can be accessed at https://github.com/pfizer-opensource/bigwig-loader.


Epigenomics , Software , Epigenesis, Genetic
3.
J Chem Inf Model ; 64(7): 2331-2344, 2024 Apr 08.
Article En | MEDLINE | ID: mdl-37642660

Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource-intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.


Benchmarking , Quantitative Structure-Activity Relationship , Biological Assay , Machine Learning
4.
Chem Res Toxicol ; 2023 Sep 10.
Article En | MEDLINE | ID: mdl-37690056

Predictive modeling of toxicity is a crucial step in the drug discovery pipeline. It can help filter out molecules with a high probability of failing in the early stages of de novo drug design. Thus, several machine learning (ML) models have been developed to predict the toxicity of molecules by combining classical ML techniques or deep neural networks with well-known molecular representations such as fingerprints or 2D graphs. But the more natural, accurate representation of molecules is expected to be defined in physical 3D space like in ab initio methods. Recent studies successfully used equivariant graph neural networks (EGNNs) for representation learning based on 3D structures to predict quantum-mechanical properties of molecules. Inspired by this, we investigated the performance of EGNNs to construct reliable ML models for toxicity prediction. We used the equivariant transformer (ET) model in TorchMD-NET for this. Eleven toxicity data sets taken from MoleculeNet, TDCommons, and ToxBenchmark have been considered to evaluate the capability of ET for toxicity prediction. Our results show that ET adequately learns 3D representations of molecules that can successfully correlate with toxicity activity, achieving good accuracies on most data sets comparable to state-of-the-art models. We also test a physicochemical property, namely, the total energy of a molecule, to inform the toxicity prediction with a physical prior. However, our work suggests that these two properties can not be related. We also provide an attention weight analysis for helping to understand the toxicity prediction in 3D space and thus increase the explainability of the ML model. In summary, our findings offer promising insights considering 3D geometry information via EGNNs and provide a straightforward way to integrate molecular conformers into ML-based pipelines for predicting and investigating toxicity prediction in physical space. We expect that in the future, especially for larger, more diverse data sets, EGNNs will be an essential tool in this domain.

5.
Nat Rev Drug Discov ; 22(11): 895-916, 2023 11.
Article En | MEDLINE | ID: mdl-37697042

Developments in computational omics technologies have provided new means to access the hidden diversity of natural products, unearthing new potential for drug discovery. In parallel, artificial intelligence approaches such as machine learning have led to exciting developments in the computational drug design field, facilitating biological activity prediction and de novo drug design for molecular targets of interest. Here, we describe current and future synergies between these developments to effectively identify drug candidates from the plethora of molecules produced by nature. We also discuss how to address key challenges in realizing the potential of these synergies, such as the need for high-quality datasets to train deep learning algorithms and appropriate strategies for algorithm validation.


Artificial Intelligence , Biological Products , Humans , Algorithms , Machine Learning , Drug Discovery , Drug Design , Biological Products/pharmacology
7.
Chem Sci ; 14(12): 3235-3246, 2023 Mar 22.
Article En | MEDLINE | ID: mdl-36970100

Automated synthesis planning is key for efficient generative chemistry. Since reactions of given reactants may yield different products depending on conditions such as the chemical context imposed by specific reagents, computer-aided synthesis planning should benefit from recommendations of reaction conditions. Traditional synthesis planning software, however, typically proposes reactions without specifying such conditions, relying on human organic chemists who know the conditions to carry out suggested reactions. In particular, reagent prediction for arbitrary reactions, a crucial aspect of condition recommendation, has been largely overlooked in cheminformatics until recently. Here we employ the Molecular Transformer, a state-of-the-art model for reaction prediction and single-step retrosynthesis, to tackle this problem. We train the model on the US patents dataset (USPTO) and test it on Reaxys to demonstrate its out-of-distribution generalization capabilities. Our reagent prediction model also improves the quality of product prediction: the Molecular Transformer is able to substitute the reagents in the noisy USPTO data with reagents that enable product prediction models to outperform those trained on plain USPTO. This makes it possible to improve upon the state-of-the-art in reaction product prediction on the USPTO MIT benchmark.

9.
J Chem Theory Comput ; 18(8): 5068-5078, 2022 Aug 09.
Article En | MEDLINE | ID: mdl-35837736

Existing computational methods for estimating pKa values in proteins rely on theoretical approximations and lengthy computations. In this work, we use a data set of 6 million theoretically determined pKa shifts to train deep learning models, which are shown to rival the physics-based predictors. These neural networks managed to infer the electrostatic contributions of different chemical groups and learned the importance of solvent exposure and close interactions, including hydrogen bonds. Although trained only using theoretical data, our pKAI+ model displayed the best accuracy in a test set of ∼750 experimental values. Inference times allow speedups of more than 1000× compared to physics-based methods. By combining speed, accuracy, and a reasonable understanding of the underlying physics, our models provide a game-changing solution for fast estimations of macroscopic pKa values from ensembles of microscopic values as well as for many downstream applications such as molecular docking and constant-pH molecular dynamics simulations.


Deep Learning , Molecular Docking Simulation , Molecular Dynamics Simulation , Proteins/chemistry , Static Electricity
10.
Int J Mol Sci ; 22(23)2021 Nov 28.
Article En | MEDLINE | ID: mdl-34884688

In silico protein-ligand binding prediction is an ongoing area of research in computational chemistry and machine learning based drug discovery, as an accurate predictive model could greatly reduce the time and resources necessary for the detection and prioritization of possible drug candidates. Proteochemometric modeling (PCM) attempts to create an accurate model of the protein-ligand interaction space by combining explicit protein and ligand descriptors. This requires the creation of information-rich, uniform and computer interpretable representations of proteins and ligands. Previous studies in PCM modeling rely on pre-defined, handcrafted feature extraction methods, and many methods use protein descriptors that require alignment or are otherwise specific to a particular group of related proteins. However, recent advances in representation learning have shown that unsupervised machine learning can be used to generate embeddings that outperform complex, human-engineered representations. Several different embedding methods for proteins and molecules have been developed based on various language-modeling methods. Here, we demonstrate the utility of these unsupervised representations and compare three protein embeddings and two compound embeddings in a fair manner. We evaluate performance on various splits of a benchmark dataset, as well as on an internal dataset of protein-ligand binding activities and find that unsupervised-learned representations significantly outperform handcrafted representations.


Cheminformatics/methods , Proteins/metabolism , Unsupervised Machine Learning , Ligands , Quantitative Structure-Activity Relationship
11.
Chem Sci ; 12(42): 14174-14181, 2021 Nov 03.
Article En | MEDLINE | ID: mdl-34760202

The automatic recognition of the molecular content of a molecule's graphical depiction is an extremely challenging problem that remains largely unsolved despite decades of research. Recent advances in neural machine translation enable the auto-encoding of molecular structures in a continuous vector space of fixed size (latent representation) with low reconstruction errors. In this paper, we present a fast and accurate model combining deep convolutional neural network learning from molecule depictions and a pre-trained decoder that translates the latent representation into the SMILES representation of the molecules. This combination allows us to precisely infer a molecular structure from an image. Our rigorous evaluation shows that Img2Mol is able to correctly translate up to 88% of the molecular depictions into their SMILES representation. A pretrained version of Img2Mol is made publicly available on GitHub for non-commercial users.

12.
Bioinformatics ; 38(1): 297-298, 2021 12 22.
Article En | MEDLINE | ID: mdl-34260689

SUMMARY: pKa values of ionizable residues and isoelectric points of proteins provide valuable local and global insights about their structure and function. These properties can be estimated with reasonably good accuracy using Poisson-Boltzmann and Monte Carlo calculations at a considerable computational cost (from some minutes to several hours). pKPDB is a database of over 12 M theoretical pKa values calculated over 120k protein structures deposited in the Protein Data Bank. By providing precomputed pKa and pI values, users can retrieve results instantaneously for their protein(s) of interest while also saving countless hours and resources that would be spent on repeated calculations. Furthermore, there is an ever-growing imbalance between experimental pKa and pI values and the number of resolved structures. This database will complement the experimental and computational data already available and can also provide crucial information regarding buried residues that are under-represented in experimental measurements. AVAILABILITY AND IMPLEMENTATION: Gzipped csv files containing p Ka and isoelectric point values can be downloaded from https://pypka.org/pKPDB. To query a single PDB code please use the PypKa free server at https://pypka.org. The pKPDB source code can be found at https://github.com/mms-fcul/pKPDB. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Proteins , Software , Databases, Protein
13.
Toxicology ; 458: 152846, 2021 06 30.
Article En | MEDLINE | ID: mdl-34216698

The 3Rs concept, calling for replacement, reduction and refinement of animal experimentation, is receiving increasing attention around the world, and has found its way to legislation, in particular in the European Union. This is aligned by continuing high-level efforts of the European Commission to support development and implementation of 3Rs methods. In this respect, the European project called "ONTOX: ontology-driven and artificial intelligence-based repeated dose toxicity testing of chemicals for next generation risk assessment" was recently initiated with the goal to provide a functional and sustainable solution for advancing human risk assessment of chemicals without the use of animals in line with the principles of 21st century toxicity testing and next generation risk assessment. ONTOX will deliver a generic strategy to create new approach methodologies (NAMs) in order to predict systemic repeated dose toxicity effects that, upon combination with tailored exposure assessment, will enable human risk assessment. For proof-of-concept purposes, focus is put on NAMs addressing adversities in the liver, kidneys and developing brain induced by a variety of chemicals. The NAMs each consist of a computational system based on artificial intelligence and are fed by biological, toxicological, chemical and kinetic data. Data are consecutively integrated in physiological maps, quantitative adverse outcome pathway networks and ontology frameworks. Supported by artificial intelligence, data gaps are identified and are filled by targeted in vitro and in silico testing. ONTOX is anticipated to have a deep and long-lasting impact at many levels, in particular by consolidating Europe's world-leading position regarding the development, exploitation, regulation and application of animal-free methods for human risk assessment of chemicals.


Artificial Intelligence , Gene Ontology , Toxicity Tests , Animal Testing Alternatives , Animals , Computer Simulation , European Union , Humans , In Vitro Techniques , Risk Assessment
14.
Bioinformatics ; 37(6): 861-867, 2021 05 05.
Article En | MEDLINE | ID: mdl-33241296

MOTIVATION: Image-based profiling combines high-throughput screening with multiparametric feature analysis to capture the effect of perturbations on biological systems. This technology has attracted increasing interest in the field of plant phenotyping, promising to accelerate the discovery of novel herbicides. However, the extraction of meaningful features from unlabeled plant images remains a big challenge. RESULTS: We describe a novel data-driven approach to find feature representations from plant time-series images in a self-supervised manner by using time as a proxy for image similarity. In the spirit of transfer learning, we first apply an ImageNet-pretrained architecture as a base feature extractor. Then, we extend this architecture with a triplet network to refine and reduce the dimensionality of extracted features by ranking relative similarities between consecutive and non-consecutive time points. Without using any labels, we produce compact, organized representations of plant phenotypes and demonstrate their superior applicability to clustering, image retrieval and classification tasks. Besides time, our approach could be applied using other surrogate measures of phenotype similarity, thus providing a versatile method of general interest to the phenotypic profiling community. AVAILABILITY AND IMPLEMENTATION: Source code is provided in https://github.com/bayer-science-for-a-better-life/plant-triplet-net. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Plants , Software , Cluster Analysis
15.
Bioinformatics ; 36(13): 4093-4094, 2020 07 01.
Article En | MEDLINE | ID: mdl-32369561

SUMMARY: Optimizing small molecules in a drug discovery project is a notoriously difficult task as multiple molecular properties have to be considered and balanced at the same time. In this work, we present our novel interactive in silico compound optimization platform termed grünifai to support the ideation of the next generation of compounds under the constraints of a multiparameter objective. grünifai integrates adjustable in silico models, a continuous representation of the chemical space, a scalable particle swarm optimization algorithm and the possibility to actively steer the compound optimization through providing feedback on generated intermediate structures. AVAILABILITY AND IMPLEMENTATION: Source code and documentation are freely available under an MIT license and are openly available on GitHub (https://github.com/jrwnter/gruenifai). The backend, including the optimization method and distribution on multiple GPU nodes is written in Python 3. The frontend is written in ReactJS.


Algorithms , Software , Computer Simulation , Documentation , Research Design
16.
Nat Commun ; 11(1): 10, 2020 01 03.
Article En | MEDLINE | ID: mdl-31900408

Finding new molecules with a desired biological activity is an extremely difficult task. In this context, artificial intelligence and generative models have been used for molecular de novo design and compound optimization. Herein, we report a generative model that bridges systems biology and molecular design, conditioning a generative adversarial network with transcriptomic data. By doing so, we can automatically design molecules that have a high probability to induce a desired transcriptomic profile. As long as the gene expression signature of the desired state is provided, this model is able to design active-like molecules for desired targets without any previous target annotation of the training compounds. Molecules designed by this model are more similar to active compounds than the ones identified by similarity of gene expression signatures. Overall, this method represents an alternative approach to bridge chemistry and biology in the long and difficult road of drug discovery.


Artificial Intelligence , Drug Design , Pharmaceutical Preparations/chemical synthesis , Neural Networks, Computer , Pharmaceutical Preparations/chemistry , Transcriptome
17.
Chem Sci ; 11(38): 10378-10389, 2020 Sep 11.
Article En | MEDLINE | ID: mdl-34094299

Protecting molecular structures from disclosure against external parties is of great relevance for industrial and private associations, such as pharmaceutical companies. Within the framework of external collaborations, it is common to exchange datasets by encoding the molecular structures into descriptors. Molecular fingerprints such as the extended-connectivity fingerprints (ECFPs) are frequently used for such an exchange, because they typically perform well on quantitative structure-activity relationship tasks. ECFPs are often considered to be non-invertible due to the way they are computed. In this paper, we present a fast reverse-engineering method to deduce the molecular structure given revealed ECFPs. Our method includes the Neuraldecipher, a neural network model that predicts a compact vector representation of compounds, given ECFPs. We then utilize another pre-trained model to retrieve the molecular structure as SMILES representation. We demonstrate that our method is able to reconstruct molecular structures to some extent, and improves, when ECFPs with larger fingerprint sizes are revealed. For example, given ECFP count vectors of length 4096, we are able to correctly deduce up to 69% of molecular structures on a validation set (112 K unique samples) with our method.

18.
Molecules ; 25(1)2019 Dec 21.
Article En | MEDLINE | ID: mdl-31877719

Simple physico-chemical properties, like logD, solubility, or melting point, can reveal a great deal about how a compound under development might later behave. These data are typically measured for most compounds in drug discovery projects in a medium throughput fashion. Collecting and assembling all the Bayer in-house data related to these properties allowed us to apply powerful machine learning techniques to predict the outcome of those assays for new compounds. In this paper, we report our finding that, especially for predicting physicochemical ADMET endpoints, a multitask graph convolutional approach appears a highly competitive choice. For seven endpoints of interest, we compared the performance of that approach to fully connected neural networks and different single task models. The new model shows increased predictive performance compared to previous modeling methods and will allow early prioritization of compounds even before they are synthesized. In addition, our model follows the generalized solubility equation without being explicitly trained under this constraint.


Drug Discovery/methods , Pharmaceutical Preparations/chemistry , Algorithms , Machine Learning , Models, Chemical , Neural Networks, Computer , Pharmaceutical Preparations/chemical synthesis , Quantitative Structure-Activity Relationship
19.
Chem Sci ; 10(34): 8016-8024, 2019 Sep 14.
Article En | MEDLINE | ID: mdl-31853357

One of the main challenges in small molecule drug discovery is finding novel chemical compounds with desirable properties. In this work, we propose a novel method that combines in silico prediction of molecular properties such as biological activity or pharmacokinetics with an in silico optimization algorithm, namely Particle Swarm Optimization. Our method takes a starting compound as input and proposes new molecules with more desirable (predicted) properties. It navigates a machine-learned continuous representation of a drug-like chemical space guided by a defined objective function. The objective function combines multiple in silico prediction models, defined desirability ranges and substructure constraints. We demonstrate that our proposed method is able to consistently find more desirable molecules for the studied tasks in relatively short time. We hope that our method can support medicinal chemists in accelerating and improving the lead optimization process.

20.
Chem Sci ; 10(6): 1692-1701, 2019 Feb 14.
Article En | MEDLINE | ID: mdl-30842833

There has been a recent surge of interest in using machine learning across chemical space in order to predict properties of molecules or design molecules and materials with the desired properties. Most of this work relies on defining clever feature representations, in which the chemical graph structure is encoded in a uniform way such that predictions across chemical space can be made. In this work, we propose to exploit the powerful ability of deep neural networks to learn a feature representation from low-level encodings of a huge corpus of chemical structures. Our model borrows ideas from neural machine translation: it translates between two semantically equivalent but syntactically different representations of molecular structures, compressing the meaningful information both representations have in common in a low-dimensional representation vector. Once the model is trained, this representation can be extracted for any new molecule and utilized as a descriptor. In fair benchmarks with respect to various human-engineered molecular fingerprints and graph-convolution models, our method shows competitive performance in modelling quantitative structure-activity relationships in all analysed datasets. Additionally, we show that our descriptor significantly outperforms all baseline molecular fingerprints in two ligand-based virtual screening tasks. Overall, our descriptors show the most consistent performances in all experiments. The continuity of the descriptor space and the existence of the decoder that permits deducing a chemical structure from an embedding vector allow for exploration of the space and open up new opportunities for compound optimization and idea generation.

...