Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 117
Filtrar
1.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35880623

RESUMO

Adoption of recently developed methods from machine learning has given rise to creation of drug-discovery knowledge graphs (KGs) that utilize the interconnected nature of the domain. Graph-based modelling of the data, combined with KG embedding (KGE) methods, are promising as they provide a more intuitive representation and are suitable for inference tasks such as predicting missing links. One common application is to produce ranked lists of genes for a given disease, where the rank is based on the perceived likelihood of association between the gene and the disease. It is thus critical that these predictions are not only pertinent but also biologically meaningful. However, KGs can be biased either directly due to the underlying data sources that are integrated or due to modelling choices in the construction of the graph, one consequence of which is that certain entities can get topologically overrepresented. We demonstrate the effect of these inherent structural imbalances, resulting in densely connected entities being highly ranked no matter the context. We provide support for this observation across different datasets, models as well as predictive tasks. Further, we present various graph perturbation experiments which yield more support to the observation that KGE models can be more influenced by the frequency of entities rather than any biological information encoded within the relations. Our results highlight the importance of data modelling choices, and emphasizes the need for practitioners to be mindful of these issues when interpreting model outputs and during KG composition.


Assuntos
Aprendizado de Máquina , Reconhecimento Automatizado de Padrão , Conhecimento
2.
Brief Bioinform ; 23(6)2022 11 19.
Artigo em Inglês | MEDLINE | ID: mdl-36151740

RESUMO

Drug discovery and development is a complex and costly process. Machine learning approaches are being investigated to help improve the effectiveness and speed of multiple stages of the drug discovery pipeline. Of these, those that use Knowledge Graphs (KG) have promise in many tasks, including drug repurposing, drug toxicity prediction and target gene-disease prioritization. In a drug discovery KG, crucial elements including genes, diseases and drugs are represented as entities, while relationships between them indicate an interaction. However, to construct high-quality KGs, suitable data are required. In this review, we detail publicly available sources suitable for use in constructing drug discovery focused KGs. We aim to help guide machine learning and KG practitioners who are interested in applying new techniques to the drug discovery field, but who may be unfamiliar with the relevant data sources. The datasets are selected via strict criteria, categorized according to the primary type of information contained within and are considered based upon what information could be extracted to build a KG. We then present a comparative analysis of existing public drug discovery KGs and an evaluation of selected motivating case studies from the literature. Additionally, we raise numerous and unique challenges and issues associated with the domain and its datasets, while also highlighting key future research directions. We hope this review will motivate KGs use in solving key and emerging questions in the drug discovery domain.


Assuntos
Aprendizado de Máquina , Reconhecimento Automatizado de Padrão , Descoberta de Drogas , Conhecimento , Armazenamento e Recuperação da Informação
3.
J Chem Inf Model ; 2024 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-38950185

RESUMO

Machine-learning (ML) and deep-learning (DL) approaches to predict the molecular properties of small molecules are increasingly deployed within the design-make-test-analyze (DMTA) drug design cycle to predict molecular properties of interest. Despite this uptake, there are only a few automated packages to aid their development and deployment that also support uncertainty estimation, model explainability, and other key aspects of model usage. This represents a key unmet need within the field, and the large number of molecular representations and algorithms (and associated parameters) means it is nontrivial to robustly optimize, evaluate, reproduce, and deploy models. Here, we present QSARtuna, a molecule property prediction modeling pipeline, written in Python and utilizing the Optuna, Scikit-learn, RDKit, and ChemProp packages, which enables the efficient and automated comparison between molecular representations and machine learning models. The platform was developed by considering the increasingly important aspect of model uncertainty quantification and explainability by design. We provide details for our framework and provide illustrative examples to demonstrate the capability of the software when applied to simple molecular property, reaction/reactivity prediction, and DNA encoded library enrichment classification. We hope that the release of QSARtuna will further spur innovation in automatic ML modeling and provide a platform for education of best practices in molecular property modeling. The code for the QSARtuna framework is made freely available via GitHub.

4.
J Chem Inf Model ; 64(7): 2331-2344, 2024 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-37642660

RESUMO

Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource-intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.


Assuntos
Benchmarking , Relação Quantitativa Estrutura-Atividade , Bioensaio , Aprendizado de Máquina
5.
Bioinformatics ; 38(21): 4951-4952, 2022 10 31.
Artigo em Inglês | MEDLINE | ID: mdl-36073898

RESUMO

SUMMARY: We present Icolos, a workflow manager written in Python as a tool for automating complex structure-based workflows for drug design. Icolos can be used as a standalone tool, for example in virtual screening campaigns, or can be used in conjunction with deep learning-based molecular generation facilitated for example by REINVENT, a previously published molecular de novo design package. In this publication, we focus on the internal structure and general capabilities of Icolos, using molecular docking experiments as an illustrative example. AVAILABILITY AND IMPLEMENTATION: The source code is freely available at https://github.com/MolecularAI/Icolos under the Apache 2.0 license. Tutorial notebooks containing minimal working examples can be found at https://github.com/MolecularAI/IcolosCommunity. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Desenho de Fármacos , Software , Fluxo de Trabalho , Simulação de Acoplamento Molecular
6.
J Chem Inf Model ; 63(7): 1841-1846, 2023 04 10.
Artigo em Inglês | MEDLINE | ID: mdl-36959737

RESUMO

We introduce the AiZynthTrain Python package for training synthesis models in a robust, reproducible, and extensible way. It contains two pipelines that create a template-based one-step retrosynthesis model and a RingBreaker model that can be straightforwardly integrated in retrosynthesis software. We train such models on the publicly available reaction data set from the U.S. Patent and Trademark Office (USPTO), and these are the first retrosynthesis models created in a completely reproducible end-to-end fashion, starting with the original reaction data source and ending with trained machine-learning models. In particular, we show that employing new heuristics implemented in the pipeline greatly improves the ability of the RingBreaker model for disconnecting ring systems. Furthermore, we demonstrate the robustness of the pipeline by training on a more diverse but proprietary data set. We envisage that this framework will be extended with other synthesis models in the future.


Assuntos
Aprendizado de Máquina , Software
7.
J Chem Inf Model ; 63(4): 1099-1113, 2023 02 27.
Artigo em Inglês | MEDLINE | ID: mdl-36758178

RESUMO

Accurate methods to predict solubility from molecular structure are highly sought after in the chemical sciences. To assess the state of the art, the American Chemical Society organized a "Second Solubility Challenge" in 2019, in which competitors were invited to submit blinded predictions of the solubilities of 132 drug-like molecules. In the first part of this article, we describe the development of two models that were submitted to the Blind Challenge in 2019 but which have not previously been reported. These models were based on computationally inexpensive molecular descriptors and traditional machine learning algorithms and were trained on a relatively small data set of 300 molecules. In the second part of the article, to test the hypothesis that predictions would improve with more advanced algorithms and higher volumes of training data, we compare these original predictions with those made after the deadline using deep learning models trained on larger solubility data sets consisting of 2999 and 5697 molecules. The results show that there are several algorithms that are able to obtain near state-of-the-art performance on the solubility challenge data sets, with the best model, a graph convolutional neural network, resulting in an RMSE of 0.86 log units. Critical analysis of the models reveals systematic differences between the performance of models using certain feature sets and training data sets. The results suggest that careful selection of high quality training data from relevant regions of chemical space is critical for prediction accuracy but that other methodological issues remain problematic for machine learning solubility models, such as the difficulty in modeling complex chemical spaces from sparse training data sets.


Assuntos
Aprendizado Profundo , Solubilidade , Redes Neurais de Computação , Aprendizado de Máquina , Algoritmos
8.
J Chem Inf Model ; 62(9): 2093-2100, 2022 05 09.
Artigo em Inglês | MEDLINE | ID: mdl-34757744

RESUMO

Here, we explore the impact of different graph traversal algorithms on molecular graph generation. We do this by training a graph-based deep molecular generative model to build structures using a node order determined via either a breadth- or depth-first search algorithm. What we observe is that using a breadth-first traversal leads to better coverage of training data features compared to a depth-first traversal. We have quantified these differences using a variety of metrics on a data set of natural products. These metrics include percent validity, molecular coverage, and molecular shape. We also observe that by using either a breadth- or depth-first traversal it is possible to overtrain the generative models, at which point the results with either graph traversal algorithm are identical.


Assuntos
Algoritmos
9.
J Chem Inf Model ; 62(20): 4863-4872, 2022 Oct 24.
Artigo em Inglês | MEDLINE | ID: mdl-36219571

RESUMO

Machine learning provides effective computational tools for exploring the chemical space via deep generative models. Here, we propose a new reinforcement learning scheme to fine-tune graph-based deep generative models for de novo molecular design tasks. We show how our computational framework can successfully guide a pretrained generative model toward the generation of molecules with a specific property profile, even when such molecules are not present in the training set and unlikely to be generated by the pretrained model. We explored the following tasks: generating molecules of decreasing/increasing size, increasing drug-likeness, and increasing bioactivity. Using the proposed approach, we achieve a model which generates diverse compounds with predicted DRD2 activity for 95% of sampled molecules, outperforming previously reported methods on this metric.


Assuntos
Desenho de Fármacos , Aprendizado de Máquina
10.
J Chem Inf Model ; 62(9): 2046-2063, 2022 05 09.
Artigo em Inglês | MEDLINE | ID: mdl-34460269

RESUMO

Because of the strong relationship between the desired molecular activity and its structural core, the screening of focused, core-sharing chemical libraries is a key step in lead optimization. Despite the plethora of current research focused on in silico methods for molecule generation, to our knowledge, no tool capable of designing such libraries has been proposed. In this work, we present a novel tool for de novo drug design called LibINVENT. It is capable of rapidly proposing chemical libraries of compounds sharing the same core while maximizing a range of desirable properties. To further help the process of designing focused libraries, the user can list specific chemical reactions that can be used for the library creation. LibINVENT is therefore a flexible tool for generating virtual chemical libraries for lead optimization in a broad range of scenarios. Additionally, the shared core ensures that the compounds in the library are similar, possess desirable properties, and can also be synthesized under the same or similar conditions. The LibINVENT code is freely available in our public repository at https://github.com/MolecularAI/Lib-INVENT. The code necessary for data preprocessing is further available at: https://github.com/MolecularAI/Lib-INVENT-dataset.


Assuntos
Desenho de Fármacos , Bibliotecas de Moléculas Pequenas , Bibliotecas de Moléculas Pequenas/química
11.
J Chem Inf Model ; 61(8): 3899-3907, 2021 08 23.
Artigo em Inglês | MEDLINE | ID: mdl-34342428

RESUMO

We present a novel algorithm to compute the distance between synthetic routes based on tree edit distances. Such distances can be used to cluster synthesis routes generated using a retrosynthesis prediction tool. We show that the clustering of selected routes from a retrosynthesis analysis is performed in less than 10 s on average and only constitutes seven percent of the total time (prediction + clustering). Furthermore, we are able to show that representative routes from each cluster can be used to reduce the set of predicted routes. Finally, we show with a number of examples that the algorithm gives intuitive clusters that can be easily rationalized and that the routes in a cluster tend to use similar chemistry. The algorithm is included in the latest version of open-source AiZynthFinder software (https://github.com/MolecularAI/aizynthfinder) and as a separate package (https://github.com/MolecularAI/route-distances).


Assuntos
Software , Algoritmos , Análise por Conglomerados
12.
J Chem Inf Model ; 61(6): 2572-2581, 2021 06 28.
Artigo em Inglês | MEDLINE | ID: mdl-34015916

RESUMO

In recent years, deep molecular generative models have emerged as promising methods for de novo molecular design. Thanks to the rapid advance of deep learning techniques, deep learning architectures such as recurrent neural networks, variational autoencoders, and adversarial networks have been successfully employed for constructing generative models. Recently, quite a few metrics have been proposed to evaluate these deep generative models. However, many of these metrics cannot evaluate the chemical space coverage of sampled molecules. This work presents a novel and complementary metric for evaluating deep molecular generative models. The metric is based on the chemical space coverage of a reference dataset-GDB-13. The performance of seven different molecular generative models was compared by calculating what fraction of the structures, ring systems, and functional groups could be reproduced from the largely unseen reference set when using only a small fraction of GDB-13 for training. The results show that the performance of the generative models studied varies significantly using the benchmark metrics introduced herein, such that the generalization capabilities of the generative models can be clearly differentiated. In addition, the coverages of GDB-13 ring systems and functional groups were compared between the models. Our study provides a useful new metric that can be used for evaluating and comparing generative models.


Assuntos
Redes Neurais de Computação
13.
J Chem Inf Model ; 61(3): 1444-1456, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-33661004

RESUMO

The understanding of the mechanism-of-action (MoA) of compounds and the prediction of potential drug targets play an important role in small-molecule drug discovery. The aim of this work was to compare chemical and cell morphology information for bioactivity prediction. The comparison was performed using bioactivity data from the ExCAPE database, image data (in the form of CellProfiler features) from the Cell Painting data set (the largest publicly available data set of cell images with ∼30,000 compound perturbations), and extended connectivity fingerprints (ECFPs) using the multitask Bayesian matrix factorization (BMF) approach Macau. We found that the BMF Macau and random forest (RF) performance were overall similar when ECFPs were used as compound descriptors. However, BMF Macau outperformed RF in 159 out of 224 targets (71%) when image data were used as compound information. Using BMF Macau, 100 (corresponding to about 45%) and 90 (about 40%) of the 224 targets were predicted with high predictive performance (AUC > 0.8) with ECFP data and image data as side information, respectively. There were targets better predicted by image data as side information, such as ß-catenin, and others better predicted by fingerprint-based side information, such as proteins belonging to the G-protein-Coupled Receptor 1 family, which could be rationalized from the underlying data distributions in each descriptor domain. In conclusion, both cell morphology changes and chemical structure information contain information about compound bioactivity, which is also partially complementary, and can hence contribute to in silico MoA analysis.


Assuntos
Descoberta de Drogas , Proteínas , Teorema de Bayes , Simulação por Computador , Bases de Dados Factuais
14.
Bioorg Med Chem ; 44: 116308, 2021 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-34280849

RESUMO

We have demonstrated the utility of a 3D shape and pharmacophore similarity scoring component in molecular design with a deep generative model trained with reinforcement learning. Using Dopamine receptor type 2 (DRD2) as an example and its antagonist haloperidol 1 as a starting point in a ligand based design context, we have shown in a retrospective study that a 3D similarity enabled generative model can discover new leads in the absence of any other information. It can be efficiently used for scaffold hopping and generation of novel series. 3D similarity based models were compared against 2D QSAR based, indicating a significant degree of orthogonality of the generated outputs and with the former having a more diverse output. In addition, when the two scoring components are combined together for training of the generative model, it results in more efficient exploration of desirable chemical space compared to the individual components.


Assuntos
Desenho de Fármacos , Haloperidol/farmacologia , Receptores de Dopamina D2/metabolismo , Haloperidol/síntese química , Haloperidol/química , Humanos , Ligantes , Modelos Moleculares , Estrutura Molecular , Relação Quantitativa Estrutura-Atividade , Relação Estrutura-Atividade
15.
J Chem Inf Model ; 60(10): 4546-4559, 2020 10 26.
Artigo em Inglês | MEDLINE | ID: mdl-32865408

RESUMO

In the context of bioactivity prediction, the question of how to calibrate a score produced by a machine learning method into a probability of binding to a protein target is not yet satisfactorily addressed. In this study, we compared the performance of three such methods, namely, Platt scaling (PS), isotonic regression (IR), and Venn-ABERS predictors (VA), in calibrating prediction scores obtained from ligand-target prediction comprising the Naïve Bayes, support vector machines, and random forest (RF) algorithms. Calibration quality was assessed on bioactivity data available at AstraZeneca for 40 million data points (compound-target pairs) across 2112 targets and performance was assessed using stratified shuffle split (SSS) and leave 20% of scaffolds out (L20SO) validation. VA achieved the best calibration performances across all machine learning algorithms and cross validation methods tested and also the lowest (best) Brier score loss (mean squared difference between the outputted probability estimates assigned to a compound and the actual outcome). In comparison, the PS and IR methods can actually degrade the assigned probability estimates, particularly for the RF for SSS and during L20SO. Sphere exclusion, a method to sample additional (putative) inactive compounds, was shown to inflate the overall Brier score loss performance, through the artificial requirement for inactive molecules to be dissimilar to active compounds, but was shown to result in overconfident estimators. VA was able to successfully calibrate the probability estimates for even small calibration sets. The multiprobability values (lower and upper probability boundary intervals) were shown to produce large discordance for test set molecules that are neither very similar nor very dissimilar to the active training set, which were hence difficult to predict, suggesting that multiprobability discordance can be used as an estimate for target prediction uncertainty. Overall, we were able to show in this work that VA scaling of target prediction models is able to improve probability estimates in all testing instances and is currently being applied for in-house approaches.


Assuntos
Aprendizado de Máquina , Máquina de Vetores de Suporte , Teorema de Bayes , Ligantes , Probabilidade
16.
J Chem Inf Model ; 60(6): 2977-2988, 2020 06 22.
Artigo em Inglês | MEDLINE | ID: mdl-32311268

RESUMO

The potential to predict solvation free energies (SFEs) in any solvent using a machine learning (ML) model based on thermodynamic output, extracted exclusively from 3D-RISM simulations in water is investigated. The models on multiple solvents take into account both the solute and solvent description and offer the possibility to predict SFEs of any solute in any solvent with root mean squared errors less than 1 kcal/mol. Validations that involve exclusion of fractions or clusters of the solutes or solvents exemplify the model's capability to predict SFEs of novel solutes and solvents with diverse chemical profiles. In addition to being predictive, our models can identify the solute and solvent features that influence SFE predictions. Furthermore, using 3D-RISM hydration thermodynamic output to predict SFEs in any organic solvent reduces the need to run 3D-RISM simulations in all these solvents. Altogether, our multisolvent models for SFE predictions that take advantage of the solvation effects are expected to have an impact in the property prediction space.


Assuntos
Água , Entropia , Soluções , Solventes , Termodinâmica
17.
J Chem Inf Model ; 60(12): 5918-5922, 2020 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-33118816

RESUMO

In the past few years, we have witnessed a renaissance of the field of molecular de novo drug design. The advancements in deep learning and artificial intelligence (AI) have triggered an avalanche of ideas on how to translate such techniques to a variety of domains including the field of drug design. A range of architectures have been devised to find the optimal way of generating chemical compounds by using either graph- or string (SMILES)-based representations. With this application note, we aim to offer the community a production-ready tool for de novo design, called REINVENT. It can be effectively applied on drug discovery projects that are striving to resolve either exploration or exploitation problems while navigating the chemical space. It can facilitate the idea generation process by bringing to the researcher's attention the most promising compounds. REINVENT's code is publicly available at https://github.com/MolecularAI/Reinvent.


Assuntos
Inteligência Artificial , Desenho de Fármacos , Descoberta de Drogas
18.
Int J Mol Sci ; 21(24)2020 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-33334026

RESUMO

Non-alcoholic fatty liver disease (NAFLD) has a large impact on global health. At the onset of disease, NAFLD is characterized by hepatic steatosis defined by the accumulation of triglycerides stored as lipid droplets. Developing therapeutics against NAFLD and progression to non-alcoholic steatohepatitis (NASH) remains a high priority in the medical and scientific community. Drug discovery programs to identify potential therapeutic compounds have supported high throughput/high-content screening of in vitro human-relevant models of NAFLD to accelerate development of efficacious anti-steatotic medicines. Human induced pluripotent stem cell (hiPSC) technology is a powerful platform for disease modeling and therapeutic assessment for cell-based therapy and personalized medicine. In this study, we applied AstraZeneca's chemogenomic library, hiPSC technology and multiplexed high content screening to identify compounds that significantly reduced intracellular neutral lipid content. Among 13,000 compounds screened, we identified hits that protect against hiPSC-derived hepatic endoplasmic reticulum stress-induced steatosis by a mechanism of action including inhibition of the cyclin D3-cyclin-dependent kinase 2-4 (CDK2-4)/CCAAT-enhancer-binding proteins (C/EBPα)/diacylglycerol acyltransferase 2 (DGAT2) pathway, followed by alteration of the expression of downstream genes related to NAFLD. These findings demonstrate that our phenotypic platform provides a reliable approach in drug discovery, to identify novel drugs for treatment of fatty liver disease as well as to elucidate their underlying mechanisms.


Assuntos
Ensaios de Seleção de Medicamentos Antitumorais , Estresse do Retículo Endoplasmático/efeitos dos fármacos , Hepatócitos/citologia , Hepatócitos/efeitos dos fármacos , Hepatócitos/metabolismo , Células-Tronco Pluripotentes Induzidas/citologia , Metabolismo dos Lipídeos/efeitos dos fármacos , Transdução de Sinais/efeitos dos fármacos , Animais , Proteínas Estimuladoras de Ligação a CCAAT/metabolismo , Biologia Computacional/métodos , Quinase 2 Dependente de Ciclina/metabolismo , Diacilglicerol O-Aciltransferase/metabolismo , Ensaios de Seleção de Medicamentos Antitumorais/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Gotículas Lipídicas/metabolismo , Fígado/efeitos dos fármacos , Fígado/metabolismo , Fígado/patologia , Inibidores de Proteínas Quinases/farmacologia
19.
J Mol Cell Cardiol ; 127: 204-214, 2019 02.
Artigo em Inglês | MEDLINE | ID: mdl-30597148

RESUMO

Over 5 million people in the United States suffer from heart failure, due to the limited ability to regenerate functional cardiac tissue. One potential therapeutic strategy is to enhance proliferation of resident cardiomyocytes. However, phenotypic screening for therapeutic agents is challenged by the limited ability of conventional markers to discriminate between cardiomyocyte proliferation and endoreplication (e.g. polyploidy and multinucleation). Here, we developed a novel assay that combines automated live-cell microscopy and image processing algorithms to discriminate between proliferation and endoreplication by quantifying changes in the number of nuclei, changes in the number of cells, binucleation, and nuclear DNA content. We applied this assay to further prioritize hits from a primary screen for DNA synthesis, identifying 30 compounds that enhance proliferation of human induced pluripotent stem cell-derived cardiomyocytes. Among the most active compounds from the phenotypic screen are clinically approved L-type calcium channel blockers from multiple chemical classes whose activities were confirmed across different sources of human induced pluripotent stem cell-derived cardiomyocytes. Identification of compounds that stimulate human cardiomyocyte proliferation may provide new therapeutic strategies for heart failure.


Assuntos
Canais de Cálcio Tipo L/metabolismo , Células-Tronco Pluripotentes Induzidas/citologia , Miócitos Cardíacos/citologia , Miócitos Cardíacos/metabolismo , Proliferação de Células , DNA/biossíntese , Humanos , Processamento de Imagem Assistida por Computador , Fenótipo , Ploidias
20.
Bioinformatics ; 34(1): 72-79, 2018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-28961699

RESUMO

Motivation: In silico approaches often fail to utilize bioactivity data available for orthologous targets due to insufficient evidence highlighting the benefit for such an approach. Deeper investigation into orthologue chemical space and its influence toward expanding compound and target coverage is necessary to improve the confidence in this practice. Results: Here we present analysis of the orthologue chemical space in ChEMBL and PubChem and its impact on target prediction. We highlight the number of conflicting bioactivities between human and orthologues is low and annotations are overall compatible. Chemical space analysis shows orthologues are chemically dissimilar to human with high intra-group similarity, suggesting they could effectively extend the chemical space modelled. Based on these observations, we show the benefit of orthologue inclusion in terms of novel target coverage. We also benchmarked predictive models using a time-series split and also using bioactivities from Chemistry Connect and HTS data available at AstraZeneca, showing that orthologue bioactivity inclusion statistically improved performance. Availability and implementation: Orthologue-based bioactivity prediction and the compound training set are available at www.github.com/lhm30/PIDGINv2. Contact: ab454@cam.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Simulação por Computador , Descoberta de Drogas/métodos , Proteínas/metabolismo , Homologia de Sequência de Aminoácidos , Animais , Humanos , Ligantes , Modelos Biológicos , Proteínas/efeitos dos fármacos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA