Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
1.
J Chem Inf Model ; 62(9): 2111-2120, 2022 05 09.
Artigo em Inglês | MEDLINE | ID: mdl-35034452

RESUMO

Finding synthesis routes for molecules of interest is essential in the discovery of new drugs and materials. To find such routes, computer-assisted synthesis planning (CASP) methods are employed, which rely on a single-step model of chemical reactivity. In this study, we introduce a template-based single-step retrosynthesis model based on Modern Hopfield Networks, which learn an encoding of both molecules and reaction templates in order to predict the relevance of templates for a given molecule. The template representation allows generalization across different reactions and significantly improves the performance of template relevance prediction, especially for templates with few or zero training examples. With inference speed up to orders of magnitude faster than baseline methods, we improve or match the state-of-the-art performance for top-k exact match accuracy for k ≥ 3 in the retrosynthesis benchmark USPTO-50k. Code to reproduce the results is available at github.com/ml-jku/mhn-react.

2.
PLoS Comput Biol ; 9(2): e1002899, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23436985

RESUMO

Infection with HIV cannot currently be cured; however it can be controlled by combination treatment with multiple anti-retroviral drugs. Given different viral genotypes for virtually each individual patient, the question now arises which drug combination to use to achieve effective treatment. With the availability of viral genotypic data and clinical phenotypic data, it has become possible to create computational models able to predict an optimal treatment regimen for an individual patient. Current models are based only on sequence data derived from viral genotyping; chemical similarity of drugs is not considered. To explore the added value of chemical similarity inclusion we applied proteochemometric models, combining chemical and protein target properties in a single bioactivity model. Our dataset was a large scale clinical database of genotypic and phenotypic information (in total ca. 300,000 drug-mutant bioactivity data points, 4 (NNRTI), 8 (NRTI) or 9 (PI) drugs, and 10,700 (NNRTI) 10,500 (NRTI) or 27,000 (PI) mutants). Our models achieved a prediction error below 0.5 Log Fold Change. Moreover, when directly compared with previously published sequence data, derived models PCM performed better in resistance classification and prediction of Log Fold Change (0.76 log units versus 0.91). Furthermore, we were able to successfully confirm both known and identify previously unpublished, resistance-conferring mutations of HIV Reverse Transcriptase (e.g. K102Y, T216M) and HIV Protease (e.g. Q18N, N88G) from our dataset. Finally, we applied our models prospectively to the public HIV resistance database from Stanford University obtaining a correct resistance prediction rate of 84% on the full set (compared to 80% in previous work on a high quality subset). We conclude that proteochemometric models are able to accurately predict the phenotypic resistance based on genotypic data even for novel mutants and mixtures. Furthermore, we add an applicability domain to the prediction, informing the user about the reliability of predictions.


Assuntos
Fármacos Anti-HIV/química , Fármacos Anti-HIV/farmacologia , Biologia Computacional/métodos , Descoberta de Drogas/métodos , HIV/efeitos dos fármacos , Modelos Biológicos , Inteligência Artificial , Bases de Dados Genéticas , HIV/genética , Mutação , Fenótipo , Reprodutibilidade dos Testes
3.
J Cheminform ; 15(1): 20, 2023 Feb 11.
Artigo em Inglês | MEDLINE | ID: mdl-36774523

RESUMO

Artificial Intelligence is revolutionizing many aspects of the pharmaceutical industry. Deep learning models are now routinely applied to guide drug discovery projects leading to faster and improved findings, but there are still many tasks with enormous unrealized potential. One such task is the reaction yield prediction. Every year more than one fifth of all synthesis attempts result in product yields which are either zero or too low. This equates to chemical and human resources being spent on activities which ultimately do not progress the programs, leading to a triple loss when accounting for the cost of opportunity in time wasted. In this work we pre-train a BERT model on more than 16 million reactions from 4 different data sources, and fine tune it to achieve an uncertainty calibrated global yield prediction model. This model is an improvement upon state of the art not just from the increase in pre-train data but also by introducing a new embedding layer which solves a few limitations of SMILES and enables integration of additional information such as equivalents and molecule role into the reaction encoding, the model is called BERT Enriched Embedding (BEE). The model is benchmarked on an open-source dataset against a state-of-the-art synthesis focused BERT showing a near 20-point improvement in r2 score. The model is fine-tuned and tested on an internal company data benchmark, and a prospective study shows that the application of the model can reduce the total number of negative reactions (yield under 5%) ran in Janssen by at least 34%. Lastly, we corroborate the previous results through experimental validation, by directly deploying the model in an on-going drug discovery project and showing that it can also be used successfully as a reagent recommender due to its fast inference speed and reliable confidence estimation, a critical feature for industry application.

4.
Nat Genet ; 51(7): 1082-1091, 2019 07.
Artigo em Inglês | MEDLINE | ID: mdl-31253980

RESUMO

Most candidate drugs currently fail later-stage clinical trials, largely due to poor prediction of efficacy on early target selection1. Drug targets with genetic support are more likely to be therapeutically valid2,3, but the translational use of genome-scale data such as from genome-wide association studies for drug target discovery in complex diseases remains challenging4-6. Here, we show that integration of functional genomic and immune-related annotations, together with knowledge of network connectivity, maximizes the informativeness of genetics for target validation, defining the target prioritization landscape for 30 immune traits at the gene and pathway level. We demonstrate how our genetics-led drug target prioritization approach (the priority index) successfully identifies current therapeutics, predicts activity in high-throughput cellular screens (including L1000, CRISPR, mutagenesis and patient-derived cell assays), enables prioritization of under-explored targets and allows for determination of target-level trait relationships. The priority index is an open-access, scalable system accelerating early-stage drug target selection for immune-mediated disease.


Assuntos
Artrite Reumatoide/genética , Descoberta de Drogas , Redes Reguladoras de Genes , Genoma Humano , Imunidade Inata/genética , Locos de Características Quantitativas , Seleção Genética , Artrite Reumatoide/tratamento farmacológico , Artrite Reumatoide/imunologia , Regulação da Expressão Gênica , Estudo de Associação Genômica Ampla , Humanos , Polimorfismo de Nucleotídeo Único
6.
Chem Sci ; 9(24): 5441-5451, 2018 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-30155234

RESUMO

Deep learning is currently the most successful machine learning technique in a wide range of application areas and has recently been applied successfully in drug discovery research to predict potential drug targets and to screen for active molecules. However, due to (1) the lack of large-scale studies, (2) the compound series bias that is characteristic of drug discovery datasets and (3) the hyperparameter selection bias that comes with the high number of potential deep learning architectures, it remains unclear whether deep learning can indeed outperform existing computational methods in drug discovery tasks. We therefore assessed the performance of several deep learning methods on a large-scale drug discovery dataset and compared the results with those of other machine learning and target prediction methods. To avoid potential biases from hyperparameter selection or compound series, we used a nested cluster-cross-validation strategy. We found (1) that deep learning methods significantly outperform all competing methods and (2) that the predictive performance of deep learning is in many cases comparable to that of tests performed in wet labs (i.e., in vitro assays).

7.
J Cheminform ; 5(1): 41, 2013 Sep 23.
Artigo em Inglês | MEDLINE | ID: mdl-24059694

RESUMO

BACKGROUND: While a large body of work exists on comparing and benchmarking of descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking. Hence, in the current work a total of 13 different protein descriptor sets have been compared with respect to their behavior in perceiving similarities between amino acids. The descriptor sets included in the study are Z-scales (3 variants), VHSE, T-scales, ST-scales, MS-WHIM, FASGAI and BLOSUM, and a novel protein descriptor set termed ProtFP (4 variants). We investigate to which extent descriptor sets show collinear as well as orthogonal behavior via principal component analysis (PCA). RESULTS: In describing amino acid similarities, MSWHIM, T-scales and ST-scales show related behavior, as do the VHSE, FASGAI, and ProtFP (PCA3) descriptor sets. Conversely, the ProtFP (PCA5), ProtFP (PCA8), Z-Scales (Binned), and BLOSUM descriptor sets show behavior that is distinct from one another as well as both of the clusters above. Generally, the use of more principal components (>3 per amino acid, per descriptor) leads to a significant differences in the way amino acids are described, despite that the later principal components capture less variation per component of the original input data. CONCLUSION: In this work a comparison is provided of how similar (and differently) currently available amino acids descriptor sets behave when converting structure to property space. The results obtained enable molecular modelers to select suitable amino acid descriptor sets for structure-activity analyses, e.g. those showing complementary behavior.

8.
J Cheminform ; 5(1): 42, 2013 Sep 24.
Artigo em Inglês | MEDLINE | ID: mdl-24059743

RESUMO

BACKGROUND: While a large body of work exists on comparing and benchmarking descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking. Hence, in the current work a total of 13 amino acid descriptor sets have been benchmarked with respect to their ability of establishing bioactivity models. The descriptor sets included in the study are Z-scales (3 variants), VHSE, T-scales, ST-scales, MS-WHIM, FASGAI, BLOSUM, a novel protein descriptor set (termed ProtFP (4 variants)), and in addition we created and benchmarked three pairs of descriptor combinations. Prediction performance was evaluated in seven structure-activity benchmarks which comprise Angiotensin Converting Enzyme (ACE) dipeptidic inhibitor data, and three proteochemometric data sets, namely (1) GPCR ligands modeled against a GPCR panel, (2) enzyme inhibitors (NNRTIs) with associated bioactivities against a set of HIV enzyme mutants, and (3) enzyme inhibitors (PIs) with associated bioactivities on a large set of HIV enzyme mutants. RESULTS: The amino acid descriptor sets compared here show similar performance (<0.1 log units RMSE difference and <0.1 difference in MCC), while errors for individual proteins were in some cases found to be larger than those resulting from descriptor set differences ( > 0.3 log units RMSE difference and >0.7 difference in MCC). Combining different descriptor sets generally leads to better modeling performance than utilizing individual sets. The best performers were Z-scales (3) combined with ProtFP (Feature), or Z-Scales (3) combined with an average Z-Scale value for each target, while ProtFP (PCA8), ST-Scales, and ProtFP (Feature) rank last. CONCLUSIONS: While amino acid descriptor sets capture different aspects of amino acids their ability to be used for bioactivity modeling is still - on average - surprisingly similar. Still, combining sets describing complementary information consistently leads to small but consistent improvement in modeling performance (average MCC 0.01 better, average RMSE 0.01 log units lower). Finally, performance differences exist between the targets compared thereby underlining that choosing an appropriate descriptor set is of fundamental for bioactivity modeling, both from the ligand- as well as the protein side.

9.
J Med Chem ; 55(16): 7010-20, 2012 Aug 23.
Artigo em Inglês | MEDLINE | ID: mdl-22827545

RESUMO

The four subtypes of adenosine receptors form relevant drug targets in the treatment of, e.g., diabetes and Parkinson's disease. In the present study, we aimed at finding novel small molecule ligands for these receptors using virtual screening approaches based on proteochemometric (PCM) modeling. We combined bioactivity data from all human and rat receptors in order to widen available chemical space. After training and validating a proteochemometric model on this combined data set (Q(2) of 0.73, RMSE of 0.61), we virtually screened a vendor database of 100910 compounds. Of 54 compounds purchased, six novel high affinity adenosine receptor ligands were confirmed experimentally, one of which displayed an affinity of 7 nM on the human adenosine A(1) receptor. We conclude that the combination of rat and human data performs better than human data only. Furthermore, we conclude that proteochemometric modeling is an efficient method to quickly screen for novel bioactive compounds.


Assuntos
Bases de Dados de Compostos Químicos , Modelos Moleculares , Receptores Purinérgicos P1/química , Animais , Inteligência Artificial , Sítios de Ligação , Células CHO , Simulação por Computador , Cricetinae , Cricetulus , Humanos , Ligantes , Ensaio Radioligante , Ratos , Receptor A1 de Adenosina/química , Receptor A1 de Adenosina/metabolismo , Receptor A2A de Adenosina/química , Receptor A2A de Adenosina/metabolismo , Receptor A2B de Adenosina/química , Receptor A2B de Adenosina/metabolismo , Receptor A3 de Adenosina/química , Receptor A3 de Adenosina/metabolismo , Receptores Purinérgicos P1/metabolismo , Relação Estrutura-Atividade
10.
PLoS One ; 6(11): e27518, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-22132107

RESUMO

In quite a few diseases, drug resistance due to target variability poses a serious problem in pharmacotherapy. This is certainly true for HIV, and hence, it is often unknown which drug is best to use or to develop against an individual HIV strain. In this work we applied 'proteochemometric' modeling of HIV Non-Nucleoside Reverse Transcriptase (NNRTI) inhibitors to support preclinical development by predicting compound performance on multiple mutants in the lead selection stage. Proteochemometric models are based on both small molecule and target properties and can thus capture multi-target activity relationships simultaneously, the targets in this case being a set of 14 HIV Reverse Transcriptase (RT) mutants. We validated our model by experimentally confirming model predictions for 317 untested compound-mutant pairs, with a prediction error comparable with assay variability (RMSE 0.62). Furthermore, dependent on the similarity of a new mutant to the training set, we could predict with high accuracy which compound will be most effective on a sequence with a previously unknown genotype. Hence, our models allow the evaluation of compound performance on untested sequences and the selection of the most promising leads for further preclinical research. The modeling concept is likely to be applicable also to other target families with genetic variability like other viruses or bacteria, or with similar orthologs like GPCRs.


Assuntos
Avaliação Pré-Clínica de Medicamentos/métodos , Modelos Moleculares , Proteômica/métodos , Inibidores da Transcriptase Reversa/análise , Inibidores da Transcriptase Reversa/química , Sequência de Aminoácidos , Sítios de Ligação , Bases de Dados como Assunto , Transcriptase Reversa do HIV/antagonistas & inibidores , Transcriptase Reversa do HIV/química , Humanos , Ligantes , Dados de Sequência Molecular , Mutação/genética , Reprodutibilidade dos Testes , Inibidores da Transcriptase Reversa/farmacologia
11.
Protein Sci ; 19(4): 742-52, 2010 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-20120021

RESUMO

In this work, we describe two novel approaches to utilize the dynamic structure information implicitly contained in large crystal structure data sets. The first approach visualizes both consistent as well as variable ligand-induced changes in ligand-bound compared with apo protein crystal structures. For this purpose, information was mined from B-factors and ligand-induced residue displacements in multiple crystal structures, minimizing experimental error and noise. With this approach, the mechanism of action of non-nucleoside reverse transcriptase inhibitors (NNRTIs) as an inseparable combination of distortion of protein dynamics and conformational changes of HIV-1 reverse transcriptase was corroborated (a combination of the previously proposed "molecular arthritis" and "distorted site" mechanisms). The second approach presented here uses "consensus structures" to map common binding features that are present in a set of structures of NNRTI-bound HIV-1 reverse transcriptase. Consensus structures are based on different levels of structural overlap of multiple crystal structures and are used to analyze protein-ligand interactions. The structures are shown to yield information about conserved hydrogen bonding interactions as well as binding-pocket flexibility, shape, and volume. From the consensus structures, a common wild type NNRTI binding pocket emerges. Furthermore, we were able to identify a conserved backbone hydrogen bond acceptor at P236 and a novel hydrophobic subpocket, which are not yet utilized by current drugs. Our methods introduced here reinterpret the atom information and make use of the data variability by using multiple structures, complementing classical 3D structural information of single structures.


Assuntos
Cristalografia por Raios X , Mineração de Dados/métodos , Transcriptase Reversa do HIV/química , Proteínas/química , Inibidores da Transcriptase Reversa/química , Sítios de Ligação , Bases de Dados de Proteínas , Ligação de Hidrogênio , Ligantes , Modelos Moleculares , Relação Estrutura-Atividade
12.
J Chem Inf Model ; 47(4): 1279-93, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-17511441

RESUMO

Chemoinformatics is a large scientific discipline that deals with the storage, organization, management, retrieval, analysis, dissemination, visualization, and use of chemical information. Chemoinformatics techniques are used extensively in drug discovery and development. Although many consider it a mature field, the advent of high-throughput experimental techniques and the need to analyze very large data sets have brought new life and challenges to it. Here, we review a selection of papers published in 2006 that caught our attention with regard to the novelty of the methodology that was presented. The field is seeing significant growth, which will be further catalyzed by the widespread availability of public databases to support the development and validation of new approaches.


Assuntos
Informática , Técnicas de Química Combinatória , Indústria Farmacêutica , Genômica , Relação Quantitativa Estrutura-Atividade
13.
J Chem Inf Model ; 47(2): 295-301, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-17381167

RESUMO

We have developed a Java library for substructure matching that features easy-to-read syntax and extensibility. This molecular query language (MQL) is grounded on a context-free grammar, which allows for straightforward modification and extension. The formal description of MQL is provided in this paper. Molecule primitives are atoms, bonds, properties, branching, and rings. User-defined features can be added via a Java interface. In MQL, molecules are represented as graphs. Substructure matching was implemented using the Ullmann algorithm because of favorable run-time performance. The Ullmann algorithm carries out a fast subgraph isomorphism search by combining backtracking with effective forward checking. MQL software design was driven by the aim to facilitate the use of various cheminformatics toolkits. Two Java interfaces provide a bridge from our MQL package to an external toolkit: the first one provides the matching rules for every feature of a particular toolkit; the second one converts the found match from the internal format of MQL to the format of the external toolkit. We already implemented these interfaces for the Chemistry Development Toolkit.


Assuntos
Simulação por Computador , Modelos Moleculares , Design de Software , Algoritmos , Estrutura Molecular , Preparações Farmacêuticas/química
14.
J Chem Inf Comput Sci ; 43(3): 1077-84, 2003.
Artigo em Inglês | MEDLINE | ID: mdl-12767167

RESUMO

The paper describes a fast and flexible descriptor selection method using a genetic algorithm variant (GA-SEC). The relevance of the descriptors will be measured using Shannon entropy (SE) and differential Shannon entropy (DSE), which have very sparse memory requirements and allow the processing of huge data sets. A small quantity of the most important descriptors will be used automatically to build a value prediction model. The most important descriptors are not a linear combination of other descriptors, but transparent, pure descriptors. We used an artificial neural network (ANN) model to predict the aqueous solubility logS and the octanol/water partition coefficient logP. The logS data set was divided into a training set of 1016 compounds and a test set of 253 compounds. A correlation coefficient of 0.93 and an empirical standard deviation of 0.54 were achieved. The logP data set was divided into a training set of 1853 compounds and a test set of 138 compounds. A correlation coefficient of 0.92 and an empirical standard deviation of 0.44 were achieved.


Assuntos
Algoritmos , Modelos Químicos , Compostos Orgânicos/química , Água/química , Bases de Dados Factuais , Entropia , Modelos Genéticos , Redes Neurais de Computação , Octanóis/química , Compostos Orgânicos/análise , Análise de Regressão , Solubilidade
15.
J Chem Inf Comput Sci ; 44(3): 921-30, 2004.
Artigo em Inglês | MEDLINE | ID: mdl-15154758

RESUMO

The paper describes different aspects of classification models based on molecular data sets with the focus on feature selection methods. Especially model quality and avoiding a high variance on unseen data (overfitting) will be discussed with respect to the feature selection problem. We present several standard approaches and modifications of our Genetic Algorithm based on the Shannon Entropy Cliques (GA-SEC) algorithm and the extension for classification problems using boosting.

16.
J Chem Inf Comput Sci ; 44(3): 931-9, 2004.
Artigo em Inglês | MEDLINE | ID: mdl-15154759

RESUMO

We show that the topological polar surface area (TPSA) descriptor and the radial distribution function (RDF) applied to electronic and steric atom properties, like the conjugated electrotopological state (CETS), are the most relevant features/descriptors for predicting the human intestinal absorption (HIA) out of a large set of 2934 features/descriptors. A HIA data set with 196 molecules with measured HIA values and 2934 features/descriptors were calculated using JOELib and MOE. We used an adaptive boosting algorithm to solve the binary classification problem (AdaBoost.M1) and Genetic Algorithms based on Shannon Entropy Cliques (GA-SEC) variants as hybrid feature selection algorithms. The selection of relevant features was applied with respect to the generalization ability of the classification model, avoiding a high variance for unseen molecules (overfitting).


Assuntos
Absorção Intestinal , Modelos Teóricos , Humanos
17.
J Mol Model ; 9(4): 235-41, 2003 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-12720113

RESUMO

The compressed feature matrix (CFM) is a feature based molecular descriptor for the fast processing of pharmacochemical applications such as adaptive similarity search, pharmacophore development and substructure search. Depending on the particular purpose, the descriptor may be generated upon either topological or Euclidean molecular data. To assure a variable utilizability, the assignment of the structural patterns to feature types is arbitrarily determined by the user. This step is based on a graph algorithm for substructure search, which resembles the common substructure descriptors. While these merely allow a screening for the predefined patterns, the CFM permits a real substructure/subgraph search, presuming that all desired elements of the query substructure are described by the selected feature set. In this work, the CFM based substructure search is evaluated with regard to both the different outputs resulting from varying feature sets and the search speed. As a benchmark we use the programmable atom typer (PATTY) graph algorithm. When comparing the two methods, the CFM based matrix algorithm is up to several hundred times faster than PATTY and when using the CFM as a basis for substructure screening, the search speed is accelerated by three orders of magnitude. Thus, the CFM based substructure search complies with the requirements for interactive usage, even for the evaluation of several hundred thousand compounds. The concept of the CFM is implemented in the software COFEA. FIGURE CFM based substructure search using the compounds dopamine and benzene-1,2-diol


Assuntos
Simulação por Computador , Estrutura Molecular , Algoritmos , Estudos de Avaliação como Assunto , Modelos Moleculares , Conformação Molecular
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA