Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
1.
Faraday Discuss ; 2024 Sep 25.
Artigo em Inglês | MEDLINE | ID: mdl-39320108

RESUMO

Machine learning has gained popularity for predicting molecular properties based on molecular structure. This study explores the uncertainty estimates of neural fingerprint-based models by comparing pure graph neural networks (GNN) to classical machine learning algorithms combined with neural fingerprints. We investigate the advantage of extracting the neural fingerprint from the GNN and integrating it into a method known for producing better-calibrated probability estimates. Comparisons are made using three classical machine learning methods and the Chemprop model, considering different molecular representations and calibration techniques. We utilize 19 datasets from Toxcast, reflecting real-world scenarios with balanced accuracies ranging from 0.6 to 0.8. Results demonstrate that neural fingerprints combined with classical machine learning methods exhibit a slight decrease in prediction performance compared to the native Chemprop model. However, these models provide significantly improved uncertainty estimates. Notably, uncertainty estimates of neural fingerprint-based methods remain relatively robust for molecules dissimilar to the training set. This suggests that methods like random forest with neural fingerprints can deliver strong prediction performance and reliable uncertainty estimates. When considering both performance and uncertainty, the calibrated Chemprop model and the combination of neural fingerprints with random forest or support vector classifier (SVC) yield comparable results. Surprisingly, the SVC method shows promising performance when combined with neural or count fingerprints. These findings are particularly relevant in real-world industrial projects where accurate predictions and reliable uncertainty estimates are crucial.

2.
J Chem Inf Model ; 64(16): 6259-6280, 2024 Aug 26.
Artigo em Inglês | MEDLINE | ID: mdl-39136669

RESUMO

Molecular Property Prediction (MPP) is vital for drug discovery, crop protection, and environmental science. Over the last decades, diverse computational techniques have been developed, from using simple physical and chemical properties and molecular fingerprints in statistical models and classical machine learning to advanced deep learning approaches. In this review, we aim to distill insights from current research on employing transformer models for MPP. We analyze the currently available models and explore key questions that arise when training and fine-tuning a transformer model for MPP. These questions encompass the choice and scale of the pretraining data, optimal architecture selections, and promising pretraining objectives. Our analysis highlights areas not yet covered in current research, inviting further exploration to enhance the field's understanding. Additionally, we address the challenges in comparing different models, emphasizing the need for standardized data splitting and robust statistical analysis.


Assuntos
Aprendizado de Máquina , Descoberta de Drogas/métodos , Aprendizado Profundo
3.
J Chem Inf Model ; 2024 Sep 17.
Artigo em Inglês | MEDLINE | ID: mdl-39288001

RESUMO

The open-source package scikit-learn provides various machine learning algorithms and data processing tools, including the Pipeline class, which allows users to prepend custom data transformation steps to the machine learning model. We introduce the MolPipeline package, which extends this concept to cheminformatics by wrapping standard RDKit functionality, such as reading and writing SMILES strings or calculating molecular descriptors from a molecule object. We aimed to build an easy-to-use Python package to create completely automated end-to-end pipelines that scale to large data sets. Particular emphasis was put on handling erroneous instances, where resolution would require manual intervention in default pipelines. MolPipeline provides the building blocks to enable seamless integration of common cheminformatics tasks within scikit-learn's pipeline framework, such as scaffold splits and molecular standardization, making pipeline building easily adaptable to diverse project requirements.

4.
Risk Anal ; 42(2): 224-238, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-33300210

RESUMO

For hazard classifications of chemicals, continuous data from animal- or nonanimal testing methods are often dichotomized into binary positive/negative outcomes by defining classification thresholds (CT). Experimental data are, however, subject to biological and technical variability. Each test method's precision is limited resulting in uncertainty of the positive/negative outcome if the experimental result is close to the CT. Borderline ranges (BR) around the CT were suggested, which represent ranges in which the study result is ambiguous, that is, positive or negative results are equally likely. The BR reflects a method's precision uncertainty. This article explores and compares different approaches to quantify the BR. Besides using the pooled standard deviation, we determine the BR by means of the median absolute deviation (MAD), with a sequential combination of both methods, and by using nonparametric bootstrapping. Furthermore, we quantify the BR for different hazardous effects, including nonanimal tests for skin corrosion, eye irritation, skin irritation, and skin sensitization as well as for an animal test on skin sensitization (the local lymph node assay, LLNA). Additionally, for one method (direct peptide reactivity assay) the BR was determined experimentally and compared to calculated BRs. Our results demonstrate that (i) the precision of the methods is determining the size of their BRs, (ii) there is no "perfect" method to derive a BR, alas, (iii) a consensus on BR is needed to account for the limited precision of testing methods.


Assuntos
Alternativas aos Testes com Animais , Ensaio Local de Linfonodo , Alternativas aos Testes com Animais/métodos , Animais , Pele , Incerteza
5.
Chem Res Toxicol ; 34(2): 396-411, 2021 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-33185102

RESUMO

Disturbance of the thyroid hormone homeostasis has been associated with adverse health effects such as goiters and impaired mental development in humans and thyroid tumors in rats. In vitro and in silico methods for predicting the effects of small molecules on thyroid hormone homeostasis are currently being explored as alternatives to animal experiments, but are still in an early stage of development. The aim of this work was the development of a battery of in silico models for a set of targets involved in molecular initiating events of thyroid hormone homeostasis: deiodinases 1, 2, and 3, thyroid peroxidase (TPO), thyroid hormone receptor (TR), sodium/iodide symporter, thyrotropin-releasing hormone receptor, and thyroid-stimulating hormone receptor. The training data sets were compiled from the ToxCast database and related scientific literature. Classical statistical approaches as well as several machine learning methods (including random forest, support vector machine, and neural networks) were explored in combination with three data balancing techniques. The models were trained on molecular descriptors and fingerprints and evaluated on holdout data. Furthermore, multi-task neural networks combining several end points were investigated as a possible way to improve the performance of models for which the experimental data available for model training are limited. Classifiers for TPO and TR performed particularly well, with F1 scores of 0.83 and 0.81 on the holdout data set, respectively. Models for the other studied targets yielded F1 scores of up to 0.77. An in-depth analysis of the reliability of predictions was performed for the most relevant models. All data sets used in this work for model development and validation are available in the Supporting Information.


Assuntos
Homeostase/efeitos dos fármacos , Bibliotecas de Moléculas Pequenas/farmacologia , Hormônios Tireóideos/metabolismo , Animais , Bases de Dados Factuais , Humanos , Aprendizado de Máquina , Modelos Moleculares , Estrutura Molecular , Bibliotecas de Moléculas Pequenas/química
6.
J Chem Inf Model ; 61(7): 3255-3272, 2021 07 26.
Artigo em Inglês | MEDLINE | ID: mdl-34153183

RESUMO

Computational methods such as machine learning approaches have a strong track record of success in predicting the outcomes of in vitro assays. In contrast, their ability to predict in vivo endpoints is more limited due to the high number of parameters and processes that may influence the outcome. Recent studies have shown that the combination of chemical and biological data can yield better models for in vivo endpoints. The ChemBioSim approach presented in this work aims to enhance the performance of conformal prediction models for in vivo endpoints by combining chemical information with (predicted) bioactivity assay outcomes. Three in vivo toxicological endpoints, capturing genotoxic (MNT), hepatic (DILI), and cardiological (DICC) issues, were selected for this study due to their high relevance for the registration and authorization of new compounds. Since the sparsity of available biological assay data is challenging for predictive modeling, predicted bioactivity descriptors were introduced instead. Thus, a machine learning model for each of the 373 collected biological assays was trained and applied on the compounds of the in vivo toxicity data sets. Besides the chemical descriptors (molecular fingerprints and physicochemical properties), these predicted bioactivities served as descriptors for the models of the three in vivo endpoints. For this study, a workflow based on a conformal prediction framework (a method for confidence estimation) built on random forest models was developed. Furthermore, the most relevant chemical and bioactivity descriptors for each in vivo endpoint were preselected with lasso models. The incorporation of bioactivity descriptors increased the mean F1 scores of the MNT model from 0.61 to 0.70 and for the DICC model from 0.72 to 0.82 while the mean efficiencies increased by roughly 0.10 for both endpoints. In contrast, for the DILI endpoint, no significant improvement in model performance was observed. Besides pure performance improvements, an analysis of the most important bioactivity features allowed detection of novel and less intuitive relationships between the predicted biological assay outcomes used as descriptors and the in vivo endpoints. This study presents how the prediction of in vivo toxicity endpoints can be improved by the incorporation of biological information-which is not necessarily captured by chemical descriptors-in an automated workflow without the need for adding experimental workload for the generation of bioactivity descriptors as predicted outcomes of bioactivity assays were utilized. All bioactivity CP models for deriving the predicted bioactivities, as well as the in vivo toxicity CP models, can be freely downloaded from https://doi.org/10.5281/zenodo.4761225.


Assuntos
Fígado , Aprendizado de Máquina , Bioensaio , Conformação Molecular
7.
J Chem Inf Model ; 59(8): 3370-3388, 2019 08 26.
Artigo em Inglês | MEDLINE | ID: mdl-31361484

RESUMO

Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 16 proprietary industrial data sets spanning a wide variety of chemical end points. In addition, we introduce a graph convolutional model that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary data sets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows.


Assuntos
Redes Neurais de Computação , Gráficos por Computador
9.
Sci Rep ; 12(1): 7244, 2022 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-35508546

RESUMO

Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly-available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.


Assuntos
Bioensaio , Aprendizado de Máquina , Calibragem , Conformação Molecular
10.
J Cheminform ; 12(1): 24, 2020 Apr 14.
Artigo em Inglês | MEDLINE | ID: mdl-33431007

RESUMO

Risk assessment of newly synthesised chemicals is a prerequisite for regulatory approval. In this context, in silico methods have great potential to reduce time, cost, and ultimately animal testing as they make use of the ever-growing amount of available toxicity data. Here, KnowTox is presented, a novel pipeline that combines three different in silico toxicology approaches to allow for confident prediction of potentially toxic effects of query compounds, i.e. machine learning models for 88 endpoints, alerts for 919 toxic substructures, and computational support for read-across. It is mainly based on the ToxCast dataset, containing after preprocessing a sparse matrix of 7912 compounds tested against 985 endpoints. When applying machine learning models, applicability and reliability of predictions for new chemicals are of utmost importance. Therefore, first, the conformal prediction technique was deployed, comprising an additional calibration step and per definition creating internally valid predictors at a given significance level. Second, to further improve validity and information efficiency, two adaptations are suggested, exemplified at the androgen receptor antagonism endpoint. An absolute increase in validity of 23% on the in-house dataset of 534 compounds could be achieved by introducing KNNRegressor normalisation. This increase in validity comes at the cost of efficiency, which could again be improved by 20% for the initial ToxCast model by balancing the dataset during model training. Finally, the value of the developed pipeline for risk assessment is discussed using two in-house triazole molecules. Compared to a single toxicity prediction method, complementing the outputs of different approaches can have a higher impact on guiding toxicity testing and de-selecting most likely harmful development-candidate compounds early in the development process.

11.
Artigo em Inglês | MEDLINE | ID: mdl-32522348

RESUMO

At the 2019 annual meeting of the European Environmental Mutagen and Genomics Society a workshop session related to the use of read across concepts in toxicology was held. The goal of this session was to provide the audience an overview of general read-across concepts. From ECHA's read across assessment framework, the starting point is chemical similarity. There are several approaches and algorithms available for calculating chemical similarity based on molecular descriptors, distance/similarity measures and weighting schemata for specific endpoints. Therefore, algorithms that adapt themselves to the data (endpoint/s) and provide a good ability to distinguish between structural similar and not similar molecules regarding specific endpoints are needed and their use discussed. Toxico-dynamic end points are usually in the focus of read across cases. However, without appropriate attention to kinetics and metabolism such cases are unlikely to be successful. To further enhance the quality of read across cases new approach methods can be very useful. Examples based on a biological approach using plasma metabolomics in rats are given. Finally, with the availability of large data sets of structure activity relationships, in silico tools have been developed which provide hitherto undiscovered information. Automated process is now able to assess the chemical - activity space around the molecule target substance and examples are given demonstrating a high predictivity for certain endpoints of toxicity. Thus, this session provides not only current state of the art criteria for good read across, but also indicates how read-across can be further developed in the near future.


Assuntos
Substâncias Perigosas/química , Mutagênicos/química , Algoritmos , Animais , Bases de Dados Factuais , Humanos , Metabolômica/métodos , Medição de Risco
12.
J Med Chem ; 63(16): 8667-8682, 2020 08 27.
Artigo em Inglês | MEDLINE | ID: mdl-32243158

RESUMO

Artificial intelligence and machine learning have demonstrated their potential role in predictive chemistry and synthetic planning of small molecules; there are at least a few reports of companies employing in silico synthetic planning into their overall approach to accessing target molecules. A data-driven synthesis planning program is one component being developed and evaluated by the Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS) consortium, comprising MIT and 13 chemical and pharmaceutical company members. Together, we wrote this perspective to share how we think predictive models can be integrated into medicinal chemistry synthesis workflows, how they are currently used within MLPDS member companies, and the outlook for this field.


Assuntos
Técnicas de Química Sintética/métodos , Química Farmacêutica/métodos , Aprendizado de Máquina , Indústria Química/métodos , Descoberta de Drogas/métodos , Modelos Químicos , Pesquisa Farmacêutica/métodos
13.
J Cheminform ; 9(1): 44, 2017 Aug 03.
Artigo em Inglês | MEDLINE | ID: mdl-29086213

RESUMO

The goal of defining an applicability domain for a predictive classification model is to identify the region in chemical space where the model's predictions are reliable. The boundary of the applicability domain is defined with the help of a measure that shall reflect the reliability of an individual prediction. Here, the available measures are differentiated into those that flag unusual objects and which are independent of the original classifier and those that use information of the trained classifier. The former set of techniques is referred to as novelty detection while the latter is designated as confidence estimation. A review of the available confidence estimators shows that most of these measures estimate the probability of class membership of the predicted objects which is inversely related to the error probability. Thus, class probability estimates are natural candidates for defining the applicability domain but were not comprehensively included in previous benchmark studies. The focus of the present study is to find the best measure for defining the applicability domain for a given binary classification technique and to determine the performance of novelty detection versus confidence estimation. Six different binary classification techniques in combination with ten data sets were studied to benchmark the various measures. The area under the receiver operating characteristic curve (AUC ROC) was employed as main benchmark criterion. It is shown that class probability estimates constantly perform best to differentiate between reliable and unreliable predictions. Previously proposed alternatives to class probability estimates do not perform better than the latter and are inferior in most cases. Interestingly, the impact of defining an applicability domain depends on the observed area under the receiver operator characteristic curve. That means that it depends on the level of difficulty of the classification problem (expressed as AUC ROC) and will be largest for intermediately difficult problems (range AUC ROC 0.7-0.9). In the ranking of classifiers, classification random forests performed best on average. Hence, classification random forests in combination with the respective class probability estimate are a good starting point for predictive binary chemoinformatic classifiers with applicability domain. Graphical abstract .

14.
Mol Inform ; 35(5): 160-80, 2016 05.
Artigo em Inglês | MEDLINE | ID: mdl-27492083

RESUMO

Classification rules are often used in chemoinformatics to predict categorical properties of drug candidates related to bioactivity from explanatory variables, which encode the respective molecular structures (i.e. molecular descriptors). To avoid predictions with an unduly large error probability, the domain the classifier is applied to should be restricted to the domain covered by the training set objects. This latter domain is commonly referred to as applicability domain in chemoinformatics. Conceptually, the applicability domain defines the region in space where the "normal" objects are located. Defining the border of the applicability domain may then be viewed as detecting anomalous or novel objects or as detecting outliers. Currently two different types of measures are in use. The first one defines the applicability domain solely in terms of the molecular descriptor space, which is referred to as novelty detection. The second type defines the applicability domain in terms of the expected reliability of the predictions which is referred to as confidence estimation. Both types are systematically differentiated here and the most popular measures are reviewed. It will be shown that all common chemoinformatic classifiers have built-in confidence scores. Since confidence estimation uses information of the class labels for computing the confidence scores, it is expected to be more efficient in reducing the error rate than novelty detection, which solely uses the information of the explanatory variables.


Assuntos
Bases de Dados de Produtos Farmacêuticos , Algoritmos , Modelos Químicos , Estrutura Molecular , Preparações Farmacêuticas/química , Preparações Farmacêuticas/classificação , Relação Quantitativa Estrutura-Atividade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA