RESUMO
Modern nanotechnology has generated numerous datasets from in vitro and in vivo studies on nanomaterials, with some available on nanoinformatics portals. However, these existing databases lack the digital data and tools suitable for machine learning studies. Here, we report a nanoinformatics platform that accurately annotates nanostructures into machine-readable data files and provides modeling toolkits. This platform, accessible to the public at https://vinas-toolbox.com/, has annotated nanostructures of 14 material types. The associated nanodescriptor data and assay test results are appropriate for modeling purposes. The modeling toolkits enable data standardization, data visualization, and machine learning model development to predict properties and bioactivities of new nanomaterials. Moreover, a library of virtual nanostructures with their predicted properties and bioactivities is available, directing the synthesis of new nanomaterials. This platform provides a data-driven computational modeling platform for the nanoscience community, significantly aiding in the development of safe and effective nanomaterials.
Assuntos
Aprendizado de Máquina , Nanoestruturas , Nanoestruturas/química , Nanotecnologia/métodos , Software , Simulação por Computador , HumanosRESUMO
Failure of animal models to predict hepatotoxicity in humans has created a push to develop biological pathway-based alternatives, such as those that use in vitro assays. Public screening programs (e.g., ToxCast/Tox21 programs) have tested thousands of chemicals using in vitro high-throughput screening (HTS) assays. Developing pathway-based models for simple biological pathways, such as endocrine disruption, has proven successful, but development remains a challenge for complex toxicities like hepatotoxicity, due to the many biological events involved. To this goal, we aimed to develop a computational strategy for developing pathway-based models for complex toxicities. Using a database of 2171 chemicals with human hepatotoxicity classifications, we identified 157 out of 1600+ ToxCast/Tox21 HTS assays to be associated with human hepatotoxicity. Then, a computational framework was used to group these assays by biological target or mechanisms into 52 key event (KE) models of hepatotoxicity. KE model output is a KE score summarizing chemical potency against a hepatotoxicity-relevant biological target or mechanism. Grouping hepatotoxic chemicals based on the chemical structure revealed chemical classes with high KE scores plausibly informing their hepatotoxicity mechanisms. Using KE scores and supervised learning to predict in vivo hepatotoxicity, including toxicokinetic information, improved the predictive performance. This new approach can be a universal computational toxicology strategy for various chemical toxicity evaluations.
Assuntos
Doença Hepática Induzida por Substâncias e Drogas , Ensaios de Triagem em Larga Escala , Animais , Humanos , Toxicocinética , Bases de Dados Factuais , BioensaioRESUMO
Traditional methodologies for assessing chemical toxicity are expensive and time-consuming. Computational modeling approaches have emerged as low-cost alternatives, especially those used to develop quantitative structure-activity relationship (QSAR) models. However, conventional QSAR models have limited training data, leading to low predictivity for new compounds. We developed a data-driven modeling approach for constructing carcinogenicity-related models and used these models to identify potential new human carcinogens. To this goal, we used a probe carcinogen dataset from the US Environmental Protection Agency's Integrated Risk Information System (IRIS) to identify relevant PubChem bioassays. Responses of 25 PubChem assays were significantly relevant to carcinogenicity. Eight assays inferred carcinogenicity predictivity and were selected for QSAR model training. Using 5 machine learning algorithms and 3 types of chemical fingerprints, 15 QSAR models were developed for each PubChem assay dataset. These models showed acceptable predictivity during 5-fold cross-validation (average CCR = 0.71). Using our QSAR models, we can correctly predict and rank 342 IRIS compounds' carcinogenic potentials (PPV = 0.72). The models predicted potential new carcinogens, which were validated by a literature search. This study portends an automated technique that can be applied to prioritize potential toxicants using validated QSAR models based on extensive training sets from public data resources.
Assuntos
Algoritmos , Relação Quantitativa Estrutura-Atividade , Humanos , Simulação por Computador , Carcinógenos/toxicidade , BioensaioRESUMO
Modern nanotechnology provides efficient and cost-effective nanomaterials (NMs). The increasing usage of NMs arises great concerns regarding nanotoxicity in humans. Traditional animal testing of nanotoxicity is expensive and time-consuming. Modeling studies using machine learning (ML) approaches are promising alternatives to direct evaluation of nanotoxicity based on nanostructure features. However, NMs, including two-dimensional nanomaterials (2DNMs) such as graphenes, have complex structures making them difficult to annotate and quantify the nanostructures for modeling purposes. To address this issue, we constructed a virtual graphenes library using nanostructure annotation techniques. The irregular graphene structures were generated by modifying virtual nanosheets. The nanostructures were digitalized from the annotated graphenes. Based on the annotated nanostructures, geometrical nanodescriptors were computed using Delaunay tessellation approach for ML modeling. The partial least square regression (PLSR) models for the graphenes were built and validated using a leave-one-out cross-validation (LOOCV) procedure. The resulted models showed good predictivity in four toxicity-related endpoints with the coefficient of determination (R2) ranging from 0.558 to 0.822. This study provides a novel nanostructure annotation strategy that can be applied to generate high-quality nanodescriptors for ML model developments, which can be widely applied to nanoinformatics studies of graphenes and other NMs.
RESUMO
For hazard identification, classification, and labeling purposes, animal testing guidelines are required by law to evaluate the developmental toxicity potential of new and existing chemical products. However, guideline developmental toxicity studies are costly, time-consuming, and require many laboratory animals. Computational modeling has emerged as a promising, animal-sparing, and cost-effective method for evaluating the developmental toxicity potential of chemicals, such as endocrine disruptors, without the use of animals. We aimed to develop a predictive and explainable computational model for developmental toxicants. To this end, a comprehensive dataset of 1244 chemicals with developmental toxicity classifications was curated from public repositories and literature sources. Data from 2140 toxicological high-throughput screening assays were extracted from PubChem and the ToxCast program for this dataset and combined with information about 834 chemical fragments to group assays based on their chemical-mechanistic relationships. This effort revealed two assay clusters containing 83 and 76 assays, respectively, with high positive predictive rates for developmental toxicants identified with animal testing guidelines (PPV = 72.4 and 77.3% during cross-validation). These two assay clusters can be used as developmental toxicity models and were applied to predict new chemicals for external validation. This study provides a new strategy for constructing alternative chemical developmental toxicity evaluations that can be replicated for other toxicity modeling studies.
Assuntos
Ensaios de Triagem em Larga Escala , Testes de Toxicidade , Animais , Bioensaio , Feminino , Substâncias Perigosas , Ensaios de Triagem em Larga Escala/métodos , Gravidez , Medição de Risco , Testes de Toxicidade/métodosRESUMO
As defined by the World Health Organization, an endocrine disruptor is an exogenous substance or mixture that alters function(s) of the endocrine system and consequently causes adverse health effects in an intact organism, its progeny, or (sub)populations. Traditional experimental testing regimens to identify toxicants that induce endocrine disruption can be expensive and time-consuming. Computational modeling has emerged as a promising and cost-effective alternative method for screening and prioritizing potentially endocrine-active compounds. The efficient identification of suitable chemical descriptors and machine-learning algorithms, including deep learning, is a considerable challenge for computational toxicology studies. Here, we sought to apply classic machine-learning algorithms and deep-learning approaches to a panel of over 7500 compounds tested against 18 Toxicity Forecaster assays related to nuclear estrogen receptor (ERα and ERß) activity. Three binary fingerprints (Extended Connectivity FingerPrints, Functional Connectivity FingerPrints, and Molecular ACCess System) were used as chemical descriptors in this study. Each descriptor was combined with four machine-learning and two deep- learning (normal and multitask neural networks) approaches to construct models for all 18 ER assays. The resulting model performance was evaluated using the area under the receiver- operating curve (AUC) values obtained from a fivefold cross-validation procedure. The results showed that individual models have AUC values that range from 0.56 to 0.86. External validation was conducted using two additional sets of compounds (n = 592 and n = 966) with established interactions with nuclear ER demonstrated through experimentation. An agonist, antagonist, or binding score was determined for each compound by averaging its predicted probabilities in relevant assay models as an external validation, yielding AUC values ranging from 0.63 to 0.91. The results suggest that multitask neural networks offer advantages when modeling mechanistically related endpoints. Consensus predictions based on the average values of individual models remain the best modeling strategy for computational toxicity evaluations.
Assuntos
Aprendizado de Máquina , Modelos Estatísticos , Receptores de Estrogênio , Algoritmos , Animais , Biologia Computacional , Bases de Dados de Compostos Químicos , Aprendizado Profundo , Disruptores Endócrinos/metabolismo , Disruptores Endócrinos/toxicidade , Humanos , Camundongos , Ligação Proteica , Receptores de Estrogênio/antagonistas & inibidores , Receptores de Estrogênio/efeitos dos fármacos , Receptores de Estrogênio/metabolismoRESUMO
Traditional experimental testing to identify endocrine disruptors that enhance estrogenic signaling relies on expensive and labor-intensive experiments. We sought to design a knowledge-based deep neural network (k-DNN) approach to reveal and organize public high-throughput screening data for compounds with nuclear estrogen receptor α and ß (ERα and ERß) binding potentials. The target activity was rodent uterotrophic bioactivity driven by ERα/ERß activations. After training, the resultant network successfully inferred critical relationships among ERα/ERß target bioassays, shown as weights of 6521 edges between 1071 neurons. The resultant network uses an adverse outcome pathway (AOP) framework to mimic the signaling pathway initiated by ERα and identify compounds that mimic endogenous estrogens (i.e., estrogen mimetics). The k-DNN can predict estrogen mimetics by activating neurons representing several events in the ERα/ERß signaling pathway. Therefore, this virtual pathway model, starting from a compound's chemistry initiating ERα activation and ending with rodent uterotrophic bioactivity, can efficiently and accurately prioritize new estrogen mimetics (AUC = 0.864-0.927). This k-DNN method is a potential universal computational toxicology strategy to utilize public high-throughput screening data to characterize hazards and prioritize potentially toxic compounds.
Assuntos
Rotas de Resultados Adversos , Receptor beta de Estrogênio , Receptor alfa de Estrogênio , Estrogênios , Ensaios de Triagem em Larga Escala , Redes Neurais de ComputaçãoRESUMO
PURPOSE: This study aimed to conduct a meta-analysis to investigate the distribution of EBV and HPV stratified according to histological NPC type. MATERIALS & METHODS: We performed a meta-analysis to produce pooled prevalence estimates in a random-effects model. We also performed calculations for attributable fractions of viral combinations in NPC, stratified according to histological type. RESULTS: There was a higher prevalence of HPV DNA in WHO Type I (34.4%) versus WHO Type II/III (18.4%). The attributable fractions of WHO Type I NPC was predominantly double negative EBV(-) HPV(-) NPC (56.4%), and EBV(-) HPV(+) NPC (21.5%), in contrast to the predominant infection in WHO Type II/III which was EBV(+) HPV(-) NPC (87.5%). Co-infection of both EBV and HPV was uncommon, and double-negative infection was more common in WHO Type I NPC. CONCLUSION: A significant proportion of WHO Type I NPC was either double-negative EBV(-)HPV(-) or EBV(-)HPV(+).
Assuntos
Alphapapillomavirus/isolamento & purificação , Inibidor p16 de Quinase Dependente de Ciclina/isolamento & purificação , Infecções por Vírus Epstein-Barr/diagnóstico , Herpesvirus Humano 4/isolamento & purificação , Carcinoma Nasofaríngeo/virologia , Neoplasias Nasofaríngeas/virologia , Infecções por Papillomavirus/diagnóstico , Biomarcadores , Infecções por Vírus Epstein-Barr/virologia , Humanos , Carcinoma Nasofaríngeo/patologia , Neoplasias Nasofaríngeas/patologia , Infecções por Papillomavirus/virologia , PrognósticoRESUMO
Digitalizing complex nanostructures into data structures suitable for machine learning modeling without losing nanostructure information has been a major challenge. Deep learning frameworks, particularly convolutional neural networks (CNNs), are especially adept at handling multidimensional and complex inputs. In this study, CNNs were applied for the modeling of nanoparticle activities exclusively from nanostructures. The nanostructures were represented by virtual molecular projections, a multidimensional digitalization of nanostructures, and used as input data to train CNNs. To this end, 77 nanoparticles with various activities and/or physicochemical property results were used for modeling. The resulting CNN model predictions show high correlations with the experimental results. An analysis of a trained CNN quantitatively showed that neurons were able to recognize distinct nanostructure features critical to activities and physicochemical properties. This "end-to-end" deep learning approach is well suited to digitalize complex nanostructures for data-driven machine learning modeling and can be broadly applied to rationally design nanoparticles with desired activities.
RESUMO
A variety of machine learning methods such as naive Bayesian, support vector machines and more recently deep neural networks are demonstrating their utility for drug discovery and development. These leverage the generally bigger datasets created from high-throughput screening data and allow prediction of bioactivities for targets and molecular properties with increased levels of accuracy. We have only just begun to exploit the potential of these techniques but they may already be fundamentally changing the research process for identifying new molecules and/or repurposing old drugs. The integrated application of such machine learning models for end-to-end (E2E) application is broadly relevant and has considerable implications for developing future therapies and their targeting.
Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , Algoritmos , Teorema de Bayes , Simulação por Computador , Desenho de Fármacos , Desenvolvimento de Medicamentos , Descoberta de Drogas , Reposicionamento de Medicamentos , Humanos , Nanomedicina , Redes Neurais de Computação , Máquina de Vetores de Suporte , Tecnologia Farmacêutica/tendênciasRESUMO
The U.S. Environmental Protection Agency (EPA) periodically releases in vitro data across a variety of targets, including the estrogen receptor (ER). In 2015, the EPA used these data to construct mathematical models of ER agonist and antagonist pathways to prioritize chemicals for endocrine disruption testing. However, mathematical models require in vitro data prior to predicting estrogenic activity, but machine learning methods are capable of prospective prediction from the molecular structure alone. The current study describes the generation and evaluation of Bayesian machine learning models grouped by the EPA's ER agonist pathway model using multiple data types with proprietary software, Assay Central. External predictions with three test sets of in vitro and in vivo reference chemicals with agonist activity classifications were compared to previous mathematical model publications. Training data sets were subjected to additional machine learning algorithms and compared with rank normalized scores of internal five-fold cross-validation statistics. External predictions were found to be comparable or superior to previous studies published by the EPA. When assessing six additional algorithms for the training data sets, Assay Central performed similarly at a reduced computational cost. This study demonstrates that machine learning can prioritize chemicals for future in vitro and in vivo testing of ER agonism.
Assuntos
Disruptores Endócrinos , Receptores de Estrogênio , Teorema de Bayes , Disruptores Endócrinos/toxicidade , Aprendizado de Máquina , Estudos ProspectivosRESUMO
PURPOSE: To investigate the association between race and ethnicity and prognosis in head and neck cancers (HNC), while controlling for socioeconomic status (SES). MATERIALS AND METHODS: Medline, Scopus, EMBASE, and the Cochrane Library were used to identify studies for inclusion, from database inception till March 5th 2019. Studies that analyzed the role of race and ethnicity in overall survival (OS) for malignancies of the head and neck were included in this study. For inclusion, the study needed to report a multivariate analysis controlling for some proxy of SES (for example household income or employment status). Pooled estimates were generated using a random effects model. Subgroup analysis by tumor sub-site, meta-regression, and sensitivity analyses were also performed. RevMan 5.3, Meta Essentials, and OpenMeta[Analyst] were used for statistical analysis. RESULTS: Ten studies from 2004 to 2019 with a total of 108,990 patients were included for analysis in this study. After controlling for SES, tumor stage, and treatment variables, blacks were found to have a poorer survival compared to whites (HR = 1.27, 95%CI: 1.18-1.36, p < 0.00001). Subgroup analysis by sub-site and sensitivity analysis agreed with the primary result. No differences in survival across sub-sites were observed. Meta-regression did not identify any factors associated with the pooled estimate. CONCLUSIONS: In HNC, blacks have poorer OS compared to whites even after controlling for socioeconomic factors.
Assuntos
Neoplasias de Cabeça e Pescoço/etnologia , Neoplasias de Cabeça e Pescoço/mortalidade , Grupos Raciais , Classe Social , Humanos , Prognóstico , Taxa de SobrevidaRESUMO
The human immunodeficiency virus (HIV) causes over a million deaths every year and has a huge economic impact in many countries. The first class of drugs approved were nucleoside reverse transcriptase inhibitors. A newer generation of reverse transcriptase inhibitors have become susceptible to drug resistant strains of HIV, and hence, alternatives are urgently needed. We have recently pioneered the use of Bayesian machine learning to generate models with public data to identify new compounds for testing against different disease targets. The current study has used the NIAID ChemDB HIV, Opportunistic Infection and Tuberculosis Therapeutics Database for machine learning studies. We curated and cleaned data from HIV-1 wild-type cell-based and reverse transcriptase (RT) DNA polymerase inhibition assays. Compounds from this database with ≤1 µM HIV-1 RT DNA polymerase activity inhibition and cell-based HIV-1 inhibition are correlated (Pearson r = 0.44, n = 1137, p < 0.0001). Models were trained using multiple machine learning approaches (Bernoulli Naive Bayes, AdaBoost Decision Tree, Random Forest, support vector classification, k-Nearest Neighbors, and deep neural networks as well as consensus approaches) and then their predictive abilities were compared. Our comparison of different machine learning methods demonstrated that support vector classification, deep learning, and a consensus were generally comparable and not significantly different from each other using 5-fold cross validation and using 24 training and test set combinations. This study demonstrates findings in line with our previous studies for various targets that training and testing with multiple data sets does not demonstrate a significant difference between support vector machine and deep neural networks.
Assuntos
Fármacos Anti-HIV/farmacologia , Infecções por HIV/tratamento farmacológico , Transcriptase Reversa do HIV/antagonistas & inibidores , HIV/efeitos dos fármacos , Aprendizado de Máquina , Inibidores da Transcriptase Reversa/farmacologia , Teorema de Bayes , Bases de Dados Factuais , Árvores de Decisões , Descoberta de Drogas , Infecções por HIV/virologia , Humanos , Redes Neurais de Computação , Máquina de Vetores de SuporteRESUMO
Summary: We have developed a public Chemical In vitroIn vivo Profiling (CIIPro) portal, which can automatically extract in vitro biological data from public resources (i.e. PubChem) for user-supplied compounds. For compounds with in vivo target activity data (e.g. animal toxicity testing results), the integrated cheminformatics algorithm will optimize the extracted biological data using in vitroin vivo correlations. The resulting in vitro biological data for target compounds can be used for read-across risk assessment of target compounds. Additionally, the CIIPro portal can identify the most similar compounds based on their optimized bioprofiles. The CIIPro portal provides new powerful assessment capabilities to the scientific community and can be easily integrated with other cheminformatics tools. Availability and Implementation: ciipro.rutgers.edu. Contact: danrusso@scarletmail.rutgers.edu or hao.zhu99@rutgers.edu
Assuntos
Biologia Computacional/métodos , Software , Toxicologia/métodos , Animais , Medicamentos Biossimilares , Medição de Risco/métodosRESUMO
Many chemicals that disrupt endocrine function have been linked to a variety of adverse biological outcomes. However, screening for endocrine disruption using in vitro or in vivo approaches is costly and time-consuming. Computational methods, e.g., quantitative structure-activity relationship models, have become more reliable due to bigger training sets, increased computing power, and advanced machine learning algorithms, such as multilayered artificial neural networks. Machine learning models can be used to predict compounds for endocrine disrupting capabilities, such as binding to the estrogen receptor (ER), and allow for prioritization and further testing. In this work, an exhaustive comparison of multiple machine learning algorithms, chemical spaces, and evaluation metrics for ER binding was performed on public data sets curated using in-house cheminformatics software (Assay Central). Chemical features utilized in modeling consisted of binary fingerprints (ECFP6, FCFP6, ToxPrint, or MACCS keys) and continuous molecular descriptors from RDKit. Each feature set was subjected to classic machine learning algorithms (Bernoulli Naive Bayes, AdaBoost Decision Tree, Random Forest, Support Vector Machine) and Deep Neural Networks (DNN). Models were evaluated using a variety of metrics: recall, precision, F1-score, accuracy, area under the receiver operating characteristic curve, Cohen's Kappa, and Matthews correlation coefficient. For predicting compounds within the training set, DNN has an accuracy higher than that of other methods; however, in 5-fold cross validation and external test set predictions, DNN and most classic machine learning models perform similarly regardless of the data set or molecular descriptors used. We have also used the rank normalized scores as a performance-criteria for each machine learning method, and Random Forest performed best on the validation set when ranked by metric or by data sets. These results suggest classic machine learning algorithms may be sufficient to develop high quality predictive models of ER activity.
Assuntos
Aprendizado de Máquina , Receptores de Estrogênio/metabolismo , Algoritmos , Animais , Teorema de Bayes , Humanos , Ligação Proteica , Software , Máquina de Vetores de SuporteRESUMO
Tuberculosis is a global health dilemma. In 2016, the WHO reported 10.4 million incidences and 1.7 million deaths. The need to develop new treatments for those infected with Mycobacterium tuberculosis ( Mtb) has led to many large-scale phenotypic screens and many thousands of new active compounds identified in vitro. However, with limited funding, efforts to discover new active molecules against Mtb needs to be more efficient. Several computational machine learning approaches have been shown to have good enrichment and hit rates. We have curated small molecule Mtb data and developed new models with a total of 18,886 molecules with activity cutoffs of 10 µM, 1 µM, and 100 nM. These data sets were used to evaluate different machine learning methods (including deep learning) and metrics and to generate predictions for additional molecules published in 2017. One Mtb model, a combined in vitro and in vivo data Bayesian model at a 100 nM activity yielded the following metrics for 5-fold cross validation: accuracy = 0.88, precision = 0.22, recall = 0.91, specificity = 0.88, kappa = 0.31, and MCC = 0.41. We have also curated an evaluation set ( n = 153 compounds) published in 2017, and when used to test our model, it showed the comparable statistics (accuracy = 0.83, precision = 0.27, recall = 1.00, specificity = 0.81, kappa = 0.36, and MCC = 0.47). We have also compared these models with additional machine learning algorithms showing Bayesian machine learning models constructed with literature Mtb data generated by different laboratories generally were equivalent to or outperformed deep neural networks with external test sets. Finally, we have also compared our training and test sets to show they were suitably diverse and different in order to represent useful evaluation sets. Such Mtb machine learning models could help prioritize compounds for testing in vitro and in vivo.
Assuntos
Antituberculosos/farmacologia , Mycobacterium tuberculosis/efeitos dos fármacos , Teorema de Bayes , Descoberta de Drogas , Aprendizado de Máquina , Máquina de Vetores de SuporteRESUMO
Machine learning methods have been applied to many data sets in pharmaceutical research for several decades. The relative ease and availability of fingerprint type molecular descriptors paired with Bayesian methods resulted in the widespread use of this approach for a diverse array of end points relevant to drug discovery. Deep learning is the latest machine learning algorithm attracting attention for many of pharmaceutical applications from docking to virtual screening. Deep learning is based on an artificial neural network with multiple hidden layers and has found considerable traction for many artificial intelligence applications. We have previously suggested the need for a comparison of different machine learning methods with deep learning across an array of varying data sets that is applicable to pharmaceutical research. End points relevant to pharmaceutical research include absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) properties, as well as activity against pathogens and drug discovery data sets. In this study, we have used data sets for solubility, probe-likeness, hERG, KCNQ1, bubonic plague, Chagas, tuberculosis, and malaria to compare different machine learning methods using FCFP6 fingerprints. These data sets represent whole cell screens, individual proteins, physicochemical properties as well as a data set with a complex end point. Our aim was to assess whether deep learning offered any improvement in testing when assessed using an array of metrics including AUC, F1 score, Cohen's kappa, Matthews correlation coefficient and others. Based on ranked normalized scores for the metrics or data sets Deep Neural Networks (DNN) ranked higher than SVM, which in turn was ranked higher than all the other machine learning methods. Visualizing these properties for training and test sets using radar type plots indicates when models are inferior or perhaps over trained. These results also suggest the need for assessing deep learning further using multiple metrics with much larger scale comparisons, prospective testing as well as assessment of different fingerprints and DNN architectures beyond those used.
Assuntos
Descoberta de Drogas/métodos , Aprendizado de Máquina , Redes Neurais de Computação , Teorema de Bayes , Conjuntos de Dados como AssuntoRESUMO
Computational modeling has emerged as a time-saving and cost-effective alternative to traditional animal testing for assessing chemicals for their potential hazards. However, few computational modeling studies for immunotoxicity were reported, with few models available for predicting toxicants due to the lack of training data and the complex mechanisms of immunotoxicity. In this study, we employed a data-driven quantitative structure-activity relationship (QSAR) modeling workflow to extensively enlarge the limited training data by revealing multiple targets involved in immunotoxicity. To this end, a probe data set of 6,341 chemicals was obtained from a high-throughput screening (HTS) assay testing for the activation of the aryl hydrocarbon receptor (AhR) signaling pathway, a key event leading to immunotoxicity. Searching this probe data set against PubChem yielded 3,183 assays with testing results for varying proportions of these 6,341 compounds. 100 assays were selected to develop QSAR models based on their correlations to AhR agonism. Twelve individual QSAR models were built for each assay using combinations of four machine-learning algorithms and three molecular fingerprints. 5-fold cross-validation of the resulting models showed good predictivity (average CCR = 0.73). A total of 20 assays were further selected based on QSAR model performance, and their resulting QSAR models showed good predictivity of potential immunotoxicants from external chemicals. This study provides a computational modeling strategy that can utilize large public toxicity data sets for modeling immunotoxicity and other toxicity endpoints, which have limited training data and complicated toxicity mechanisms.
RESUMO
High-throughput screening (HTS) techniques are increasingly being adopted by a variety of fields of toxicology. Notably, large-scale research efforts from government, industrial, and academic laboratories are screening millions of chemicals against a variety of biomolecular targets, producing an enormous amount of publicly available HTS assay data. These HTS assay data provide toxicologists important information on how chemicals interact with different biomolecular targets and provide illustrations of potential toxicity mechanisms. Open public data repositories, such as the National Institutes of Health's PubChem ( http://pubchem.ncbi.nlm.nih.gov ), were established to accept, store, and share HTS data. Through the PubChem website, users can rapidly obtain the PubChem assay results for compounds by using different chemical identifiers (including SMILES, InChIKey, IUPAC names, etc.). However, obtaining these data in a user-friendly format suitable for modeling and other informatics analysis (e.g., gathering PubChem data for hundreds or thousands of chemicals in a modeling friendly format) directly through the PubChem web portal is not feasible. This chapter aims to introduce two approaches to obtain the HTS assay results for large datasets of compounds from the PubChem portal. First, programmatic access via PubChem's PUG-REST web service using the Python programming language will be described. Second, most users, who lack programming skills, can directly obtain PubChem data for a large set of compounds by using the freely available Chemical In vitro-In vivo Profiling (CIIPro) portal ( http://www.ciipro.rutgers.edu ).