RESUMO
In protein-ligand docking, the score assigned to a protein-ligand complex is approximate. Especially, the internal energy of the ligand is difficult to compute precisely using a molecular mechanics based force-field, introducing significant noise in the rank-ordering of ligands. We propose an open-source protocol (https://github.com/UnixJunkie/MMO), using two quantum mechanics (QM) single point energy calculations, plus a Monte Carlo (Monte Carlo) based ligand minimization procedure in-between, to estimate ligand strain after docking. The MC simulation uses the ANI-2x (QM approximating) force field and is performed in the dihedral space. On some protein targets, using strain filtering after docking allows to significantly improve hit rates. We performed a structure-based virtual screening campaign on nine protein targets from the Laboratoire d'Innovation Thérapeutique-PubChem assays dataset using Cambridge crystallographic data centre genetic optimization for ligand docking. Then, docked ligands were submitted to the strain estimation protocol and the impact on hit rate was analyzed. As for docking, the method does not always work. However, if sufficient active and inactive molecules are known for a given protein target, its efficiency can be evaluated.
RESUMO
In the presence of structural data, one sometimes need to compare 3D ligands. We design an overlay-free method to rank order 3D molecules in the pharmacophore feature space. The proposed encoding includes only two fittable parameters, is sparse, and is not too high dimensional. At the cost of an additional parameter, to delineate the binding site from a protein-ligand complex, the method can compare binding sites. The method was benchmarked on the LIT-PCBA data set for ligand-based virtual screening experiments and the sc-PDB and a Vertex data set when comparing binding sites. In similarity searches, the proposed method outperforms an open-source software doing optimal superposition of ligand-based pharmacophores and RDKit's 3D pharmacophore fingerprints. When comparing binding sites, the method is competitive with state of the art approaches. On a single CPU core, up to 374,000 ligand/s or 132,000 binding site/s can be rank ordered. The "AutoCorrelation of Pharmacophore Features" open-source software is released at https://github.com/tsudalab/ACP4.
Assuntos
Farmacóforo , Software , Ligantes , Sítios de LigaçãoRESUMO
In structure-based virtual screening (SBVS), a binding site on a protein structure is used to search for ligands with favorable nonbonded interactions. Because it is computationally difficult, docking is time-consuming and any docking user will eventually encounter a chemical library that is too big to dock. This problem might arise because there is not enough computing power or because preparing and storing so many three-dimensional (3D) ligands requires too much space. In this study, however, we show that quality regressors can be trained to predict docking scores from molecular fingerprints. Although typical docking has a screening rate of less than one ligand per second on one CPU core, our regressors can predict about 5800 docking scores per second. This approach allows us to focus docking on the portion of a database that is predicted to have docking scores below a user-chosen threshold. Herein, usage examples are shown, where only 25% of a ligand database is docked, without any significant virtual screening performance loss. We call this method "lean-docking". To validate lean-docking, a massive docking campaign using several state-of-the-art docking software packages was undertaken on an unbiased data set, with only wet-lab tested active and inactive molecules. Although regressors allow the screening of a larger chemical space, even at a constant docking power, it is also clear that significant progress in the virtual screening power of docking scores is desirable.
Assuntos
Bibliotecas de Moléculas Pequenas , Sítios de Ligação , Ligantes , Simulação de Acoplamento Molecular , Ligação ProteicaRESUMO
MOTIVATION: Genome-wide identification of the transcriptomic responses of human cell lines to drug treatments is a challenging issue in medical and pharmaceutical research. However, drug-induced gene expression profiles are largely unknown and unobserved for all combinations of drugs and human cell lines, which is a serious obstacle in practical applications. RESULTS: Here, we developed a novel computational method to predict unknown parts of drug-induced gene expression profiles for various human cell lines and predict new drug therapeutic indications for a wide range of diseases. We proposed a tensor-train weighted optimization (TT-WOPT) algorithm to predict the potential values for unknown parts in tensor-structured gene expression data. Our results revealed that the proposed TT-WOPT algorithm can accurately reconstruct drug-induced gene expression data for a range of human cell lines in the Library of Integrated Network-based Cellular Signatures. The results also revealed that in comparison with the use of original gene expression profiles, the use of imputed gene expression profiles improved the accuracy of drug repositioning. We also performed a comprehensive prediction of drug indications for diseases with gene expression profiles, which suggested many potential drug indications that were not predicted by previous approaches. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Biologia Computacional , Transcriptoma , Algoritmos , Linhagem Celular , Reposicionamento de Medicamentos , HumanosRESUMO
In ligand-based virtual screening, high-throughput screening (HTS) data sets can be exploited to train classification models. Such models can be used to prioritize yet untested molecules, from the most likely active (against a protein target of interest) to the least likely active. In this study, a single-parameter ranking method with an Applicability Domain (AD) is proposed. In effect, Kernel Density Estimates (KDE) are revisited to improve their computational efficiency and incorporate an AD. Two modifications are proposed: (i) using vanishing kernels (i.e., kernel functions with a finite support) and (ii) using the Tanimoto distance between molecular fingerprints as a radial basis function. This construction is termed "Vanishing Ranking Kernels" (VRK). Using VRK on 21 HTS assays, it is shown that VRK can compete in performance with a graph convolutional deep neural network. VRK are conceptually simple and fast to train. During training, they require optimizing a single parameter. A trained VRK model usually defines an active AD. Exploiting this AD can significantly increase the screening frequency of a VRK model. Software: https://github.com/UnixJunkie/rankers. Data sets: https://zenodo.org/record/1320776 and https://zenodo.org/record/3540423.
Assuntos
Redes Neurais de Computação , Software , LigantesRESUMO
In Quantitative Structure-Activity Relationship (QSAR) modeling, one must come up with an activity model but also with an applicability domain for that model. Some existing methods to create an applicability domain are complex, hard to implement, and/or difficult to interpret. Also, they often require the user to select a threshold value, or they embed an empirical constant. In this work, we propose a trivial to interpret and fully automatic Distance-Based Boolean Applicability Domain (DBBAD) algorithm for category QSAR. In retrospective experiments on High Throughput Screening data sets, this applicability domain improves the classification performance and early retrieval of support vector machine and random forest based classifiers, while improving the scaffold diversity among top-ranked active molecules.
Assuntos
Algoritmos , Biologia Computacional/métodos , Avaliação Pré-Clínica de Medicamentos , Ensaios de Triagem em Larga Escala , Relação Quantitativa Estrutura-AtividadeRESUMO
The SAMPL challenges provide an ideal opportunity for unbiased evaluation and comparison of different approaches used in computational drug design. During the fourth round of this SAMPL challenge, we participated in the virtual screening and binding pose prediction on inhibitors targeting the HIV-1 integrase enzyme. For virtual screening, we used well known and widely used in silico methods combined with personal in cerebro insights and experience. Regular docking only performed slightly better than random selection, but the performance was significantly improved upon incorporation of additional filters based on pharmacophore queries and electrostatic similarities. The best performance was achieved when logical selection was added. For the pose prediction, we utilized a similar consensus approach that amalgamated the results of the Glide-XP docking with structural knowledge and rescoring. The pose prediction results revealed that docking displayed reasonable performance in predicting the binding poses. However, prediction performance can be improved utilizing scientific experience and rescoring approaches. In both the virtual screening and pose prediction challenges, the top performance was achieved by our approaches. Here we describe the methods and strategies used in our approaches and discuss the rationale of their performances.
Assuntos
Desenho Assistido por Computador , Inibidores de Integrase de HIV/química , Inibidores de Integrase de HIV/farmacologia , Integrase de HIV/metabolismo , HIV-1/enzimologia , Simulação de Acoplamento Molecular , Desenho de Fármacos , Infecções por HIV/tratamento farmacológico , Infecções por HIV/enzimologia , Infecções por HIV/virologia , Integrase de HIV/química , Humanos , Ligação Proteica , SoftwareRESUMO
Cystic fibrosis (CF) is a monogenetic disease caused by the mutation of CFTR, a cAMP-regulated Cl- channel expressing at the apical plasma membrane (PM) of epithelia. ∆F508-CFTR, the most common mutant in CF, fails to reach the PM due to its misfolding and premature degradation at the endoplasmic reticulum (ER). Recently, CFTR modulators have been developed to correct CFTR abnormalities, with some being used as therapeutic agents for CF treatment. One notable example is Trikafta, a triple combination of CFTR modulators (TEZ/ELX/IVA), which significantly enhances the functionality of ΔF508-CFTR on the PM. However, there's room for improvement in its therapeutic effectiveness since TEZ/ELX/IVA doesn't fully stabilize ΔF508-CFTR on the PM. To discover new CFTR modulators, we conducted a virtual screening of approximately 4.3 million compounds based on the chemical structures of existing CFTR modulators. This effort led us to identify a novel CFTR ligand named FR3. Unlike clinically available CFTR modulators, FR3 appears to operate through a distinct mechanism of action. FR3 enhances the functional expression of ΔF508-CFTR on the apical PM in airway epithelial cell lines by stabilizing NBD1. Notably, FR3 counteracted the degradation of mature ΔF508-CFTR, which still occurs despite the presence of TEZ/ELX/IVA. Furthermore, FR3 corrected the defective PM expression of a misfolded ABCB1 mutant. Therefore, FR3 may be a potential lead compound for addressing diseases resulting from the misfolding of ABC transporters.
RESUMO
The COVID-19 pandemic continues to pose a substantial threat to human lives and is likely to do so for years to come. Despite the availability of vaccines, searching for efficient small-molecule drugs that are widely available, including in low- and middle-income countries, is an ongoing challenge. In this work, we report the results of an open science community effort, the "Billion molecules against COVID-19 challenge", to identify small-molecule inhibitors against SARS-CoV-2 or relevant human receptors. Participating teams used a wide variety of computational methods to screen a minimum of 1 billion virtual molecules against 6 protein targets. Overall, 31 teams participated, and they suggested a total of 639,024 molecules, which were subsequently ranked to find 'consensus compounds'. The organizing team coordinated with various contract research organizations (CROs) and collaborating institutions to synthesize and test 878 compounds for biological activity against proteases (Nsp5, Nsp3, TMPRSS2), nucleocapsid N, RdRP (only the Nsp12 domain), and (alpha) spike protein S. Overall, 27 compounds with weak inhibition/binding were experimentally identified by binding-, cleavage-, and/or viral suppression assays and are presented here. Open science approaches such as the one presented here contribute to the knowledge base of future drug discovery efforts in finding better SARS-CoV-2 treatments.
Assuntos
COVID-19 , SARS-CoV-2 , Humanos , Pandemias , Bioensaio , Descoberta de DrogasRESUMO
Increasing the variety of antimicrobial peptides is crucial in meeting the global challenge of multi-drug-resistant bacterial pathogens. While several deep-learning-based peptide design pipelines are reported, they may not be optimal in data efficiency. High efficiency requires a well-compressed latent space, where optimization is likely to fail due to numerous local minima. We present a multi-objective peptide design pipeline based on a discrete latent space and D-Wave quantum annealer with the aim of solving the local minima problem. To achieve multi-objective optimization, multiple peptide properties are encoded into a score using non-dominated sorting. Our pipeline is applied to design therapeutic peptides that are antimicrobial and non-hemolytic at the same time. From 200â¯000 peptides designed by our pipeline, four peptides proceeded to wet-lab validation. Three of them showed high anti-microbial activity, and two are non-hemolytic. Our results demonstrate how quantum-based optimizers can be taken advantage of in real-world medical studies.
RESUMO
In protein folding, clustering is commonly used as one way to identify the best decoy produced. Initializing the pairwise distance matrix for a large decoy set is computationally expensive. We have proposed a fast method that works even on large decoy sets. This method is implemented in a software called Durandal. Durandal has been shown to be consistently faster than other software performing fast exact clustering. In some cases, Durandal can even outperform the speed of an approximate method. Durandal uses the triangular inequality to accelerate exact clustering, without compromising the distance function. Recently, we have further enhanced the performance of Durandal by incorporating a Quaternion-based characteristic polynomial method that has increased the speed of Durandal between 13% and 27% compared with the previous version. Durandal source code is available under the GNU General Public License at http://www.riken.jp/zhangiru/software/durandal_released_qcp.tgz. Alternatively, a compiled version of Durandal is also distributed with the nightly builds of the Phenix (http://www.phenix-online.org/) crystallographic software suite (Adams et al., Acta Crystallogr Sect D 2010, 66, 213).
Assuntos
Dobramento de Proteína , Proteínas/química , Software , Análise por ConglomeradosRESUMO
MOTIVATION: Clustering is commonly used to identify the best decoy among many generated in protein structure prediction when using energy alone is insufficient. Calculation of the pairwise distance matrix for a large decoy set is computationally expensive. Typically, only a reduced set of decoys using energy filtering is subjected to clustering analysis. A fast clustering method for a large decoy set would be beneficial to protein structure prediction and this still poses a challenge. RESULTS: We propose a method using propagation of geometric constraints to accelerate exact clustering, without compromising the distance measure. Our method can be used with any metric distance. Metrics that are expensive to compute and have known cheap lower and upper bounds will benefit most from the method. We compared our method's accuracy against published results from the SPICKER clustering software on 40 large decoy sets from the I-TASSER protein folding engine. We also performed some additional speed comparisons on six targets from the 'semfold' decoy set. In our tests, our method chose a better decoy than the energy criterion in 25 out of 40 cases versus 20 for SPICKER. Our method also was shown to be consistently faster than another fast software performing exact clustering named Calibur. In some cases, our approach can even outperform the speed of an approximate method. AVAILABILITY: Our C++ software is released under the GNU General Public License. It can be downloaded from http://www.riken.jp/zhangiru/software/durandal_released.tgz.
Assuntos
Conformação Proteica , Software , Algoritmos , Análise por Conglomerados , Entropia , Dobramento de Proteína , Proteínas/químicaRESUMO
Ab initio phasing is one of the remaining challenges in protein crystallography. Recent progress in computational structure prediction has enabled the generation of de novo models with high enough accuracy to solve the phase problem ab initio. This `ab initio phasing with de novo models' method first generates a huge number of de novo models and then selects some lowest energy models to solve the phase problem using molecular replacement. The amount of CPU time required is huge even for small proteins and this has limited the utility of this method. Here, an approach is described that significantly reduces the computing time required to perform ab initio phasing with de novo models. Instead of performing molecular replacement after the completion of all models, molecular replacement is initiated during the course of each simulation. The approach principally focuses on avoiding the refinement of the best and the worst models and terminating the entire simulation early once suitable models for phasing have been obtained. In a benchmark data set of 20 proteins, this method is over two orders of magnitude faster than the conventional approach. It was observed that in most cases molecular-replacement solutions were determined soon after the coarse-grained models were turned into full-atom representations. It was also found that all-atom refinement was hardly able to change the models sufficiently to enable successful molecular replacement if the coarse-grained models were not very close to the native structure. Therefore, it remains critical to generate good-quality coarse-grained models to enable subsequent all-atom refinement for successful ab initio phasing by molecular replacement.
Assuntos
Cristalografia por Raios X/métodos , Proteínas/químicaRESUMO
UNLABELLED: Bioinformaticians are tackling increasingly computation-intensive tasks. In the meantime, workstations are shifting towards multi-core architectures and even massively multi-core may be the norm soon. Bag-of-Tasks (BoT) applications are commonly encountered in bioinformatics. They consist of a large number of independent computation-intensive tasks. This note introduces PAR, a scalable, dynamic, parallel and distributed execution engine for Bag-of-Tasks. PAR is aimed at multi-core architectures and small clusters. Accelerations obtained thanks to PAR on two different applications are shown. AVAILABILITY: PAR is released under the GNU General Public License version three and can be freely downloaded (http://download.savannah.gnu.org/releases/par/par.tgz).
Assuntos
Biologia Computacional/métodos , SoftwareRESUMO
BACKGROUND: In recent years, in silico molecular design is regaining interest. To generate on a computer molecules with optimized properties, scoring functions can be coupled with a molecular generator to design novel molecules with a desired property profile. RESULTS: In this article, a simple method is described to generate only valid molecules at high frequency ([Formula: see text] molecule/s using a single CPU core), given a molecular training set. The proposed method generates diverse SMILES (or DeepSMILES) encoded molecules while also showing some propensity at training set distribution matching. When working with DeepSMILES, the method reaches peak performance ([Formula: see text] molecule/s) because it relies almost exclusively on string operations. The "Fast Assembly of SMILES Fragments" software is released as open-source at https://github.com/UnixJunkie/FASMIFRA . Experiments regarding speed, training set distribution matching, molecular diversity and benchmark against several other methods are also shown.
RESUMO
Glycans play important roles in cell communication, protein interaction, and immunity, and structural changes in glycans are associated with the regulation of a range of biological pathways involved in disease. However, our understanding of the detailed relationships between specific diseases and glycans is very limited. In this study, we proposed an omics-based method to investigate the correlations between glycans and a wide range of human diseases. We analyzed the gene expression patterns of glycogenes (glycosyltransferases and glycosidases) for 79 different diseases. A biological pathway-based glycogene signature was constructed to identify the alteration in glycan biosynthesis and the associated glycan structures for each disease state. The degradation of N-glycan and keratan sulfate, for example, may promote the growth or metastasis of multiple types of cancer, including endometrial, gastric, and nasopharyngeal. Our results also revealed that commonalities between diseases can be interpreted using glycogene expression patterns, as well as the associated glycan structure patterns at the level of the affected pathway. The proposed method is expected to be useful for understanding the relationships between glycans, glycogenes, and disease and identifying disease-specific glycan biomarkers.
Assuntos
Biomarcadores Tumorais/genética , Neoplasias/genética , Polissacarídeos/genética , Biomarcadores Tumorais/metabolismo , Configuração de Carboidratos , Humanos , Neoplasias/metabolismo , Polissacarídeos/metabolismoRESUMO
InhA or enoyl-acyl carrier protein reductase of Mycobacterium tuberculosis (mtInhA), which controls mycobacterial cell wall construction, has been targeted in the development of antituberculosis drugs. Previously, our in silico structure-based drug screening study identified a novel class of compounds (designated KES4), which is capable of inhibiting the enzymatic activity of mtInhA, as well as mycobacterial growth. The compounds are composed of four ring structures (A-D), and the MD simulation predicted specific interactions with mtInhA of the D-ring and methylene group between the B-ring and C-ring; however, there is still room for improvement in the A-ring structure. In this study, a structure-activity relationship study of the A-ring was attempted with the assistance of in silico docking simulations. In brief, the virtual chemical library of A-ring-modified KES4 was constructed and subjected to in silico docking simulation against mtInhA using the GOLD program. Among the selected candidates, we achieved synthesis of seven compounds, and the bioactivities (effects on InhA activity and mycobacterial growth and cytotoxicity) of the synthesized molecules were evaluated. Among the compounds tested, two candidates (compounds 3d and 3f) exhibited superior properties as mtInhA-targeted anti-infectives for mycobacteria than the lead compound KES4.
Assuntos
Antituberculosos/farmacologia , Proteínas de Bactérias/antagonistas & inibidores , Mycobacterium tuberculosis/efeitos dos fármacos , Oxirredutases/antagonistas & inibidores , Antituberculosos/química , Simulação por Computador , Simulação de Acoplamento Molecular , Relação Estrutura-AtividadeRESUMO
BACKGROUND: OCaml is a functional programming language with strong static types, Hindley-Milner type inference and garbage collection. In this article, we share our experience in prototyping chemoinformatics and structural bioinformatics software in OCaml. RESULTS: First, we introduce the language, list entry points for chemoinformaticians who would be interested in OCaml and give code examples. Then, we list some scientific open source software written in OCaml. We also present recent open source libraries useful in chemoinformatics. The parallelization of OCaml programs and their performance is also shown. Finally, tools and methods useful when prototyping scientific software in OCaml are given. CONCLUSIONS: In our experience, OCaml is a programming language of choice for method development in chemoinformatics and structural bioinformatics.
RESUMO
BACKGROUND: In ligand-based virtual screening experiments, a known active ligand is used in similarity searches to find putative active compounds for the same protein target. When there are several known active molecules, screening using all of them is more powerful than screening using a single ligand. A consensus query can be created by either screening serially with different ligands before merging the obtained similarity scores, or by combining the molecular descriptors (i.e. chemical fingerprints) of those ligands. RESULTS: We report on the discriminative power and speed of several consensus methods, on two datasets only made of experimentally verified molecules. The two datasets contain a total of 19 protein targets, 3776 known active and ~ 2 × 106 inactive molecules. Three chemical fingerprints are investigated: MACCS 166 bits, ECFP4 2048 bits and an unfolded version of MOLPRINT2D. Four different consensus policies and five consensus sizes were benchmarked. CONCLUSIONS: The best consensus method is to rank candidate molecules using the maximum score obtained by each candidate molecule versus all known actives. When the number of actives used is small, the same screening performance can be approached by a consensus fingerprint. However, if the computational exploration of the chemical space is limited by speed (i.e. throughput), a consensus fingerprint allows to outperform this consensus of scores.
RESUMO
Protein modeling and design activities often require querying the Protein Data Bank (PDB) with a structural fragment, possibly containing gaps. For some applications, it is preferable to work on a specific subset of the PDB or with unpublished structures. These requirements, along with specific user needs, motivated the creation of a new software to manage and query 3D protein fragments. Fragger is a protein fragment picker that allows protein fragment databases to be created and queried. All fragment lengths are supported and any set of PDB files can be used to create a database. Fragger can efficiently search a fragment database with a query fragment and a distance threshold. Matching fragments are ranked by distance to the query. The query fragment can have structural gaps and the allowed amino acid sequences matching a query can be constrained via a regular expression of one-letter amino acid codes. Fragger also incorporates a tool to compute the backbone RMSD of one versus many fragments in high throughput. Fragger should be useful for protein design, loop grafting and related structural bioinformatics tasks.