RESUMO
MOTIVATION: Accurate annotation of protein functions is fundamental for understanding molecular and cellular physiology. Data-driven methods hold promise for systematically deriving rules underlying the relationship between protein structure and function. However, the choice of protein structural representation is critical. Pre-defined biochemical features emphasize certain aspects of protein properties while ignoring others, and therefore may fail to capture critical information in complex protein sites. RESULTS: In this paper, we present a general framework that applies 3D convolutional neural networks (3DCNNs) to structure-based protein functional site detection. The framework can extract task-dependent features automatically from the raw atom distributions. We benchmarked our method against other methods and demonstrate better or comparable performance for site detection. Our deep 3DCNNs achieved an average recall of 0.955 at a precision threshold of 0.99 on PROSITE families, detected 98.89 and 92.88% of nitric oxide synthase and TRYPSIN-like enzyme sites in Catalytic Site Atlas, and showed good performance on challenging cases where sequence motifs are absent but a function is known to exist. Finally, we inspected the individual contributions of each atom to the classification decisions and show that our models successfully recapitulate known 3D features within protein functional sites. AVAILABILITY AND IMPLEMENTATION: The 3DCNN models described in this paper are available at https://simtk.org/projects/fscnn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Redes Neurais de Computação , ProteínasRESUMO
Accurate determination of target-ligand interactions is crucial in the drug discovery process. In this paper, we propose a graph-convolutional (Graph-CNN) framework for predicting protein-ligand interactions. First, we built an unsupervised graph-autoencoder to learn fixed-size representations of protein pockets from a set of representative druggable protein binding sites. Second, we trained two Graph-CNNs to automatically extract features from pocket graphs and 2D ligand graphs, respectively, driven by binding classification labels. We demonstrate that graph-autoencoders can learn fixed-size representations for protein pockets of varying sizes and the Graph-CNN framework can effectively capture protein-ligand binding interactions without relying on target-ligand complexes. Across several metrics, Graph-CNNs achieved better or comparable performance to 3DCNN ligand-scoring, AutoDock Vina, RF-Score, and NNScore on common virtual screening benchmark data sets. Visualization of key pocket residues and ligand atoms contributing to the classification decisions confirms that our networks are able to detect important interface residues and ligand atoms within the pockets and ligands, respectively.
Assuntos
Desenvolvimento de Medicamentos , Redes Neurais de Computação , Sítios de Ligação , Bases de Dados de Compostos Químicos , Descoberta de Drogas , Ligantes , Ligação Proteica , Conformação Proteica , Proteínas/químicaRESUMO
Our goal is to answer the question: compared with experimental structures, how useful are predicted models for functional annotation? We assessed the functional utility of predicted models by comparing the performances of a suite of methods for functional characterization on the predictions and the experimental structures. We identified 28 sites in 25 protein targets to perform functional assessment. These 28 sites included nine sites with known ligand binding (holo-sites), nine sites that are expected or suggested by experimental authors for small molecule binding (apo-sites), and Ten sites containing important motifs, loops, or key residues with important disease-associated mutations. We evaluated the utility of the predictions by comparing their microenvironments to the experimental structures. Overall structural quality correlates with functional utility. However, the best-ranked predictions (global) may not have the best functional quality (local). Our assessment provides an ability to discriminate between predictions with high structural quality. When assessing ligand-binding sites, most prediction methods have higher performance on apo-sites than holo-sites. Some servers show consistently high performance for certain types of functional sites. Finally, many functional sites are associated with protein-protein interaction. We also analyzed biologically relevant features from the protein assemblies of two targets where the active site spanned the protein-protein interface. For the assembly targets, we find that the features in the models are mainly determined by the choice of template.
Assuntos
Produtos Biológicos/metabolismo , Biologia Computacional/métodos , Modelos Moleculares , Modelos Estatísticos , Conformação Proteica , Proteínas/química , Proteínas/metabolismo , Sítios de Ligação , Domínio Catalítico , Humanos , Ligantes , Ligação ProteicaRESUMO
BACKGROUND: Central to protein biology is the understanding of how structural elements give rise to observed function. The surfeit of protein structural data enables development of computational methods to systematically derive rules governing structural-functional relationships. However, performance of these methods depends critically on the choice of protein structural representation. Most current methods rely on features that are manually selected based on knowledge about protein structures. These are often general-purpose but not optimized for the specific application of interest. In this paper, we present a general framework that applies 3D convolutional neural network (3DCNN) technology to structure-based protein analysis. The framework automatically extracts task-specific features from the raw atom distribution, driven by supervised labels. As a pilot study, we use our network to analyze local protein microenvironments surrounding the 20 amino acids, and predict the amino acids most compatible with environments within a protein structure. To further validate the power of our method, we construct two amino acid substitution matrices from the prediction statistics and use them to predict effects of mutations in T4 lysozyme structures. RESULTS: Our deep 3DCNN achieves a two-fold increase in prediction accuracy compared to models that employ conventional hand-engineered features and successfully recapitulates known information about similar and different microenvironments. Models built from our predictions and substitution matrices achieve an 85% accuracy predicting outcomes of the T4 lysozyme mutation variants. Our substitution matrices contain rich information relevant to mutation analysis compared to well-established substitution matrices. Finally, we present a visualization method to inspect the individual contributions of each atom to the classification decisions. CONCLUSIONS: End-to-end trained deep learning networks consistently outperform methods using hand-engineered features, suggesting that the 3DCNN framework is well suited for analysis of protein microenvironments and may be useful for other protein structural analyses.
Assuntos
Aminoácidos/metabolismo , Encéfalo/metabolismo , Rede Nervosa/fisiologia , Animais , Área Sob a Curva , Desenvolvimento Embrionário , Hibridização In Situ , Camundongos , Curva ROC , TranscriptomaRESUMO
DNA-Encoded Chemical Libraries (DELs) have emerged as efficient and cost-effective ligand discovery tools, which enable the generation of protein-ligand interaction data of unprecedented size. In this article, we present an approach that combines DEL screening and instance-level deep learning modeling to identify tumor-targeting ligands against carbonic anhydrase IX (CAIX), a clinically validated marker of hypoxia and clear cell renal cell carcinoma. We present a new ligand identification and hit-to-lead strategy driven by machine learning models trained on DELs, which expand the scope of DEL-derived chemical motifs. CAIX-screening datasets obtained from three different DELs were used to train machine learning models for generating novel hits, dissimilar to elements present in the original DELs. Out of the 152 novel potential hits that were identified with our approach and screened in an in vitro enzymatic inhibition assay, 70% displayed submicromolar activities (IC50 < 1 µM). To generate lead compounds that are functionalized with anticancer payloads, analogues of top hits were prioritized for synthesis based on the predicted CAIX affinity and synthetic feasibility. Three lead candidates showed accumulation on the surface of CAIX-expressing tumor cells in cellular binding assays. The best compound displayed an in vitro KD of 5.7 nM and selectively targeted tumors in mice bearing human renal cell carcinoma lesions. Our results demonstrate the synergy between DEL and machine learning for the identification of novel hits and for the successful translation of lead candidates for in vivo targeting applications.
RESUMO
Chemoinformatics is an established discipline focusing on extracting, processing and extrapolating meaningful data from chemical structures. With the rapid explosion of chemical 'big' data from HTS and combinatorial synthesis, machine learning has become an indispensable tool for drug designers to mine chemical information from large compound databases to design drugs with important biological properties. To process the chemical data, we first reviewed multiple processing layers in the chemoinformatics pipeline followed by the introduction of commonly used machine learning models in drug discovery and QSAR analysis. Here, we present basic principles and recent case studies to demonstrate the utility of machine learning techniques in chemoinformatics analyses; and we discuss limitations and future directions to guide further development in this evolving field.
Assuntos
Descoberta de Drogas/métodos , Informática , Aprendizado de Máquina , Preparações Farmacêuticas/química , Animais , Difusão de Inovações , Ensaios de Triagem em Larga Escala , Humanos , Estrutura Molecular , Reconhecimento Automatizado de Padrão , Relação Quantitativa Estrutura-AtividadeRESUMO
Microenvironment stiffening plays a crucial role in tumorigenesis. While filopodia are generally thought to be one of the cellular mechanosensors for probing environmental stiffness, the effects of environmental stiffness on filopodial activities of cancer cells remain unclear. In this work, we investigated the filopodial activities of human lung adenocarcinoma cells CL1-5 cultured on substrates of tunable stiffness using a novel platform. The platform consists of an optical system called structured illumination nano-profilometry, which allows time-lapsed visualization of filopodial activities without fluorescence labeling. The culturing substrates were composed of polyvinyl chloride mixed with an environmentally friendly plasticizer to yield Young's modulus ranging from 20 to 60 kPa. Cell viability studies showed that the viability of cells cultured on the substrates was similar to those cultured on commonly used elastomers such as polydimethylsiloxane. Time-lapsed live cell images were acquired and the filopodial activities in response to substrates with varying degrees of stiffness were analyzed. Statistical analyses revealed that lung cancer cells cultured on softer substrates appeared to have longer filopodia, higher filopodial densities with respect to the cellular perimeter, and slower filopodial retraction rates. Nonetheless, the temporal analysis of filopodial activities revealed that whether a filopodium decides to extend or retract is purely a stochastic process without dependency on substrate stiffness. The discrepancy of the filopodial activities between lung cancer cells cultured on substrates with different degrees of stiffness vanished when the myosin II activities were inhibited by treating the cells with blebbistatin, which suggests that the filopodial activities are closely modulated by the adhesion strength of the cells. Our data quantitatively relate filopodial activities of lung cancer cells with environmental stiffness and should shed light on the understanding and treatment of cancer progression and metastasis.