RESUMEN
SUMMARY: The design of two overlapping genes in a microbial genome is an emerging technique for adding more reliable control mechanisms in engineered organisms for increased stability. The design of functional overlapping gene pairs is a challenging procedure, and computational design tools are used to improve the efficiency to deploy successful designs in genetically engineered systems. GENTANGLE (Gene Tuples ArraNGed in overLapping Elements) is a high-performance containerized pipeline for the computational design of two overlapping genes translated in different reading frames of the genome. This new software package can be used to design and test gene entanglements for microbial engineering projects using arbitrary sets of user-specified gene pairs. AVAILABILITY AND IMPLEMENTATION: The GENTANGLE source code and its submodules are freely available on GitHub at https://github.com/BiosecSFA/gentangle. The DATANGLE (DATA for genTANGLE) repository contains related data and results and is freely available on GitHub at https://github.com/BiosecSFA/datangle. The GENTANGLE container is freely available on Singularity Cloud Library at https://cloud.sylabs.io/library/khyox/gentangle/gentangle.sif. The GENTANGLE repository wiki (https://github.com/BiosecSFA/gentangle/wiki), website (https://biosecsfa.github.io/gentangle/), and user manual contain detailed instructions on how to use the different components of software and data, including examples and reproducing the results. The code is licensed under the GNU Affero General Public License version 3 (https://www.gnu.org/licenses/agpl.html).
Asunto(s)
Programas Informáticos , Biología Computacional/métodos , Genoma Microbiano , Ingeniería Genética/métodosRESUMEN
Protein-ligand interactions are essential to drug discovery and drug development efforts. Desirable on-target or multitarget interactions are the first step in finding an effective therapeutic, while undesirable off-target interactions are the first step in assessing safety. In this work, we introduce a novel ligand-based featurization and mapping of human protein pockets to identify closely related protein targets and to project novel drugs into a hybrid protein-ligand feature space to identify their likely protein interactions. Using structure-based template matches from PDB, protein pockets are featured by the ligands that bind to their best co-complex template matches. The simplicity and interpretability of this approach provide a granular characterization of the human proteome at the protein-pocket level instead of the traditional protein-level characterization by family, function, or pathway. We demonstrate the power of this featurization method by clustering a subset of the human proteome and evaluating the predicted cluster associations of over 7000 compounds.
Asunto(s)
Proteoma , Humanos , Unión Proteica , Sitios de Unión , Conformación Proteica , Ligandos , Análisis por ConglomeradosRESUMEN
The growing capabilities of synthetic biology and organic chemistry demand tools to guide syntheses toward useful molecules. Here, we present Molecular AutoenCoding Auto-Workaround (MACAW), a tool that uses a novel approach to generate molecules predicted to meet a desired property specification (e.g., a binding affinity of 50 nM or an octane number of 90). MACAW describes molecules by embedding them into a smooth multidimensional numerical space, avoiding uninformative dimensions that previous methods often introduce. The coordinates in this embedding provide a natural choice of features for accurately predicting molecular properties, which we demonstrate with examples for cetane and octane numbers, flash points, and histamine H1 receptor binding affinity. The approach is computationally efficient and well-suited to the small- and medium-size datasets commonly used in biosciences. We showcase the utility of MACAW for virtual screening by identifying molecules with high predicted binding affinity to the histamine H1 receptor and limited affinity to the muscarinic M2 receptor, which are targets of medicinal relevance. Combining these predictive capabilities with a novel generative algorithm for molecules allows us to recommend molecules with a desired property value (i.e., inverse molecular design). We demonstrate this capability by recommending molecules with predicted octane numbers of 40, 80, and 120, which is an important characteristic of biofuels. Thus, MACAW augments classical retrosynthesis tools by providing recommendations for molecules on specification.
Asunto(s)
Octanos , Receptores Histamínicos H1 , Algoritmos , Unión ProteicaRESUMEN
The identification of promising lead compounds showing pharmacological activities toward a biological target is essential in early stage drug discovery. With the recent increase in available small-molecule databases, virtual high-throughput screening using physics-based molecular docking has emerged as an essential tool in assisting fast and cost-efficient lead discovery and optimization. However, the best scored docking poses are often suboptimal, resulting in incorrect screening and chemical property calculation. We address the pose classification problem by leveraging data-driven machine learning approaches to identify correct docking poses from AutoDock Vina and Glide screens. To enable effective classification of docking poses, we present two convolutional neural network approaches: a three-dimensional convolutional neural network (3D-CNN) and an attention-based point cloud network (PCN) trained on the PDBbind refined set. We demonstrate the effectiveness of our proposed classifiers on multiple evaluation data sets including the standard PDBbind CASF-2016 benchmark data set and various compound libraries with structurally different protein targets including an ion channel data set extracted from Protein Data Bank (PDB) and an in-house KCa3.1 inhibitor data set. Our experiments show that excluding false positive docking poses using the proposed classifiers improves virtual high-throughput screening to identify novel molecules against each target protein compared to the initial screen based on the docking scores.
Asunto(s)
Canales Iónicos , Redes Neurales de la Computación , Ligandos , Simulación del Acoplamiento Molecular , Unión ProteicaRESUMEN
Predicting accurate protein-ligand binding affinities is an important task in drug discovery but remains a challenge even with computationally expensive biophysics-based energy scoring methods and state-of-the-art deep learning approaches. Despite the recent advances in the application of deep convolutional and graph neural network-based approaches, it remains unclear what the relative advantages of each approach are and how they compare with physics-based methodologies that have found more mainstream success in virtual screening pipelines. We present fusion models that combine features and inference from complementary representations to improve binding affinity prediction. This, to our knowledge, is the first comprehensive study that uses a common series of evaluations to directly compare the performance of three-dimensional (3D)-convolutional neural networks (3D-CNNs), spatial graph neural networks (SG-CNNs), and their fusion. We use temporal and structure-based splits to assess performance on novel protein targets. To test the practical applicability of our models, we examine their performance in cases that assume that the crystal structure is not available. In these cases, binding free energies are predicted using docking pose coordinates as the inputs to each model. In addition, we compare these deep learning approaches to predictions based on docking scores and molecular mechanic/generalized Born surface area (MM/GBSA) calculations. Our results show that the fusion models make more accurate predictions than their constituent neural network models as well as docking scoring and MM/GBSA rescoring, with the benefit of greater computational efficiency than the MM/GBSA method. Finally, we provide the code to reproduce our results and the parameter files of the trained models used in this work. The software is available as open source at https://github.com/llnl/fast. Model parameter files are available at ftp://gdo-bioinformatics.ucllnl.org/fast/pdbbind2016_model_checkpoints/.
Asunto(s)
Redes Neurales de la Computación , Proteínas , Ligandos , Unión Proteica , Proteínas/metabolismo , Programas InformáticosRESUMEN
Cholestatic liver injury is frequently associated with drug inhibition of bile salt transporters, such as the bile salt export pump (BSEP). Reliable in silico models to predict BSEP inhibition directly from chemical structures would significantly reduce costs during drug discovery and could help avoid injury to patients. We report our development of classification and regression models for BSEP inhibition with substantially improved performance over previously published models. We assessed the performance effects of different methods of chemical featurization, data set partitioning, and class labeling and identified the methods producing models that generalized best to novel chemical entities.
Asunto(s)
Enfermedad Hepática Inducida por Sustancias y Drogas , Colestasis , Miembro 11 de la Subfamilia B de Transportador de Casetes de Unión al ATP , Transportadoras de Casetes de Unión a ATP , Humanos , Aprendizaje AutomáticoRESUMEN
We present a new approach to estimate the binding affinity from given three-dimensional poses of protein-ligand complexes. In this scheme, every protein-ligand atom pair makes an additive free-energy contribution. The sum of these pairwise contributions then gives the total binding free energy or the logarithm of the dissociation constant. The pairwise contribution is calculated by a function implemented via a neural network that takes the properties of the two atoms and their distance as input. The pairwise function is trained using a portion of the PDBbind 2018 data set. The model achieves good accuracy for affinity predictions when evaluated with PDBbind 2018 and with the CASF-2016 benchmark, comparing favorably to many scoring functions such as that of AutoDock Vina. The framework here may be extended to incorporate other factors to further improve its accuracy and power.
Asunto(s)
Diseño de Fármacos , Redes Neurales de la Computación , Ligandos , Simulación del Acoplamiento Molecular , Unión ProteicaRESUMEN
Accurately predicting small molecule partitioning and hydrophobicity is critical in the drug discovery process. There are many heterogeneous chemical environments within a cell and entire human body. For example, drugs must be able to cross the hydrophobic cellular membrane to reach their intracellular targets, and hydrophobicity is an important driving force for drug-protein binding. Atomistic molecular dynamics (MD) simulations are routinely used to calculate free energies of small molecules binding to proteins, crossing lipid membranes, and solvation but are computationally expensive. Machine learning (ML) and empirical methods are also used throughout drug discovery but rely on experimental data, limiting the domain of applicability. We present atomistic MD simulations calculating 15,000 small molecule free energies of transfer from water to cyclohexane. This large data set is used to train ML models that predict the free energies of transfer. We show that a spatial graph neural network model achieves the highest accuracy, followed closely by a 3D-convolutional neural network, and shallow learning based on the chemical fingerprint is significantly less accurate. A mean absolute error of â¼4 kJ/mol compared to the MD calculations was achieved for our best ML model. We also show that including data from the MD simulation improves the predictions, tests the transferability of each model to a diverse set of molecules, and show multitask learning improves the predictions. This work provides insight into the hydrophobicity of small molecules and ML cheminformatics modeling, and our data set will be useful for designing and testing future ML cheminformatics methods.
Asunto(s)
Aprendizaje Profundo , Simulación de Dinámica Molecular , Entropía , Humanos , Interacciones Hidrofóbicas e Hidrofílicas , TermodinámicaRESUMEN
One of the key requirements for incorporating machine learning (ML) into the drug discovery process is complete traceability and reproducibility of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing ML models that predict key pharma-relevant parameters. The ATOM Modeling PipeLine, or AMPL, extends the functionality of the open source library DeepChem and supports an array of ML and molecular featurization tools. We have benchmarked AMPL on a large collection of pharmaceutical data sets covering a wide range of parameters. Our key findings indicate that traditional molecular fingerprints underperform other feature representation methods. We also find that data set size correlates directly with prediction performance, which points to the need to expand public data sets. Uncertainty quantification can help predict model error, but correlation with error varies considerably between data sets and model types. Our findings point to the need for an extensible pipeline that can be shared to make model building more widely accessible and reproducible. This software is open source and available at: https://github.com/ATOMconsortium/AMPL.
Asunto(s)
Descubrimiento de Drogas , Programas Informáticos , Aprendizaje Automático , Reproducibilidad de los ResultadosRESUMEN
BACKGROUND: The National Cancer Institute drug pair screening effort against 60 well-characterized human tumor cell lines (NCI-60) presents an unprecedented resource for modeling combinational drug activity. RESULTS: We present a computational model for predicting cell line response to a subset of drug pairs in the NCI-ALMANAC database. Based on residual neural networks for encoding features as well as predicting tumor growth, our model explains 94% of the response variance. While our best result is achieved with a combination of molecular feature types (gene expression, microRNA and proteome), we show that most of the predictive power comes from drug descriptors. To further demonstrate value in detecting anticancer therapy, we rank the drug pairs for each cell line based on model predicted combination effect and recover 80% of the top pairs with enhanced activity. CONCLUSIONS: We present promising results in applying deep learning to predicting combinational drug response. Our feature analysis indicates screening data involving more cell lines are needed for the models to make better use of molecular features.
Asunto(s)
Aprendizaje Profundo/tendencias , Evaluación Preclínica de Medicamentos/métodos , Línea Celular Tumoral , Humanos , National Cancer Institute (U.S.) , Redes Neurales de la Computación , Estados UnidosRESUMEN
Identifying causative disease agents in human patients from shotgun metagenomic sequencing (SMS) presents a powerful tool to apply when other targeted diagnostics fail. Numerous technical challenges remain, however, before SMS can move beyond the role of research tool. Accurately separating the known and unknown organism content remains difficult, particularly when SMS is applied as a last resort. The true amount of human DNA that remains in a sample after screening against the human reference genome and filtering nonbiological components left from library preparation has previously been underreported. In this study, we create the most comprehensive collection of microbial and reference-free human genetic variation available in a database optimized for efficient metagenomic search by extracting sequences from GenBank and the 1000 Genomes Project. The results reveal new human sequences found in individual Human Microbiome Project (HMP) samples. Individual samples contain up to 95% human sequence, and 4% of the individual HMP samples contain 10% or more human reads. Left unidentified, human reads can complicate and slow down further analysis and lead to inaccurately labeled microbial taxa and ultimately lead to privacy concerns as more human genome data is collected.
Asunto(s)
Genoma Microbiano , Metagenoma , Metagenómica/métodos , Microbiota , Biología Computacional/métodos , Bases de Datos de Ácidos Nucleicos , Humanos , Curva ROCRESUMEN
The organisms in aerosol microenvironments, especially densely populated urban areas, are relevant to maintenance of public health and detection of potential epidemic or biothreat agents. To examine aerosolized microorganisms in this environment, we performed sequencing on the material from an urban aerosol surveillance program. Whole metagenome sequencing was applied to DNA extracted from air filters obtained during periods from each of the four seasons. The composition of bacteria, plants, fungi, invertebrates, and viruses demonstrated distinct temporal shifts. Bacillus thuringiensis serovar kurstaki was detected in samples known to be exposed to aerosolized spores, illustrating the potential utility of this approach for identification of intentionally introduced microbial agents. Together, these data demonstrate the temporally dependent metagenomic complexity of urban aerosols and the potential of genomic analytical techniques for biosurveillance and monitoring of threats to public health.
Asunto(s)
Microbiología del Aire , ADN Bacteriano/aislamiento & purificación , Metagenómica/métodos , Bacillus thuringiensis/aislamiento & purificación , Bacterias/clasificación , Bacterias/aislamiento & purificación , Biomasa , Ciudades , Variaciones en el Número de Copia de ADN , ADN Bacteriano/genética , District of Columbia , Monitoreo del Ambiente , Hongos/clasificación , Hongos/aislamiento & purificación , Metagenoma , Estaciones del Año , Alineación de Secuencia , Análisis de Secuencia de ADNRESUMEN
Combat wound healing and resolution are highly affected by the resident microbial flora. We therefore sought to achieve comprehensive detection of microbial populations in wounds using novel genomic technologies and bioinformatics analyses. We employed a microarray capable of detecting all sequenced pathogens for interrogation of 124 wound samples from extremity injuries in combat-injured U.S. service members. A subset of samples was also processed via next-generation sequencing and metagenomic analysis. Array analysis detected microbial targets in 51% of all wound samples, with Acinetobacter baumannii being the most frequently detected species. Multiple Pseudomonas species were also detected in tissue biopsy specimens. Detection of the Acinetobacter plasmid pRAY correlated significantly with wound failure, while detection of enteric-associated bacteria was associated significantly with successful healing. Whole-genome sequencing revealed broad microbial biodiversity between samples. The total wound bioburden did not associate significantly with wound outcome, although temporal shifts were observed over the course of treatment. Given that standard microbiological methods do not detect the full range of microbes in each wound, these data emphasize the importance of supplementation with molecular techniques for thorough characterization of wound-associated microbes. Future application of genomic protocols for assessing microbial content could allow application of specialized care through early and rapid identification and management of critical patterns in wound bioburden.
Asunto(s)
Bacterias/clasificación , Bacterias/aislamiento & purificación , Biota , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis por Micromatrices/métodos , Infección de Heridas/microbiología , Adulto , Bacterias/genética , Carga Bacteriana , Humanos , Personal Militar , Cicatrización de Heridas , Adulto JovenRESUMEN
MOTIVATION: Deep metagenomic sequencing of biological samples has the potential to recover otherwise difficult-to-detect microorganisms and accurately characterize biological samples with limited prior knowledge of sample contents. Existing metagenomic taxonomic classification algorithms, however, do not scale well to analyze large metagenomic datasets, and balancing classification accuracy with computational efficiency presents a fundamental challenge. RESULTS: A method is presented to shift computational costs to an off-line computation by creating a taxonomy/genome index that supports scalable metagenomic classification. Scalable performance is demonstrated on real and simulated data to show accurate classification in the presence of novel organisms on samples that include viruses, prokaryotes, fungi and protists. Taxonomic classification of the previously published 150 giga-base Tyrolean Iceman dataset was found to take <20 h on a single node 40 core large memory machine and provide new insights on the metagenomic contents of the sample. AVAILABILITY: Software was implemented in C++ and is freely available at http://sourceforge.net/projects/lmat CONTACT: allen99@llnl.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Metagenómica/métodos , Filogenia , Algoritmos , Clasificación/métodos , Bases de Datos de Ácidos Nucleicos , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Programas InformáticosRESUMEN
Viral populations in natural infections can have a high degree of sequence diversity, which can directly impact immune escape. However, antibody potency is often tested in vitro with a relatively clonal viral populations, such as laboratory virus or pseudotyped virus stocks, which may not accurately represent the genetic diversity of circulating viral genotypes. This can affect the validity of viral phenotype assays, such as antibody neutralization assays. To address this issue, we tested whether recombinant virus carrying SARS-CoV-2 spike (VSV-SARS-CoV-2-S) stocks could be made more genetically diverse by passage, and if a stock passaged under selective pressure was more capable of escaping monoclonal antibody (mAb) neutralization than unpassaged stock or than viral stock passaged without selective pressures. We passaged VSV-SARS-CoV-2-S four times concurrently in three cell lines and then six times with or without polyclonal antiserum selection pressure. All three of the monoclonal antibodies tested neutralized the viral population present in the unpassaged stock. The viral inoculum derived from serial passage without antiserum selection pressure was neutralized by two of the three mAbs. However, the viral inoculum derived from serial passage under antiserum selection pressure escaped neutralization by all three mAbs. Deep sequencing revealed the rapid acquisition of multiple mutations associated with antibody escape in the VSV-SARS-CoV-2-S that had been passaged in the presence of antiserum, including key mutations present in currently circulating Omicron subvariants. These data indicate that viral stock that was generated under polyclonal antiserum selection pressure better reflects the natural environment of the circulating virus and may yield more biologically relevant outcomes in phenotypic assays. Thus, mAb assessment assays that utilize a more genetically diverse, biologically relevant, virus stock may yield data that are relevant for prediction of mAb efficacy and for enhancing biosurveillance.
Asunto(s)
Anticuerpos Neutralizantes , COVID-19 , Humanos , SARS-CoV-2/genética , Anticuerpos Antivirales , Pruebas de Neutralización , Sueros Inmunes , Glicoproteína de la Espiga del Coronavirus/genéticaRESUMEN
BACKGROUND: High throughput sequencing is beginning to make a transformative impact in the area of viral evolution. Deep sequencing has the potential to reveal the mutant spectrum within a viral sample at high resolution, thus enabling the close examination of viral mutational dynamics both within- and between-hosts. The challenge however, is to accurately model the errors in the sequencing data and differentiate real viral mutations, particularly those that exist at low frequencies, from sequencing errors. RESULTS: We demonstrate that overlapping read pairs (ORP) -- generated by combining short fragment sequencing libraries and longer sequencing reads -- significantly reduce sequencing error rates and improve rare variant detection accuracy. Using this sequencing protocol and an error model optimized for variant detection, we are able to capture a large number of genetic mutations present within a viral population at ultra-low frequency levels (<0.05%). CONCLUSIONS: Our rare variant detection strategies have important implications beyond viral evolution and can be applied to any basic and clinical research area that requires the identification of rare mutations.
Asunto(s)
Análisis Mutacional de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Mutación , Virus/genética , Benchmarking , Genoma Viral/genética , Reacción en Cadena de la PolimerasaRESUMEN
Neural Network (NN) models provide potential to speed up the drug discovery process and reduce its failure rates. The success of NN models requires uncertainty quantification (UQ) as drug discovery explores chemical space beyond the training data distribution. Standard NN models do not provide uncertainty information. Some methods require changing the NN architecture or training procedure, limiting the selection of NN models. Moreover, predictive uncertainty can come from different sources. It is important to have the ability to separately model different types of predictive uncertainty, as the model can take assorted actions depending on the source of uncertainty. In this paper, we examine UQ methods that estimate different sources of predictive uncertainty for NN models aiming at protein-ligand binding prediction. We use our prior knowledge on chemical compounds to design the experiments. By utilizing a visualization method we create non-overlapping and chemically diverse partitions from a collection of chemical compounds. These partitions are used as training and test set splits to explore NN model uncertainty. We demonstrate how the uncertainties estimated by the selected methods describe different sources of uncertainty under different partitions and featurization schemes and the relationship to prediction error.
RESUMEN
Minimizing the human and economic costs of the COVID-19 pandemic and future pandemics requires the ability to develop and deploy effective treatments for novel pathogens as soon as possible after they emerge. To this end, we introduce a new computational pipeline for the rapid identification and characterization of binding sites in viral proteins along with the key chemical features, which we call chemotypes, of the compounds predicted to interact with those same sites. The composition of source organisms for the structural models associated with an individual binding site is used to assess the site's degree of structural conservation across different species, including other viruses and humans. We propose a search strategy for novel therapeutics that involves the selection of molecules preferentially containing the most structurally rich chemotypes identified by our algorithm. While we demonstrate the pipeline on SARS-CoV-2, it is generalizable to any new virus, as long as either experimentally solved structures for its proteins are available or sufficiently accurate predicted structures can be constructed.
RESUMEN
Molecular biology methods and technologies have advanced substantially over the past decade. These new molecular methods should be incorporated among the standard tools of planetary protection (PP) and could be validated for incorporation by 2026. To address the feasibility of applying modern molecular techniques to such an application, NASA conducted a technology workshop with private industry partners, academics, and government agency stakeholders, along with NASA staff and contractors. The technical discussions and presentations of the Multi-Mission Metagenomics Technology Development Workshop focused on modernizing and supplementing the current PP assays. The goals of the workshop were to assess the state of metagenomics and other advanced molecular techniques in the context of providing a validated framework to supplement the bacterial endospore-based NASA Standard Assay and to identify knowledge and technology gaps. In particular, workshop participants were tasked with discussing metagenomics as a stand-alone technology to provide rapid and comprehensive analysis of total nucleic acids and viable microorganisms on spacecraft surfaces, thereby allowing for the development of tailored and cost-effective microbial reduction plans for each hardware item on a spacecraft. Workshop participants recommended metagenomics approaches as the only data source that can adequately feed into quantitative microbial risk assessment models for evaluating the risk of forward (exploring extraterrestrial planet) and back (Earth harmful biological) contamination. Participants were unanimous that a metagenomics workflow, in tandem with rapid targeted quantitative (digital) PCR, represents a revolutionary advance over existing methods for the assessment of microbial bioburden on spacecraft surfaces. The workshop highlighted low biomass sampling, reagent contamination, and inconsistent bioinformatics data analysis as key areas for technology development. Finally, it was concluded that implementing metagenomics as an additional workflow for addressing concerns of NASA's robotic mission will represent a dramatic improvement in technology advancement for PP and will benefit future missions where mission success is affected by backward and forward contamination.
Asunto(s)
Planetas , Vuelo Espacial , Estados Unidos , Humanos , Medio Ambiente Extraterrestre , Metagenómica , United States National Aeronautics and Space Administration , Nave Espacial , PolíticasRESUMEN
We present a structure-based method for finding and evaluating structural similarities in protein regions relevant to ligand binding. PDBspheres comprises an exhaustive library of protein structure regions ('spheres') adjacent to complexed ligands derived from the Protein Data Bank (PDB), along with methods to find and evaluate structural matches between a protein of interest and spheres in the library. PDBspheres uses the LGA (Local-Global Alignment) structure alignment algorithm as the main engine for detecting structural similarities between the protein of interest and template spheres from the library, which currently contains >2 million spheres. To assess confidence in structural matches, an all-atom-based similarity metric takes side chain placement into account. Here, we describe the PDBspheres method, demonstrate its ability to detect and characterize binding sites in protein structures, show how PDBspheres-a strictly structure-based method-performs on a curated dataset of 2528 ligand-bound and ligand-free crystal structures, and use PDBspheres to cluster pockets and assess structural similarities among protein binding sites of 4876 structures in the 'refined set' of the PDBbind 2019 dataset.