Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 168
Filter
1.
Patterns (N Y) ; 4(12): 100865, 2023 Dec 08.
Article in English | MEDLINE | ID: mdl-38106612

ABSTRACT

Chemical similarity searches are a widely used family of in silico methods for identifying pharmaceutical leads. These methods historically relied on structure-based comparisons to compute similarity. Here, we use a chemical language model to create a vector-based chemical search. We extend previous implementations by creating a prompt engineering strategy that utilizes two different chemical string representation algorithms: one for the query and the other for the database. We explore this method by reviewing search results from nine queries with diverse targets. We find that the method identifies molecules with similar patent-derived functionality to the query, as determined by our validated LLM-assisted patent summarization pipeline. Further, many of these functionally similar molecules have different structures and scaffolds from the query, making them unlikely to be found with traditional chemical similarity searches. This method may serve as a new tool for the discovery of novel molecular structural classes that achieve target functionality.

2.
bioRxiv ; 2023 Nov 15.
Article in English | MEDLINE | ID: mdl-38014091

ABSTRACT

Class II microcins are antimicrobial peptides that have shown some potential as novel antibiotics. However, to date only ten class II microcins have been described, and discovery of novel microcins has been hampered by their short length and high sequence divergence. Here, we ask if we can use numerical embeddings generated by protein large language models to detect microcins in bacterial genome assemblies and whether this method can outperform sequence-based methods such as BLAST. We find that embeddings detect known class II microcins much more reliably than does BLAST and that any two microcins tend to have a small distance in embedding space even though they typically are highly diverged at the sequence level. In datasets of Escherichia coli, Klebsiella spp., and Enterobacter spp. genomes, we further find novel putative microcins that were previously missed by sequence-based search methods.

3.
Res Sq ; 2023 Sep 13.
Article in English | MEDLINE | ID: mdl-37790501

ABSTRACT

Antimicrobial peptides commonly act by disrupting bacterial membranes, but also frequently damage mammalian membranes. Deciphering the rules governing membrane selectivity is critical to understanding their function and enabling their therapeutic use. Past attempts to decipher these rules have failed because they cannot interrogate adequate peptide sequence variation. To overcome this problem, we develop deep mutational surface localized antimicrobial display (dmSLAY), which reveals comprehensive positional residue importance and flexibility across an antimicrobial peptide sequence. We apply dmSLAY to Protegrin-1, a potent yet toxic antimicrobial peptide, and identify thousands of sequence variants that positively or negatively influence its antibacterial activity. Further analysis reveals that avoiding large aromatic residues and eliminating disulfide bound cysteine pairs while maintaining membrane bound secondary structure greatly improves Protegrin-1 bacterial specificity. Moreover, dmSLAY datasets enable machine learning to expand our analysis to include over 5.7 million sequence variants and reveal full Protegrin-1 mutational profiles driving either bacterial or mammalian membrane specificity. Our results describe an innovative, high-throughput approach for elucidating antimicrobial peptide sequence-structure-function relationships which can inform synthetic peptide-based drug design.

4.
Sci Rep ; 13(1): 13280, 2023 08 16.
Article in English | MEDLINE | ID: mdl-37587128

ABSTRACT

Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.


Subject(s)
Amino Acids , Antifibrinolytic Agents , Amino Acid Sequence , Electric Power Supplies , Language
5.
Mol Biol Evol ; 40(9)2023 09 01.
Article in English | MEDLINE | ID: mdl-37619989

ABSTRACT

The most highly expressed genes in microbial genomes tend to use a limited set of synonymous codons, often referred to as "preferred codons." The existence of preferred codons is commonly attributed to selection pressures on various aspects of protein translation including accuracy and/or speed. However, gene expression is condition-dependent and even within single-celled organisms transcript and protein abundances can vary depending on a variety of environmental and other factors. Here, we show that growth rate-dependent expression variation is an important constraint that significantly influences the evolution of gene sequences. Using large-scale transcriptomic and proteomic data sets in Escherichia coli and Saccharomyces cerevisiae, we confirm that codon usage biases are strongly associated with gene expression but highlight that this relationship is most pronounced when gene expression measurements are taken during rapid growth conditions. Specifically, genes whose relative expression increases during periods of rapid growth have stronger codon usage biases than comparably expressed genes whose expression decreases during rapid growth conditions. These findings highlight that gene expression measured in any particular condition tells only part of the story regarding the forces shaping the evolution of microbial gene sequences. More generally, our results imply that microbial physiology during rapid growth is critical for explaining long-term translational constraints.


Subject(s)
Codon Usage , Magnoliopsida , Proteomics , Escherichia coli/genetics , Protein Biosynthesis , Saccharomyces cerevisiae/genetics , Bias
6.
bioRxiv ; 2023 Sep 10.
Article in English | MEDLINE | ID: mdl-37547010

ABSTRACT

Antimicrobial peptides commonly act by disrupting bacterial membranes, but also frequently damage mammalian membranes. Deciphering the rules governing membrane selectivity is critical to understanding their function and enabling their therapeutic use. Past attempts to decipher these rules have failed because they cannot interrogate adequate peptide sequence variation. To overcome this problem, we develop deep mutational surface localized antimicrobial display (dmSLAY), which reveals comprehensive positional residue importance and flexibility across an antimicrobial peptide sequence. We apply dmSLAY to Protegrin-1, a potent yet toxic antimicrobial peptide, and identify thousands of sequence variants that positively or negatively influence its antibacterial activity. Further analysis reveals that avoiding large aromatic residues and eliminating disulfide bound cysteine pairs while maintaining membrane bound secondary structure greatly improves Protegrin-1 bacterial specificity. Moreover, dmSLAY datasets enable machine learning to expand our analysis to include over 5.7 million sequence variants and reveal full Protegrin-1 mutational profiles driving either bacterial or mammalian membrane specificity. Our results describe an innovative, high-throughput approach for elucidating antimicrobial peptide sequence-structure-function relationships which can inform synthetic peptide-based drug design.

7.
bioRxiv ; 2023 Jul 11.
Article in English | MEDLINE | ID: mdl-37502928

ABSTRACT

CRISPR-associated transposons (CASTs) co-opt CRISPR-Cas proteins and Tn7-family transposons for RNA-guided vertical and horizontal transmission. CASTs encode minimal CRISPR arrays but can't acquire new spacers. Here, we show that CASTs instead co-opt defense-associated CRISPR arrays for horizontal transmission. A bioinformatic analysis shows that all CAST sub-types co-occur with defense-associated CRISPR-Cas systems. Using an E. coli quantitative transposition assay, we show that CASTs use CRISPR RNAs (crRNAs) from these defense systems for horizontal gene transfer. A high-resolution structure of the type I-F CAST-Cascade in complex with a type III-B crRNA reveals that Cas6 recognizes direct repeats via sequence-independent π - π interactions. In addition to using heterologous CRISPR arrays, type V CASTs can also transpose via a crRNA-independent unguided mechanism, even when the S15 co-factor is over-expressed. Over-expressing S15 and the trans-activating CRISPR RNA (tracrRNA) or a single guide RNA (sgRNA) reduces, but does not abrogate, off-target integration for type V CASTs. Exploiting new spacers in defense-associated CRISPR arrays explains how CASTs horizontally transfer to new hosts. More broadly, this work will guide further efforts to engineer the activity and specificity of CASTs for gene editing applications.

8.
bioRxiv ; 2023 Jul 13.
Article in English | MEDLINE | ID: mdl-36993177

ABSTRACT

The most highly expressed genes in microbial genomes tend to use a limited set of synonymous codons, often referred to as "preferred codons." The existence of preferred codons is commonly attributed to selection pressures on various aspects of protein translation including accuracy and/or speed. However, gene expression is condition-dependent and even within single-celled organisms transcript and protein abundances can vary depending on a variety of environmental and other factors. Here, we show that growth rate-dependent expression variation is an important constraint that significantly influences the evolution of gene sequences. Using large-scale transcriptomic and proteomic data sets in Escherichia coli and Saccharomyces cerevisiae, we confirm that codon usage biases are strongly associated with gene expression but highlight that this relationship is most pronounced when gene expression measurements are taken during rapid growth conditions. Specifically, genes whose relative expression increases during periods of rapid growth have stronger codon usage biases than comparably expressed genes whose expression decreases during rapid growth conditions. These findings highlight that gene expression measured in any particular condition tells only part of the story regarding the forces shaping the evolution of microbial gene sequences. More generally, our results imply that microbial physiology during rapid growth is critical for explaining long-term translational constraints.

9.
bioRxiv ; 2023 Jul 09.
Article in English | MEDLINE | ID: mdl-36993648

ABSTRACT

Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.

10.
Sci Adv ; 9(2): eade0008, 2023 01 13.
Article in English | MEDLINE | ID: mdl-36630516

ABSTRACT

Peptide macrocycles are a rapidly emerging class of therapeutic, yet the design of their structure and activity remains challenging. This is especially true for those with ß-hairpin structure due to weak folding properties and a propensity for aggregation. Here, we use proteomic analysis and common antimicrobial features to design a large peptide library with macrocyclic ß-hairpin structure. Using an activity-driven high-throughput screen, we identify dozens of peptides killing bacteria through selective membrane disruption and analyze their biochemical features via machine learning. Active peptides contain a unique constrained structure and are highly enriched for cationic charge with arginine in their turn region. Our results provide a synthetic strategy for structured macrocyclic peptide design and discovery while also elucidating characteristics important for ß-hairpin antimicrobial peptide activity.


Subject(s)
Anti-Bacterial Agents , Proteomics , Anti-Bacterial Agents/pharmacology , Anti-Bacterial Agents/chemistry , Peptides/pharmacology , Peptides/chemistry , Bacteria
11.
Curr Opin Struct Biol ; 78: 102518, 2023 02.
Article in English | MEDLINE | ID: mdl-36603229

ABSTRACT

Machine and deep learning approaches can leverage the increasingly available massive datasets of protein sequences, structures, and mutational effects to predict variants with improved fitness. Many different approaches are being developed, but systematic benchmarking studies indicate that even though the specifics of the machine learning algorithms matter, the more important constraint comes from the data availability and quality utilized during training. In cases where little experimental data are available, unsupervised and self-supervised pre-training with generic protein datasets can still perform well after subsequent refinement via hybrid or transfer learning approaches. Overall, recent progress in this field has been staggering, and machine learning approaches will likely play a major role in future breakthroughs in protein biochemistry and engineering.


Subject(s)
Machine Learning , Neural Networks, Computer , Algorithms , Amino Acid Sequence , Mutation
12.
ArXiv ; 2023 Dec 18.
Article in English | MEDLINE | ID: mdl-38196747

ABSTRACT

The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of orthogonal methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631K molecule-function pairs, was created using an LLM- and embedding-based method to obtain functional labels for approximately 100K molecules from their corresponding 188K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an orthogonal approach to traditional structure-based methods in the pursuit of designing novel functional molecules.

13.
Appl Environ Microbiol ; 88(23): e0148622, 2022 12 13.
Article in English | MEDLINE | ID: mdl-36394322

ABSTRACT

Microcins are a class of antimicrobial peptides produced by certain Gram-negative bacterial species to kill or inhibit the growth of competing bacteria. Only 10 unique, experimentally validated class II microcins have been identified, and the majority of these come from Escherichia coli. Although the current representation of microcins is sparse, they exhibit a diverse array of molecular functionalities, uptake mechanisms, and target specificities. This broad diversity from such a small representation suggests that microcins may have untapped potential for bioprospecting peptide antibiotics from genomic data sets. We used a systematic bioinformatics approach to search for verified and novel class II microcins in E. coli and other species within its family, Enterobacteriaceae. Nearly one-quarter of the E. coli genome assemblies contained one or more microcins, where the prevalence of hits to specific microcins varied by isolate phylogroup. E. coli isolates from human extraintestinal and poultry meat sources were enriched for microcins, while those from freshwater were depleted. Putative microcins were found in various abundances across all five distinct phylogenetic lineages of Enterobacteriaceae, with a particularly high prevalence in the "Klebsiella" clade. Representative genome assemblies from species across the Enterobacterales order, as well as a few outgroup species, also contained putative microcin sequences. This study suggests that microcins have a complicated evolutionary history, spanning far beyond our limited knowledge of the currently validated microcins. Efforts to functionally characterize these newly identified microcins have great potential to open a new field of peptide antibiotics and microbiome modulators and elucidate the ways in which bacteria compete with each other. IMPORTANCE Class II microcins are small bacteriocins produced by strains of Gram-negative bacteria in the Enterobacteriaceae. They are generally understood to play a role in interbacterial competition, although direct evidence of this is limited, and they could prove informative in developing new peptide antibiotics. However, few examples of verified class II microcins exist, and novel microcins are difficult to identify due to their sequence diversity, making it complicated to study them as a group. Here, we overcome this limitation by developing a bioinformatics pipeline to detect microcins in silico. Using this pipeline, we demonstrate that both verified and novel class II microcins are widespread within and outside the Enterobacteriaceae, which has not been systematically shown previously. The observed prevalence of class II microcins suggests that they are ecologically important, and the elucidation of novel microcins provides a resource that can be used to expand our knowledge of the structure and function of microcins as antibacterials.


Subject(s)
Bacteriocins , Escherichia coli , Anti-Bacterial Agents/pharmacology , Anti-Bacterial Agents/chemistry , Bacteria , Bacteriocins/genetics , Bacteriocins/pharmacology , Bacteriocins/chemistry , Enterobacteriaceae , Escherichia coli/genetics , Peptides/genetics , Phylogeny
14.
PLoS One ; 17(5): e0268883, 2022.
Article in English | MEDLINE | ID: mdl-35617346

ABSTRACT

Synthetic biology has successfully advanced our ability to design and implement complex, time-varying genetic circuits to control the expression of recombinant proteins. However, these circuits typically require the production of regulatory genes whose only purpose is to coordinate expression of other genes. When designing very small genetic constructs, such as viral genomes, we may want to avoid introducing such auxiliary gene products while nevertheless encoding complex expression dynamics. To this end, here we demonstrate that varying only the placement and strengths of promoters, terminators, and RNase cleavage sites in a computational model of a bacteriophage genome is sufficient to achieve solutions to a variety of basic gene expression patterns. We discover these genetic solutions by computationally evolving genomes to reproduce desired gene expression time-course data. Our approach shows that non-trivial patterns can be evolved, including patterns where the relative ordering of genes by abundance changes over time. We find that some patterns are easier to evolve than others, and comparable expression patterns can be achieved via different genetic architectures. Our work opens up a novel avenue to genome engineering via fine-tuning the balance of gene expression and gene degradation rates.


Subject(s)
Gene Regulatory Networks , Synthetic Biology , Gene Expression , Genes, Regulator , Promoter Regions, Genetic
15.
Proc Natl Acad Sci U S A ; 118(49)2021 12 07.
Article in English | MEDLINE | ID: mdl-34845024

ABSTRACT

CRISPR-associated Tn7 transposons (CASTs) co-opt cas genes for RNA-guided transposition. CASTs are exceedingly rare in genomic databases; recent surveys have reported Tn7-like transposons that co-opt Type I-F, I-B, and V-K CRISPR effectors. Here, we expand the diversity of reported CAST systems via a bioinformatic search of metagenomic databases. We discover architectures for all known CASTs, including arrangements of the Cascade effectors, target homing modalities, and minimal V-K systems. We also describe families of CASTs that have co-opted the Type I-C and Type IV CRISPR-Cas systems. Our search for non-Tn7 CASTs identifies putative candidates that include a nuclease dead Cas12. These systems shed light on how CRISPR systems have coevolved with transposases and expand the programmable gene-editing toolkit.


Subject(s)
Clustered Regularly Interspaced Short Palindromic Repeats/genetics , DNA Transposable Elements/genetics , Bacterial Proteins/metabolism , CRISPR-Associated Proteins/metabolism , CRISPR-Cas Systems/genetics , CRISPR-Cas Systems/physiology , Clustered Regularly Interspaced Short Palindromic Repeats/physiology , DNA Transposable Elements/physiology , Endonucleases/genetics , Gene Editing , Metagenome , Metagenomics/methods , RNA, Guide, Kinetoplastida/genetics , Transposases/genetics
16.
J Biol Phys ; 47(4): 435-454, 2021 12.
Article in English | MEDLINE | ID: mdl-34751854

ABSTRACT

One fundamental problem of protein biochemistry is to predict protein structure from amino acid sequence. The inverse problem, predicting either entire sequences or individual mutations that are consistent with a given protein structure, has received much less attention even though it has important applications in both protein engineering and evolutionary biology. Here, we ask whether 3D convolutional neural networks (3D CNNs) can learn the local fitness landscape of protein structure to reliably predict either the wild-type amino acid or the consensus in a multiple sequence alignment from the local structural context surrounding site of interest. We find that the network can predict wild type with good accuracy, and that network confidence is a reliable measure of whether a given prediction is likely going to be correct or not. Predictions of consensus are less accurate and are primarily driven by whether or not the consensus matches the wild type. Our work suggests that high-confidence mis-predictions of the wild type may identify sites that are primed for mutation and likely targets for protein engineering.


Subject(s)
Neural Networks, Computer , Proteins , Amino Acid Sequence , Amino Acids , Proteins/genetics
17.
Sci Rep ; 11(1): 9622, 2021 05 05.
Article in English | MEDLINE | ID: mdl-33953215

ABSTRACT

Viruses experience selective pressure on the timing and order of events during infection to maximize the number of viable offspring they produce. Additionally, they may experience variability in cellular environments encountered, as individual eukaryotic cells can display variation in gene expression among cells. This leads to a dynamic phenotypic landscape that viruses must face to replicate. To examine replication dynamics displayed by viruses faced with this variable landscape, we have developed a method for fitting a stochastic mechanistic model of viral infection to time-lapse imaging data from high-throughput single-cell poliovirus infection experiments. The model's mechanistic parameters provide estimates of several aspects associated with the virus's intracellular dynamics. We examine distributions of parameter estimates and assess their variability to gain insight into the root causes of variability in viral growth dynamics. We also fit our model to experiments performed under various drug treatments and examine which parameters differ under these conditions. We find that parameters associated with translation and early stage viral replication processes are essential for the model to capture experimentally observed dynamics. In aggregate, our results suggest that differences in viral growth data generated under different treatments can largely be captured by steps that occur early in the replication process.


Subject(s)
Models, Biological , Poliovirus/physiology , Time-Lapse Imaging , Virus Replication/physiology , Host-Pathogen Interactions , Humans
18.
PeerJ ; 9: e11396, 2021.
Article in English | MEDLINE | ID: mdl-33996289

ABSTRACT

Bacteriophages are broadly classified into two distinct lifestyles: temperate and virulent. Temperate phages are capable of a latent phase of infection within a host cell (lysogenic cycle), whereas virulent phages directly replicate and lyse host cells upon infection (lytic cycle). Accurate lifestyle identification is critical for determining the role of individual phage species within ecosystems and their effect on host evolution. Here, we present BACPHLIP, a BACterioPHage LIfestyle Predictor. BACPHLIP detects the presence of a set of conserved protein domains within an input genome and uses this data to predict lifestyle via a Random Forest classifier that was trained on a dataset of 634 phage genomes. On an independent test set of 423 phages, BACPHLIP has an accuracy of 98% greatly exceeding that of the previously existing tools (79%). BACPHLIP is freely available on GitHub (https://github.com/adamhockenberry/bacphlip) and the code used to build and test the classifier is provided in a separate repository (https://github.com/adamhockenberry/bacphlip-model-dev) for users wishing to interrogate and re-train the underlying classification model.

19.
Protein Sci ; 30(3): 613-623, 2021 03.
Article in English | MEDLINE | ID: mdl-33389765

ABSTRACT

The beta hairpin motif is a ubiquitous protein structural motif that can be found in molecules across the tree of life. This motif, which is also popular in synthetically designed proteins and peptides, is known for its stability and adaptability to broad functions. Here, we systematically probe all 49,000 unique beta hairpin substructures contained within the Protein Data Bank (PDB) to uncover key characteristics correlated with stable beta hairpin structure, including amino acid biases and enriched interstrand contacts. We find that position specific amino acid preferences, while seen throughout the beta hairpin structure, are most evident within the turn region, where they depend on subtle turn dynamics associated with turn length and secondary structure. We also establish a set of broad design principles, such as the inclusion of aspartic acid residues at a specific position and the careful consideration of desired secondary structure when selecting residues for the turn region, that can be applied to the generation of libraries encoding proteins or peptides containing beta hairpin structures.


Subject(s)
Amino Acid Motifs , Computational Biology/methods , Databases, Protein , Proteins , Amino Acid Sequence , Proteins/chemistry , Proteins/genetics
20.
Article in English | MEDLINE | ID: mdl-35445164

ABSTRACT

Gene clusters are sets of co-localized, often contiguous genes that together perform specific functions, many of which are relevant to biotechnology. There is a need for software tools that can extract candidate gene clusters from vast amounts of available genomic data. Therefore, we developed Opfi: a modular pipeline for identification of arbitrary gene clusters in assembled genomic or metagenomic sequences. Opfi contains functions for annotation, de-deduplication, and visualization of putative gene clusters. It utilizes a customizable rule-based filtering approach for selection of candidate systems that adhere to user-defined criteria. Opfi is implemented in Python, and is available on the Python Package Index and on Bioconda (Grüning et al., 2018).

SELECTION OF CITATIONS
SEARCH DETAIL
...