Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 170
Filtrar
2.
Digit Discov ; 3(6): 1150-1159, 2024 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-38873033

RESUMO

The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule-function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.

3.
Patterns (N Y) ; 4(12): 100865, 2023 Dec 08.
Artigo em Inglês | MEDLINE | ID: mdl-38106612

RESUMO

Chemical similarity searches are a widely used family of in silico methods for identifying pharmaceutical leads. These methods historically relied on structure-based comparisons to compute similarity. Here, we use a chemical language model to create a vector-based chemical search. We extend previous implementations by creating a prompt engineering strategy that utilizes two different chemical string representation algorithms: one for the query and the other for the database. We explore this method by reviewing search results from nine queries with diverse targets. We find that the method identifies molecules with similar patent-derived functionality to the query, as determined by our validated LLM-assisted patent summarization pipeline. Further, many of these functionally similar molecules have different structures and scaffolds from the query, making them unlikely to be found with traditional chemical similarity searches. This method may serve as a new tool for the discovery of novel molecular structural classes that achieve target functionality.

4.
bioRxiv ; 2023 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-38014091

RESUMO

Class II microcins are antimicrobial peptides that have shown some potential as novel antibiotics. However, to date only ten class II microcins have been described, and discovery of novel microcins has been hampered by their short length and high sequence divergence. Here, we ask if we can use numerical embeddings generated by protein large language models to detect microcins in bacterial genome assemblies and whether this method can outperform sequence-based methods such as BLAST. We find that embeddings detect known class II microcins much more reliably than does BLAST and that any two microcins tend to have a small distance in embedding space even though they typically are highly diverged at the sequence level. In datasets of Escherichia coli, Klebsiella spp., and Enterobacter spp. genomes, we further find novel putative microcins that were previously missed by sequence-based search methods.

5.
Res Sq ; 2023 Sep 13.
Artigo em Inglês | MEDLINE | ID: mdl-37790501

RESUMO

Antimicrobial peptides commonly act by disrupting bacterial membranes, but also frequently damage mammalian membranes. Deciphering the rules governing membrane selectivity is critical to understanding their function and enabling their therapeutic use. Past attempts to decipher these rules have failed because they cannot interrogate adequate peptide sequence variation. To overcome this problem, we develop deep mutational surface localized antimicrobial display (dmSLAY), which reveals comprehensive positional residue importance and flexibility across an antimicrobial peptide sequence. We apply dmSLAY to Protegrin-1, a potent yet toxic antimicrobial peptide, and identify thousands of sequence variants that positively or negatively influence its antibacterial activity. Further analysis reveals that avoiding large aromatic residues and eliminating disulfide bound cysteine pairs while maintaining membrane bound secondary structure greatly improves Protegrin-1 bacterial specificity. Moreover, dmSLAY datasets enable machine learning to expand our analysis to include over 5.7 million sequence variants and reveal full Protegrin-1 mutational profiles driving either bacterial or mammalian membrane specificity. Our results describe an innovative, high-throughput approach for elucidating antimicrobial peptide sequence-structure-function relationships which can inform synthetic peptide-based drug design.

6.
Mol Biol Evol ; 40(9)2023 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-37619989

RESUMO

The most highly expressed genes in microbial genomes tend to use a limited set of synonymous codons, often referred to as "preferred codons." The existence of preferred codons is commonly attributed to selection pressures on various aspects of protein translation including accuracy and/or speed. However, gene expression is condition-dependent and even within single-celled organisms transcript and protein abundances can vary depending on a variety of environmental and other factors. Here, we show that growth rate-dependent expression variation is an important constraint that significantly influences the evolution of gene sequences. Using large-scale transcriptomic and proteomic data sets in Escherichia coli and Saccharomyces cerevisiae, we confirm that codon usage biases are strongly associated with gene expression but highlight that this relationship is most pronounced when gene expression measurements are taken during rapid growth conditions. Specifically, genes whose relative expression increases during periods of rapid growth have stronger codon usage biases than comparably expressed genes whose expression decreases during rapid growth conditions. These findings highlight that gene expression measured in any particular condition tells only part of the story regarding the forces shaping the evolution of microbial gene sequences. More generally, our results imply that microbial physiology during rapid growth is critical for explaining long-term translational constraints.


Assuntos
Uso do Códon , Magnoliopsida , Proteômica , Escherichia coli/genética , Biossíntese de Proteínas , Saccharomyces cerevisiae/genética , Viés
7.
Sci Rep ; 13(1): 13280, 2023 08 16.
Artigo em Inglês | MEDLINE | ID: mdl-37587128

RESUMO

Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.


Assuntos
Aminoácidos , Antifibrinolíticos , Sequência de Aminoácidos , Fontes de Energia Elétrica , Idioma
8.
bioRxiv ; 2023 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-37547010

RESUMO

Antimicrobial peptides commonly act by disrupting bacterial membranes, but also frequently damage mammalian membranes. Deciphering the rules governing membrane selectivity is critical to understanding their function and enabling their therapeutic use. Past attempts to decipher these rules have failed because they cannot interrogate adequate peptide sequence variation. To overcome this problem, we develop deep mutational surface localized antimicrobial display (dmSLAY), which reveals comprehensive positional residue importance and flexibility across an antimicrobial peptide sequence. We apply dmSLAY to Protegrin-1, a potent yet toxic antimicrobial peptide, and identify thousands of sequence variants that positively or negatively influence its antibacterial activity. Further analysis reveals that avoiding large aromatic residues and eliminating disulfide bound cysteine pairs while maintaining membrane bound secondary structure greatly improves Protegrin-1 bacterial specificity. Moreover, dmSLAY datasets enable machine learning to expand our analysis to include over 5.7 million sequence variants and reveal full Protegrin-1 mutational profiles driving either bacterial or mammalian membrane specificity. Our results describe an innovative, high-throughput approach for elucidating antimicrobial peptide sequence-structure-function relationships which can inform synthetic peptide-based drug design.

9.
bioRxiv ; 2023 Jul 11.
Artigo em Inglês | MEDLINE | ID: mdl-37502928

RESUMO

CRISPR-associated transposons (CASTs) co-opt CRISPR-Cas proteins and Tn7-family transposons for RNA-guided vertical and horizontal transmission. CASTs encode minimal CRISPR arrays but can't acquire new spacers. Here, we show that CASTs instead co-opt defense-associated CRISPR arrays for horizontal transmission. A bioinformatic analysis shows that all CAST sub-types co-occur with defense-associated CRISPR-Cas systems. Using an E. coli quantitative transposition assay, we show that CASTs use CRISPR RNAs (crRNAs) from these defense systems for horizontal gene transfer. A high-resolution structure of the type I-F CAST-Cascade in complex with a type III-B crRNA reveals that Cas6 recognizes direct repeats via sequence-independent π - π interactions. In addition to using heterologous CRISPR arrays, type V CASTs can also transpose via a crRNA-independent unguided mechanism, even when the S15 co-factor is over-expressed. Over-expressing S15 and the trans-activating CRISPR RNA (tracrRNA) or a single guide RNA (sgRNA) reduces, but does not abrogate, off-target integration for type V CASTs. Exploiting new spacers in defense-associated CRISPR arrays explains how CASTs horizontally transfer to new hosts. More broadly, this work will guide further efforts to engineer the activity and specificity of CASTs for gene editing applications.

10.
bioRxiv ; 2023 Jul 13.
Artigo em Inglês | MEDLINE | ID: mdl-36993177

RESUMO

The most highly expressed genes in microbial genomes tend to use a limited set of synonymous codons, often referred to as "preferred codons." The existence of preferred codons is commonly attributed to selection pressures on various aspects of protein translation including accuracy and/or speed. However, gene expression is condition-dependent and even within single-celled organisms transcript and protein abundances can vary depending on a variety of environmental and other factors. Here, we show that growth rate-dependent expression variation is an important constraint that significantly influences the evolution of gene sequences. Using large-scale transcriptomic and proteomic data sets in Escherichia coli and Saccharomyces cerevisiae, we confirm that codon usage biases are strongly associated with gene expression but highlight that this relationship is most pronounced when gene expression measurements are taken during rapid growth conditions. Specifically, genes whose relative expression increases during periods of rapid growth have stronger codon usage biases than comparably expressed genes whose expression decreases during rapid growth conditions. These findings highlight that gene expression measured in any particular condition tells only part of the story regarding the forces shaping the evolution of microbial gene sequences. More generally, our results imply that microbial physiology during rapid growth is critical for explaining long-term translational constraints.

11.
bioRxiv ; 2023 Jul 09.
Artigo em Inglês | MEDLINE | ID: mdl-36993648

RESUMO

Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.

12.
Sci Adv ; 9(2): eade0008, 2023 01 13.
Artigo em Inglês | MEDLINE | ID: mdl-36630516

RESUMO

Peptide macrocycles are a rapidly emerging class of therapeutic, yet the design of their structure and activity remains challenging. This is especially true for those with ß-hairpin structure due to weak folding properties and a propensity for aggregation. Here, we use proteomic analysis and common antimicrobial features to design a large peptide library with macrocyclic ß-hairpin structure. Using an activity-driven high-throughput screen, we identify dozens of peptides killing bacteria through selective membrane disruption and analyze their biochemical features via machine learning. Active peptides contain a unique constrained structure and are highly enriched for cationic charge with arginine in their turn region. Our results provide a synthetic strategy for structured macrocyclic peptide design and discovery while also elucidating characteristics important for ß-hairpin antimicrobial peptide activity.


Assuntos
Antibacterianos , Proteômica , Antibacterianos/farmacologia , Antibacterianos/química , Peptídeos/farmacologia , Peptídeos/química , Bactérias
13.
Curr Opin Struct Biol ; 78: 102518, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36603229

RESUMO

Machine and deep learning approaches can leverage the increasingly available massive datasets of protein sequences, structures, and mutational effects to predict variants with improved fitness. Many different approaches are being developed, but systematic benchmarking studies indicate that even though the specifics of the machine learning algorithms matter, the more important constraint comes from the data availability and quality utilized during training. In cases where little experimental data are available, unsupervised and self-supervised pre-training with generic protein datasets can still perform well after subsequent refinement via hybrid or transfer learning approaches. Overall, recent progress in this field has been staggering, and machine learning approaches will likely play a major role in future breakthroughs in protein biochemistry and engineering.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Algoritmos , Sequência de Aminoácidos , Mutação
14.
ArXiv ; 2023 Dec 18.
Artigo em Inglês | MEDLINE | ID: mdl-38196747

RESUMO

The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of orthogonal methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631K molecule-function pairs, was created using an LLM- and embedding-based method to obtain functional labels for approximately 100K molecules from their corresponding 188K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an orthogonal approach to traditional structure-based methods in the pursuit of designing novel functional molecules.

15.
Appl Environ Microbiol ; 88(23): e0148622, 2022 12 13.
Artigo em Inglês | MEDLINE | ID: mdl-36394322

RESUMO

Microcins are a class of antimicrobial peptides produced by certain Gram-negative bacterial species to kill or inhibit the growth of competing bacteria. Only 10 unique, experimentally validated class II microcins have been identified, and the majority of these come from Escherichia coli. Although the current representation of microcins is sparse, they exhibit a diverse array of molecular functionalities, uptake mechanisms, and target specificities. This broad diversity from such a small representation suggests that microcins may have untapped potential for bioprospecting peptide antibiotics from genomic data sets. We used a systematic bioinformatics approach to search for verified and novel class II microcins in E. coli and other species within its family, Enterobacteriaceae. Nearly one-quarter of the E. coli genome assemblies contained one or more microcins, where the prevalence of hits to specific microcins varied by isolate phylogroup. E. coli isolates from human extraintestinal and poultry meat sources were enriched for microcins, while those from freshwater were depleted. Putative microcins were found in various abundances across all five distinct phylogenetic lineages of Enterobacteriaceae, with a particularly high prevalence in the "Klebsiella" clade. Representative genome assemblies from species across the Enterobacterales order, as well as a few outgroup species, also contained putative microcin sequences. This study suggests that microcins have a complicated evolutionary history, spanning far beyond our limited knowledge of the currently validated microcins. Efforts to functionally characterize these newly identified microcins have great potential to open a new field of peptide antibiotics and microbiome modulators and elucidate the ways in which bacteria compete with each other. IMPORTANCE Class II microcins are small bacteriocins produced by strains of Gram-negative bacteria in the Enterobacteriaceae. They are generally understood to play a role in interbacterial competition, although direct evidence of this is limited, and they could prove informative in developing new peptide antibiotics. However, few examples of verified class II microcins exist, and novel microcins are difficult to identify due to their sequence diversity, making it complicated to study them as a group. Here, we overcome this limitation by developing a bioinformatics pipeline to detect microcins in silico. Using this pipeline, we demonstrate that both verified and novel class II microcins are widespread within and outside the Enterobacteriaceae, which has not been systematically shown previously. The observed prevalence of class II microcins suggests that they are ecologically important, and the elucidation of novel microcins provides a resource that can be used to expand our knowledge of the structure and function of microcins as antibacterials.


Assuntos
Bacteriocinas , Escherichia coli , Antibacterianos/farmacologia , Antibacterianos/química , Bactérias , Bacteriocinas/genética , Bacteriocinas/farmacologia , Bacteriocinas/química , Enterobacteriaceae , Escherichia coli/genética , Peptídeos/genética , Filogenia
16.
PLoS One ; 17(5): e0268883, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35617346

RESUMO

Synthetic biology has successfully advanced our ability to design and implement complex, time-varying genetic circuits to control the expression of recombinant proteins. However, these circuits typically require the production of regulatory genes whose only purpose is to coordinate expression of other genes. When designing very small genetic constructs, such as viral genomes, we may want to avoid introducing such auxiliary gene products while nevertheless encoding complex expression dynamics. To this end, here we demonstrate that varying only the placement and strengths of promoters, terminators, and RNase cleavage sites in a computational model of a bacteriophage genome is sufficient to achieve solutions to a variety of basic gene expression patterns. We discover these genetic solutions by computationally evolving genomes to reproduce desired gene expression time-course data. Our approach shows that non-trivial patterns can be evolved, including patterns where the relative ordering of genes by abundance changes over time. We find that some patterns are easier to evolve than others, and comparable expression patterns can be achieved via different genetic architectures. Our work opens up a novel avenue to genome engineering via fine-tuning the balance of gene expression and gene degradation rates.


Assuntos
Redes Reguladoras de Genes , Biologia Sintética , Expressão Gênica , Genes Reguladores , Regiões Promotoras Genéticas
17.
J Biol Phys ; 47(4): 435-454, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34751854

RESUMO

One fundamental problem of protein biochemistry is to predict protein structure from amino acid sequence. The inverse problem, predicting either entire sequences or individual mutations that are consistent with a given protein structure, has received much less attention even though it has important applications in both protein engineering and evolutionary biology. Here, we ask whether 3D convolutional neural networks (3D CNNs) can learn the local fitness landscape of protein structure to reliably predict either the wild-type amino acid or the consensus in a multiple sequence alignment from the local structural context surrounding site of interest. We find that the network can predict wild type with good accuracy, and that network confidence is a reliable measure of whether a given prediction is likely going to be correct or not. Predictions of consensus are less accurate and are primarily driven by whether or not the consensus matches the wild type. Our work suggests that high-confidence mis-predictions of the wild type may identify sites that are primed for mutation and likely targets for protein engineering.


Assuntos
Redes Neurais de Computação , Proteínas , Sequência de Aminoácidos , Aminoácidos , Proteínas/genética
18.
Proc Natl Acad Sci U S A ; 118(49)2021 12 07.
Artigo em Inglês | MEDLINE | ID: mdl-34845024

RESUMO

CRISPR-associated Tn7 transposons (CASTs) co-opt cas genes for RNA-guided transposition. CASTs are exceedingly rare in genomic databases; recent surveys have reported Tn7-like transposons that co-opt Type I-F, I-B, and V-K CRISPR effectors. Here, we expand the diversity of reported CAST systems via a bioinformatic search of metagenomic databases. We discover architectures for all known CASTs, including arrangements of the Cascade effectors, target homing modalities, and minimal V-K systems. We also describe families of CASTs that have co-opted the Type I-C and Type IV CRISPR-Cas systems. Our search for non-Tn7 CASTs identifies putative candidates that include a nuclease dead Cas12. These systems shed light on how CRISPR systems have coevolved with transposases and expand the programmable gene-editing toolkit.


Assuntos
Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas/genética , Elementos de DNA Transponíveis/genética , Proteínas de Bactérias/metabolismo , Proteínas Associadas a CRISPR/metabolismo , Sistemas CRISPR-Cas/genética , Sistemas CRISPR-Cas/fisiologia , Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas/fisiologia , Elementos de DNA Transponíveis/fisiologia , Endonucleases/genética , Edição de Genes , Metagenoma , Metagenômica/métodos , RNA Guia de Cinetoplastídeos/genética , Transposases/genética
19.
Sci Rep ; 11(1): 9622, 2021 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-33953215

RESUMO

Viruses experience selective pressure on the timing and order of events during infection to maximize the number of viable offspring they produce. Additionally, they may experience variability in cellular environments encountered, as individual eukaryotic cells can display variation in gene expression among cells. This leads to a dynamic phenotypic landscape that viruses must face to replicate. To examine replication dynamics displayed by viruses faced with this variable landscape, we have developed a method for fitting a stochastic mechanistic model of viral infection to time-lapse imaging data from high-throughput single-cell poliovirus infection experiments. The model's mechanistic parameters provide estimates of several aspects associated with the virus's intracellular dynamics. We examine distributions of parameter estimates and assess their variability to gain insight into the root causes of variability in viral growth dynamics. We also fit our model to experiments performed under various drug treatments and examine which parameters differ under these conditions. We find that parameters associated with translation and early stage viral replication processes are essential for the model to capture experimentally observed dynamics. In aggregate, our results suggest that differences in viral growth data generated under different treatments can largely be captured by steps that occur early in the replication process.


Assuntos
Modelos Biológicos , Poliovirus/fisiologia , Imagem com Lapso de Tempo , Replicação Viral/fisiologia , Interações Hospedeiro-Patógeno , Humanos
20.
PeerJ ; 9: e11396, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33996289

RESUMO

Bacteriophages are broadly classified into two distinct lifestyles: temperate and virulent. Temperate phages are capable of a latent phase of infection within a host cell (lysogenic cycle), whereas virulent phages directly replicate and lyse host cells upon infection (lytic cycle). Accurate lifestyle identification is critical for determining the role of individual phage species within ecosystems and their effect on host evolution. Here, we present BACPHLIP, a BACterioPHage LIfestyle Predictor. BACPHLIP detects the presence of a set of conserved protein domains within an input genome and uses this data to predict lifestyle via a Random Forest classifier that was trained on a dataset of 634 phage genomes. On an independent test set of 423 phages, BACPHLIP has an accuracy of 98% greatly exceeding that of the previously existing tools (79%). BACPHLIP is freely available on GitHub (https://github.com/adamhockenberry/bacphlip) and the code used to build and test the classifier is provided in a separate repository (https://github.com/adamhockenberry/bacphlip-model-dev) for users wishing to interrogate and re-train the underlying classification model.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA