Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 6 de 6
Filtrar
1.
Nat Commun ; 15(1): 6176, 2024 Jul 22.
Artículo en Inglés | MEDLINE | ID: mdl-39039051

RESUMEN

Generative deep learning is reshaping drug design. Chemical language models (CLMs) - which generate molecules in the form of molecular strings - bear particular promise for this endeavor. Here, we introduce a recent deep learning architecture, termed Structured State Space Sequence (S4) model, into de novo drug design. In addition to its unprecedented performance in various fields, S4 has shown remarkable capabilities to learn the global properties of sequences. This aspect is intriguing in chemical language modeling, where complex molecular properties like bioactivity can 'emerge' from separated portions in the molecular string. This observation gives rise to the following question: Can S4 advance chemical language modeling for de novo design? To provide an answer, we systematically benchmark S4 with state-of-the-art CLMs on an array of drug discovery tasks, such as the identification of bioactive compounds, and the design of drug-like molecules and natural products. S4 shows a superior capacity to learn complex molecular properties, while at the same time exploring diverse scaffolds. Finally, when applied prospectively to kinase inhibition, S4 designs eight of out ten molecules that are predicted as highly active by molecular dynamics simulations. Taken together, these findings advocate for the introduction of S4 into chemical language modeling - uncovering its untapped potential in the molecular sciences.


Asunto(s)
Simulación de Dinámica Molecular , Diseño de Fármacos , Aprendizaje Profundo , Modelos Químicos , Descubrimiento de Drogas/métodos , Productos Biológicos/química
2.
Curr Opin Struct Biol ; 86: 102818, 2024 06.
Artículo en Inglés | MEDLINE | ID: mdl-38669740

RESUMEN

Deep learning is becoming increasingly relevant in drug discovery, from de novo design to protein structure prediction and synthesis planning. However, it is often challenged by the small data regimes typical of certain drug discovery tasks. In such scenarios, deep learning approaches-which are notoriously 'data-hungry'-might fail to live up to their promise. Developing novel approaches to leverage the power of deep learning in low-data scenarios is sparking great attention, and future developments are expected to propel the field further. This mini-review provides an overview of recent low-data-learning approaches in drug discovery, analyzing their hurdles and advantages. Finally, we venture to provide a forecast of future research directions in low-data learning for drug discovery.


Asunto(s)
Aprendizaje Profundo , Descubrimiento de Drogas , Descubrimiento de Drogas/métodos , Humanos , Proteínas/química , Proteínas/metabolismo
3.
Mol Inform ; 43(3): e202300249, 2024 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38196065

RESUMEN

Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words, analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data-driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. The experiments on multiple protein-ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.


Asunto(s)
Algoritmos , Proteínas , Humanos , Ligandos , Proteínas/química , Aprendizaje Automático , Estructura Molecular
4.
J Comput Biol ; 30(11): 1240-1245, 2023 11.
Artículo en Inglés | MEDLINE | ID: mdl-37988394

RESUMEN

Robust generalization of drug-target affinity (DTA) prediction models is a notoriously difficult problem in computational drug discovery. In this article, we present pydebiaseddta: a computational software for improving the generalizability of DTA prediction models to novel ligands and/or proteins. pydebiaseddta serves as the practical implementation of the DebiasedDTA training framework, which advocates modifying the training distribution to mitigate the effect of spurious correlations in the training data set that leads to substantially degraded performance for novel ligands and proteins. Written in Python programming language, pydebiaseddta combines a user-friendly streamlined interface with a feature-rich and highly modifiable architecture. With this article we introduce our software, showcase its main functionalities, and describe practical ways for new users to engage with it.


Asunto(s)
Lenguajes de Programación , Programas Informáticos , Proteínas , Descubrimiento de Drogas
5.
J Comput Biol ; 30(11): 1226-1239, 2023 11.
Artículo en Inglés | MEDLINE | ID: mdl-37988395

RESUMEN

Statistical models that accurately predict the binding affinity of an input ligand-protein pair can greatly accelerate drug discovery. Such models are trained on available ligand-protein interaction data sets, which may contain biases that lead the predictor models to learn data set-specific, spurious patterns instead of generalizable relationships. This leads the prediction performances of these models to drop dramatically for previously unseen biomolecules. Various approaches that aim to improve model generalizability either have limited applicability or introduce the risk of degrading overall prediction performance. In this article, we present DebiasedDTA, a novel training framework for drug-target affinity (DTA) prediction models that addresses data set biases to improve the generalizability of such models. DebiasedDTA relies on reweighting the training samples to achieve robust generalization, and is thus applicable to most DTA prediction models. Extensive experiments with different biomolecule representations, model architectures, and data sets demonstrate that DebiasedDTA achieves improved generalizability in predicting drug-target affinities.


Asunto(s)
Modelos Estadísticos , Proteínas , Ligandos , Proteínas/química , Descubrimiento de Drogas
6.
Mol Inform ; 40(5): e2000212, 2021 05.
Artículo en Inglés | MEDLINE | ID: mdl-33225594

RESUMEN

Identification of high affinity drug-target interactions is a major research question in drug discovery. Proteins are generally represented by their structures or sequences. However, structures are available only for a small subset of biomolecules and sequence similarity is not always correlated with functional similarity. We propose ChemBoost, a chemical language based approach for affinity prediction using SMILES syntax. We hypothesize that SMILES is a codified language and ligands are documents composed of chemical words. These documents can be used to learn chemical word vectors that represent words in similar contexts with similar vectors. In ChemBoost, the ligands are represented via chemical word embeddings, while the proteins are represented through sequence-based features and/or chemical words of their ligands. Our aim is to process the patterns in SMILES as a language to predict protein-ligand affinity, even when we cannot infer the function from the sequence. We used eXtreme Gradient Boosting to predict protein-ligand affinities in KIBA and BindingDB data sets. ChemBoost was able to predict drug-target binding affinity as well as or better than state-of-the-art machine learning systems. When powered with ligand-centric representations, ChemBoost was more robust to the changes in protein sequence similarity and successfully captured the interactions between a protein and a ligand, even if the protein has low sequence similarity to the known targets of the ligand.


Asunto(s)
Descubrimiento de Drogas/métodos , Aprendizaje Automático , Unión Proteica , Biología Computacional/métodos , Química Computacional/métodos , Ligandos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA