Pesquisa | Secretaria de Estado da Saúde

Improving Molecule Generation and Drug Discovery with a Knowledge-enhanced Generative Model.

Malusare, Aditya; Aggarwal, Vaneet.

ArXiv ; 2024 Feb 13.

Artigo em Inglês | MEDLINE | ID: mdl-38410649

RESUMO

Recent advancements in generative models have established state-of-the-art benchmarks in the generation of molecules and novel drug candidates. Despite these successes, a significant gap persists between generative models and the utilization of extensive biomedical knowledge, often systematized within knowledge graphs, whose potential to inform and enhance generative processes has not been realized. In this paper, we present a novel approach that bridges this divide by developing a framework for knowledge-enhanced generative models called K-DReAM. We develop a scalable methodology to extend the functionality of knowledge graphs while preserving semantic integrity, and incorporate this contextual information into a generative framework to guide a diffusion-based model. The integration of knowledge graph embeddings with our generative model furnishes a robust mechanism for producing novel drug candidates possessing specific characteristics while ensuring validity and synthesizability. K-DReAM outperforms state-of-the-art generative models on both unconditional and targeted generation tasks.

Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision.

Malusare, Aditya; Kothandaraman, Harish; Tamboli, Dipesh; Lanman, Nadia A; Aggarwal, Vaneet.

ArXiv ; 2024 Feb 14.

Artigo em Inglês | MEDLINE | ID: mdl-38410643

RESUMO

This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa