Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 5 de 5
Filtrar
Mais filtros











Base de dados
Intervalo de ano de publicação
1.
bioRxiv ; 2024 Jul 31.
Artigo em Inglês | MEDLINE | ID: mdl-39131267

RESUMO

Protein Language Models (pLMs) have revolutionized the computational modeling of protein systems, building numerical embeddings that are centered around structural features. To enhance the breadth of biochemically relevant properties available in protein embeddings, we engineered the Annotation Vocabulary, a transformer readable language of protein properties defined by structured ontologies. We trained Annotation Transformers (AT) from the ground up to recover masked protein property inputs without reference to amino acid sequences, building a new numerical feature space on protein descriptions alone. We leverage AT representations in various model architectures, for both protein representation and generation. To showcase the merit of Annotation Vocabulary integration, we performed 515 diverse downstream experiments. Using a novel loss function and only $3 in commercial compute, our premier representation model CAMP produces state-of-the-art embeddings for five out of 15 common datasets with competitive performance on the rest; highlighting the computational efficiency of latent space curation with Annotation Vocabulary. To standardize the comparison of de novo generated protein sequences, we suggest a new sequence alignment-based score that is more flexible and biologically relevant than traditional language modeling metrics. Our generative model, GSM, produces high alignment scores from annotation-only prompts with a BERT-like generation scheme. Of particular note, many GSM hallucinations return statistically significant BLAST hits, where enrichment analysis shows properties matching the annotation prompt - even when the ground truth has low sequence identity to the entire training set. Overall, the Annotation Vocabulary toolbox presents a promising pathway to replace traditional tokens with members of ontologies and knowledge graphs, enhancing transformer models in specific domains. The concise, accurate, and efficient descriptions of proteins by the Annotation Vocabulary offers a novel way to build numerical representations of proteins for protein annotation and design.

2.
bioRxiv ; 2023 Sep 17.
Artigo em Inglês | MEDLINE | ID: mdl-37745387

RESUMO

Recent advancements in Protein Language Models (pLMs) have enabled high-throughput analysis of proteins through primary sequence alone. At the same time, newfound evidence illustrates that codon usage bias is remarkably predictive and can even change the final structure of a protein. Here, we explore these findings by extending the traditional vocabulary of pLMs from amino acids to codons to encapsulate more information inside CoDing Sequences (CDS). We build upon traditional transfer learning techniques with a novel pipeline of token embedding matrix seeding, masked language modeling, and student-teacher knowledge distillation, called MELD. This transformed the pretrained ProtBERT into cdsBERT; a pLM with a codon vocabulary trained on a massive corpus of CDS. Interestingly, cdsBERT variants produced a highly biochemically relevant latent space, outperforming their amino acid-based counterparts on enzyme commission number prediction. Further analysis revealed that synonymous codon token embeddings moved distinctly in the embedding space, showcasing unique additions of information across broad phylogeny inside these traditionally "silent" mutations. This embedding movement correlated significantly with average usage bias across phylogeny. Future fine-tuned organism-specific codon pLMs may potentially have a more significant increase in codon usage fidelity. This work enables an exciting potential in using the codon vocabulary to improve current state-of-the-art structure and function prediction that necessitates the creation of a codon pLM foundation model alongside the addition of high-quality CDS to large-scale protein databases.

3.
Sci Rep ; 13(1): 2088, 2023 02 06.
Artigo em Inglês | MEDLINE | ID: mdl-36747072

RESUMO

In this study, we investigate how an organism's codon usage bias can serve as a predictor and classifier of various genomic and evolutionary traits across the domains of life. We perform secondary analysis of existing genetic datasets to build several AI/machine learning models. When trained on codon usage patterns of nearly 13,000 organisms, our models accurately predict the organelle of origin and taxonomic identity of nucleotide samples. We extend our analysis to identify the most influential codons for phylogenetic prediction with a custom feature ranking ensemble. Our results suggest that the genetic code can be utilized to train accurate classifiers of taxonomic and phylogenetic features. We then apply this classification framework to open reading frame (ORF) detection. Our statistical model assesses all possible ORFs in a nucleotide sample and rejects or deems them plausible based on the codon usage distribution. Our dataset and analyses are made publicly available on GitHub and the UCI ML Repository to facilitate open-source reproducibility and community engagement.


Assuntos
Genômica , Aprendizado de Máquina , Filogenia , Reprodutibilidade dos Testes , Códon/genética , Nucleotídeos
4.
J Pers Med ; 11(12)2021 Dec 04.
Artigo em Inglês | MEDLINE | ID: mdl-34945766

RESUMO

Heart diseases are some of the most common and pressing threats to human health worldwide. The American Heart Association and the National Institute of Health jointly work to annually update data on cardiac diseases. In 2018, 126.9 million Americans were reported as having some form of cardiac disorder, with an estimated direct and indirect total cost of USD 363.4 billion. This necessitates developing therapeutic interventions for heart diseases to improve human life expectancy and economic relief. In this review, we look into gamma-secretase as a potential therapeutic target for cardiac diseases. Gamma-secretase, an aspartyl protease enzyme, is responsible for the cleavage and activation of a number of substrates that are relevant to normal cardiac development and function as found in mutation studies. Some of these substrates are involved in downstream signaling processes and crosstalk with pathways relevant to heart diseases. Most of the substrates and signaling events we explored were found to be potentially beneficial to maintain cardiac function in diseased conditions. This review presents an updated overview of the current knowledge on gamma-secretase processing of cardiac-relevant substrates and seeks to understand if the modulation of gamma-secretase activity would be beneficial to combat cardiac diseases.

5.
J Pers Med ; 11(12)2021 Dec 16.
Artigo em Inglês | MEDLINE | ID: mdl-34945845

RESUMO

Heat shock protein 90 (Hsp90) is a molecular chaperone that interacts with up to 10% of the proteome. The extensive involvement in protein folding and regulation of protein stability within cells makes Hsp90 an attractive therapeutic target to correct multiple dysfunctions. Many of the clients of Hsp90 are found in pathways known to be pathogenic in the heart, ranging from transforming growth factor ß (TGF-ß) and mitogen activated kinase (MAPK) signaling to tumor necrosis factor α (TNFα), Gs and Gq g-protein coupled receptor (GPCR) and calcium (Ca2+) signaling. These pathways can therefore be targeted through modulation of Hsp90 activity. The activity of Hsp90 can be targeted through small-molecule inhibition. Small-molecule inhibitors of Hsp90 have been found to be cardiotoxic in some cases however. In this regard, specific targeting of Hsp90 by modulation of post-translational modifications (PTMs) emerges as an attractive strategy. In this review, we aim to address how Hsp90 functions, where Hsp90 interacts within pathological pathways, and current knowledge of small molecules and PTMs known to modulate Hsp90 activity and their potential as therapeutics in cardiac diseases.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA