Your browser doesn't support javascript.
loading
Masked Language Modeling for Resource Constrained Biological Natural Language Processing.
Article de En | MEDLINE | ID: mdl-38083556
ABSTRACT
Recent advances in Natural Language Processing (NLP) have produced state of the art results on several sequence to sequence (seq2seq) tasks. Enhancements in embedders and their training methodologies have shown significant improvement on downstream tasks. Word vector models like Word2Vec, FastText & Glove were widely used over one-hot encoded vectors for years until the advent of deep contextualized embedders. Protein sequences consist of 20 naturally occurring amino acids that can be treated as the language of nature. These amino acids in combinations with each other makeup the biological functions. The choice of vector representation and architecture design for a biological task is highly dependent upon the nature of the task. We utilize unlabelled protein sequences to train a Convolution and Gated Recurrent Network (CGRN) embedder using Masked Language Modeling (MLM) technique that shows significant performance boost under resource constraint setting on two downstream tasks i.e., F1-score(Q8) of 73.1% on Secondary Structure Prediction (SSP) & F1-score of 84% on Intrinsically Disordered Region Prediction (IDRP). We also compare different architectures on downstream tasks to show the impact of the nature of biological task on the performance of the model.
Sujet(s)

Texte intégral: 1 Collection: 01-internacional Base de données: MEDLINE Sujet principal: Traitement du langage naturel / Langage Langue: En Journal: Annu Int Conf IEEE Eng Med Biol Soc Année: 2023 Type de document: Article

Texte intégral: 1 Collection: 01-internacional Base de données: MEDLINE Sujet principal: Traitement du langage naturel / Langage Langue: En Journal: Annu Int Conf IEEE Eng Med Biol Soc Année: 2023 Type de document: Article
...