RESUMO
CD4+ T cell responses are exquisitely antigen specific and directed toward peptide epitopes displayed by human leukocyte antigen class II (HLA-II) on antigen-presenting cells. Underrepresentation of diverse alleles in ligand databases and an incomplete understanding of factors affecting antigen presentation in vivo have limited progress in defining principles of peptide immunogenicity. Here, we employed monoallelic immunopeptidomics to identify 358,024 HLA-II binders, with a particular focus on HLA-DQ and HLA-DP. We uncovered peptide-binding patterns across a spectrum of binding affinities and enrichment of structural antigen features. These aspects underpinned the development of context-aware predictor of T cell antigens (CAPTAn), a deep learning model that predicts peptide antigens based on their affinity to HLA-II and full sequence of their source proteins. CAPTAn was instrumental in discovering prevalent T cell epitopes from bacteria in the human microbiome and a pan-variant epitope from SARS-CoV-2. Together CAPTAn and associated datasets present a resource for antigen discovery and the unraveling genetic associations of HLA alleles with immunopathologies.
Assuntos
COVID-19 , Aprendizado Profundo , Humanos , Captana , SARS-CoV-2 , Antígenos HLA , Epitopos de Linfócito T , PeptídeosRESUMO
Deciphering the regulatory code of gene expression and interpreting the transcriptional effects of genome variation are critical challenges in human genetics. Modern experimental technologies have resulted in an abundance of data, enabling the development of sequence-based deep learning models that link patterns embedded in DNA to the biochemical and regulatory properties contributing to transcriptional regulation, including modeling epigenetic marks, 3D genome organization, and gene expression, with tissue and cell-type specificity. Such methods can predict the functional consequences of any noncoding variant in the human genome, even rare or never-before-observed variants, and systematically characterize their consequences beyond what is tractable from experiments or quantitative genetics studies alone. Recently, the development and application of interpretability approaches have led to the identification of key sequence patterns contributing to the predicted tasks, providing insights into the underlying biological mechanisms learned and revealing opportunities for improvement in future models.
Assuntos
Aprendizado Profundo , Regulação da Expressão Gênica , Transcrição Gênica , Humanos , Genoma Humano , Epigênese GenéticaRESUMO
Bangladesh's subtropical climate with an abundance of sunlight throughout the greater portion of the year results in increased effectiveness of solar panels. Solar irradiance forecasting is an essential aspect of grid-connected photovoltaic systems to efficiently manage solar power's variation and uncertainty and to assist in balancing power supply and demand. This is why it is essential to forecast solar irradiation accurately. Many meteorological factors influence solar irradiation, which has a high degree of fluctuation and uncertainty. Predicting solar irradiance multiple steps ahead makes it difficult for forecasting models to capture long-term sequential relationships. Attention-based models are widely used in the field of Natural Language Processing for their ability to learn long-term dependencies within sequential data. In this paper, our aim is to present an attention-based model framework for multivariate time series forecasting. Using data from two different locations in Bangladesh with a resolution of 30 min, the Attention-based encoder-decoder, Transformer, and Temporal Fusion Transformer (TFT) models are trained and tested to predict over 24 steps ahead and compared with other forecasting models. According to our findings, adding the attention mechanism significantly increased prediction accuracy and TFT has shown to be more precise than the rest of the algorithms in terms of accuracy and robustness. The obtained mean square error (MSE), the mean absolute error (MAE), and the coefficient of determination (R2) values for TFT are 0.151, 0.212, and 0.815, respectively. In comparison to the benchmark and sequential models (including the Naive, MLP, and Encoder-Decoder models), TFT has a reduction in the MSE and MAE of 8.4-47.9% and 6.1-22.3%, respectively, while R2 is raised by 2.13-26.16%. The ability to incorporate long-distance dependency increases the predictive power of attention models.
RESUMO
Introduction: Various sequencing based approaches are used to identify and characterize the activities of cis-regulatory elements in a genome-wide fashion. Some of these techniques rely on indirect markers such as histone modifications (ChIP-seq with histone antibodies) or chromatin accessibility (ATAC-seq, DNase-seq, FAIRE-seq), while other techniques use direct measures such as episomal assays measuring the enhancer properties of DNA sequences (STARR-seq) and direct measurement of the binding of transcription factors (ChIP-seq with transcription factor-specific antibodies). The activities of cis-regulatory elements such as enhancers, promoters, and repressors are determined by their sequence and secondary processes such as chromatin accessibility, DNA methylation, and bound histone markers. Methods: Here, machine learning models are employed to evaluate the accuracy with which cis-regulatory elements identified by various commonly used sequencing techniques can be predicted by their underlying sequence alone to distinguish between cis-regulatory activity that is reflective of sequence content versus secondary processes. Results and discussion: Models trained and evaluated on D. melanogaster sequences identified through DNase-seq and STARR-seq are significantly more accurate than models trained on sequences identified by H3K4me1, H3K4me3, and H3K27ac ChIP-seq, FAIRE-seq, and ATAC-seq. These results suggest that the activity detected by DNase-seq and STARR-seq can be largely explained by underlying DNA sequence, independent of secondary processes. Experimentally, a subset of DNase-seq and H3K4me1 ChIP-seq sequences were tested for enhancer activity using luciferase assays and compared with previous tests performed on STARR-seq sequences. The experimental data indicated that STARR-seq sequences are substantially enriched for enhancer-specific activity, while the DNase-seq and H3K4me1 ChIP-seq sequences are not. Taken together, these results indicate that the DNase-seq approach identifies a broad class of regulatory elements of which enhancers are a subset and the associated data are appropriate for training models for detecting regulatory activity from sequence alone, STARR-seq data are best for training enhancer-specific sequence models, and H3K4me1 ChIP-seq data are not well suited for training and evaluating sequence-based models for cis-regulatory element prediction.
Assuntos
Drosophila melanogaster , Histonas , Animais , Histonas/genética , Análise de Sequência de DNA , Cromatina/genética , DesoxirribonucleasesRESUMO
Introduction: Large pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modeling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which more naturally fits text generation tasks. The monolingual variants of T5 models have been limited to well-resourced languages, while the massively multilingual T5 model supports 101 languages. Methods: We trained two different-sized T5-type sequence-to-sequence models for morphologically rich Slovene language with much fewer resources. We analyzed the behavior of new models on 11 tasks, eight classification ones (named entity recognition, sentiment classification, lemmatization, two question answering tasks, two natural language inference tasks, and a coreference resolution task), and three text generation tasks (text simplification and two summarization tasks on different datasets). We compared the new SloT5 models with the multilingual mT5 model, multilingual mBART-50 model, and with four encoder BERT-like models: multilingual BERT, multilingual XLM-RoBERTa, trilingual Croatian-Slovene-English BERT, and monolingual Slovene RoBERTa model. Results: Concerning the classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model. However, these models are helpful for generative tasks and provide several useful results. In general, the size of models matters, and currently, there is not enough training data for Slovene for successful pretraining of large models. Discussion: While the results are obtained on Slovene, we believe that they may generalize to other less-resourced languages, where such models will be built. We make the training and evaluation code, as well as the trained models, publicly available.
RESUMO
Computational models of memory are often expressed as hierarchic sequence models, but the hierarchies in these models are typically fairly shallow, reflecting the tendency for memories of superordinate sequence states to become increasingly conflated. This article describes a broad-coverage probabilistic sentence processing model that uses a variant of a left-corner parsing strategy to flatten sentence processing operations in parsing into a similarly shallow hierarchy of learned sequences. The main result of this article is that a broad-coverage model with constraints on hierarchy depth can process large newspaper corpora with the same accuracy as a state-of-the-art parser not defined in terms of sequential working memory operations.