RESUMEN
Understanding a protein's function based solely on its amino acid sequence is a crucial but intricate task in bioinformatics. Traditionally, this challenge has proven difficult. However, recent years have witnessed the rise of deep learning as a powerful tool, achieving significant success in protein function prediction. Their strength lies in their ability to automatically learn informative features from protein sequences, which can then be used to predict the protein's function. This study builds upon these advancements by proposing a novel model: CNN-CBAM+BiGRU. It incorporates a Convolutional Block Attention Module (CBAM) alongside BiGRUs. CBAM acts as a spotlight, guiding the CNN to focus on the most informative parts of the protein data, leading to more accurate feature extraction. BiGRUs, a type of Recurrent Neural Network (RNN), excel at capturing long-range dependencies within the protein sequence, which are essential for accurate function prediction. The proposed model integrates the strengths of both CNN-CBAM and BiGRU. This study's findings, validated through experimentation, showcase the effectiveness of this combined approach. For the human dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +1.0â¯% for cellular components, +1.1â¯% for molecular functions, and +0.5â¯% for biological processes. For the yeast dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +2.4â¯% for the cellular component, +1.2â¯% for molecular functions, and +0.6â¯% for biological processes.
Asunto(s)
Biología Computacional , Redes Neurales de la Computación , Proteínas , Biología Computacional/métodos , Humanos , Proteínas/genética , Proteínas/metabolismo , Aprendizaje Profundo , Bases de Datos de Proteínas , Algoritmos , Secuencia de AminoácidosRESUMEN
Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9â¯%, for the molecular function prediction task by 3.8â¯% and for the biological process prediction task by 0.6â¯% for human dataset and for yeast dataset the cellular component prediction task by 2.4â¯%, for the molecular function prediction task by 5.2â¯% and for the biological process prediction task by 1.2â¯%.
Asunto(s)
Aprendizaje Profundo , Humanos , Saccharomyces cerevisiae/genética , Secuencia de Aminoácidos , Lenguaje , ViriónRESUMEN
Proteins are represented in various ways, each contributing differently to protein-related tasks. Here, information from each representation (protein sequence, 3D structure, and interaction data) is combined for an efficient protein function prediction task. Recently, uni-modal has produced promising results with state-of-the-art attention mechanisms that learn the relative importance of features, whereas multi-modal approaches have produced promising results by simply concatenating obtained features using a computational approach from different representations which leads to an increase in the overall trainable parameters. In this paper, we propose a novel, light-weight cross-modal multi-attention (CrMoMulAtt) mechanism that captures the relative contribution of each modality with a lower number of trainable parameters. The proposed mechanism shows a higher contribution from PPI and a lower contribution from structure data. The results obtained from the proposed CrossPredGO mechanism demonstrate an increment in Fmax in the range of +(3.29 to 7.20)% with at most 31% lower trainable parameters compared with DeepGO and MultiPredGO.
RESUMEN
Deep learning approaches, such as convolution neural networks (CNNs) and deep recurrent neural networks (RNNs), have been the backbone for predicting protein function, with promising state-of-the-art (SOTA) results. RNNs with an in-built ability (i) focus on past information, (ii) collect both short-and-long range dependency information, and (iii) bi-directional processing offers a strong sequential processing mechanism. CNNs, however, are confined to focusing on short-term information from both the past and the future, although they offer parallelism. Therefore, a novel bi-directional CNN that strictly complies with the sequential processing mechanism of RNNs is introduced and is used for developing a protein function prediction framework, Bi-SeqCNN. This is a sub-sequence-based framework. Further, Bi-SeqCNN + is an ensemble approach to better the prediction results. To our knowledge, this is the first time bi-directional CNNs are employed for general temporal data analysis and not just for protein sequences. The proposed architecture produces improvements up to +5.5% over contemporary SOTA methods on three benchmark protein sequence datasets. Moreover, it is substantially lighter and attain these results with (0.50-0.70 times) fewer parameters than the SOTA methods.
RESUMEN
BACKGROUND: Intravenous tenecteplase increases reperfusion in patients with salvageable brain tissue on perfusion imaging and might have advantages over alteplase as a thrombolytic for ischaemic stroke. We aimed to assess the non-inferiority of tenecteplase versus alteplase on clinical outcomes in patients selected by use of perfusion imaging. METHODS: This international, multicentre, open-label, parallel-group, randomised, clinical non-inferiority trial enrolled patients from 35 hospitals in eight countries. Participants were aged 18 years or older, within 4·5 h of ischaemic stroke onset or last known well, were not being considered for endovascular thrombectomy, and met target mismatch criteria on brain perfusion imaging. Patients were randomly assigned (1:1) by use of a centralised web server with randomly permuted blocks to intravenous tenecteplase (0·25 mg/kg) or alteplase (0·90 mg/kg). The primary outcome was the proportion of patients without disability (modified Rankin Scale 0-1) at 3 months, assessed via masked review in both the intention-to-treat and per-protocol populations. We aimed to recruit 832 participants to yield 90% power (one-sided alpha=0·025) to detect a risk difference of 0·08, with an absolute non-inferiority margin of -0·03. The trial was registered with the Australian New Zealand Clinical Trials Registry, ACTRN12613000243718, and the European Union Clinical Trials Register, EudraCT Number 2015-002657-36, and it is completed. FINDINGS: Recruitment ceased early following the announcement of other trial results showing non-inferiority of tenecteplase versus alteplase. Between March 21, 2014, and Oct 20, 2023, 680 patients were enrolled and randomly assigned to tenecteplase (n=339) and alteplase (n=341), all of whom were included in the intention-to-treat analysis (multiple imputation was used to account for missing primary outcome data for five patients). Protocol violations occurred in 74 participants, thus the per-protocol population comprised 601 people (295 in the tenecteplase group and 306 in the alteplase group). Participants had a median age of 74 years (IQR 63-82), baseline National Institutes of Health Stroke Scale score of 7 (4-11), and 260 (38%) were female. In the intention-to-treat analysis, the primary outcome occurred in 191 (57%) of 335 participants allocated to tenecteplase and 188 (55%) of 340 participants allocated to alteplase (standardised risk difference [SRD]=0·03 [95% CI -0·033 to 0·10], one-tailed pnon-inferiority=0·031). In the per-protocol analysis, the primary outcome occurred in 173 (59%) of 295 participants allocated to tenecteplase and 171 (56%) of 306 participants allocated to alteplase (SRD 0·05 [-0·02 to 0·12], one-tailed pnon-inferiority=0·01). Nine (3%) of 337 patients in the tenecteplase group and six (2%) of 340 in the alteplase group had symptomatic intracranial haemorrhage (unadjusted risk difference=0·01 [95% CI -0·01 to 0·03]) and 23 (7%) of 335 and 15 (4%) of 340 died within 90 days of starting treatment (SRD 0·02 [95% CI -0·02 to 0·05]). INTERPRETATION: The findings in our study provide further evidence to strengthen the assertion of the non-inferiority of tenecteplase to alteplase, specifically when perfusion imaging has been used to identify reperfusion-eligible stroke patients. Although non-inferiority was achieved in the per-protocol population, it was not reached in the intention-to-treat analysis, possibly due to sample size limtations. Nonetheless, large-scale implementation of perfusion CT to assist in patient selection for intravenous thrombolysis in the early time window was shown to be feasible. FUNDING: Australian National Health Medical Research Council; Boehringer Ingelheim.
Asunto(s)
Fibrinolíticos , Accidente Cerebrovascular Isquémico , Imagen de Perfusión , Tenecteplasa , Activador de Tejido Plasminógeno , Humanos , Tenecteplasa/uso terapéutico , Tenecteplasa/administración & dosificación , Masculino , Femenino , Accidente Cerebrovascular Isquémico/tratamiento farmacológico , Accidente Cerebrovascular Isquémico/diagnóstico por imagen , Activador de Tejido Plasminógeno/uso terapéutico , Activador de Tejido Plasminógeno/administración & dosificación , Anciano , Fibrinolíticos/uso terapéutico , Fibrinolíticos/administración & dosificación , Persona de Mediana Edad , Imagen de Perfusión/métodos , Terapia Trombolítica/métodos , Resultado del Tratamiento , Anciano de 80 o más AñosRESUMEN
Inferring the protein function(s) via the protein sub-sequence classification is often obstructed due to lack of knowledge about function(s) of sub-sequences in the protein sequence. In this regard, we develop a novel "multi-aspect" paradigm to perform the sub-sequence classification in an efficient way by utilizing the information of the parent sequence. The aspects are: (1) Multi-label: independent labelling of sub-sequences with more than one functions of the parent sequence, and (ii) Label-relevance: scoring the parent functions to highlight the relevance of performing a given function by the sub-sequence. The multi-aspect paradigm is used to propose the "Multi-Attention Based Multi-Aspect Network" for classifying the protein sub-sequences, where multi-attention is a novel approach to process sub-sequences at word-level. Next, the proposed Global-ProtEnc method is a sub-sequence based approach to encoding protein sequences for protein function prediction task, which is finally used to develop as ensemble methods, Global-ProtEnc-Plus. Evaluations of both the Global-ProtEnc and the Global-ProtEnc-Plus methods on the benchmark CAFA3 dataset delivered a outstanding performances. Compared to the state-of-the-art DeepGOPlus, the improvements in Fmax with the Global-ProtEnc-Plus for the biological process is +6.50 percent and cellular component is +1.90 percent.
Asunto(s)
Algoritmos , Proteínas , Proteínas/genética , Proteínas/metabolismo , Secuencia de AminoácidosRESUMEN
The short-and-long range interactions amongst amino-acids in a protein sequence are primarily responsible for the function performed by the protein. Recently convolutional neural network (CNN)s have produced promising results on sequential data including those of NLP tasks and protein sequences. However, CNN's strength primarily lies at capturing short range interactions and are not so good at long range interactions. On the other hand, dilated CNNs are good at capturing both short-and-long range interactions because of varied - short-and-long - receptive fields. Further, CNNs are quite light-weight in terms of trainable parameters, whereas most existing deep learning solutions for protein function prediction (PFP) are based on multi-modality and are rather complex and heavily parametrized. In this paper, we propose a (sub-sequence + dilated-CNNs)-based simple, light-weight and sequence-only PFP framework Lite-SeqCNN. By varying dilation-rates, Lite-SeqCNN efficiently captures both short-and-long range interactions and has (0.50-0.75 times) fewer trainable parameters than its contemporary deep learning models. Further, Lite-SeqCNN + is an ensemble of three Lite-SeqCNNs developed with different segment-sizes that produces even better results compared to the individual models. The proposed architecture produced improvements upto 5% over state-of-the-art approaches Global-ProtEnc Plus, DeepGOPlus, and GOLabeler on three different prominent datasets curated from the UniProt database.
Asunto(s)
Redes Neurales de la Computación , Proteínas , Proteínas/genética , Secuencia de Aminoácidos , Bases de Datos Factuales , AminoácidosRESUMEN
This paper advances the self-attention mechanism in the standard transformer network specific to the modeling of the protein sequences. We introduce a novel context-window based scaled self-attention mechanism for processing protein sequences that is based on the notion of (i) local context and (ii) large contextual pattern. Both notions are essential to building a good representation for protein sequences. The proposed context-window based scaled self-attention mechanism is further used to build the multi context-window based scaled (MCWS) transformer network for the protein function prediction task at the protein sub-sequence level. Overall, the proposed MCWS transformer network produced improved predictive performances, outperforming existing state-of-the-art approaches by substantial margins. With respect to the standard transformer network, the proposed network produced improvements in F1-score of +2.30% and +2.08% on the biological process (BP) and molecular function (MF) datasets, respectively. The corresponding improvements over the state-of-the-art ProtVecGen-Plus+ProtVecGen-Ensemble approach are +3.38% (BP) and +2.86% (MF). Equally important, robust performances were obtained across protein sequences of different lengths.
Asunto(s)
Secuencia de Aminoácidos , Proteínas , Diseño de Software , Proteínas/químicaRESUMEN
This paper explores the use of variants of tf-idf-based descriptors, namely length-normalized-tf-idf and log-normalized-tf-idf, combined with a segmentation technique, for efficient modeling of variable-length protein sequences. The proposed solution, ProtVecGen-Ensemble, is an ensemble of three models trained on differently segmented datasets constructed from an input dataset containing complete protein sequences. Evaluations using biological process (BP) and molecular function (MF) datasets demonstrate that the proposed feature set is not only superior to its contemporaries but also produces more consistent results with respect to variation in sequence lengths. Improvements of +6.07% (BP) and +7.56% (MF) over state-of-the-art tf-idf-based MLDA feature set were obtained. The best results were achieved when ProtVecGen-Ensemble was combined with ProtVecGen-Plus - the state-of-the-art method for protein function prediction - resulting in improvements of +8.90% (BP) and +11.28% (MF) over MLDA and +1.49% (BP) and +2.07% (MF) over ProtVecGen-Plus+MLDA. To capture the performance consistency with respect to sequence lengths, we have defined a variance-based metric, with lower values indicating better performance. On this metric, the proposed ProtVecGen-Ensemble+ProtVecGen-Plus framework resulted in reductions of 56.85 percent (BP) and 56.08 percent (MF) over MLDA and 10.37 percent (BP) and 26.48 percent (MF) over ProtVecGenPlus+MLDA.
Asunto(s)
Algoritmos , Fenómenos Biológicos , Secuencia de Aminoácidos , Proteínas/genéticaRESUMEN
The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.
Asunto(s)
Aprendizaje Profundo , Proteínas , Análisis de Secuencia de Proteína/métodos , Algoritmos , Secuencia de Aminoácidos , Biología Computacional , Análisis Discriminante , Conformación Proteica , Proteínas/química , Proteínas/clasificación , Proteínas/fisiologíaRESUMEN
Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user's query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.
Asunto(s)
Biología/métodos , Bases de Datos Genéticas , Árboles/clasificación , Árboles/genética , Interfaz Usuario-ComputadorRESUMEN
BACKGROUND: A common problem in phylogenetic analysis is to identify frequent patterns in a collection of phylogenetic trees. The goal is, roughly, to find a subset of the species (taxa) on which all or some significant subset of the trees agree. One popular method to do so is through maximum agreement subtrees (MASTs). MASTs are also used, among other things, as a metric for comparing phylogenetic trees, computing congruence indices and to identify horizontal gene transfer events. RESULTS: We give algorithms and experimental results for two approaches to identify common patterns in a collection of phylogenetic trees, one based on agreement subtrees, called maximal agreement subtrees, the other on frequent subtrees, called maximal frequent subtrees. These approaches can return subtrees on larger sets of taxa than MASTs, and can reveal new common phylogenetic relationships not present in either MASTs or the majority rule tree (a popular consensus method). Our current implementation is available on the web at https://code.google.com/p/mfst-miner/. CONCLUSIONS: Our computational results confirm that maximal agreement subtrees and all maximal frequent subtrees can reveal a more complete phylogenetic picture of the common patterns in collections of phylogenetic trees than maximum agreement subtrees; they are also often more resolved than the majority rule tree. Further, our experiments show that enumerating maximal frequent subtrees is considerably more practical than enumerating ordinary (not necessarily maximal) frequent subtrees.
RESUMEN
BACKGROUND: A multi-labeled tree, or MUL-tree, is a phylogenetic tree where two or more leaves share a label, e.g., a species name. A MUL-tree can imply multiple conflicting phylogenetic relationships for the same set of taxa, but can also contain conflict-free information that is of interest and yet is not obvious. RESULTS: We define the information content of a MUL-tree T as the set of all conflict-free quartet topologies implied by T, and define the maximal reduced form of T as the smallest tree that can be obtained from T by pruning leaves and contracting edges while retaining the same information content. We show that any two MUL-trees with the same information content exhibit the same reduced form. This introduces an equivalence relation among MUL-trees with potential applications to comparing MUL-trees. We present an efficient algorithm to reduce a MUL-tree to its maximally reduced form and evaluate its performance on empirical datasets in terms of both quality of the reduced tree and the degree of data reduction achieved. CONCLUSIONS: Our measure of conflict-free information content based on quartets is simple and topologically appealing. In the experiments, the maximally reduced form is often much smaller than the original tree, yet retains most of the taxa. The reduction algorithm is quadratic in the number of leaves and its complexity is unaffected by the multiplicity of leaf labels or the degree of the nodes.