Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 46
Filter
Add more filters











Publication year range
1.
Appl Microbiol Biotechnol ; 108(1): 415, 2024 Jul 11.
Article in English | MEDLINE | ID: mdl-38990377

ABSTRACT

Currently, the main α-amylase family GH13 has been divided into 47 subfamilies in CAZy, with new subfamilies regularly emerging. The present in silico study was performed to highlight the groups, represented by the maltogenic amylase from Thermotoga neapolitana and the α-amylase from Haloarcula japonica, which are worth of creating their own new GH13 subfamilies. This enlarges functional annotation and thus allows more precise prediction of the function of putative proteins. Interestingly, those two share certain sequence features, e.g. the highly conserved cysteine in the second conserved sequence region (CSR-II) directly preceding the catalytic nucleophile, or the well-preserved GQ character of the end of CSR-VII. On the other hand, the two groups bear also specific and highly conserved positions that distinguish them not only from each other but also from representatives of remaining GH13 subfamilies established so far. For the T. neapolitana maltogenic amylase group, it is the stretch of residues at the end of CSR-V highly conserved as L-[DN]. The H. japonica α-amylase group can be characterized by a highly conserved [WY]-[GA] sequence at the end of CSR-II. Other specific sequence features include an almost fully conserved aspartic acid located directly preceding the general acid/base in CSR-III or well-preserved glutamic acid in CSR-IV. The assumption that these two groups represent two mutually related, but simultaneously independent GH13 subfamilies has been supported by phylogenetic analysis as well as by comparison of tertiary structures. The main α-amylase family GH13 has thus been expanded by two novel subfamilies GH13_48 and GH13_49. KEY POINTS: • In silico analysis of two groups of family GH13 members with characterized representatives • Identification of certain common, but also some specific sequence features in seven CSRs • Creation of two novel subfamilies-GH13_48 and GH13_49 within the CAZy database.


Subject(s)
Phylogeny , alpha-Amylases , alpha-Amylases/genetics , alpha-Amylases/metabolism , alpha-Amylases/chemistry , Amino Acid Sequence , Conserved Sequence , Sequence Alignment
2.
Brief Bioinform ; 25(4)2024 May 23.
Article in English | MEDLINE | ID: mdl-38935070

ABSTRACT

Inferring gene regulatory network (GRN) is one of the important challenges in systems biology, and many outstanding computational methods have been proposed; however there remains some challenges especially in real datasets. In this study, we propose Directed Graph Convolutional neural network-based method for GRN inference (DGCGRN). To better understand and process the directed graph structure data of GRN, a directed graph convolutional neural network is conducted which retains the structural information of the directed graph while also making full use of neighbor node features. The local augmentation strategy is adopted in graph neural network to solve the problem of poor prediction accuracy caused by a large number of low-degree nodes in GRN. In addition, for real data such as E.coli, sequence features are obtained by extracting hidden features using Bi-GRU and calculating the statistical physicochemical characteristics of gene sequence. At the training stage, a dynamic update strategy is used to convert the obtained edge prediction scores into edge weights to guide the subsequent training process of the model. The results on synthetic benchmark datasets and real datasets show that the prediction performance of DGCGRN is significantly better than existing models. Furthermore, the case studies on bladder uroepithelial carcinoma and lung cancer cells also illustrate the performance of the proposed model.


Subject(s)
Computational Biology , Gene Regulatory Networks , Neural Networks, Computer , Humans , Computational Biology/methods , Algorithms , Urinary Bladder Neoplasms/genetics , Urinary Bladder Neoplasms/pathology , Escherichia coli/genetics
3.
PeerJ ; 12: e17010, 2024.
Article in English | MEDLINE | ID: mdl-38495766

ABSTRACT

Proteins are considered indispensable for facilitating an organism's viability, reproductive capabilities, and other fundamental physiological functions. Conventional biological assays are characterized by prolonged duration, extensive labor requirements, and financial expenses in order to identify essential proteins. Therefore, it is widely accepted that employing computational methods is the most expeditious and effective approach to successfully discerning essential proteins. Despite being a popular choice in machine learning (ML) applications, the deep learning (DL) method is not suggested for this specific research work based on sequence features due to the restricted availability of high-quality training sets of positive and negative samples. However, some DL works on limited availability of data are also executed at recent times which will be our future scope of work. Conventional ML techniques are thus utilized in this work due to their superior performance compared to DL methodologies. In consideration of the aforementioned, a technique called EPI-SF is proposed here, which employs ML to identify essential proteins within the protein-protein interaction network (PPIN). The protein sequence is the primary determinant of protein structure and function. So, initially, relevant protein sequence features are extracted from the proteins within the PPIN. These features are subsequently utilized as input for various machine learning models, including XGB Boost Classifier, AdaBoost Classifier, logistic regression (LR), support vector classification (SVM), Decision Tree model (DT), Random Forest model (RF), and Naïve Bayes model (NB). The objective is to detect the essential proteins within the PPIN. The primary investigation conducted on yeast examined the performance of various ML models for yeast PPIN. Among these models, the RF model technique had the highest level of effectiveness, as indicated by its precision, recall, F1-score, and AUC values of 0.703, 0.720, 0.711, and 0.745, respectively. It is also found to be better in performance when compared to the other state-of-arts based on traditional centrality like betweenness centrality (BC), closeness centrality (CC), etc. and deep learning methods as well like DeepEP, as emphasized in the result section. As a result of its favorable performance, EPI-SF is later employed for the prediction of novel essential proteins inside the human PPIN. Due to the tendency of viruses to selectively target essential proteins involved in the transmission of diseases within human PPIN, investigations are conducted to assess the probable involvement of these proteins in COVID-19 and other related severe diseases.


Subject(s)
Protein Interaction Maps , Saccharomyces cerevisiae , Humans , Bayes Theorem , Proteins/chemistry , Machine Learning
4.
RNA Biol ; 21(1): 1-10, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38357904

ABSTRACT

RNA modifications play crucial roles in various biological processes and diseases. Accurate prediction of RNA modification sites is essential for understanding their functions. In this study, we propose a hybrid approach that fuses a pre-trained sequence representation with various sequence features to predict multiple types of RNA modifications in one combined prediction framework. We developed MRM-BERT, a deep learning method that combined the pre-trained DNABERT deep sequence representation module and the convolutional neural network (CNN) exploiting four traditional sequence feature encodings to improve the prediction performance. MRM-BERT was evaluated on multiple datasets of 12 commonly occurring RNA modifications, including m6A, m5C, m1A and so on. The results demonstrate that our hybrid model outperforms other models in terms of area under receiver operating characteristic curve (AUC) for all 12 types of RNA modifications. MRM-BERT is available as an online tool (http://117.122.208.21:8501) or source code (https://github.com/abhhba999/MRM-BERT), which allows users to predict RNA modification sites and visualize the results. Overall, our study provides an effective and efficient approach to predict multiple RNA modifications, contributing to the understanding of RNA biology and the development of therapeutic strategies.


Subject(s)
Neural Networks, Computer , RNA , RNA/genetics , ROC Curve , Software
5.
Comput Methods Programs Biomed ; 244: 107955, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38064959

ABSTRACT

BACKGROUND AND OBJECTIVE: Protein-protein interaction (PPI) is a vital process in all living cells, controlling essential cell functions such as cell cycle regulation, signal transduction, and metabolic processes with broad applications that include antibody therapeutics, vaccines, and drug discovery. The problem of sequence-based PPI prediction has been a long-standing issue in computational biology. METHODS: We introduce MaTPIP, a cutting-edge deep-learning framework for predicting PPI. MaTPIP stands out due to its innovative design, fusing pre-trained Protein Language Model (PLM)-based features with manually curated protein sequence attributes, emphasizing the part-whole relationship by incorporating two-dimensional granular part (amino-acid) level features and one-dimensional whole-level (protein) features. What sets MaTPIP apart is its ability to integrate these features across three different input terminals seamlessly. MatPIP also includes a distinctive configuration of Convolutional Neural Network (CNN) with Transformer components for concurrent utilization of CNN and sequential characteristics in each iteration and a one-dimensional to two-dimensional converter followed by a unified embedding. The statistical significance of this classifier is validated using McNemar's test. RESULTS: MaTPIP outperformed the existing methods on both the Human PPI benchmark and cross-species PPI testing datasets, demonstrating its immense generalization capability for PPI prediction. We used seven diverse datasets with varying PPI target class distributions. Notably, within the novel PPI scenario, the most challenging category for Human PPI Benchmark, MaTPIP improves the existing state-of-the-art score from 74.1% to 78.6% (measured in Area under ROC Curve), from 23.2% to 32.8% (in average precision) and from 4.9% to 9.5% (in precision at 3% recall) for 50%, 10% and 0.3% target class distributions, respectively. In cross-species PPI evaluation, hybrid MaTPIP establishes a new benchmark score (measured in Area Under precision-recall curve) of 81.1% from the previous 60.9% for Mouse, 80.9% from 56.2% for Fly, 78.1% from 55.9% for Worm, 59.9% from 41.7% for Yeast, and 66.2% from 58.8% for E.coli. Our eXplainable AI-based assessment reveals an average contribution of different feature families per prediction on these datasets. CONCLUSIONS: MaTPIP mixes manually curated features with the feature extracted from the pre-trained PLM to predict sequence-based protein-protein association. Furthermore, MaTPIP demonstrates strong generalization capabilities for cross-species PPI predictions.


Subject(s)
Deep Learning , Humans , Animals , Mice , Neural Networks, Computer , Proteins/metabolism , Amino Acid Sequence , ROC Curve
6.
Comput Struct Biotechnol J ; 21: 5544-5560, 2023.
Article in English | MEDLINE | ID: mdl-38034401

ABSTRACT

Thermally stable proteins find extensive applications in industrial production, pharmaceutical development, and serve as a highly evolved starting point in protein engineering. The thermal stability of proteins is commonly characterized by their melting temperature (Tm). However, due to the limited availability of experimentally determined Tm data and the insufficient accuracy of existing computational methods in predicting Tm, there is an urgent need for a computational approach to accurately forecast the Tm values of thermophilic proteins. Here, we present a deep learning-based model, called DeepTM, which exclusively utilizes protein sequences as input and accurately predicts the Tm values of target thermophilic proteins on a dataset consisting of 7790 thermophilic protein entries. On a test set of 1550 samples, DeepTM demonstrates excellent performance with a coefficient of determination (R2) of 0.75, Pearson correlation coefficient (P) of 0.87, and root mean square error (RMSE) of 6.24 ℃. We further analyzed the sequence features that determine the thermal stability of thermophilic proteins and found that dipeptide frequency, optimal growth temperature (OGT) of the host organisms, and the evolutionary information of the protein significantly affect its melting temperature. We compared the performance of DeepTM with recently reported methods, ProTstab2 and DeepSTABp, in predicting the Tm values on two blind test datasets. One dataset comprised 22 PET plastic-degrading enzymes, while the other included 29 thermally stable proteins of broader classification. In the PET plastic-degrading enzyme dataset, DeepTM achieved RMSE of 8.25 ℃. Compared to ProTstab2 (20.05 ℃) and DeepSTABp (20.97 ℃), DeepTM demonstrated a reduction in RMSE of 58.85% and 60.66%, respectively. In the dataset of thermally stable proteins, DeepTM (RMSE=7.66 ℃) demonstrated a 51.73% reduction in RMSE compared to ProTstab2 (RMSE=15.87 ℃). DeepTM, with the sole requirement of protein sequence information, accurately predicts the melting temperature and achieves a fully end-to-end prediction process, thus providing enhanced convenience and expediency for further protein engineering.

7.
Front Med (Lausanne) ; 10: 1281880, 2023.
Article in English | MEDLINE | ID: mdl-38020152

ABSTRACT

Introduction: Hemagglutinin (HA) is responsible for facilitating viral entry and infection by promoting the fusion between the host membrane and the virus. Given its significance in the process of influenza virus infestation, HA has garnered attention as a target for influenza drug and vaccine development. Thus, accurately identifying HA is crucial for the development of targeted vaccine drugs. However, the identification of HA using in-silico methods is still lacking. This study aims to design a computational model to identify HA. Methods: In this study, a benchmark dataset comprising 106 HA and 106 non-HA sequences were obtained from UniProt. Various sequence-based features were used to formulate samples. By perform feature optimization and inputting them four kinds of machine learning methods, we constructed an integrated classifier model using the stacking algorithm. Results and discussion: The model achieved an accuracy of 95.85% and with an area under the receiver operating characteristic (ROC) curve of 0.9863 in the 5-fold cross-validation. In the independent test, the model exhibited an accuracy of 93.18% and with an area under the ROC curve of 0.9793. The code can be found from https://github.com/Zouxidan/HA_predict.git. The proposed model has excellent prediction performance. The model will provide convenience for biochemical scholars for the study of HA.

8.
Brief Bioinform ; 24(5)2023 09 20.
Article in English | MEDLINE | ID: mdl-37643374

ABSTRACT

Silencers are noncoding DNA sequence fragments located on the genome that suppress gene expression. The variation of silencers in specific cells is closely related to gene expression and cancer development. Computational approaches that exclusively rely on DNA sequence information for silencer identification fail to account for the cell specificity of silencers, resulting in diminished accuracy. Despite the discovery of several transcription factors and epigenetic modifications associated with silencers on the genome, there is still no definitive biological signal or combination thereof to fully characterize silencers, posing challenges in selecting suitable biological signals for their identification. Therefore, we propose a sophisticated deep learning framework called DeepICSH, which is based on multiple biological data sources. Specifically, DeepICSH leverages a deep convolutional neural network to automatically capture biologically relevant signal combinations strongly associated with silencers, originating from a diverse array of biological signals. Furthermore, the utilization of attention mechanisms facilitates the scoring and visualization of these signal combinations, whereas the employment of skip connections facilitates the fusion of multilevel sequence features and signal combinations, thereby empowering the accurate identification of silencers within specific cells. Extensive experiments on HepG2 and K562 cell line data sets demonstrate that DeepICSH outperforms state-of-the-art methods in silencer identification. Notably, we introduce for the first time a deep learning framework based on multi-omics data for classifying strong and weak silencers, achieving favorable performance. In conclusion, DeepICSH shows great promise for advancing the study and analysis of silencers in complex diseases. The source code is available at https://github.com/lyli1013/DeepICSH.


Subject(s)
Deep Learning , Genome, Human , Humans , Cell Line , Epigenesis, Genetic , Multiomics
9.
Comput Biol Med ; 163: 107143, 2023 09.
Article in English | MEDLINE | ID: mdl-37339574

ABSTRACT

Non-coding RNA (ncRNA) is a functional RNA molecule that plays a key role in various fundamental biological processes, such as gene regulation. Therefore, studying the connection between ncRNA and proteins holds significant importance in exploring the function of ncRNA. Although many efficient and accurate methods have been developed by modern biological scientists, accurate predictions still pose a major challenge for various issues. In our approach, we utilize a multi-head attention mechanism to merge residual connections, allowing for the automatic learning of ncRNA and protein sequence features. Specifically, the proposed method projects node features into multiple spaces based on multi-head attention mechanism, thereby obtaining different feature interaction patterns in these spaces. By stacking interaction layers, higher-order interaction modes can be derived, while still preserving the initial feature information through the residual connection. This strategy effectively leverages the sequence information of ncRNA and protein, enabling the capture of hidden high-order features. The final experimental results demonstrate the effectiveness of our method, with AUC values of 97.4%, 98.5%, and 94.8% achieved on the NPInter v2.0, RPI807, and RPI488 datasets, respectively. These impressive results solidify our method as a powerful tool for exploring the connection between ncRNAs and proteins. We have uploaded the implementation code on GitHub: https://github.com/ZZCrazy00/MHAM-NPI.


Subject(s)
Proteins , RNA, Untranslated , RNA, Untranslated/genetics , RNA, Untranslated/metabolism , Proteins/metabolism
10.
Biomolecules ; 13(4)2023 04 03.
Article in English | MEDLINE | ID: mdl-37189388

ABSTRACT

CRISPR/Cas9 technology is capable of precisely editing genomes and is at the heart of various scientific and medical advances in recent times. The advances in biomedical research are hindered because of the inadvertent burden on the genome when genome editors are employed-the off-target effects. Although experimental screens to detect off-targets have allowed understanding the activity of Cas9, that knowledge remains incomplete as the rules do not extrapolate well to new target sequences. Off-target prediction tools developed recently have increasingly relied on machine learning and deep learning techniques to reliably understand the complete threat of likely off-targets because the rules that drive Cas9 activity are not fully understood. In this study, we present a count-based as well as deep-learning-based approach to derive sequence features that are important in deciding on Cas9 activity at a sequence. There are two major challenges in off-target determination-the identification of a likely site of Cas9 activity and the prediction of the extent of Cas9 activity at that site. The hybrid multitask CNN-biLSTM model developed, named CRISP-RCNN, simultaneously predicts off-targets and the extent of activity on off-targets. Employing methods of integrated gradients and weighting kernels for feature importance approximation, analysis of nucleotide and position preference, and mismatch tolerance have been performed.


Subject(s)
CRISPR-Cas Systems , Machine Learning , CRISPR-Cas Systems/genetics , Genome
11.
Sensors (Basel) ; 23(7)2023 Apr 06.
Article in English | MEDLINE | ID: mdl-37050832

ABSTRACT

To solve the problem of low accuracy of pavement crack detection caused by natural environment interference, this paper designed a lightweight detection framework named PCDETR (Pavement Crack DEtection TRansformer) network, based on the fusion of the convolution features with the sequence features and proposed an efficient pavement crack detection method. Firstly, the scalable Swin-Transformer network and the residual network are used as two parallel channels of the backbone network to extract the long-sequence global features and the underlying visual local features of the pavement cracks, respectively, which are concatenated and fused to enrich the extracted feature information. Then, the encoder and decoder of the transformer detection framework are optimized; the location and category information of the pavement cracks can be obtained directly using the set prediction, which provided a low-code method to reduce the implementation complexity. The research result shows that the highest AP (Average Precision) of this method reaches 45.8% on the COCO dataset, which is significantly higher than that of DETR and its variants model Conditional DETR where the AP values are 36.9% and 42.8%, respectively. On the self-collected pavement crack dataset, the AP of the proposed method reaches 45.6%, which is 3.8% higher than that of Mask R-CNN (Region-based Convolution Neural Network) and 8.8% higher than that of Faster R-CNN. Therefore, this method is an efficient pavement crack detection algorithm.

12.
Comput Biol Med ; 151(Pt A): 106268, 2022 12.
Article in English | MEDLINE | ID: mdl-36370585

ABSTRACT

DNA-binding proteins (DBPs) protect DNA from nuclease hydrolysis, inhibit the action of RNA polymerase, prevents replication and transcription from occurring simultaneously on a piece of DNA. Most of the conventional methods for detecting DBPs are biochemical methods, but the time cost is high. In recent years, a variety of machine learning-based methods that have been used on a large scale for large-scale screening of DBPs. To improve the prediction performance of DBPs, we propose a random Fourier features-based sparse representation classifier (RFF-SRC), which randomly map the features into a high-dimensional space to solve nonlinear classification problems. And L2,1-matrix norm is introduced to get sparse solution of model. To evaluate performance, our model is tested on several benchmark data sets of DBPs and 8 UCI data sets. RFF-SRC achieves better performance in experimental results.


Subject(s)
Algorithms , DNA-Binding Proteins , Machine Learning , DNA
13.
Fish Shellfish Immunol ; 130: 79-85, 2022 Nov.
Article in English | MEDLINE | ID: mdl-36087818

ABSTRACT

Mammalian evolutionary conserved signaling intermediate in Toll pathways (ECSIT) is an important intracellular protein that involves in innate immunity, embryogenesis, and assembly or stability of the mitochondrial complex I. In the present study, the ECSIT was characterized in soiny mullet (Liza haematocheila). The full-length cDNA of mullet ECSIT was 1860 bp, encoding 449 amino acids. Mullet ECSIT shared 60.4%∼78.2% sequence identities with its teleost counterparts. Two conserved protein domains, ECSIT domain and C-terminal domain, were found in mullet ECSIT. Realtime qPCR analysis revealed that mullet ECSIT was distributed in all examined tissues with high expressions in spleen, head kidney (HK) and gill. Further analysis showed that mullet ECSIT in spleen was up-regulated from 6 h to 48 h after Streptococcus dysgalactiae infection. In addition, the co-immunoprecipitation (co-IP) assay confirmed that mullet ECSIT could interact with tumor necrosis factor receptor-associated factor 6 (TRAF6). Molecular docking revealed that the polar interaction and hydrophobic interaction play crucial roles in the forming of ECSIT-TRAF6 complex. The resides of mullet ECSIT that involved in the interaction between ECSIT and TRAF6 were Arg107, Glu113, Phe114, Glu124, Lys120 and Lys121, which mainly located in the ECSIT domain. Our results demonstrated that mullet ECSIT involved in the immune defense against bacterial and regulation of TLRs signaling pathway by interaction with TRAF6. To the best of our knowledge, this is the first report on ECSIT of soiny mullet, which deepen the understanding of ECSIT and its functions in the immune response of teleosts.


Subject(s)
Smegmamorpha , Streptococcal Infections , Amino Acids/metabolism , Animals , DNA, Complementary/genetics , Immunity, Innate/genetics , Mammals/genetics , Mammals/metabolism , Molecular Docking Simulation , Phylogeny , Signal Transduction , Streptococcal Infections/veterinary , TNF Receptor-Associated Factor 6/genetics
14.
Brief Bioinform ; 23(6)2022 11 19.
Article in English | MEDLINE | ID: mdl-36094083

ABSTRACT

Short open reading frames (sORFs) refer to the small nucleic fragments no longer than 303 nt in length that probably encode small peptides. To date, translatable sORFs have been found in both untranslated regions of messenger ribonucleic acids (RNAs; mRNAs) and long non-coding RNAs (lncRNAs), playing vital roles in a myriad of biological processes. As not all sORFs are translated or essentially translatable, it is important to develop a highly accurate computational tool for characterizing the coding potential of sORFs, thereby facilitating discovery of novel functional peptides. In light of this, we designed a series of ensemble models by integrating Efficient-CapsNet and LightGBM, collectively termed csORF-finder, to differentiate the coding sORFs (csORFs) from non-coding sORFs in Homo sapiens, Mus musculus and Drosophila melanogaster, respectively. To improve the performance of csORF-finder, we introduced a novel feature encoding scheme named trinucleotide deviation from expected mean (TDE) and computed all types of in-frame sequence-based features, such as i-framed-3mer, i-framed-CKSNAP and i-framed-TDE. Benchmarking results showed that these features could significantly boost the performance compared to the original 3-mer, CKSNAP and TDE features. Our performance comparisons showed that csORF-finder achieved a superior performance than the state-of-the-art methods for csORF prediction on multi-species and non-ATG initiation independent test datasets. Furthermore, we applied csORF-finder to screen the lncRNA datasets for identifying potential csORFs. The resulting data serve as an important computational repository for further experimental validation. We hope that csORF-finder can be exploited as a powerful platform for high-throughput identification of csORFs and functional characterization of these csORFs encoded peptides.


Subject(s)
Open Reading Frames , RNA, Long Noncoding , Animals , Mice , Drosophila melanogaster/genetics , Machine Learning , Peptides/genetics , RNA, Long Noncoding/genetics , RNA, Messenger/genetics , Humans
15.
Biochim Biophys Acta Gene Regul Mech ; 1865(5): 194844, 2022 07.
Article in English | MEDLINE | ID: mdl-35870788

ABSTRACT

Meiotic recombination is a driver of evolution, and aberrant recombination is a major contributor to aneuploidy in mammals. Mechanism of recombination remains elusive yet. Here, we present a computational analysis to explore recombination-related dynamics of chromatin accessibility in mouse primordial germ cells (PGCs). Our data reveals that: (1) recombination hotspots which get accessible at meiosis-specific DNase I-hypersensitive sites (DHSs) only when PGCs enter meiosis are located preferentially in intronic and distal intergenic regions; (2) stable DHSs maintained stably across PGC differentiation are enriched by CTCF motifs and CTCF binding and mediate chromatin loop formation; (3) compared with the specific DHSs aroused at meiotic stage, stable DHSs are largely encoded in DNA sequence and also enriched by epigenetic marks; (4) PRDM9 is likely to target nucleosome-occupied hotspot regions and remodels local chromatin structure to make them accessible for recombination machinery; and (5) cells undergoing meiotic recombination are deficient in TAD structure and chromatin loop arrays are organized regularly along the axis formed between homologous chromosomes. Taken together, by analyzing DHS-related DNA features, epigenetic marks and 3D genome structure, we revealed some specific roles of chromatin accessibility in recombination, which would expand our understanding of recombination mechanism.


Subject(s)
Chromatin , Meiosis , Animals , Chromatin/genetics , DNA Breaks, Double-Stranded , Germ Cells/metabolism , Histone-Lysine N-Methyltransferase/metabolism , Mammals/genetics , Mice , Nucleosomes/genetics
16.
Acta Crystallogr D Struct Biol ; 78(Pt 5): 553-559, 2022 May 01.
Article in English | MEDLINE | ID: mdl-35503204

ABSTRACT

Crystallographers have an array of search-model options for structure solution by molecular replacement (MR). The well established options of homologous experimental structures and regular secondary-structure elements or motifs are increasingly supplemented by computational modelling. Such modelling may be carried out locally or may use pre-calculated predictions retrieved from databases such as the EBI AlphaFold database. MrParse is a new pipeline to help to streamline the decision process in MR by consolidating bioinformatic predictions in one place. When reflection data are provided, MrParse can rank any experimental homologues found using eLLG, which indicates the likelihood that a given search model will work in MR. Inbuilt displays of predicted secondary structure, coiled-coil and transmembrane regions further inform the choice of MR protocol. MrParse can also identify and rank homologues in the EBI AlphaFold database, a function that will also interest other structural biologists and bioinformaticians.


Subject(s)
Proteins , Databases, Protein , Models, Molecular , Protein Domains , Protein Structure, Secondary , Proteins/chemistry
17.
Front Genet ; 13: 877409, 2022.
Article in English | MEDLINE | ID: mdl-35419029

ABSTRACT

MicroRNAs (miRNAs) play vital roles in gene expression regulations. Identification of essential miRNAs is of fundamental importance in understanding their cellular functions. Experimental methods for identifying essential miRNAs are always costly and time-consuming. Therefore, computational methods are considered as alternative approaches. Currently, only a handful of studies are focused on predicting essential miRNAs. In this work, we proposed to predict essential miRNAs using the XGBoost framework with CART (Classification and Regression Trees) on various types of sequence-based features. We named this method as XGEM (XGBoost for essential miRNAs). The prediction performance of XGEM is promising. In comparison with other state-of-the-art methods, XGEM performed the best, indicating its potential in identifying essential miRNAs.

18.
Comput Biol Chem ; 98: 107662, 2022 Jun.
Article in English | MEDLINE | ID: mdl-35288360

ABSTRACT

S-Adenosyl methionine (SAM), a universal methyl group donor, plays a vital role in biosynthesis and acts as an inhibitor to many enzymes. Due to protein interaction-dependent biological role, SAM has become a favorite target in various therapeutical and clinical studies such as treating cancer, Alzheimer's, epilepsy, and neurological disorders. Therefore, the identification of the SAM interacting proteins and their interaction sites is a biologically significant problem. However, wet-lab techniques, though accurate, to identify SAM interactions and interaction sites are tedious and costly. Therefore, efficient and accurate computational methods for this purpose are vital to the design and assist such wet-lab experiments. In this study, we present machine learning-based models to predict SAM interacting proteins and their interaction sites by using only primary structures of proteins. Here we modeled SAM interaction prediction through whole protein sequence features along with different classifiers. Whereas, we modeled SAM interaction site prediction through overlapping sequence windows and ranking with multiple instance learning that allows handling imprecisely annotated SAM interaction sites. Through a series of simulation studies along with biological significant evaluation, we showed that our proposed models give a state-of-the-art performance for both SAM interaction and interaction site prediction. Through data mining in this study, we have also identified various characteristics of amino acid sub-sequences and their relative position to effectively locate interaction sites in a SAM interacting protein. Python code for training and evaluating our proposed models together with a webserver implementation as SIP (Sam Interaction Predictor) is available at the URL: https://sites.google.com/view/wajidarshad/software.


Subject(s)
Proteins , S-Adenosylmethionine , Amino Acid Sequence , Computer Simulation , Machine Learning , Proteins/metabolism , S-Adenosylmethionine/chemistry , S-Adenosylmethionine/metabolism
19.
Genes (Basel) ; 12(11)2021 10 24.
Article in English | MEDLINE | ID: mdl-34828296

ABSTRACT

Long noncoding RNA (lncRNA) plays a crucial role in many critical biological processes and participates in complex human diseases through interaction with proteins. Considering that identifying lncRNA-protein interactions through experimental methods is expensive and time-consuming, we propose a novel method based on deep learning that combines raw sequence composition features, hand-designed features and structure features, called LGFC-CNN, to predict lncRNA-protein interactions. The two sequence preprocessing methods and CNN modules (GloCNN and LocCNN) are utilized to extract the raw sequence global and local features. Meanwhile, we select hand-designed features by comparing the predictive effect of different lncRNA and protein features combinations. Furthermore, we obtain the structure features and unifying the dimensions through Fourier transform. In the end, the four types of features are integrated to comprehensively predict the lncRNA-protein interactions. Compared with other state-of-the-art methods on three lncRNA-protein interaction datasets, LGFC-CNN achieves the best performance with an accuracy of 94.14%, on RPI21850; an accuracy of 92.94%, on RPI7317; and an accuracy of 98.19% on RPI1847. The results show that our LGFC-CNN can effectively predict the lncRNA-protein interactions by combining raw sequence composition features, hand-designed features and structure features.


Subject(s)
Deep Learning , Gene Regulatory Networks/physiology , Protein Interaction Maps/physiology , RNA, Long Noncoding/metabolism , RNA-Binding Proteins/metabolism , Animals , Computational Biology/instrumentation , Computational Biology/methods , Datasets as Topic , Humans , Neural Networks, Computer , RNA, Long Noncoding/genetics , RNA-Binding Proteins/genetics
20.
Molecules ; 26(18)2021 Sep 21.
Article in English | MEDLINE | ID: mdl-34577174

ABSTRACT

This study brings a detailed bioinformatics analysis of fungal and chloride-dependent α-amylases from the family GH13. Overall, 268 α-amylase sequences were retrieved from subfamilies GH13_1 (39 sequences), GH13_5 (35 sequences), GH13_15 (28 sequences), GH13_24 (23 sequences), GH13_32 (140 sequences) and GH13_42 (3 sequences). Eight conserved sequence regions (CSRs) characteristic for the family GH13 were identified in all sequences and respective sequence logos were analysed in an effort to identify unique sequence features of each subfamily. The main emphasis was given on the subfamily GH13_32 since it contains both fungal α-amylases and their bacterial chloride-activated counterparts. In addition to in silico analysis focused on eventual ability to bind the chloride anion, the property typical mainly for animal α-amylases from subfamilies GH13_15 and GH13_24, attention has been paid also to the potential presence of the so-called secondary surface-binding sites (SBSs) identified in complexed crystal structures of some particular α-amylases from the studied subfamilies. As template enzymes with already experimentally determined SBSs, the α-amylases from Aspergillus niger (GH13_1), Bacillus halmapalus, Bacillus paralicheniformis and Halothermothrix orenii (all from GH13_5) and Homo sapiens (saliva; GH13_24) were used. Evolutionary relationships between GH13 fungal and chloride-dependent α-amylases were demonstrated by two evolutionary trees-one based on the alignment of the segment of sequences spanning almost the entire catalytic TIM-barrel domain and the other one based on the alignment of eight extracted CSRs. Although both trees demonstrated similar results in terms of a closer evolutionary relatedness of subfamilies GH13_1 with GH13_42 including in a wider sense also the subfamily GH13_5 as well as for subfamilies GH13_32, GH13_15 and GH13_24, some subtle differences in clustering of particular α-amylases may nevertheless be observed.


Subject(s)
Chlorides/chemistry , Fungal Proteins/chemistry , alpha-Amylases/chemistry , Amino Acid Sequence , Animals , Aspergillus niger/chemistry , Bacillus/chemistry , Binding Sites , Catalytic Domain , Computational Biology , Computer Simulation , Evolution, Molecular , Firmicutes/chemistry , Humans , Protein Binding , Sequence Alignment , Surface Properties
SELECTION OF CITATIONS
SEARCH DETAIL