Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 272
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Cell ; 187(13): 3357-3372.e19, 2024 Jun 20.
Article in English | MEDLINE | ID: mdl-38866018

ABSTRACT

Microbial hydrogen (H2) cycling underpins the diversity and functionality of diverse anoxic ecosystems. Among the three evolutionarily distinct hydrogenase superfamilies responsible, [FeFe] hydrogenases were thought to be restricted to bacteria and eukaryotes. Here, we show that anaerobic archaea encode diverse, active, and ancient lineages of [FeFe] hydrogenases through combining analysis of existing and new genomes with extensive biochemical experiments. [FeFe] hydrogenases are encoded by genomes of nine archaeal phyla and expressed by H2-producing Asgard archaeon cultures. We report an ultraminimal hydrogenase in DPANN archaea that binds the catalytic H-cluster and produces H2. Moreover, we identify and characterize remarkable hybrid complexes formed through the fusion of [FeFe] and [NiFe] hydrogenases in ten other archaeal orders. Phylogenetic analysis and structural modeling suggest a deep evolutionary history of hybrid hydrogenases. These findings reveal new metabolic adaptations of archaea, streamlined H2 catalysts for biotechnological development, and a surprisingly intertwined evolutionary history between the two major H2-metabolizing enzymes.


Subject(s)
Archaea , Hydrogen , Hydrogenase , Phylogeny , Archaea/genetics , Archaea/enzymology , Archaeal Proteins/metabolism , Archaeal Proteins/chemistry , Archaeal Proteins/genetics , Genome, Archaeal , Hydrogen/metabolism , Hydrogenase/metabolism , Hydrogenase/genetics , Hydrogenase/chemistry , Iron-Sulfur Proteins/metabolism , Iron-Sulfur Proteins/genetics , Iron-Sulfur Proteins/chemistry , Models, Molecular , Protein Structure, Tertiary
2.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38261340

ABSTRACT

The recent advances of single-cell RNA sequencing (scRNA-seq) have enabled reliable profiling of gene expression at the single-cell level, providing opportunities for accurate inference of gene regulatory networks (GRNs) on scRNA-seq data. Most methods for inferring GRNs suffer from the inability to eliminate transitive interactions or necessitate expensive computational resources. To address these, we present a novel method, termed GMFGRN, for accurate graph neural network (GNN)-based GRN inference from scRNA-seq data. GMFGRN employs GNN for matrix factorization and learns representative embeddings for genes. For transcription factor-gene pairs, it utilizes the learned embeddings to determine whether they interact with each other. The extensive suite of benchmarking experiments encompassing eight static scRNA-seq datasets alongside several state-of-the-art methods demonstrated mean improvements of 1.9 and 2.5% over the runner-up in area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC). In addition, across four time-series datasets, maximum enhancements of 2.4 and 1.3% in AUROC and AUPRC were observed in comparison to the runner-up. Moreover, GMFGRN requires significantly less training time and memory consumption, with time and memory consumed <10% compared to the second-best method. These findings underscore the substantial potential of GMFGRN in the inference of GRNs. It is publicly available at https://github.com/Lishuoyy/GMFGRN.


Subject(s)
Benchmarking , Gene Regulatory Networks , Area Under Curve , Learning , Neural Networks, Computer
3.
Nucleic Acids Res ; 52(D1): D732-D737, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37870467

ABSTRACT

ICEberg 3.0 (https://tool2-mml.sjtu.edu.cn/ICEberg3/) is an upgraded database that provides comprehensive insights into bacterial integrative and conjugative elements (ICEs). In comparison to the previous version, three key enhancements were introduced: First, through text mining and manual curation, it now encompasses details of 2065 ICEs, 607 IMEs and 275 CIMEs, including 430 with experimental support. Secondly, ICEberg 3.0 systematically categorizes cargo gene functions of ICEs into six groups based on literature curation and predictive analysis, providing a profound understanding of ICEs'diverse biological traits. The cargo gene prediction pipeline is integrated into the online tool ICEfinder 2.0. Finally, ICEberg 3.0 aids the analysis and exploration of ICEs from the human microbiome. Extracted and manually curated from 2405 distinct human microbiome samples, the database comprises 1386 putative ICEs, offering insights into the complex dynamics of Bacteria-ICE-Cargo networks within the human microbiome. With the recent updates, ICEberg 3.0 enhances its capability to unravel the intricacies of ICE biology, particularly in the characterization and understanding of cargo gene functions and ICE interactions within the microbiome. This enhancement may facilitate the investigation of the dynamic landscape of ICE biology and its implications for microbial communities.


Subject(s)
Bacteria , Conjugation, Genetic , Databases, Genetic , Humans , Bacteria/genetics , Databases, Factual , DNA Transposable Elements , Microbiota
4.
Nucleic Acids Res ; 52(D1): D784-D790, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37897352

ABSTRACT

TADB 3.0 (https://bioinfo-mml.sjtu.edu.cn/TADB3/) is an updated database that provides comprehensive information on bacterial types I to VIII toxin-antitoxin (TA) loci. Compared with the previous version, three major improvements are introduced: First, with the aid of text mining and manual curation, it records the details of 536 TA loci with experimental support, including 102, 403, 8, 14, 1, 1, 3 and 4 TA loci of types I to VIII, respectively; Second, by leveraging the upgraded TA prediction tool TAfinder 2.0 with a stringent strategy, TADB 3.0 collects 211 697 putative types I to VIII TA loci predicted in 34 789 completely sequenced prokaryotic genomes, providing researchers with a large-scale dataset for further follow-up analysis and characterization; Third, based on their genomic locations, relationships of 69 019 TA loci and 60 898 mobile genetic elements (MGEs) are visualized by interactive networks accessible through the user-friendly web page. With the recent updates, TADB 3.0 may provide improved in silico support for comprehending the biological roles of TA pairs in prokaryotes and their functional associations with MGEs.


Subject(s)
Bacterial Proteins , Databases, Genetic , Interspersed Repetitive Sequences , Toxin-Antitoxin Systems , Bacterial Proteins/genetics , Genome, Bacterial , Toxin-Antitoxin Systems/genetics , Genetic Loci
5.
Nucleic Acids Res ; 52(D1): D562-D571, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37953313

ABSTRACT

The single-cell proteomics enables the direct quantification of protein abundance at the single-cell resolution, providing valuable insights into cellular phenotypes beyond what can be inferred from transcriptome analysis alone. However, insufficient large-scale integrated databases hinder researchers from accessing and exploring single-cell proteomics, impeding the advancement of this field. To fill this deficiency, we present a comprehensive database, namely Single-cell Proteomic DataBase (SPDB, https://scproteomicsdb.com/), for general single-cell proteomic data, including antibody-based or mass spectrometry-based single-cell proteomics. Equipped with standardized data process and a user-friendly web interface, SPDB provides unified data formats for convenient interaction with downstream analysis, and offers not only dataset-level but also protein-level data search and exploration capabilities. To enable detailed exhibition of single-cell proteomic data, SPDB also provides a module for visualizing data from the perspectives of cell metadata or protein features. The current version of SPDB encompasses 133 antibody-based single-cell proteomic datasets involving more than 300 million cells and over 800 marker/surface proteins, and 10 mass spectrometry-based single-cell proteomic datasets involving more than 4000 cells and over 7000 proteins. Overall, SPDB is envisioned to be explored as a useful resource that will facilitate the wider research communities by providing detailed insights into proteomics from the single-cell perspective.


Subject(s)
Proteins , Proteomics , Antibodies , Knowledge Bases , Mass Spectrometry , Humans , Animals , Single-Cell Analysis
6.
Brief Bioinform ; 24(3)2023 05 19.
Article in English | MEDLINE | ID: mdl-37080771

ABSTRACT

Single-cell RNA sequencing (scRNA-seq) has significantly accelerated the experimental characterization of distinct cell lineages and types in complex tissues and organisms. Cell-type annotation is of great importance in most of the scRNA-seq analysis pipelines. However, manual cell-type annotation heavily relies on the quality of scRNA-seq data and marker genes, and therefore can be laborious and time-consuming. Furthermore, the heterogeneity of scRNA-seq datasets poses another challenge for accurate cell-type annotation, such as the batch effect induced by different scRNA-seq protocols and samples. To overcome these limitations, here we propose a novel pipeline, termed TripletCell, for cross-species, cross-protocol and cross-sample cell-type annotation. We developed a cell embedding and dimension-reduction module for the feature extraction (FE) in TripletCell, namely TripletCell-FE, to leverage the deep metric learning-based algorithm for the relationships between the reference gene expression matrix and the query cells. Our experimental studies on 21 datasets (covering nine scRNA-seq protocols, two species and three tissues) demonstrate that TripletCell outperformed state-of-the-art approaches for cell-type annotation. More importantly, regardless of protocols or species, TripletCell can deliver outstanding and robust performance in annotating different types of cells. TripletCell is freely available at https://github.com/liuyan3056/TripletCell. We believe that TripletCell is a reliable computational tool for accurately annotating various cell types using scRNA-seq data and will be instrumental in assisting the generation of novel biological hypotheses in cell biology.


Subject(s)
Algorithms , Single-Cell Analysis , Single-Cell Analysis/methods , Sequence Analysis, RNA/methods , Gene Expression Profiling/methods , Cluster Analysis
7.
Brief Bioinform ; 24(6)2023 09 22.
Article in English | MEDLINE | ID: mdl-37874948

ABSTRACT

Proteases contribute to a broad spectrum of cellular functions. Given a relatively limited amount of experimental data, developing accurate sequence-based predictors of substrate cleavage sites facilitates a better understanding of protease functions and substrate specificity. While many protease-specific predictors of substrate cleavage sites were developed, these efforts are outpaced by the growth of the protease substrate cleavage data. In particular, since data for 100+ protease types are available and this number continues to grow, it becomes impractical to publish predictors for new protease types, and instead it might be better to provide a computational platform that helps users to quickly and efficiently build predictors that address their specific needs. To this end, we conceptualized, developed, tested and released a versatile bioinformatics platform, ProsperousPlus, that empowers users, even those with no programming or little bioinformatics background, to build fast and accurate predictors of substrate cleavage sites. ProsperousPlus facilitates the use of the rapidly accumulating substrate cleavage data to train, empirically assess and deploy predictive models for user-selected substrate types. Benchmarking tests on test datasets show that our platform produces predictors that on average exceed the predictive performance of current state-of-the-art approaches. ProsperousPlus is available as a webserver and a stand-alone software package at http://prosperousplus.unimelb-biotools.cloud.edu.au/.


Subject(s)
Machine Learning , Peptide Hydrolases , Peptide Hydrolases/metabolism , Substrate Specificity , Algorithms
8.
Brief Bioinform ; 24(4)2023 07 20.
Article in English | MEDLINE | ID: mdl-37291763

ABSTRACT

BACKGROUND: Promoters are DNA regions that initiate the transcription of specific genes near the transcription start sites. In bacteria, promoters are recognized by RNA polymerases and associated sigma factors. Effective promoter recognition is essential for synthesizing the gene-encoded products by bacteria to grow and adapt to different environmental conditions. A variety of machine learning-based predictors for bacterial promoters have been developed; however, most of them were designed specifically for a particular species. To date, only a few predictors are available for identifying general bacterial promoters with limited predictive performance. RESULTS: In this study, we developed TIMER, a Siamese neural network-based approach for identifying both general and species-specific bacterial promoters. Specifically, TIMER uses DNA sequences as the input and employs three Siamese neural networks with the attention layers to train and optimize the models for a total of 13 species-specific and general bacterial promoters. Extensive 10-fold cross-validation and independent tests demonstrated that TIMER achieves a competitive performance and outperforms several existing methods on both general and species-specific promoter prediction. As an implementation of the proposed method, the web server of TIMER is publicly accessible at http://web.unimelb-bioinfortools.cloud.edu.au/TIMER/.


Subject(s)
Bacteria , Neural Networks, Computer , Bacteria/genetics , Bacteria/metabolism , DNA-Directed RNA Polymerases/genetics , DNA-Directed RNA Polymerases/metabolism , Base Sequence , Promoter Regions, Genetic
9.
Brief Bioinform ; 24(3)2023 05 19.
Article in English | MEDLINE | ID: mdl-37150785

ABSTRACT

A-to-I editing is the most prevalent RNA editing event, which refers to the change of adenosine (A) bases to inosine (I) bases in double-stranded RNAs. Several studies have revealed that A-to-I editing can regulate cellular processes and is associated with various human diseases. Therefore, accurate identification of A-to-I editing sites is crucial for understanding RNA-level (i.e. transcriptional) modifications and their potential roles in molecular functions. To date, various computational approaches for A-to-I editing site identification have been developed; however, their performance is still unsatisfactory and needs further improvement. In this study, we developed a novel stacked-ensemble learning model, ATTIC (A-To-I ediTing predICtor), to accurately identify A-to-I editing sites across three species, including Homo sapiens, Mus musculus and Drosophila melanogaster. We first comprehensively evaluated 37 RNA sequence-derived features combined with 14 popular machine learning algorithms. Then, we selected the optimal base models to build a series of stacked ensemble models. The final ATTIC framework was developed based on the optimal models improved by the feature selection strategy for specific species. Extensive cross-validation and independent tests illustrate that ATTIC outperforms state-of-the-art tools for predicting A-to-I editing sites. We also developed a web server for ATTIC, which is publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/ATTIC/. We anticipate that ATTIC can be utilized as a useful tool to accelerate the identification of A-to-I RNA editing events and help characterize their roles in post-transcriptional regulation.


Subject(s)
Drosophila melanogaster , RNA Editing , Animals , Mice , Humans , Drosophila melanogaster/genetics , Drosophila melanogaster/metabolism , RNA/genetics , Adenosine/genetics , Adenosine/metabolism , Inosine/genetics , Inosine/metabolism
10.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36528806

ABSTRACT

Determining the pathogenicity and functional impact (i.e. gain-of-function; GOF or loss-of-function; LOF) of a variant is vital for unraveling the genetic level mechanisms of human diseases. To provide a 'one-stop' framework for the accurate identification of pathogenicity and functional impact of variants, we developed a two-stage deep-learning-based computational solution, termed VPatho, which was trained using a total of 9619 pathogenic GOF/LOF and 138 026 neutral variants curated from various databases. A total number of 138 variant-level, 262 protein-level and 103 genome-level features were extracted for constructing the models of VPatho. The development of VPatho consists of two stages: (i) a random under-sampling multi-scale residual neural network (ResNet) with a newly defined weighted-loss function (RUS-Wg-MSResNet) was proposed to predict variants' pathogenicity on the gnomAD_NV + GOF/LOF dataset; and (ii) an XGBOD model was constructed to predict the functional impact of the given variants. Benchmarking experiments demonstrated that RUS-Wg-MSResNet achieved the highest prediction performance with the weights calculated based on the ratios of neutral versus pathogenic variants. Independent tests showed that both RUS-Wg-MSResNet and XGBOD achieved outstanding performance. Moreover, assessed using variants from the CAGI6 competition, RUS-Wg-MSResNet achieved superior performance compared to state-of-the-art predictors. The fine-trained XGBOD models were further used to blind test the whole LOF data downloaded from gnomAD and accordingly, we identified 31 nonLOF variants that were previously labeled as LOF/uncertain variants. As an implementation of the developed approach, a webserver of VPatho is made publicly available at http://csbio.njust.edu.cn/bioinf/vpatho/ to facilitate community-wide efforts for profiling and prioritizing the query variants with respect to their pathogenicity and functional impact.


Subject(s)
Deep Learning , Humans , Gain of Function Mutation , Genome
11.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36567255

ABSTRACT

Underlying medical conditions, such as cancer, kidney disease and heart failure, are associated with a higher risk for severe COVID-19. Accurate classification of COVID-19 patients with underlying medical conditions is critical for personalized treatment decision and prognosis estimation. In this study, we propose an interpretable artificial intelligence model termed VDJMiner to mine the underlying medical conditions and predict the prognosis of COVID-19 patients according to their immune repertoires. In a cohort of more than 1400 COVID-19 patients, VDJMiner accurately identifies multiple underlying medical conditions, including cancers, chronic kidney disease, autoimmune disease, diabetes, congestive heart failure, coronary artery disease, asthma and chronic obstructive pulmonary disease, with an average area under the receiver operating characteristic curve (AUC) of 0.961. Meanwhile, in this same cohort, VDJMiner achieves an AUC of 0.922 in predicting severe COVID-19. Moreover, VDJMiner achieves an accuracy of 0.857 in predicting the response of COVID-19 patients to tocilizumab treatment on the leave-one-out test. Additionally, VDJMiner interpretively mines and scores V(D)J gene segments of the T-cell receptors that are associated with the disease. The identified associations between single-cell V(D)J gene segments and COVID-19 are highly consistent with previous studies. The source code of VDJMiner is publicly accessible at https://github.com/TencentAILabHealthcare/VDJMiner. The web server of VDJMiner is available at https://gene.ai.tencent.com/VDJMiner/.


Subject(s)
Asthma , COVID-19 , Humans , Artificial Intelligence , ROC Curve , Software
12.
Brief Bioinform ; 24(2)2023 03 19.
Article in English | MEDLINE | ID: mdl-36880172

ABSTRACT

Lysine 2-hydroxyisobutylation (Khib), which was first reported in 2014, has been shown to play vital roles in a myriad of biological processes including gene transcription, regulation of chromatin functions, purine metabolism, pentose phosphate pathway and glycolysis/gluconeogenesis. Identification of Khib sites in protein substrates represents an initial but crucial step in elucidating the molecular mechanisms underlying protein 2-hydroxyisobutylation. Experimental identification of Khib sites mainly depends on the combination of liquid chromatography and mass spectrometry. However, experimental approaches for identifying Khib sites are often time-consuming and expensive compared with computational approaches. Previous studies have shown that Khib sites may have distinct characteristics for different cell types of the same species. Several tools have been developed to identify Khib sites, which exhibit high diversity in their algorithms, encoding schemes and feature selection techniques. However, to date, there are no tools designed for predicting cell type-specific Khib sites. Therefore, it is highly desirable to develop an effective predictor for cell type-specific Khib site prediction. Inspired by the residual connection of ResNet, we develop a deep learning-based approach, termed ResNetKhib, which leverages both the one-dimensional convolution and transfer learning to enable and improve the prediction of cell type-specific 2-hydroxyisobutylation sites. ResNetKhib is capable of predicting Khib sites for four human cell types, mouse liver cell and three rice cell types. Its performance is benchmarked against the commonly used random forest (RF) predictor on both 10-fold cross-validation and independent tests. The results show that ResNetKhib achieves the area under the receiver operating characteristic curve values ranging from 0.807 to 0.901, depending on the cell type and species, which performs better than RF-based predictors and other currently available Khib site prediction tools. We also implement an online web server of the proposed ResNetKhib algorithm together with all the curated datasets and trained model for the wider research community to use, which is publicly accessible at https://resnetkhib.erc.monash.edu/.


Subject(s)
Lysine , Protein Processing, Post-Translational , Animals , Mice , Humans , Lysine/metabolism , Proteins/metabolism , Algorithms , Machine Learning
13.
Brief Bioinform ; 24(4)2023 07 20.
Article in English | MEDLINE | ID: mdl-37369638

ABSTRACT

Antimicrobial peptides (AMPs) are short peptides that play crucial roles in diverse biological processes and have various functional activities against target organisms. Due to the abuse of chemical antibiotics and microbial pathogens' increasing resistance to antibiotics, AMPs have the potential to be alternatives to antibiotics. As such, the identification of AMPs has become a widely discussed topic. A variety of computational approaches have been developed to identify AMPs based on machine learning algorithms. However, most of them are not capable of predicting the functional activities of AMPs, and those predictors that can specify activities only focus on a few of them. In this study, we first surveyed 10 predictors that can identify AMPs and their functional activities in terms of the features they employed and the algorithms they utilized. Then, we constructed comprehensive AMP datasets and proposed a new deep learning-based framework, iAMPCN (identification of AMPs based on CNNs), to identify AMPs and their related 22 functional activities. Our experiments demonstrate that iAMPCN significantly improved the prediction performance of AMPs and their corresponding functional activities based on four types of sequence features. Benchmarking experiments on the independent test datasets showed that iAMPCN outperformed a number of state-of-the-art approaches for predicting AMPs and their functional activities. Furthermore, we analyzed the amino acid preferences of different AMP activities and evaluated the model on datasets of varying sequence redundancy thresholds. To facilitate the community-wide identification of AMPs and their corresponding functional types, we have made the source codes of iAMPCN publicly available at https://github.com/joy50706/iAMPCN/tree/master. We anticipate that iAMPCN can be explored as a valuable tool for identifying potential AMPs with specific functional activities for further experimental validation.


Subject(s)
Antimicrobial Cationic Peptides , Deep Learning , Antimicrobial Cationic Peptides/pharmacology , Antimicrobial Peptides , Anti-Bacterial Agents , Algorithms
14.
Brief Bioinform ; 24(6)2023 09 22.
Article in English | MEDLINE | ID: mdl-37950905

ABSTRACT

Cancer genomics is dedicated to elucidating the genes and pathways that contribute to cancer progression and development. Identifying cancer genes (CGs) associated with the initiation and progression of cancer is critical for characterization of molecular-level mechanism in cancer research. In recent years, the growing availability of high-throughput molecular data and advancements in deep learning technologies has enabled the modelling of complex interactions and topological information within genomic data. Nevertheless, because of the limited labelled data, pinpointing CGs from a multitude of potential mutations remains an exceptionally challenging task. To address this, we propose a novel deep learning framework, termed self-supervised masked graph learning (SMG), which comprises SMG reconstruction (pretext task) and task-specific fine-tuning (downstream task). In the pretext task, the nodes of multi-omic featured protein-protein interaction (PPI) networks are randomly substituted with a defined mask token. The PPI networks are then reconstructed using the graph neural network (GNN)-based autoencoder, which explores the node correlations in a self-prediction manner. In the downstream tasks, the pre-trained GNN encoder embeds the input networks into feature graphs, whereas a task-specific layer proceeds with the final prediction. To assess the performance of the proposed SMG method, benchmarking experiments are performed on three node-level tasks (identification of CGs, essential genes and healthy driver genes) and one graph-level task (identification of disease subnetwork) across eight PPI networks. Benchmarking experiments and performance comparison with existing state-of-the-art methods demonstrate the superiority of SMG on multi-omic feature engineering.


Subject(s)
Neoplasms , Oncogenes , Mutation , Benchmarking , Genes, Essential , Genomics , Neoplasms/genetics
15.
Brief Bioinform ; 25(1)2023 11 22.
Article in English | MEDLINE | ID: mdl-38152979

ABSTRACT

The identification and characterization of essential genes are central to our understanding of the core biological functions in eukaryotic organisms, and has important implications for the treatment of diseases caused by, for example, cancers and pathogens. Given the major constraints in testing the functions of genes of many organisms in the laboratory, due to the absence of in vitro cultures and/or gene perturbation assays for most metazoan species, there has been a need to develop in silico tools for the accurate prediction or inference of essential genes to underpin systems biological investigations. Major advances in machine learning approaches provide unprecedented opportunities to overcome these limitations and accelerate the discovery of essential genes on a genome-wide scale. Here, we developed and evaluated a large language model- and graph neural network (LLM-GNN)-based approach, called 'Bingo', to predict essential protein-coding genes in the metazoan model organisms Caenorhabditis elegans and Drosophila melanogaster as well as in Mus musculus and Homo sapiens (a HepG2 cell line) by integrating LLM and GNNs with adversarial training. Bingo predicts essential genes under two 'zero-shot' scenarios with transfer learning, showing promise to compensate for a lack of high-quality genomic and proteomic data for non-model organisms. In addition, the attention mechanisms and GNNExplainer were employed to manifest the functional sites and structural domain with most contribution to essentiality. In conclusion, Bingo provides the prospect of being able to accurately infer the essential genes of little- or under-studied organisms of interest, and provides a biological explanation for gene essentiality.


Subject(s)
Drosophila Proteins , Genes, Essential , Mice , Animals , Proteomics , Drosophila melanogaster/genetics , Workflow , Neural Networks, Computer , Proteins/genetics , Microfilament Proteins/genetics , Drosophila Proteins/genetics
16.
Bioinformatics ; 40(4)2024 Mar 29.
Article in English | MEDLINE | ID: mdl-38552307

ABSTRACT

MOTIVATION: Cell-type clustering is a crucial first step for single-cell RNA-seq data analysis. However, existing clustering methods often provide different results on cluster assignments with respect to their own data pre-processing, choice of distance metrics, and strategies of feature extraction, thereby limiting their practical applications. RESULTS: We propose Cross-Tabulation Ensemble Clustering (CTEC) method that formulates two re-clustering strategies (distribution- and outlier-based) via cross-tabulation. Benchmarking experiments on five scRNA-Seq datasets illustrate that the proposed CTEC method offers significant improvements over the individual clustering methods. Moreover, CTEC-DB outperforms the state-of-the-art ensemble methods for single-cell data clustering, with 45.4% and 17.1% improvement over the single-cell aggregated from ensemble clustering method (SAFE) and the single-cell aggregated clustering via Mixture model ensemble method (SAME), respectively, on the two-method ensemble test. AVAILABILITY AND IMPLEMENTATION: The source code of the benchmark in this work is available at the GitHub repository https://github.com/LWCHN/CTEC.git.


Subject(s)
Algorithms , Single-Cell Analysis , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Cluster Analysis , Data Analysis , Gene Expression Profiling/methods
17.
BMC Bioinformatics ; 25(1): 13, 2024 Jan 09.
Article in English | MEDLINE | ID: mdl-38195423

ABSTRACT

BACKGROUND: MicroRNAs (miRNAs) are a class of non-coding RNAs that play a pivotal role as gene expression regulators. These miRNAs are typically approximately 20 to 25 nucleotides long. The maturation of miRNAs requires Dicer cleavage at specific sites within the precursor miRNAs (pre-miRNAs). Recent advances in machine learning-based approaches for cleavage site prediction, such as PHDcleav and LBSizeCleav, have been reported. ReCGBM, a gradient boosting-based model, demonstrates superior performance compared with existing methods. Nonetheless, ReCGBM operates solely as a binary classifier despite the presence of two cleavage sites in a typical pre-miRNA. Previous approaches have focused on utilizing only a fraction of the structural information in pre-miRNAs, often overlooking comprehensive secondary structure information. There is a compelling need for the development of a novel model to address these limitations. RESULTS: In this study, we developed a deep learning model for predicting the presence of a Dicer cleavage site within a pre-miRNA segment. This model was enhanced by an autoencoder that learned the secondary structure embeddings of pre-miRNA. Benchmarking experiments demonstrated that the performance of our model was comparable to that of ReCGBM in the binary classification tasks. In addition, our model excelled in multi-class classification tasks, making it a more versatile and practical solution than ReCGBM. CONCLUSIONS: Our proposed model exhibited superior performance compared with the current state-of-the-art model, underscoring the effectiveness of a deep learning approach in predicting Dicer cleavage sites. Furthermore, our model could be trained using only sequence and secondary structure information. Its capacity to accommodate multi-class classification tasks has enhanced the practical utility of our model.


Subject(s)
Deep Learning , MicroRNAs , Humans , Benchmarking , Machine Learning , Nucleotides
18.
BMC Med ; 22(1): 188, 2024 May 07.
Article in English | MEDLINE | ID: mdl-38715068

ABSTRACT

BACKGROUND: Floods are the most frequent weather-related disaster, causing significant health impacts worldwide. Limited studies have examined the long-term consequences of flooding exposure. METHODS: Flood data were retrieved from the Dartmouth Flood Observatory and linked with health data from 499,487 UK Biobank participants. To calculate the annual cumulative flooding exposure, we multiplied the duration and severity of each flood event and then summed these values for each year. We conducted a nested case-control analysis to evaluate the long-term effect of flooding exposure on all-cause and cause-specific mortality. Each case was matched with eight controls. Flooding exposure was modelled using a distributed lag non-linear model to capture its nonlinear and lagged effects. RESULTS: The risk of all-cause mortality increased by 6.7% (odds ratio (OR): 1.067, 95% confidence interval (CI): 1.063-1.071) for every unit increase in flood index after confounders had been controlled for. The mortality risk from neurological and mental diseases was negligible in the current year, but strongest in the lag years 3 and 4. By contrast, the risk of mortality from suicide was the strongest in the current year (OR: 1.018, 95% CI: 1.008-1.028), and attenuated to lag year 5. Participants with higher levels of education and household income had a higher estimated risk of death from most causes whereas the risk of suicide-related mortality was higher among participants who were obese, had lower household income, engaged in less physical activity, were non-moderate alcohol consumers, and those living in more deprived areas. CONCLUSIONS: Long-term exposure to floods is associated with an increased risk of mortality. The health consequences of flooding exposure would vary across different periods after the event, with different profiles of vulnerable populations identified for different causes of death. These findings contribute to a better understanding of the long-term impacts of flooding exposure.


Subject(s)
Floods , Humans , Floods/mortality , Case-Control Studies , United Kingdom/epidemiology , Male , Female , Aged , Middle Aged , Adult , Cause of Death , Risk Factors
19.
Small ; 20(6): e2305052, 2024 Feb.
Article in English | MEDLINE | ID: mdl-37798622

ABSTRACT

The rapid increase and spread of Gram-negative bacteria resistant to many or all existing treatments threaten a return to the preantibiotic era. The presence of bacterial polysaccharides that impede the penetration of many antimicrobials and protect them from the innate immune system contributes to resistance and pathogenicity. No currently approved antibiotics target the polysaccharide regions of microbes. Here, describe monolaurin-based niosomes, the first lipid nanoparticles that can eliminate bacterial polysaccharides from hypervirulent Klebsiella pneumoniae, are described. Their combination with polymyxin B shows no cytotoxicity in vitro and is highly effective in combating K. pneumoniae infection in vivo. Comprehensive mechanistic studies have revealed that antimicrobial activity proceeds via a multimodal mechanism. Initially, lipid nanoparticles disrupt polysaccharides, then outer and inner membranes are destabilized and destroyed by polymyxin B, resulting in synergistic cell lysis. This novel lipidic nanoparticle system shows tremendous promise as a highly effective antimicrobial treatment targeting multidrug-resistant Gram-negative pathogens.


Subject(s)
Nanoparticles , Polymyxin B , Polymyxin B/pharmacology , Liposomes/pharmacology , Anti-Bacterial Agents/pharmacology , Gram-Negative Bacteria , Klebsiella pneumoniae , Polysaccharides, Bacterial/pharmacology , Microbial Sensitivity Tests , Drug Resistance, Multiple, Bacterial
20.
Nat Methods ; 18(5): 520-527, 2021 05.
Article in English | MEDLINE | ID: mdl-33859439

ABSTRACT

Despite the availability of methods for analyzing protein complexes, systematic analysis of complexes under multiple conditions remains challenging. Approaches based on biochemical fractionation of intact, native complexes and correlation of protein profiles have shown promise. However, most approaches for interpreting cofractionation datasets to yield complex composition and rearrangements between samples depend considerably on protein-protein interaction inference. We introduce PCprophet, a toolkit built on size exclusion chromatography-sequential window acquisition of all theoretical mass spectrometry (SEC-SWATH-MS) data to predict protein complexes and characterize their changes across experimental conditions. We demonstrate improved performance of PCprophet over state-of-the-art approaches and introduce a Bayesian approach to analyze altered protein-protein interactions across conditions. We provide both command-line and graphical interfaces to support the application of PCprophet to any cofractionation MS dataset, independent of separation or quantitative liquid chromatography-MS workflow, for the detection and quantitative tracking of protein complexes and their physiological dynamics.


Subject(s)
Machine Learning , Proteins/chemistry , Proteomics , Software , Bayes Theorem , Chromatography, Gel , Databases, Protein , Protein Conformation
SELECTION OF CITATIONS
SEARCH DETAIL