Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 136
Filter
1.
Nucleic Acids Res ; 52(D1): D419-D425, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37889074

ABSTRACT

Anti-prokaryotic immune system (APIS) proteins, typically encoded by phages, prophages, and plasmids, inhibit prokaryotic immune systems (e.g. restriction modification, toxin-antitoxin, CRISPR-Cas). A growing number of APIS genes have been characterized and dispersed in the literature. Here we developed dbAPIS (https://bcb.unl.edu/dbAPIS), as the first literature curated data repository for experimentally verified APIS genes and their associated protein families. The key features of dbAPIS include: (i) experimentally verified APIS genes with their protein sequences, functional annotation, PDB or AlphaFold predicted structures, genomic context, sequence and structural homologs from different microbiome/virome databases; (ii) classification of APIS proteins into sequence-based families and construction of hidden Markov models (HMMs); (iii) user-friendly web interface for data browsing by the inhibited immune system types or by the hosts, and functions for searching and batch downloading of pre-computed data; (iv) Inclusion of all types of APIS proteins (except for anti-CRISPRs) that inhibit a variety of prokaryotic defense systems (e.g. RM, TA, CBASS, Thoeris, Gabija). The current release of dbAPIS contains 41 verified APIS proteins and ∼4400 sequence homologs of 92 families and 38 clans. dbAPIS will facilitate the discovery of novel anti-defense genes and genomic islands in phages, by providing a user-friendly data repository and a web resource for an easy homology search against known APIS proteins.


Subject(s)
CRISPR-Associated Proteins , DNA Restriction-Modification Enzymes , Databases, Genetic , Toxin-Antitoxin Systems , Bacteriophages/genetics , Genome , Genomics , DNA Restriction-Modification Enzymes/classification , DNA Restriction-Modification Enzymes/genetics , Toxin-Antitoxin Systems/genetics , CRISPR-Associated Proteins/classification , CRISPR-Associated Proteins/genetics , Internet Use
2.
Brief Bioinform ; 24(6)2023 09 22.
Article in English | MEDLINE | ID: mdl-37930028

ABSTRACT

Technological advances have now made it possible to simultaneously profile the changes of epigenomic, transcriptomic and proteomic at the single cell level, allowing a more unified view of cellular phenotypes and heterogeneities. However, current computational tools for single-cell multi-omics data integration are mainly tailored for bi-modality data, so new tools are urgently needed to integrate tri-modality data with complex associations. To this end, we develop scMHNN to integrate single-cell multi-omics data based on hypergraph neural network. After modeling the complex data associations among various modalities, scMHNN performs message passing process on the multi-omics hypergraph, which can capture the high-order data relationships and integrate the multiple heterogeneous features. Followingly, scMHNN learns discriminative cell representation via a dual-contrastive loss in self-supervised manner. Based on the pretrained hypergraph encoder, we further introduce the pre-training and fine-tuning paradigm, which allows more accurate cell-type annotation with only a small number of labeled cells as reference. Benchmarking results on real and simulated single-cell tri-modality datasets indicate that scMHNN outperforms other competing methods on both cell clustering and cell-type annotation tasks. In addition, we also demonstrate scMHNN facilitates various downstream tasks, such as cell marker detection and enrichment analysis.


Subject(s)
Epigenomics , Transcriptome , Proteomics , Gene Expression Profiling , Neural Networks, Computer
3.
Bioinformatics ; 40(3)2024 Mar 04.
Article in English | MEDLINE | ID: mdl-38439545

ABSTRACT

MOTIVATION: Removal of batch effect between multiple datasets from different experimental platforms has become an urgent problem, since single-cell RNA sequencing (scRNA-seq) techniques developed rapidly. Although there have been some methods for this problem, most of them still face the challenge of under-correction or over-correction. Specifically, handling batch effect in highly nonlinear scRNA-seq data requires a more powerful model to address under-correction. In the meantime, some previous methods focus too much on removing difference between batches, which may disturb the biological signal heterogeneity of datasets generated from different experiments, thereby leading to over-correction. RESULTS: In this article, we propose a novel multi-layer adaptation autoencoder with dual-channel framework to address the under-correction and over-correction problems in batch effect removal, which is called BERMAD and can achieve better results of scRNA-seq data integration and joint analysis. First, we design a multi-layer adaptation architecture to model distribution difference between batches from different feature granularities. The distribution matching on various layers of autoencoder with different feature dimensions can result in more accurate batch correction outcome. Second, we propose a dual-channel framework, where the deep autoencoder processing each single dataset is independently trained. Hence, the heterogeneous information that is not shared between different batches can be retained more completely, which can alleviate over-correction. Comprehensive experiments on multiple scRNA-seq datasets demonstrate the effectiveness and superiority of our method over the state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION: The code implemented in Python and the data used for experiments have been released on GitHub (https://github.com/zhanglabNKU/BERMAD) and Zenodo (https://zenodo.org/records/10695073) with detailed instructions.


Subject(s)
Single-Cell Analysis , Single-Cell Gene Expression Analysis , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Gene Expression Profiling/methods , Cluster Analysis
4.
Nucleic Acids Res ; 51(W1): W115-W121, 2023 07 05.
Article in English | MEDLINE | ID: mdl-37125649

ABSTRACT

Carbohydrate active enzymes (CAZymes) are made by various organisms for complex carbohydrate metabolism. Genome mining of CAZymes has become a routine data analysis in (meta-)genome projects, owing to the importance of CAZymes in bioenergy, microbiome, nutrition, agriculture, and global carbon recycling. In 2012, dbCAN was provided as an online web server for automated CAZyme annotation. dbCAN2 (https://bcb.unl.edu/dbCAN2) was further developed in 2018 as a meta server to combine multiple tools for improved CAZyme annotation. dbCAN2 also included CGC-Finder, a tool for identifying CAZyme gene clusters (CGCs) in (meta-)genomes. We have updated the meta server to dbCAN3 with the following new functions and components: (i) dbCAN-sub as a profile Hidden Markov Model database (HMMdb) for substrate prediction at the CAZyme subfamily level; (ii) searching against experimentally characterized polysaccharide utilization loci (PULs) with known glycan substates of the dbCAN-PUL database for substrate prediction at the CGC level; (iii) a majority voting method to consider all CAZymes with substrate predicted from dbCAN-sub for substrate prediction at the CGC level; (iv) improved data browsing and visualization of substrate prediction results on the website. In summary, dbCAN3 not only inherits all the functions of dbCAN2, but also integrates three new methods for glycan substrate prediction.


Subject(s)
Carbohydrates , Microbiota , Carbohydrate Metabolism/genetics , Polysaccharides , Databases, Factual
5.
Nucleic Acids Res ; 51(D1): D557-D563, 2023 01 06.
Article in English | MEDLINE | ID: mdl-36399503

ABSTRACT

Carbohydrate Active EnZymes (CAZymes) are significantly important for microbial communities to thrive in carbohydrate rich environments such as animal guts, agricultural soils, forest floors, and ocean sediments. Since 2017, microbiome sequencing and assembly have produced numerous metagenome assembled genomes (MAGs). We have updated our dbCAN-seq database (https://bcb.unl.edu/dbCAN_seq) to include the following new data and features: (i) ∼498 000 CAZymes and ∼169 000 CAZyme gene clusters (CGCs) from 9421 MAGs of four ecological (human gut, human oral, cow rumen, and marine) environments; (ii) Glycan substrates for 41 447 (24.54%) CGCs inferred by two novel approaches (dbCAN-PUL homology search and eCAMI subfamily majority voting) (the two approaches agreed on 4183 CGCs for substrate assignments); (iii) A redesigned CGC page to include the graphical display of CGC gene compositions, the alignment of query CGC and subject PUL (polysaccharide utilization loci) of dbCAN-PUL, and the eCAMI subfamily table to support the predicted substrates; (iv) A statistics page to organize all the data for easy CGC access according to substrates and taxonomic phyla; and (v) A batch download page. In summary, this updated dbCAN-seq database highlights glycan substrates predicted for CGCs from microbiomes. Future work will implement the substrate prediction function in our dbCAN2 web server.


Subject(s)
Microbiota , Animals , Humans , Carbohydrates , Metagenome/genetics , Microbiota/genetics , Multigene Family , Polysaccharides/metabolism , Enzymes/genetics , Bacteria/enzymology , Environmental Microbiology
6.
Small ; 20(16): e2307627, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38063849

ABSTRACT

The high freezing point of polybromides, charging products, is a significant obstacle to the rapid development of zinc-bromine flow batteries (Zn-Br2 FBs). Here, a choline-based complexing agent (CCA) is constructed to liquefy the polybromides at low temperatures. Depending on quaternary ammonium group, choline can effectively complex with polybromide anions and form dense oil-phase that has excellent antifreezing property. Benefiting from indispensable strong ion-ion interaction, the highly selectively compatible CCA, consisting of choline and N-methyl-N-ethyl-morpholinium salts (CCA-M), can be achieved to further enhance bromine fixing ability. Interestingly, the formed polybromides with CCA-M are able to keep liquid even at -40 °C. The CCA-M endows Zn-Br2 FBs at 40 mA cm-2 with unprecedented long cycle life (over 150 cycles) and high Coulombic efficiency (CE, average ≈98.8%) at -20 °C, but also at room temperature (over 1200 cycles, average CE: ≈94.7%). The CCA shows a promising prospect of application and should be extended to other antifreezing bromine-based energy storage systems.

7.
Brief Bioinform ; 23(6)2022 11 19.
Article in English | MEDLINE | ID: mdl-36124775

ABSTRACT

Pan-genome analyses of metagenome-assembled genomes (MAGs) may suffer from the known issues with MAGs: fragmentation, incompleteness and contamination. Here, we conducted a critical assessment of pan-genomics of MAGs, by comparing pan-genome analysis results of complete bacterial genomes and simulated MAGs. We found that incompleteness led to significant core gene (CG) loss. The CG loss remained when using different pan-genome analysis tools (Roary, BPGA, Anvi'o) and when using a mixture of MAGs and complete genomes. Contamination had little effect on core genome size (except for Roary due to in its gene clustering issue) but had major influence on accessory genomes. Importantly, the CG loss was partially alleviated by lowering the CG threshold and using gene prediction algorithms that consider fragmented genes, but to a less degree when incompleteness was higher than 5%. The CG loss also led to incorrect pan-genome functional predictions and inaccurate phylogenetic trees. Our main findings were supported by a study of real MAG-isolate genome data. We conclude that lowering CG threshold and predicting genes in metagenome mode (as Anvi'o does with Prodigal) are necessary in pan-genome analysis of MAGs. Development of new pan-genome analysis tools specifically for MAGs are needed in future studies.


Subject(s)
Genome, Bacterial , Metagenome , Phylogeny , Genomics , Sequence Analysis, DNA/methods , Metagenomics/methods
8.
Brief Bioinform ; 23(1)2022 01 17.
Article in English | MEDLINE | ID: mdl-34889446

ABSTRACT

In biomedical networks, molecular associations are important to understand biological processes and functions. Many computational methods, such as link prediction methods based on graph neural networks (GNNs), have been successfully applied in discovering molecular relationships with biological significance. However, it remains a challenge to explore a method that relies on representation learning of links for accurately predicting molecular associations. In this paper, we present a novel GNN based on link representation (LR-GNN) to identify potential molecular associations. LR-GNN applies a graph convolutional network (GCN)-encoder to obtain node embedding. To represent associations between molecules, we design a propagation rule that captures the node embedding of each GCN-encoder layer to construct the LR. Furthermore, the LRs of all layers are fused in output by a designed layer-wise fusing rule, which enables LR-GNN to output more accurate results. Experiments on four biomedical network data, including lncRNA-disease association, miRNA-disease association, protein-protein interaction and drug-drug interaction, show that LR-GNN outperforms state-of-the-art methods and achieves robust performance. Case studies are also presented on two datasets to verify the ability to predict unknown associations. Finally, we validate the effectiveness of the LR by visualization.


Subject(s)
Computational Biology/methods , Neural Networks, Computer , Algorithms , Biomedical Technology/methods , Cell Communication , Deep Learning , Drug Interactions , Humans , MicroRNAs , Protein Interaction Domains and Motifs , RNA, Long Noncoding , Research Design
9.
Brief Bioinform ; 23(5)2022 09 20.
Article in English | MEDLINE | ID: mdl-35947989

ABSTRACT

In recent years, a number of computational approaches have been proposed to effectively integrate multiple heterogeneous biological networks, and have shown impressive performance for inferring gene function. However, the previous methods do not fully represent the critical neighborhood relationship between genes during the feature learning process. Furthermore, it is difficult to accurately estimate the contributions of different views for multi-view integration. In this paper, we propose MGEGFP, a multi-view graph embedding method based on adaptive estimation with Graph Convolutional Network (GCN), to learn high-quality gene representations among multiple interaction networks for function prediction. First, we design a dual-channel GCN encoder to disentangle the view-specific information and the consensus pattern across diverse networks. By the aid of disentangled representations, we develop a multi-gate module to adaptively estimate the contributions of different views during each reconstruction process and make full use of the multiplexity advantages, where a diversity preservation constraint is designed to prevent the over-fitting problem. To validate the effectiveness of our model, we conduct experiments on networks from the STRING database for both yeast and human datasets, and compare the performance with seven state-of-the-art methods in five evaluation metrics. Moreover, the ablation study manifests the important contribution of the designed dual-channel encoder, multi-gate module and the diversity preservation constraint in MGEGFP. The experimental results confirm the superiority of our proposed method and suggest that MGEGFP can be a useful tool for gene function prediction.


Subject(s)
Computational Biology , Gene Regulatory Networks , Humans , Saccharomyces cerevisiae/genetics
10.
Bioinformatics ; 39(5)2023 05 04.
Article in English | MEDLINE | ID: mdl-37158576

ABSTRACT

MOTIVATION: Encoded by (pro-)viruses, anti-CRISPR (Acr) proteins inhibit the CRISPR-Cas immune system of their prokaryotic hosts. As a result, Acr proteins can be employed to develop more controllable CRISPR-Cas genome editing tools. Recent studies revealed that known acr genes often coexist with other acr genes and with phage structural genes within the same operon. For example, we found that 47 of 98 known acr genes (or their homologs) co-exist in the same operons. None of the current Acr prediction tools have considered this important genomic context feature. We have developed a new software tool AOminer to facilitate the improved discovery of new Acrs by fully exploiting the genomic context of known acr genes and their homologs. RESULTS: AOminer is the first machine learning based tool focused on the discovery of Acr operons (AOs). A two-state HMM (hidden Markov model) was trained to learn the conserved genomic context of operons that contain known acr genes or their homologs, and the learnt features could distinguish AOs and non-AOs. AOminer allows automated mining for potential AOs from query genomes or operons. AOminer outperformed all existing Acr prediction tools with an accuracy = 0.85. AOminer will facilitate the discovery of novel anti-CRISPR operons. AVAILABILITY AND IMPLEMENTATION: The webserver is available at: http://aca.unl.edu/AOminer/AOminer_APP/. The python program is at: https://github.com/boweny920/AOminer.


Subject(s)
Bacteriophages , Viral Proteins , Viral Proteins/genetics , CRISPR-Cas Systems/genetics , Gene Editing , Operon , Bacteriophages/genetics , Machine Learning
11.
Plant Physiol ; 192(1): 205-221, 2023 05 02.
Article in English | MEDLINE | ID: mdl-36756926

ABSTRACT

Flowering time is one of the most important agronomic traits affecting the adaptation and yield of rice (Oryza sativa). Heading date 1 (Hd1) is a key factor in the photoperiodic control of flowering time. In this study, two basic helix-loop-helix (bHLH) transcription factors, Hd1 Binding Protein 1 (HBP1) and Partner of HBP1 (POH1) were identified as transcriptional regulators of Hd1. We generated knockout mutants of HBP1 and ectopically expressed transgenic lines of the two bHLH transcription factors and used these lines to investigate the roles of these two factors in regulating flowering time. HBP1 physically associated with POH1 forming homo- or heterodimers to perform their functions. Both HBP1 and POH1 bound directly to the cis-acting elements located in the promoter of Hd1 to activate its expression. CRISPR/Cas9-generated knockout mutations of HBP1, but not POH1 mutations, promoted earlier flowering time; conversely, HBP1 and POH1 overexpression delayed flowering time in rice under long-day and short-day conditions by activating the expression of Hd1 and suppressing the expression of Early heading date 1 (Ehd1), Heading date 3a (Hd3a), and Rice Flowering locus T 1 (RFT1), thus controlling flowering time in rice. Our findings revealed a mechanism for flowering time control through transcriptional regulation of Hd1 and laid theoretical and practical foundations for improving the growth period, adaptation, and yield of rice.


Subject(s)
Flowers , Oryza , Oryza/metabolism , Basic Helix-Loop-Helix Transcription Factors/genetics , Basic Helix-Loop-Helix Transcription Factors/metabolism , Photoperiod , Phenotype , Plant Proteins/genetics , Plant Proteins/metabolism , Gene Expression Regulation, Plant
12.
Bioinformatics ; 38(5): 1295-1303, 2022 02 07.
Article in English | MEDLINE | ID: mdl-34864918

ABSTRACT

MOTIVATION: With the development of single-cell RNA sequencing (scRNA-seq) techniques, increasingly more large-scale gene expression datasets become available. However, to analyze datasets produced by different experiments, batch effects among different datasets must be considered. Although several methods have been recently published to remove batch effects in scRNA-seq data, two problems remain to be challenging and not completely solved: (i) how to reduce the distribution differences of different batches more accurately; and (ii) how to align samples from different batches to recover the cell type clusters. RESULTS: We proposed a novel deep-learning approach, which is a hierarchical distribution-matching framework assisted with contrastive learning to address these two problems. Firstly, we design a hierarchical framework for distribution matching based on a deep autoencoder. This framework employs an adversarial training strategy to match the global distribution of different batches. This provides an improved foundation to further match the local distributions with a maximum mean discrepancy-based loss. For local matching, we divide cells in each batch into clusters and develop a contrastive learning mechanism to simultaneously align similar cluster pairs and keep noisy pairs apart from each other. This allows to obtain clusters with all cells of the same type (true positives), and avoid clusters with cells of different type (false positives). We demonstrate the effectiveness of our method on both simulated and real datasets. Results show that our new method significantly outperforms the state-of-the-art methods and has the ability to prevent overcorrection. AVAILABILITY AND IMPLEMENTATION: The python code to generate results and figures in this article is available at https://github.com/zhanglabNKU/HDMC, the data underlying this article is also available at this github repository. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Deep Learning , Sequence Analysis, RNA/methods , Single-Cell Gene Expression Analysis , Single-Cell Analysis/methods , Exome Sequencing , Gene Expression Profiling/methods
13.
Nucleic Acids Res ; 49(D1): D523-D528, 2021 01 08.
Article in English | MEDLINE | ID: mdl-32941621

ABSTRACT

PULs (polysaccharide utilization loci) are discrete gene clusters of CAZymes (Carbohydrate Active EnZymes) and other genes that work together to digest and utilize carbohydrate substrates. While PULs have been extensively characterized in Bacteroidetes, there exist PULs from other bacterial phyla, as well as archaea and metagenomes, that remain to be catalogued in a database for efficient retrieval. We have developed an online database dbCAN-PUL (http://bcb.unl.edu/dbCAN_PUL/) to display experimentally verified CAZyme-containing PULs from literature with pertinent metadata, sequences, and annotation. Compared to other online CAZyme and PUL resources, dbCAN-PUL has the following new features: (i) Batch download of PUL data by target substrate, species/genome, genus, or experimental characterization method; (ii) Annotation for each PUL that displays associated metadata such as substrate(s), experimental characterization method(s) and protein sequence information, (iii) Links to external annotation pages for CAZymes (CAZy), transporters (UniProt) and other genes, (iv) Display of homologous gene clusters in GenBank sequences via integrated MultiGeneBlast tool and (v) An integrated BLASTX service available for users to query their sequences against PUL proteins in dbCAN-PUL. With these features, dbCAN-PUL will be an important repository for CAZyme and PUL research, complementing our other web servers and databases (dbCAN2, dbCAN-seq).


Subject(s)
Bacteroidetes/genetics , Databases, Genetic , Enzymes/metabolism , Genetic Loci , Multigene Family , Polysaccharides/metabolism , Molecular Sequence Annotation , Substrate Specificity
14.
Nucleic Acids Res ; 49(D1): D622-D629, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33068435

ABSTRACT

CRISPR-Cas is an anti-viral mechanism of prokaryotes that has been widely adopted for genome editing. To make CRISPR-Cas genome editing more controllable and safer to use, anti-CRISPR proteins have been recently exploited to prevent excessive/prolonged Cas nuclease cleavage. Anti-CRISPR (Acr) proteins are encoded by (pro)phages/(pro)viruses, and have the ability to inhibit their host's CRISPR-Cas systems. We have built an online database AcrDB (http://bcb.unl.edu/AcrDB) by scanning ∼19 000 genomes of prokaryotes and viruses with AcrFinder, a recently developed Acr-Aca (Acr-associated regulator) operon prediction program. Proteins in Acr-Aca operons were further processed by two machine learning-based programs (AcRanker and PaCRISPR) to obtain numerical scores/ranks. Compared to other anti-CRISPR databases, AcrDB has the following unique features: (i) It is a genome-scale database with the largest collection of data (39 799 Acr-Aca operons containing Aca or Acr homologs); (ii) It offers a user-friendly web interface with various functions for browsing, graphically viewing, searching, and batch downloading Acr-Aca operons; (iii) It focuses on the genomic context of Acr and Aca candidates instead of individual Acr protein family and (iv) It collects data with three independent programs each having a unique data mining algorithm for cross validation. AcrDB will be a valuable resource to the anti-CRISPR research community.


Subject(s)
CRISPR-Cas Systems/genetics , Databases, Genetic , Operon/genetics , Prokaryotic Cells/metabolism , Viruses/metabolism , Internet
15.
Appl Environ Microbiol ; 88(3): e0185121, 2022 02 08.
Article in English | MEDLINE | ID: mdl-34851722

ABSTRACT

Dietary polyphenols can significantly benefit human health, but their bioavailability is metabolically controlled by human gut microbiota. To facilitate the study of polyphenol metabolism for human gut health, we have manually curated experimentally characterized polyphenol utilization proteins (PUPs) from published literature. This resulted in 60 experimentally characterized PUPs (named seeds) with various metadata, such as species and substrate. Further database search found 107,851 homologs of the seeds from UniProt and UHGP (unified human gastrointestinal protein) databases. All PUP seeds and homologs were classified into protein classes, families, and subfamilies based on Enzyme Commission (EC) numbers, Pfam (protein family) domains, and sequence similarity networks. By locating PUP homologs in the genomes of UHGP, we have identified 1,074 physically linked PUP gene clusters (PGCs), which are potentially involved in polyphenol metabolism in the human gut. The gut microbiome of Africans was consistently ranked the top in terms of the abundance and prevalence of PUP homologs and PGCs among all geographical continents. This reflects the fact that dietary polyphenols are consumed by the African population more commonly than by other populations, such as Europeans and North Americans. A case study of the Hadza hunter-gatherer microbiome verified the feasibility of using dbPUP to profile metagenomic data for biologically meaningful discovery, suggesting an association between diet and PUP abundance. A Pfam domain enrichment analysis of PGCs identified a number of putatively novel PUP families. Lastly, a user-friendly web interface (https://bcb.unl.edu/dbpup/) provides all the data online to facilitate the research of polyphenol metabolism for improved human health. IMPORTANCE Long-term consumption of polyphenol-rich foods has been shown to lower the risk of various human diseases, such as cardiovascular diseases, cancers, and metabolic diseases. Raw polyphenols are often enzymatically processed by gut microbiome, which contains various polyphenol utilization proteins (PUPs) to produce metabolites with much higher bioaccessibility to gastrointestinal cells. This study delivered dbPUP as an online database for experimentally characterized PUPs and their homologs in human gut microbiome. This work also performed a systematic classification of PUPs into enzyme classes, families, and subfamilies. The signature Pfam domains were identified for PUP families, enabling conserved domain-based PUP annotation. This standardized sequence similarity-based PUP classification system offered a guideline for the future inclusion of new experimentally characterized PUPs and the creation of new PUP families. An in-depth data analysis was further conducted on PUP homologs and physically linked PUP gene clusters (PGCs) in gut microbiomes of different human populations.


Subject(s)
Gastrointestinal Microbiome , Microbiota , Gastrointestinal Tract/metabolism , Humans , Metagenome , Polyphenols/metabolism
16.
Nucleic Acids Res ; 48(W1): W358-W365, 2020 07 02.
Article in English | MEDLINE | ID: mdl-32402073

ABSTRACT

Anti-CRISPR (Acr) proteins encoded by (pro)phages/(pro)viruses have a great potential to enable a more controllable genome editing. However, genome mining new Acr proteins is challenging due to the lack of a conserved functional domain and the low sequence similarity among experimentally characterized Acr proteins. We introduce here AcrFinder, a web server (http://bcb.unl.edu/AcrFinder) that combines three well-accepted ideas used by previous experimental studies to pre-screen genomic data for Acr candidates. These ideas include homology search, guilt-by-association (GBA), and CRISPR-Cas self-targeting spacers. Compared to existing bioinformatics tools, AcrFinder has the following unique functions: (i) it is the first online server specifically mining genomes for Acr-Aca operons; (ii) it provides a most comprehensive Acr and Aca (Acr-associated regulator) database (populated by GBA-based Acr and Aca datasets); (iii) it combines homology-based, GBA-based, and self-targeting approaches in one software package; and (iv) it provides a user-friendly web interface to take both nucleotide and protein sequence files as inputs, and output a result page with graphic representation of the genomic contexts of Acr-Aca operons. The leave-one-out cross-validation on experimentally characterized Acr-Aca operons showed that AcrFinder had a 100% recall. AcrFinder will be a valuable web resource to help experimental microbiologists discover new Anti-CRISPRs.


Subject(s)
Bacteriophages/genetics , CRISPR-Cas Systems , Operon , Software , Viral Proteins/genetics , Databases, Genetic , Genome, Archaeal , Genome, Bacterial , Genomics/methods
17.
BMC Bioinformatics ; 22(1): 136, 2021 Mar 21.
Article in English | MEDLINE | ID: mdl-33745450

ABSTRACT

BACKGROUND: Numerous studies have demonstrated that long non-coding RNAs are related to plenty of human diseases. Therefore, it is crucial to predict potential lncRNA-disease associations for disease prognosis, diagnosis and therapy. Dozens of machine learning and deep learning algorithms have been adopted to this problem, yet it is still challenging to learn efficient low-dimensional representations from high-dimensional features of lncRNAs and diseases to predict unknown lncRNA-disease associations accurately. RESULTS: We proposed an end-to-end model, VGAELDA, which integrates variational inference and graph autoencoders for lncRNA-disease associations prediction. VGAELDA contains two kinds of graph autoencoders. Variational graph autoencoders (VGAE) infer representations from features of lncRNAs and diseases respectively, while graph autoencoders propagate labels via known lncRNA-disease associations. These two kinds of autoencoders are trained alternately by adopting variational expectation maximization algorithm. The integration of both the VGAE for graph representation learning, and the alternate training via variational inference, strengthens the capability of VGAELDA to capture efficient low-dimensional representations from high-dimensional features, and hence promotes the robustness and preciseness for predicting unknown lncRNA-disease associations. Further analysis illuminates that the designed co-training framework of lncRNA and disease for VGAELDA solves a geometric matrix completion problem for capturing efficient low-dimensional representations via a deep learning approach. CONCLUSION: Cross validations and numerical experiments illustrate that VGAELDA outperforms the current state-of-the-art methods in lncRNA-disease association prediction. Case studies indicate that VGAELDA is capable of detecting potential lncRNA-disease associations. The source code and data are available at https://github.com/zhanglabNKU/VGAELDA .


Subject(s)
Computational Biology , RNA, Long Noncoding , Algorithms , Humans , Machine Learning , RNA, Long Noncoding/genetics , Software
18.
Bioinformatics ; 36(14): 4211-4213, 2020 08 15.
Article in English | MEDLINE | ID: mdl-32386292

ABSTRACT

SUMMARY: We developed GDASC, a web version of our former DASC algorithm implemented with GPU. It provides a user-friendly web interface for detecting batch factors. Based on the good performance of DASC algorithm, it is able to give the most accurate results. For two steps of DASC, data-adaptive shrinkage and semi-non-negative matrix factorization, we designed parallelization strategies facing convex clustering solution and decomposition process. It runs more than 50 times faster than the original version on the representative RNA sequencing quality control dataset. With its accuracy and high speed, this server will be a useful tool for batch effects analysis. AVAILABILITY AND IMPLEMENTATION: http://bioinfo.nankai.edu.cn/gdasc.php. CONTACT: zhanghan@nankai.edu.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Software , Computers
19.
Bioinformatics ; 36(7): 2068-2075, 2020 04 01.
Article in English | MEDLINE | ID: mdl-31794006

ABSTRACT

MOTIVATION: Carbohydrate-active enzymes (CAZymes) are extremely important to bioenergy, human gut microbiome, and plant pathogen researches and industries. Here we developed a new amino acid k-mer-based CAZyme classification, motif identification and genome annotation tool using a bipartite network algorithm. Using this tool, we classified 390 CAZyme families into thousands of subfamilies each with distinguishing k-mer peptides. These k-mers represented the characteristic motifs (in the form of a collection of conserved short peptides) of each subfamily, and thus were further used to annotate new genomes for CAZymes. This idea was also generalized to extract characteristic k-mer peptides for all the Swiss-Prot enzymes classified by the EC (enzyme commission) numbers and applied to enzyme EC prediction. RESULTS: This new tool was implemented as a Python package named eCAMI. Benchmark analysis of eCAMI against the state-of-the-art tools on CAZyme and enzyme EC datasets found that: (i) eCAMI has the best performance in terms of accuracy and memory use for CAZyme and enzyme EC classification and annotation; (ii) the k-mer-based tools (including PPR-Hotpep, CUPP and eCAMI) perform better than homology-based tools and deep-learning tools in enzyme EC prediction. Lastly, we confirmed that the k-mer-based tools have the unique ability to identify the characteristic k-mer peptides in the predicted enzymes. AVAILABILITY AND IMPLEMENTATION: https://github.com/yinlabniu/eCAMI and https://github.com/zhanglabNKU/eCAMI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Software , Carbohydrates , Databases, Protein , Genome , Humans
20.
Phys Chem Chem Phys ; 23(46): 26070-26084, 2021 Dec 01.
Article in English | MEDLINE | ID: mdl-34787128

ABSTRACT

Zinc-bromine batteries (ZBBs) receive wide attention in distributed energy storage because of the advantages of high theoretical energy density and low cost. However, their large-scale application is still confronted with some obstacles. Therefore, in-depth research and advancement on the structure, electrolyte, anode, cathode and membrane are of great significance and impendency. Herein, we review the past and present investigations on ZBBs, discuss the key problems and technical challenges, and propose perspectives for the future, with the focus on materials and chemistry. This perspective would provide valuable information on further development of ZBBs.

SELECTION OF CITATIONS
SEARCH DETAIL