Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 3.791
Filter
Add more filters

Publication year range
1.
J Cell Sci ; 137(20)2024 10 15.
Article in English | MEDLINE | ID: mdl-38738286

ABSTRACT

Plant protoplasts provide starting material for of inducing pluripotent cell masses that are competent for tissue regeneration in vitro, analogous to animal induced pluripotent stem cells (iPSCs). Dedifferentiation is associated with large-scale chromatin reorganisation and massive transcriptome reprogramming, characterised by stochastic gene expression. How this cellular variability reflects on chromatin organisation in individual cells and what factors influence chromatin transitions during culturing are largely unknown. Here, we used high-throughput imaging and a custom supervised image analysis protocol extracting over 100 chromatin features of cultured protoplasts. The analysis revealed rapid, multiscale dynamics of chromatin patterns with a trajectory that strongly depended on nutrient availability. Decreased abundance in H1 (linker histones) is hallmark of chromatin transitions. We measured a high heterogeneity of chromatin patterns indicating intrinsic entropy as a hallmark of the initial cultures. We further measured an entropy decline over time, and an antagonistic influence by external and intrinsic factors, such as phytohormones and epigenetic modifiers, respectively. Collectively, our study benchmarks an approach to understand the variability and evolution of chromatin patterns underlying plant cell reprogramming in vitro.


Subject(s)
Chromatin , Entropy , Induced Pluripotent Stem Cells , Chromatin/metabolism , Chromatin/genetics , Induced Pluripotent Stem Cells/metabolism , Induced Pluripotent Stem Cells/cytology , Protoplasts/metabolism , Cellular Reprogramming/genetics , Histones/metabolism , Histones/genetics , Plant Cells/metabolism , Epigenesis, Genetic
2.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38605640

ABSTRACT

Language models pretrained by self-supervised learning (SSL) have been widely utilized to study protein sequences, while few models were developed for genomic sequences and were limited to single species. Due to the lack of genomes from different species, these models cannot effectively leverage evolutionary information. In this study, we have developed SpliceBERT, a language model pretrained on primary ribonucleic acids (RNA) sequences from 72 vertebrates by masked language modeling, and applied it to sequence-based modeling of RNA splicing. Pretraining SpliceBERT on diverse species enables effective identification of evolutionarily conserved elements. Meanwhile, the learned hidden states and attention weights can characterize the biological properties of splice sites. As a result, SpliceBERT was shown effective on several downstream tasks: zero-shot prediction of variant effects on splicing, prediction of branchpoints in humans, and cross-species prediction of splice sites. Our study highlighted the importance of pretraining genomic language models on a diverse range of species and suggested that SSL is a promising approach to enhance our understanding of the regulatory logic underlying genomic sequences.


Subject(s)
RNA Splicing , Vertebrates , Animals , Humans , Base Sequence , Vertebrates/genetics , RNA , Supervised Machine Learning
3.
Brief Bioinform ; 25(4)2024 May 23.
Article in English | MEDLINE | ID: mdl-38982642

ABSTRACT

Inferring cell type proportions from bulk transcriptome data is crucial in immunology and oncology. Here, we introduce guided LDA deconvolution (GLDADec), a bulk deconvolution method that guides topics using cell type-specific marker gene names to estimate topic distributions for each sample. Through benchmarking using blood-derived datasets, we demonstrate its high estimation performance and robustness. Moreover, we apply GLDADec to heterogeneous tissue bulk data and perform comprehensive cell type analysis in a data-driven manner. We show that GLDADec outperforms existing methods in estimation performance and evaluate its biological interpretability by examining enrichment of biological processes for topics. Finally, we apply GLDADec to The Cancer Genome Atlas tumor samples, enabling subtype stratification and survival analysis based on estimated cell type proportions, thus proving its practical utility in clinical settings. This approach, utilizing marker gene names as partial prior information, can be applied to various scenarios for bulk data deconvolution. GLDADec is available as an open-source Python package at https://github.com/mizuno-group/GLDADec.


Subject(s)
Software , Humans , Gene Expression Profiling/methods , Algorithms , Transcriptome , Computational Biology/methods , Neoplasms/genetics , Biomarkers, Tumor/genetics , Genetic Markers
4.
Brief Bioinform ; 25(4)2024 May 23.
Article in English | MEDLINE | ID: mdl-38985929

ABSTRACT

Recent advances in sequencing, mass spectrometry, and cytometry technologies have enabled researchers to collect multiple 'omics data types from a single sample. These large datasets have led to a growing consensus that a holistic approach is needed to identify new candidate biomarkers and unveil mechanisms underlying disease etiology, a key to precision medicine. While many reviews and benchmarks have been conducted on unsupervised approaches, their supervised counterparts have received less attention in the literature and no gold standard has emerged yet. In this work, we present a thorough comparison of a selection of six methods, representative of the main families of intermediate integrative approaches (matrix factorization, multiple kernel methods, ensemble learning, and graph-based methods). As non-integrative control, random forest was performed on concatenated and separated data types. Methods were evaluated for classification performance on both simulated and real-world datasets, the latter being carefully selected to cover different medical applications (infectious diseases, oncology, and vaccines) and data modalities. A total of 15 simulation scenarios were designed from the real-world datasets to explore a large and realistic parameter space (e.g. sample size, dimensionality, class imbalance, effect size). On real data, the method comparison showed that integrative approaches performed better or equally well than their non-integrative counterpart. By contrast, DIABLO and the four random forest alternatives outperform the others across the majority of simulation scenarios. The strengths and limitations of these methods are discussed in detail as well as guidelines for future applications.


Subject(s)
Computational Biology , Humans , Computational Biology/methods , Algorithms , Genomics/methods , Genomics/statistics & numerical data , Multiomics
5.
Brief Bioinform ; 25(4)2024 May 23.
Article in English | MEDLINE | ID: mdl-38801702

ABSTRACT

Self-supervised learning plays an important role in molecular representation learning because labeled molecular data are usually limited in many tasks, such as chemical property prediction and virtual screening. However, most existing molecular pre-training methods focus on one modality of molecular data, and the complementary information of two important modalities, SMILES and graph, is not fully explored. In this study, we propose an effective multi-modality self-supervised learning framework for molecular SMILES and graph. Specifically, SMILES data and graph data are first tokenized so that they can be processed by a unified Transformer-based backbone network, which is trained by a masked reconstruction strategy. In addition, we introduce a specialized non-overlapping masking strategy to encourage fine-grained interaction between these two modalities. Experimental results show that our framework achieves state-of-the-art performance in a series of molecular property prediction tasks, and a detailed ablation study demonstrates efficacy of the multi-modality framework and the masking strategy.


Subject(s)
Supervised Machine Learning , Algorithms , Computational Biology/methods
6.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38426327

ABSTRACT

Cluster assignment is vital to analyzing single-cell RNA sequencing (scRNA-seq) data to understand high-level biological processes. Deep learning-based clustering methods have recently been widely used in scRNA-seq data analysis. However, existing deep models often overlook the interconnections and interactions among network layers, leading to the loss of structural information within the network layers. Herein, we develop a new self-supervised clustering method based on an adaptive multi-scale autoencoder, called scAMAC. The self-supervised clustering network utilizes the Multi-Scale Attention mechanism to fuse the feature information from the encoder, hidden and decoder layers of the multi-scale autoencoder, which enables the exploration of cellular correlations within the same scale and captures deep features across different scales. The self-supervised clustering network calculates the membership matrix using the fused latent features and optimizes the clustering network based on the membership matrix. scAMAC employs an adaptive feedback mechanism to supervise the parameter updates of the multi-scale autoencoder, obtaining a more effective representation of cell features. scAMAC not only enables cell clustering but also performs data reconstruction through the decoding layer. Through extensive experiments, we demonstrate that scAMAC is superior to several advanced clustering and imputation methods in both data clustering and reconstruction. In addition, scAMAC is beneficial for downstream analysis, such as cell trajectory inference. Our scAMAC model codes are freely available at https://github.com/yancy2024/scAMAC.


Subject(s)
Data Analysis , Single-Cell Gene Expression Analysis , Cluster Analysis , Sequence Analysis, RNA , Gene Expression Profiling , Algorithms
7.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38426320

ABSTRACT

Protein subcellular localization (PSL) is very important in order to understand its functions, and its movement between subcellular niches within cells plays fundamental roles in biological process regulation. Mass spectrometry-based spatio-temporal proteomics technologies can help provide new insights of protein translocation, but bring the challenge in identifying reliable protein translocation events due to the noise interference and insufficient data mining. We propose a semi-supervised graph convolution network (GCN)-based framework termed TransGCN that infers protein translocation events from spatio-temporal proteomics. Based on expanded multiple distance features and joint graph representations of proteins, TransGCN utilizes the semi-supervised GCN to enable effective knowledge transfer from proteins with known PSLs for predicting protein localization and translocation. Our results demonstrate that TransGCN outperforms current state-of-the-art methods in identifying protein translocations, especially in coping with batch effects. It also exhibited excellent predictive accuracy in PSL prediction. TransGCN is freely available on GitHub at https://github.com/XuejiangGuo/TransGCN.


Subject(s)
Coping Skills , Proteomics , Data Mining , Mass Spectrometry , Protein Transport
8.
Proc Natl Acad Sci U S A ; 120(1): e2214972120, 2023 01 03.
Article in English | MEDLINE | ID: mdl-36580592

ABSTRACT

Regression learning is one of the long-standing problems in statistics, machine learning, and deep learning (DL). We show that writing this problem as a probabilistic expectation over (unknown) feature probabilities - thus increasing the number of unknown parameters and seemingly making the problem more complex-actually leads to its simplification, and allows incorporating the physical principle of entropy maximization. It helps decompose a very general setting of this learning problem (including discretization, feature selection, and learning multiple piece-wise linear regressions) into an iterative sequence of simple substeps, which are either analytically solvable or cheaply computable through an efficient second-order numerical solver with a sublinear cost scaling. This leads to the computationally cheap and robust non-DL second-order Sparse Probabilistic Approximation for Regression Task Analysis (SPARTAn) algorithm, that can be efficiently applied to problems with millions of feature dimensions on a commodity laptop, when the state-of-the-art learning tools would require supercomputers. SPARTAn is compared to a range of commonly used regression learning tools on synthetic problems and on the prediction of the El Niño Southern Oscillation, the dominant interannual mode of tropical climate variability. The obtained SPARTAn learners provide more predictive, sparse, and physically explainable data descriptions, clearly discerning the important role of ocean temperature variability at the thermocline in the equatorial Pacific. SPARTAn provides an easily interpretable description of the timescales by which these thermocline temperature features evolve and eventually express at the surface, thereby enabling enhanced predictability of the key drivers of the interannual climate.


Subject(s)
El Nino-Southern Oscillation , Tropical Climate , Entropy , Temperature , Algorithms
9.
Proc Natl Acad Sci U S A ; 120(32): e2300558120, 2023 08 08.
Article in English | MEDLINE | ID: mdl-37523562

ABSTRACT

While sensory representations in the brain depend on context, it remains unclear how such modulations are implemented at the biophysical level, and how processing layers further in the hierarchy can extract useful features for each possible contextual state. Here, we demonstrate that dendritic N-Methyl-D-Aspartate spikes can, within physiological constraints, implement contextual modulation of feedforward processing. Such neuron-specific modulations exploit prior knowledge, encoded in stable feedforward weights, to achieve transfer learning across contexts. In a network of biophysically realistic neuron models with context-independent feedforward weights, we show that modulatory inputs to dendritic branches can solve linearly nonseparable learning problems with a Hebbian, error-modulated learning rule. We also demonstrate that local prediction of whether representations originate either from different inputs, or from different contextual modulations of the same input, results in representation learning of hierarchical feedforward weights across processing layers that accommodate a multitude of contexts.


Subject(s)
Models, Neurological , N-Methylaspartate , Learning/physiology , Neurons/physiology , Perception
10.
Circulation ; 149(24): e1313-e1410, 2024 Jun 11.
Article in English | MEDLINE | ID: mdl-38743805

ABSTRACT

AIM: The "2024 ACC/AHA/AACVPR/APMA/ABC/SCAI/SVM/SVN/SVS/SIR/VESS Guideline for the Management of Lower Extremity Peripheral Artery Disease" provides recommendations to guide clinicians in the treatment of patients with lower extremity peripheral artery disease across its multiple clinical presentation subsets (ie, asymptomatic, chronic symptomatic, chronic limb-threatening ischemia, and acute limb ischemia). METHODS: A comprehensive literature search was conducted from October 2020 to June 2022, encompassing studies, reviews, and other evidence conducted on human subjects that was published in English from PubMed, EMBASE, the Cochrane Library, CINHL Complete, and other selected databases relevant to this guideline. Additional relevant studies, published through May 2023 during the peer review process, were also considered by the writing committee and added to the evidence tables where appropriate. STRUCTURE: Recommendations from the "2016 AHA/ACC Guideline on the Management of Patients With Lower Extremity Peripheral Artery Disease" have been updated with new evidence to guide clinicians. In addition, new recommendations addressing comprehensive care for patients with peripheral artery disease have been developed.


Subject(s)
American Heart Association , Lower Extremity , Peripheral Arterial Disease , Humans , Peripheral Arterial Disease/therapy , Peripheral Arterial Disease/diagnosis , Lower Extremity/blood supply , United States , Cardiology/standards
11.
Brief Bioinform ; 24(6)2023 09 22.
Article in English | MEDLINE | ID: mdl-37769630

ABSTRACT

Single-cell RNA sequencing (scRNA-seq) is a widely used technique for characterizing individual cells and studying gene expression at the single-cell level. Clustering plays a vital role in grouping similar cells together for various downstream analyses. However, the high sparsity and dimensionality of large scRNA-seq data pose challenges to clustering performance. Although several deep learning-based clustering algorithms have been proposed, most existing clustering methods have limitations in capturing the precise distribution types of the data or fully utilizing the relationships between cells, leaving a considerable scope for improving the clustering performance, particularly in detecting rare cell populations from large scRNA-seq data. We introduce DeepScena, a novel single-cell hierarchical clustering tool that fully incorporates nonlinear dimension reduction, negative binomial-based convolutional autoencoder for data fitting, and a self-supervision model for cell similarity enhancement. In comprehensive evaluation using multiple large-scale scRNA-seq datasets, DeepScena consistently outperformed seven popular clustering tools in terms of accuracy. Notably, DeepScena exhibits high proficiency in identifying rare cell populations within large datasets that contain large numbers of clusters. When applied to scRNA-seq data of multiple myeloma cells, DeepScena successfully identified not only previously labeled large cell types but also subpopulations in CD14 monocytes, T cells and natural killer cells, respectively.


Subject(s)
Single-Cell Analysis , Single-Cell Gene Expression Analysis , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Algorithms , Cluster Analysis , Gene Expression Profiling/methods
12.
Brief Bioinform ; 24(2)2023 03 19.
Article in English | MEDLINE | ID: mdl-36653906

ABSTRACT

Spatially resolved transcriptomics technologies enable comprehensive measurement of gene expression patterns in the context of intact tissues. However, existing technologies suffer from either low resolution or shallow sequencing depth. Here, we present DIST, a deep learning-based method that imputes the gene expression profiles on unmeasured locations and enhances the gene expression for both original measured spots and imputed spots by self-supervised learning and transfer learning. We evaluate the performance of DIST for imputation, clustering, differential expression analysis and functional enrichment analysis. The results show that DIST can impute the gene expression accurately, enhance the gene expression for low-quality data, help detect more biological meaningful differentially expressed genes and pathways, therefore allow for deeper insights into the biological processes.


Subject(s)
Deep Learning , Transcriptome , Gene Expression Profiling/methods , Cluster Analysis
13.
Brief Bioinform ; 24(2)2023 03 19.
Article in English | MEDLINE | ID: mdl-36790856

ABSTRACT

Potential miRNA-disease associations (MDA) play an important role in the discovery of complex human disease etiology. Therefore, MDA prediction is an attractive research topic in the field of biomedical machine learning. Recently, several models have been proposed for this task, but their performance limited by over-reliance on relevant network information with noisy graph structure connections. However, the application of self-supervised graph structure learning to MDA tasks remains unexplored. Our study is the first to use multi-view self-supervised contrastive learning (MSGCL) for MDA prediction. Specifically, we generated a learner view without association labels of miRNAs and diseases as input, and utilized the known association network to generate an anchor view that provides guiding signals for the learner view. The graph structure was optimized by designing a contrastive loss to maximize the consistency between the anchor and learner views. Our model is similar to a pre-trained model that continuously optimizes upstream tasks for high-quality association graph topology, thereby enhancing the latent representation of association predictions. The experimental results show that our proposed method outperforms state-of-the-art methods by 2.79$\%$ and 3.20$\%$ in area under the receiver operating characteristic curve (AUC) and area under the precision/recall curve (AUPR), respectively.


Subject(s)
Machine Learning , MicroRNAs , Humans , Area Under Curve , MicroRNAs/genetics , ROC Curve
14.
Brief Bioinform ; 24(2)2023 03 19.
Article in English | MEDLINE | ID: mdl-36869836

ABSTRACT

The rapid development of single-cell RNA sequencing (scRNA-seq) technology allows us to study gene expression heterogeneity at the cellular level. Cell annotation is the basis for subsequent downstream analysis in single-cell data mining. As more and more well-annotated scRNA-seq reference data become available, many automatic annotation methods have sprung up in order to simplify the cell annotation process on unlabeled target data. However, existing methods rarely explore the fine-grained semantic knowledge of novel cell types absent from the reference data, and they are usually susceptible to batch effects on the classification of seen cell types. Taking into consideration the limitations above, this paper proposes a new and practical task called generalized cell type annotation and discovery for scRNA-seq data whereby target cells are labeled with either seen cell types or cluster labels, instead of a unified 'unassigned' label. To accomplish this, we carefully design a comprehensive evaluation benchmark and propose a novel end-to-end algorithmic framework called scGAD. Specifically, scGAD first builds the intrinsic correspondences on seen and novel cell types by retrieving geometrically and semantically mutual nearest neighbors as anchor pairs. Together with the similarity affinity score, a soft anchor-based self-supervised learning module is then designed to transfer the known label information from reference data to target data and aggregate the new semantic knowledge within target data in the prediction space. To enhance the inter-type separation and intra-type compactness, we further propose a confidential prototype self-supervised learning paradigm to implicitly capture the global topological structure of cells in the embedding space. Such a bidirectional dual alignment mechanism between embedding space and prediction space can better handle batch effect and cell type shift. Extensive results on massive simulation datasets and real datasets demonstrate the superiority of scGAD over various state-of-the-art clustering and annotation methods. We also implement marker gene identification to validate the effectiveness of scGAD in clustering novel cell types and their biological significance. To the best of our knowledge, we are the first to introduce this new and practical task and propose an end-to-end algorithmic framework to solve it. Our method scGAD is implemented in Python using the Pytorch machine-learning library, and it is freely available at https://github.com/aimeeyaoyao/scGAD.


Subject(s)
Algorithms , Gene Expression Profiling , Gene Expression Profiling/methods , Single-Cell Analysis/methods , Computer Simulation , Cluster Analysis , Sequence Analysis, RNA/methods
15.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36464489

ABSTRACT

Viruses are the most ubiquitous and diverse entities in the biome. Due to the rapid growth of newly identified viruses, there is an urgent need for accurate and comprehensive virus classification, particularly for novel viruses. Here, we present PhaGCN2, which can rapidly classify the taxonomy of viral sequences at the family level and supports the visualization of the associations of all families. We evaluate the performance of PhaGCN2 and compare it with the state-of-the-art virus classification tools, such as vConTACT2, CAT and VPF-Class, using the widely accepted metrics. The results show that PhaGCN2 largely improves the precision and recall of virus classification, increases the number of classifiable virus sequences in the Global Ocean Virome dataset (v2.0) by four times and classifies more than 90% of the Gut Phage Database. PhaGCN2 makes it possible to conduct high-throughput and automatic expansion of the database of the International Committee on Taxonomy of Viruses. The source code is freely available at https://github.com/KennthShang/PhaGCN2.0.


Subject(s)
Viruses , Viruses/genetics , Genome, Viral , Databases, Factual , Software , Genomics
16.
Brief Bioinform ; 24(6)2023 09 22.
Article in English | MEDLINE | ID: mdl-37771003

ABSTRACT

A microbial community maintains its ecological dynamics via metabolite crosstalk. Hence, knowledge of the metabolome, alongside its populace, would help us understand the functionality of a community and also predict how it will change in atypical conditions. Methods that employ low-cost metagenomic sequencing data can predict the metabolic potential of a community, that is, its ability to produce or utilize specific metabolites. These, in turn, can potentially serve as markers of biochemical pathways that are associated with different communities. We developed MMIP (Microbiome Metabolome Integration Platform), a web-based analytical and predictive tool that can be used to compare the taxonomic content, diversity variation and the metabolic potential between two sets of microbial communities from targeted amplicon sequencing data. MMIP is capable of highlighting statistically significant taxonomic, enzymatic and metabolic attributes as well as learning-based features associated with one group in comparison with another. Furthermore, MMIP can predict linkages among species or groups of microbes in the community, specific enzyme profiles, compounds or metabolites associated with such a group of organisms. With MMIP, we aim to provide a user-friendly, online web server for performing key microbiome-associated analyses of targeted amplicon sequencing data, predicting metabolite signature, and using learning-based linkage analysis, without the need for initial metabolomic analysis, and thereby helping in hypothesis generation.


Subject(s)
Metabolome , Microbiota , Metabolomics/methods , Internet
17.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36592051

ABSTRACT

MOTIVATION: Molecular property prediction is a significant requirement in AI-driven drug design and discovery, aiming to predict the molecular property information (e.g. toxicity) based on the mined biomolecular knowledge. Although graph neural networks have been proven powerful in predicting molecular property, unbalanced labeled data and poor generalization capability for new-synthesized molecules are always key issues that hinder further improvement of molecular encoding performance. RESULTS: We propose a novel self-supervised representation learning scheme based on a Cascaded Attention Network and Graph Contrastive Learning (CasANGCL). We design a new graph network variant, designated as cascaded attention network, to encode local-global molecular representations. We construct a two-stage contrast predictor framework to tackle the label imbalance problem of training molecular samples, which is an integrated end-to-end learning scheme. Moreover, we utilize the information-flow scheme for training our network, which explicitly captures the edge information in the node/graph representations and obtains more fine-grained knowledge. Our model achieves an 81.9% ROC-AUC average performance on 661 tasks from seven challenging benchmarks, showing better portability and generalizations. Further visualization studies indicate our model's better representation capacity and provide interpretability.


Subject(s)
Benchmarking , Learning , Drug Design , Neural Networks, Computer
18.
Brief Bioinform ; 24(3)2023 05 19.
Article in English | MEDLINE | ID: mdl-37088980

ABSTRACT

Immunofluorescence patterns of anti-nuclear antibodies (ANAs) on human epithelial cell (HEp-2) substrates are important biomarkers for the diagnosis of autoimmune diseases. There are growing clinical requirements for an automatic readout and classification of ANA immunofluorescence patterns for HEp-2 images following the taxonomy recommended by the International Consensus on Antinuclear Antibody Patterns (ICAP). In this study, a comprehensive collection of HEp-2 specimen images covering a broad range of ANA patterns was established and manually annotated by experienced laboratory experts. By utilizing a supervised learning methodology, an automatic immunofluorescence pattern classification framework for HEp-2 specimen images was developed. The framework consists of a module for HEp-2 cell detection and cell-level feature extraction, followed by an image-level classifier that is capable of recognizing all 14 classes of ANA immunofluorescence patterns as recommended by ICAP. Performance analysis indicated an accuracy of 92.05% on the validation dataset and 87% on an independent test dataset, which has surpassed the performance of human examiners on the same test dataset. The proposed framework is expected to contribute to the automatic ANA pattern recognition in clinical laboratories to facilitate efficient and precise diagnosis of autoimmune diseases.


Subject(s)
Antibodies, Antinuclear , Autoimmune Diseases , Humans , Fluorescent Antibody Technique , Antibodies, Antinuclear/analysis , Autoimmune Diseases/diagnosis , Epithelial Cells , Supervised Machine Learning
19.
Brief Bioinform ; 24(6)2023 09 22.
Article in English | MEDLINE | ID: mdl-37930026

ABSTRACT

Artificial intelligence-based molecular property prediction plays a key role in molecular design such as bioactive molecules and functional materials. In this study, we propose a self-supervised pretraining deep learning (DL) framework, called functional group bidirectional encoder representations from transformers (FG-BERT), pertained based on ~1.45 million unlabeled drug-like molecules, to learn meaningful representation of molecules from function groups. The pretrained FG-BERT framework can be fine-tuned to predict molecular properties. Compared to state-of-the-art (SOTA) machine learning and DL methods, we demonstrate the high performance of FG-BERT in evaluating molecular properties in tasks involving physical chemistry, biophysics and physiology across 44 benchmark datasets. In addition, FG-BERT utilizes attention mechanisms to focus on FG features that are critical to the target properties, thereby providing excellent interpretability for downstream training tasks. Collectively, FG-BERT does not require any artificially crafted features as input and has excellent interpretability, providing an out-of-the-box framework for developing SOTA models for a variety of molecule (especially for drug) discovery tasks.


Subject(s)
Algorithms , Artificial Intelligence , Benchmarking , Machine Learning
20.
Brief Bioinform ; 24(6)2023 09 22.
Article in English | MEDLINE | ID: mdl-37950905

ABSTRACT

Cancer genomics is dedicated to elucidating the genes and pathways that contribute to cancer progression and development. Identifying cancer genes (CGs) associated with the initiation and progression of cancer is critical for characterization of molecular-level mechanism in cancer research. In recent years, the growing availability of high-throughput molecular data and advancements in deep learning technologies has enabled the modelling of complex interactions and topological information within genomic data. Nevertheless, because of the limited labelled data, pinpointing CGs from a multitude of potential mutations remains an exceptionally challenging task. To address this, we propose a novel deep learning framework, termed self-supervised masked graph learning (SMG), which comprises SMG reconstruction (pretext task) and task-specific fine-tuning (downstream task). In the pretext task, the nodes of multi-omic featured protein-protein interaction (PPI) networks are randomly substituted with a defined mask token. The PPI networks are then reconstructed using the graph neural network (GNN)-based autoencoder, which explores the node correlations in a self-prediction manner. In the downstream tasks, the pre-trained GNN encoder embeds the input networks into feature graphs, whereas a task-specific layer proceeds with the final prediction. To assess the performance of the proposed SMG method, benchmarking experiments are performed on three node-level tasks (identification of CGs, essential genes and healthy driver genes) and one graph-level task (identification of disease subnetwork) across eight PPI networks. Benchmarking experiments and performance comparison with existing state-of-the-art methods demonstrate the superiority of SMG on multi-omic feature engineering.


Subject(s)
Neoplasms , Oncogenes , Mutation , Benchmarking , Genes, Essential , Genomics , Neoplasms/genetics
SELECTION OF CITATIONS
SEARCH DETAIL