Pesquisa | BVS Violência e Saúde

1.

Semi-supervised machine learning method for predicting homogeneous ancestry groups to assess Hardy-Weinberg equilibrium in diverse whole-genome sequencing studies.

Shyr, Derek; Dey, Rounak; Li, Xihao; Zhou, Hufeng; Boerwinkle, Eric; Buyske, Steve; Daly, Mark; Gibbs, Richard A; Hall, Ira; Matise, Tara; Reeves, Catherine; Stitziel, Nathan O; Zody, Michael; Neale, Benjamin M; Lin, Xihong.

Am J Hum Genet ; 2024 Sep 04.

Artigo em Inglês | MEDLINE | ID: mdl-39270648

RESUMO

Large-scale, multi-ethnic whole-genome sequencing (WGS) studies, such as the National Human Genome Research Institute Genome Sequencing Program's Centers for Common Disease Genomics (CCDG), play an important role in increasing diversity for genetic research. Before performing association analyses, assessing Hardy-Weinberg equilibrium (HWE) is a crucial step in quality control procedures to remove low quality variants and ensure valid downstream analyses. Diverse WGS studies contain ancestrally heterogeneous samples; however, commonly used HWE methods assume that the samples are homogeneous. Therefore, directly applying these to the whole dataset can yield statistically invalid results. To account for this heterogeneity, HWE can be tested on subsets of samples that have genetically homogeneous ancestries and the results aggregated at each variant. To facilitate valid HWE subset testing, we developed a semi-supervised learning approach that predicts homogeneous ancestries based on the genotype. This method provides a convenient tool for estimating HWE in the presence of population structure and missing self-reported race and ethnicities in diverse WGS studies. In addition, assessing HWE within the homogeneous ancestries provides reliable HWE estimates that will directly benefit downstream analyses, including association analyses in WGS studies. We applied our proposed method on the CCDG dataset, predicting homogeneous genetic ancestry groups for 60,545 multi-ethnic WGS samples to assess HWE within each group.

2.

Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction.

Chen, Ken; Zhou, Yue; Ding, Maolin; Wang, Yu; Ren, Zhixiang; Yang, Yuedong.

Brief Bioinform ; 25(3)2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38605640

RESUMO

Language models pretrained by self-supervised learning (SSL) have been widely utilized to study protein sequences, while few models were developed for genomic sequences and were limited to single species. Due to the lack of genomes from different species, these models cannot effectively leverage evolutionary information. In this study, we have developed SpliceBERT, a language model pretrained on primary ribonucleic acids (RNA) sequences from 72 vertebrates by masked language modeling, and applied it to sequence-based modeling of RNA splicing. Pretraining SpliceBERT on diverse species enables effective identification of evolutionarily conserved elements. Meanwhile, the learned hidden states and attention weights can characterize the biological properties of splice sites. As a result, SpliceBERT was shown effective on several downstream tasks: zero-shot prediction of variant effects on splicing, prediction of branchpoints in humans, and cross-species prediction of splice sites. Our study highlighted the importance of pretraining genomic language models on a diverse range of species and suggested that SSL is a promising approach to enhance our understanding of the regulatory logic underlying genomic sequences.

Assuntos

Splicing de RNA , Vertebrados , Animais , Humanos , Sequência de Bases , Vertebrados/genética , RNA , Aprendizado de Máquina Supervisionado

3.

GLDADec: marker-gene guided LDA modeling for bulk gene expression deconvolution.

Azuma, Iori; Mizuno, Tadahaya; Kusuhara, Hiroyuki.

Brief Bioinform ; 25(4)2024 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-38982642

RESUMO

Inferring cell type proportions from bulk transcriptome data is crucial in immunology and oncology. Here, we introduce guided LDA deconvolution (GLDADec), a bulk deconvolution method that guides topics using cell type-specific marker gene names to estimate topic distributions for each sample. Through benchmarking using blood-derived datasets, we demonstrate its high estimation performance and robustness. Moreover, we apply GLDADec to heterogeneous tissue bulk data and perform comprehensive cell type analysis in a data-driven manner. We show that GLDADec outperforms existing methods in estimation performance and evaluate its biological interpretability by examining enrichment of biological processes for topics. Finally, we apply GLDADec to The Cancer Genome Atlas tumor samples, enabling subtype stratification and survival analysis based on estimated cell type proportions, thus proving its practical utility in clinical settings. This approach, utilizing marker gene names as partial prior information, can be applied to various scenarios for bulk data deconvolution. GLDADec is available as an open-source Python package at https://github.com/mizuno-group/GLDADec.

Assuntos

Software , Humanos , Perfilação da Expressão Gênica/métodos , Algoritmos , Transcriptoma , Biologia Computacional/métodos , Neoplasias/genética , Biomarcadores Tumorais/genética , Marcadores Genéticos

4.

Complementary multi-modality molecular self-supervised learning via non-overlapping masking for property prediction.

Shen, Ao; Yuan, Mingzhi; Ma, Yingfan; Du, Jie; Wang, Manning.

Brief Bioinform ; 25(4)2024 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-38801702

RESUMO

Self-supervised learning plays an important role in molecular representation learning because labeled molecular data are usually limited in many tasks, such as chemical property prediction and virtual screening. However, most existing molecular pre-training methods focus on one modality of molecular data, and the complementary information of two important modalities, SMILES and graph, is not fully explored. In this study, we propose an effective multi-modality self-supervised learning framework for molecular SMILES and graph. Specifically, SMILES data and graph data are first tokenized so that they can be processed by a unified Transformer-based backbone network, which is trained by a masked reconstruction strategy. In addition, we introduce a specialized non-overlapping masking strategy to encourage fine-grained interaction between these two modalities. Experimental results show that our framework achieves state-of-the-art performance in a series of molecular property prediction tasks, and a detailed ablation study demonstrates efficacy of the multi-modality framework and the masking strategy.

Assuntos

Aprendizado de Máquina Supervisionado , Algoritmos , Biologia Computacional/métodos

5.

TransGCN: a semi-supervised graph convolution network-based framework to infer protein translocations in spatio-temporal proteomics.

Wang, Bing; Zhang, Xiangzheng; Han, Xudong; Hao, Bingjie; Li, Yan; Guo, Xuejiang.

Brief Bioinform ; 25(2)2024 Jan 22.

Artigo em Inglês | MEDLINE | ID: mdl-38426320

RESUMO

Protein subcellular localization (PSL) is very important in order to understand its functions, and its movement between subcellular niches within cells plays fundamental roles in biological process regulation. Mass spectrometry-based spatio-temporal proteomics technologies can help provide new insights of protein translocation, but bring the challenge in identifying reliable protein translocation events due to the noise interference and insufficient data mining. We propose a semi-supervised graph convolution network (GCN)-based framework termed TransGCN that infers protein translocation events from spatio-temporal proteomics. Based on expanded multiple distance features and joint graph representations of proteins, TransGCN utilizes the semi-supervised GCN to enable effective knowledge transfer from proteins with known PSLs for predicting protein localization and translocation. Our results demonstrate that TransGCN outperforms current state-of-the-art methods in identifying protein translocations, especially in coping with batch effects. It also exhibited excellent predictive accuracy in PSL prediction. TransGCN is freely available on GitHub at https://github.com/XuejiangGuo/TransGCN.

Assuntos

Capacidades de Enfrentamento , Proteômica , Mineração de Dados , Espectrometria de Massas , Transporte Proteico

6.

On cheap entropy-sparsified regression learning.

Horenko, Illia; Vecchi, Edoardo; Kardos, Juraj; Wächter, Andreas; Schenk, Olaf; O'Kane, Terence J; Gagliardini, Patrick; Gerber, Susanne.

Proc Natl Acad Sci U S A ; 120(1): e2214972120, 2023 01 03.

Artigo em Inglês | MEDLINE | ID: mdl-36580592

RESUMO

Regression learning is one of the long-standing problems in statistics, machine learning, and deep learning (DL). We show that writing this problem as a probabilistic expectation over (unknown) feature probabilities - thus increasing the number of unknown parameters and seemingly making the problem more complex-actually leads to its simplification, and allows incorporating the physical principle of entropy maximization. It helps decompose a very general setting of this learning problem (including discretization, feature selection, and learning multiple piece-wise linear regressions) into an iterative sequence of simple substeps, which are either analytically solvable or cheaply computable through an efficient second-order numerical solver with a sublinear cost scaling. This leads to the computationally cheap and robust non-DL second-order Sparse Probabilistic Approximation for Regression Task Analysis (SPARTAn) algorithm, that can be efficiently applied to problems with millions of feature dimensions on a commodity laptop, when the state-of-the-art learning tools would require supercomputers. SPARTAn is compared to a range of commonly used regression learning tools on synthetic problems and on the prediction of the El Niño Southern Oscillation, the dominant interannual mode of tropical climate variability. The obtained SPARTAn learners provide more predictive, sparse, and physically explainable data descriptions, clearly discerning the important role of ocean temperature variability at the thermocline in the equatorial Pacific. SPARTAn provides an easily interpretable description of the timescales by which these thermocline temperature features evolve and eventually express at the surface, thereby enabling enhanced predictability of the key drivers of the interannual climate.

Assuntos

El Niño Oscilação Sul , Clima Tropical , Entropia , Temperatura , Algoritmos

7.

NMDA-driven dendritic modulation enables multitask representation learning in hierarchical sensory processing pathways.

Wybo, Willem A M; Tsai, Matthias C; Tran, Viet Anh Khoa; Illing, Bernd; Jordan, Jakob; Morrison, Abigail; Senn, Walter.

Proc Natl Acad Sci U S A ; 120(32): e2300558120, 2023 08 08.

Artigo em Inglês | MEDLINE | ID: mdl-37523562

RESUMO

While sensory representations in the brain depend on context, it remains unclear how such modulations are implemented at the biophysical level, and how processing layers further in the hierarchy can extract useful features for each possible contextual state. Here, we demonstrate that dendritic N-Methyl-D-Aspartate spikes can, within physiological constraints, implement contextual modulation of feedforward processing. Such neuron-specific modulations exploit prior knowledge, encoded in stable feedforward weights, to achieve transfer learning across contexts. In a network of biophysically realistic neuron models with context-independent feedforward weights, we show that modulatory inputs to dendritic branches can solve linearly nonseparable learning problems with a Hebbian, error-modulated learning rule. We also demonstrate that local prediction of whether representations originate either from different inputs, or from different contextual modulations of the same input, results in representation learning of hierarchical feedforward weights across processing layers that accommodate a multitude of contexts.

Assuntos

Modelos Neurológicos , N-Metilaspartato , Aprendizagem/fisiologia , Neurônios/fisiologia , Percepção

8.

DIST: spatial transcriptomics enhancement using deep learning.

Zhao, Yanping; Wang, Kui; Hu, Gang.

Brief Bioinform ; 24(2)2023 03 19.

Artigo em Inglês | MEDLINE | ID: mdl-36653906

RESUMO

Spatially resolved transcriptomics technologies enable comprehensive measurement of gene expression patterns in the context of intact tissues. However, existing technologies suffer from either low resolution or shallow sequencing depth. Here, we present DIST, a deep learning-based method that imputes the gene expression profiles on unmeasured locations and enhances the gene expression for both original measured spots and imputed spots by self-supervised learning and transfer learning. We evaluate the performance of DIST for imputation, clustering, differential expression analysis and functional enrichment analysis. The results show that DIST can impute the gene expression accurately, enhance the gene expression for low-quality data, help detect more biological meaningful differentially expressed genes and pathways, therefore allow for deeper insights into the biological processes.

Assuntos

Aprendizado Profundo , Transcriptoma , Perfilação da Expressão Gênica/métodos , Análise por Conglomerados

9.

scGAD: a new task and end-to-end framework for generalized cell type annotation and discovery.

Zhai, Yuyao; Chen, Liang; Deng, Minghua.

Brief Bioinform ; 24(2)2023 03 19.

Artigo em Inglês | MEDLINE | ID: mdl-36869836

RESUMO

The rapid development of single-cell RNA sequencing (scRNA-seq) technology allows us to study gene expression heterogeneity at the cellular level. Cell annotation is the basis for subsequent downstream analysis in single-cell data mining. As more and more well-annotated scRNA-seq reference data become available, many automatic annotation methods have sprung up in order to simplify the cell annotation process on unlabeled target data. However, existing methods rarely explore the fine-grained semantic knowledge of novel cell types absent from the reference data, and they are usually susceptible to batch effects on the classification of seen cell types. Taking into consideration the limitations above, this paper proposes a new and practical task called generalized cell type annotation and discovery for scRNA-seq data whereby target cells are labeled with either seen cell types or cluster labels, instead of a unified 'unassigned' label. To accomplish this, we carefully design a comprehensive evaluation benchmark and propose a novel end-to-end algorithmic framework called scGAD. Specifically, scGAD first builds the intrinsic correspondences on seen and novel cell types by retrieving geometrically and semantically mutual nearest neighbors as anchor pairs. Together with the similarity affinity score, a soft anchor-based self-supervised learning module is then designed to transfer the known label information from reference data to target data and aggregate the new semantic knowledge within target data in the prediction space. To enhance the inter-type separation and intra-type compactness, we further propose a confidential prototype self-supervised learning paradigm to implicitly capture the global topological structure of cells in the embedding space. Such a bidirectional dual alignment mechanism between embedding space and prediction space can better handle batch effect and cell type shift. Extensive results on massive simulation datasets and real datasets demonstrate the superiority of scGAD over various state-of-the-art clustering and annotation methods. We also implement marker gene identification to validate the effectiveness of scGAD in clustering novel cell types and their biological significance. To the best of our knowledge, we are the first to introduce this new and practical task and propose an end-to-end algorithmic framework to solve it. Our method scGAD is implemented in Python using the Pytorch machine-learning library, and it is freely available at https://github.com/aimeeyaoyao/scGAD.

Assuntos

Algoritmos , Perfilação da Expressão Gênica , Perfilação da Expressão Gênica/métodos , Análise de Célula Única/métodos , Simulação por Computador , Análise por Conglomerados , Análise de Sequência de RNA/métodos

10.

Microbiome Metabolome Integration Platform (MMIP): a web-based platform for microbiome and metabolome data integration and feature identification.

Gautam, Anupam; Bhowmik, Debaleena; Basu, Sayantani; Zeng, Wenhuan; Lahiri, Abhishake; Huson, Daniel H; Paul, Sandip.

Brief Bioinform ; 24(6)2023 09 22.

Artigo em Inglês | MEDLINE | ID: mdl-37771003

RESUMO

A microbial community maintains its ecological dynamics via metabolite crosstalk. Hence, knowledge of the metabolome, alongside its populace, would help us understand the functionality of a community and also predict how it will change in atypical conditions. Methods that employ low-cost metagenomic sequencing data can predict the metabolic potential of a community, that is, its ability to produce or utilize specific metabolites. These, in turn, can potentially serve as markers of biochemical pathways that are associated with different communities. We developed MMIP (Microbiome Metabolome Integration Platform), a web-based analytical and predictive tool that can be used to compare the taxonomic content, diversity variation and the metabolic potential between two sets of microbial communities from targeted amplicon sequencing data. MMIP is capable of highlighting statistically significant taxonomic, enzymatic and metabolic attributes as well as learning-based features associated with one group in comparison with another. Furthermore, MMIP can predict linkages among species or groups of microbes in the community, specific enzyme profiles, compounds or metabolites associated with such a group of organisms. With MMIP, we aim to provide a user-friendly, online web server for performing key microbiome-associated analyses of targeted amplicon sequencing data, predicting metabolite signature, and using learning-based linkage analysis, without the need for initial metabolomic analysis, and thereby helping in hypothesis generation.

Assuntos

Metaboloma , Microbiota , Metabolômica/métodos , Internet

11.

FG-BERT: a generalized and self-supervised functional group-based molecular representation learning framework for properties prediction.

Li, Biaoshun; Lin, Mujie; Chen, Tiegen; Wang, Ling.

Brief Bioinform ; 24(6)2023 09 22.

Artigo em Inglês | MEDLINE | ID: mdl-37930026

RESUMO

Artificial intelligence-based molecular property prediction plays a key role in molecular design such as bioactive molecules and functional materials. In this study, we propose a self-supervised pretraining deep learning (DL) framework, called functional group bidirectional encoder representations from transformers (FG-BERT), pertained based on ~1.45 million unlabeled drug-like molecules, to learn meaningful representation of molecules from function groups. The pretrained FG-BERT framework can be fine-tuned to predict molecular properties. Compared to state-of-the-art (SOTA) machine learning and DL methods, we demonstrate the high performance of FG-BERT in evaluating molecular properties in tasks involving physical chemistry, biophysics and physiology across 44 benchmark datasets. In addition, FG-BERT utilizes attention mechanisms to focus on FG features that are critical to the target properties, thereby providing excellent interpretability for downstream training tasks. Collectively, FG-BERT does not require any artificially crafted features as input and has excellent interpretability, providing an out-of-the-box framework for developing SOTA models for a variety of molecule (especially for drug) discovery tasks.

Assuntos

Algoritmos , Inteligência Artificial , Benchmarking , Aprendizado de Máquina

12.

BatmanNet: bi-branch masked graph transformer autoencoder for molecular representation.

Wang, Zhen; Feng, Zheng; Li, Yanjun; Li, Bowen; Wang, Yongrui; Sha, Chulin; He, Min; Li, Xiaolin.

Brief Bioinform ; 25(1)2023 11 22.

Artigo em Inglês | MEDLINE | ID: mdl-38033291

RESUMO

Although substantial efforts have been made using graph neural networks (GNNs) for artificial intelligence (AI)-driven drug discovery, effective molecular representation learning remains an open challenge, especially in the case of insufficient labeled molecules. Recent studies suggest that big GNN models pre-trained by self-supervised learning on unlabeled datasets enable better transfer performance in downstream molecular property prediction tasks. However, the approaches in these studies require multiple complex self-supervised tasks and large-scale datasets , which are time-consuming, computationally expensive and difficult to pre-train end-to-end. Here, we design a simple yet effective self-supervised strategy to simultaneously learn local and global information about molecules, and further propose a novel bi-branch masked graph transformer autoencoder (BatmanNet) to learn molecular representations. BatmanNet features two tailored complementary and asymmetric graph autoencoders to reconstruct the missing nodes and edges, respectively, from a masked molecular graph. With this design, BatmanNet can effectively capture the underlying structure and semantic information of molecules, thus improving the performance of molecular representation. BatmanNet achieves state-of-the-art results for multiple drug discovery tasks, including molecular properties prediction, drug-drug interaction and drug-target interaction, on 13 benchmark datasets, demonstrating its great potential and superiority in molecular representation learning.

Assuntos

Inteligência Artificial , Benchmarking , Sistemas de Liberação de Medicamentos , Descoberta de Drogas , Redes Neurais de Computação

13.

SMG: self-supervised masked graph learning for cancer gene identification.

Cui, Yan; Wang, Zhikang; Wang, Xiaoyu; Zhang, Yiwen; Zhang, Ying; Pan, Tong; Zhang, Zhe; Li, Shanshan; Guo, Yuming; Akutsu, Tatsuya; Song, Jiangning.

Brief Bioinform ; 24(6)2023 09 22.

Artigo em Inglês | MEDLINE | ID: mdl-37950905

RESUMO

Cancer genomics is dedicated to elucidating the genes and pathways that contribute to cancer progression and development. Identifying cancer genes (CGs) associated with the initiation and progression of cancer is critical for characterization of molecular-level mechanism in cancer research. In recent years, the growing availability of high-throughput molecular data and advancements in deep learning technologies has enabled the modelling of complex interactions and topological information within genomic data. Nevertheless, because of the limited labelled data, pinpointing CGs from a multitude of potential mutations remains an exceptionally challenging task. To address this, we propose a novel deep learning framework, termed self-supervised masked graph learning (SMG), which comprises SMG reconstruction (pretext task) and task-specific fine-tuning (downstream task). In the pretext task, the nodes of multi-omic featured protein-protein interaction (PPI) networks are randomly substituted with a defined mask token. The PPI networks are then reconstructed using the graph neural network (GNN)-based autoencoder, which explores the node correlations in a self-prediction manner. In the downstream tasks, the pre-trained GNN encoder embeds the input networks into feature graphs, whereas a task-specific layer proceeds with the final prediction. To assess the performance of the proposed SMG method, benchmarking experiments are performed on three node-level tasks (identification of CGs, essential genes and healthy driver genes) and one graph-level task (identification of disease subnetwork) across eight PPI networks. Benchmarking experiments and performance comparison with existing state-of-the-art methods demonstrate the superiority of SMG on multi-omic feature engineering.

Assuntos

Neoplasias , Oncogenes , Mutação , Benchmarking , Genes Essenciais , Genômica , Neoplasias/genética

14.

An automatic immunofluorescence pattern classification framework for HEp-2 image based on supervised learning.

Fang, Kechi; Li, Chuan; Wang, Jing.

Brief Bioinform ; 24(3)2023 05 19.

Artigo em Inglês | MEDLINE | ID: mdl-37088980

RESUMO

Immunofluorescence patterns of anti-nuclear antibodies (ANAs) on human epithelial cell (HEp-2) substrates are important biomarkers for the diagnosis of autoimmune diseases. There are growing clinical requirements for an automatic readout and classification of ANA immunofluorescence patterns for HEp-2 images following the taxonomy recommended by the International Consensus on Antinuclear Antibody Patterns (ICAP). In this study, a comprehensive collection of HEp-2 specimen images covering a broad range of ANA patterns was established and manually annotated by experienced laboratory experts. By utilizing a supervised learning methodology, an automatic immunofluorescence pattern classification framework for HEp-2 specimen images was developed. The framework consists of a module for HEp-2 cell detection and cell-level feature extraction, followed by an image-level classifier that is capable of recognizing all 14 classes of ANA immunofluorescence patterns as recommended by ICAP. Performance analysis indicated an accuracy of 92.05% on the validation dataset and 87% on an independent test dataset, which has surpassed the performance of human examiners on the same test dataset. The proposed framework is expected to contribute to the automatic ANA pattern recognition in clinical laboratories to facilitate efficient and precise diagnosis of autoimmune diseases.

Assuntos

Anticorpos Antinucleares , Doenças Autoimunes , Humanos , Imunofluorescência , Anticorpos Antinucleares/análise , Doenças Autoimunes/diagnóstico , Células Epiteliais , Aprendizado de Máquina Supervisionado

15.

CasANGCL: pre-training and fine-tuning model based on cascaded attention network and graph contrastive learning for molecular property prediction.

Zheng, Zixi; Tan, Yanyan; Wang, Hong; Yu, Shengpeng; Liu, Tianyu; Liang, Cheng.

Brief Bioinform ; 24(1)2023 01 19.

Artigo em Inglês | MEDLINE | ID: mdl-36592051

RESUMO

MOTIVATION: Molecular property prediction is a significant requirement in AI-driven drug design and discovery, aiming to predict the molecular property information (e.g. toxicity) based on the mined biomolecular knowledge. Although graph neural networks have been proven powerful in predicting molecular property, unbalanced labeled data and poor generalization capability for new-synthesized molecules are always key issues that hinder further improvement of molecular encoding performance. RESULTS: We propose a novel self-supervised representation learning scheme based on a Cascaded Attention Network and Graph Contrastive Learning (CasANGCL). We design a new graph network variant, designated as cascaded attention network, to encode local-global molecular representations. We construct a two-stage contrast predictor framework to tackle the label imbalance problem of training molecular samples, which is an integrated end-to-end learning scheme. Moreover, we utilize the information-flow scheme for training our network, which explicitly captures the edge information in the node/graph representations and obtains more fine-grained knowledge. Our model achieves an 81.9% ROC-AUC average performance on 661 tasks from seven challenging benchmarks, showing better portability and generalizations. Further visualization studies indicate our model's better representation capacity and provide interpretability.

Assuntos

Benchmarking , Aprendizagem , Desenho de Fármacos , Redes Neurais de Computação

16.

Breaking the barriers of data scarcity in drug-target affinity prediction.

Pei, Qizhi; Wu, Lijun; Zhu, Jinhua; Xia, Yingce; Xie, Shufang; Qin, Tao; Liu, Haiguang; Liu, Tie-Yan; Yan, Rui.

Brief Bioinform ; 24(6)2023 09 22.

Artigo em Inglês | MEDLINE | ID: mdl-37903413

RESUMO

Accurate prediction of drug-target affinity (DTA) is of vital importance in early-stage drug discovery, facilitating the identification of drugs that can effectively interact with specific targets and regulate their activities. While wet experiments remain the most reliable method, they are time-consuming and resource-intensive, resulting in limited data availability that poses challenges for deep learning approaches. Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue. To overcome this challenge, we present the Semi-Supervised Multi-task training (SSM) framework for DTA prediction, which incorporates three simple yet highly effective strategies: (1) A multi-task training approach that combines DTA prediction with masked language modeling using paired drug-target data. (2) A semi-supervised training method that leverages large-scale unpaired molecules and proteins to enhance drug and target representations. This approach differs from previous methods that only employed molecules or proteins in pre-training. (3) The integration of a lightweight cross-attention module to improve the interaction between drugs and targets, further enhancing prediction accuracy. Through extensive experiments on benchmark datasets such as BindingDB, DAVIS and KIBA, we demonstrate the superior performance of our framework. Additionally, we conduct case studies on specific drug-target binding activities, virtual screening experiments, drug feature visualizations and real-world applications, all of which showcase the significant potential of our work. In conclusion, our proposed SSM-DTA framework addresses the data limitation challenge in DTA prediction and yields promising results, paving the way for more efficient and accurate drug discovery processes.

Assuntos

Benchmarking , Descoberta de Drogas , Sistemas de Liberação de Medicamentos

17.

scGAAC: A graph attention autoencoder for clustering single-cell RNA-sequencing data.

Zhang, Lin; Xiang, Haiping; Wang, Feng; Chen, Zepeng; Shen, Mo; Ma, Jiani; Liu, Hui; Zheng, Hongdang.

Methods ; 229: 115-124, 2024 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-38950719

RESUMO

Single-cell RNA-sequencing (scRNA-seq) enables the investigation of intricate mechanisms governing cell heterogeneity and diversity. Clustering analysis remains a pivotal tool in scRNA-seq for discerning cell types. However, persistent challenges arise from noise, high dimensionality, and dropout in single-cell data. Despite the proliferation of scRNA-seq clustering methods, these often focus on extracting representations from individual cell expression data, neglecting potential intercellular relationships. To overcome this limitation, we introduce scGAAC, a novel clustering method based on an attention-based graph convolutional autoencoder. By leveraging structural information between cells through a graph attention autoencoder, scGAAC uncovers latent relationships while extracting representation information from single-cell gene expression patterns. An attention fusion module amalgamates the learned features of the graph attention autoencoder and the autoencoder through attention weights. Ultimately, a self-supervised learning policy guides model optimization. scGAAC, a hypothesis-free framework, performs better on four real scRNA-seq datasets than most state-of-the-art methods. The scGAAC implementation is publicly available on Github at: https://github.com/labiip/scGAAC.

Assuntos

Análise de Sequência de RNA , Análise de Célula Única , Análise de Célula Única/métodos , Humanos , Análise por Conglomerados , Análise de Sequência de RNA/métodos , RNA-Seq/métodos , Algoritmos , Software

18.

PhosBERT: A self-supervised learning model for identifying phosphorylation sites in SARS-CoV-2-infected human cells.

Li, Yong; Gao, Ru; Liu, Shan; Zhang, Hongqi; Lv, Hao; Lai, Hongyan.

Methods ; 230: 140-146, 2024 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-39179191

RESUMO

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a single-stranded RNA virus, which mainly causes respiratory and enteric diseases and is responsible for the outbreak of coronavirus disease 19 (COVID-19). Numerous studies have demonstrated that SARS-CoV-2 infection will lead to a significant dysregulation of protein post-translational modification profile in human cells. The accurate recognition of phosphorylation sites in host cells will contribute to a deep understanding of the pathogenic mechanisms of SARS-CoV-2 and also help to screen drugs and compounds with antiviral potential. Therefore, there is a need to develop cost-effective and high-precision computational strategies for specifically identifying SARS-CoV-2-infected phosphorylation sites. In this work, we first implemented a custom neural network model (named PhosBERT) on the basis of a pre-trained protein language model of ProtBert, which was a self-supervised learning approach developed on the Bidirectional Encoder Representation from Transformers (BERT) architecture. PhosBERT was then trained and validated on serine (S) and threonine (T) phosphorylation dataset and tyrosine (Y) phosphorylation dataset with 5-fold cross-validation, respectively. Independent validation results showed that PhosBERT could identify S/T phosphorylation sites with high accuracy and AUC (area under the receiver operating characteristic) value of 81.9% and 0.896. The prediction accuracy and AUC value of Y phosphorylation sites reached up to 87.1% and 0.902. It indicated that the proposed model was of good prediction ability and stability and would provide a new approach for studying SARS-CoV-2 phosphorylation sites.

Assuntos

COVID-19 , Redes Neurais de Computação , SARS-CoV-2 , Aprendizado de Máquina Supervisionado , Humanos , Fosforilação , SARS-CoV-2/metabolismo , COVID-19/virologia , COVID-19/metabolismo , Processamento de Proteína Pós-Traducional , Biologia Computacional/métodos , Glicoproteína da Espícula de Coronavírus/metabolismo

19.

Segmentation of supragranular and infragranular layers in ultra-high-resolution 7T ex vivo MRI of the human cerebral cortex.

Zeng, Xiangrui; Puonti, Oula; Sayeed, Areej; Herisse, Rogeny; Mora, Jocelyn; Evancic, Kathryn; Varadarajan, Divya; Balbastre, Yael; Costantini, Irene; Scardigli, Marina; Ramazzotti, Josephine; DiMeo, Danila; Mazzamuto, Giacomo; Pesce, Luca; Brady, Niamh; Cheli, Franco; Saverio Pavone, Francesco; Hof, Patrick R; Frost, Robert; Augustinack, Jean; van der Kouwe, André; Eugenio Iglesias, Juan; Fischl, Bruce.

Cereb Cortex ; 34(9)2024 Sep 03.

Artigo em Inglês | MEDLINE | ID: mdl-39264753

RESUMO

Accurate labeling of specific layers in the human cerebral cortex is crucial for advancing our understanding of neurodevelopmental and neurodegenerative disorders. Building on recent advancements in ultra-high-resolution ex vivo MRI, we present a novel semi-supervised segmentation model capable of identifying supragranular and infragranular layers in ex vivo MRI with unprecedented precision. On a dataset consisting of 17 whole-hemisphere ex vivo scans at 120 $\mu $m, we propose a Multi-resolution U-Nets framework that integrates global and local structural information, achieving reliable segmentation maps of the entire hemisphere, with Dice scores over 0.8 for supra- and infragranular layers. This enables surface modeling, atlas construction, anomaly detection in disease states, and cross-modality validation while also paving the way for finer layer segmentation. Our approach offers a powerful tool for comprehensive neuroanatomical investigations and holds promise for advancing our mechanistic understanding of progression of neurodegenerative diseases.

Assuntos

Córtex Cerebral , Imageamento por Ressonância Magnética , Humanos , Imageamento por Ressonância Magnética/métodos , Córtex Cerebral/diagnóstico por imagem , Processamento de Imagem Assistida por Computador/métodos , Feminino , Masculino , Idoso , Pessoa de Meia-Idade , Adulto

20.

Cooperative learning for multiview analysis.

Ding, Daisy Yi; Li, Shuangning; Narasimhan, Balasubramanian; Tibshirani, Robert.

Proc Natl Acad Sci U S A ; 119(38): e2202113119, 2022 09 20.

Artigo em Inglês | MEDLINE | ID: mdl-36095183

RESUMO

We propose a method for supervised learning with multiple sets of features ("views"). The multiview problem is especially important in biology and medicine, where "-omics" data, such as genomics, proteomics, and radiomics, are measured on a common set of samples. "Cooperative learning" combines the usual squared-error loss of predictions with an "agreement" penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g., lasso, random forests, boosting, or neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty, yielding feature sparsity. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals. We show that cooperative learning achieves higher predictive accuracy on simulated data and real multiomics examples of labor-onset prediction. By leveraging aligned signals and allowing flexible fitting mechanisms for different modalities, cooperative learning offers a powerful approach to multiomics data fusion.

Assuntos

Genômica , Redes Neurais de Computação , Aprendizado de Máquina Supervisionado , Genômica/métodos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA