Pesquisa | BVS CLAP/SMR-OPAS/OMS

1.

Detecting novel cell type in single-cell chromatin accessibility data via open-set domain adaptation.

Lin, Yuefan; Pan, Zixiang; Zeng, Yuansong; Yang, Yuedong; Dai, Zhiming.

Brief Bioinform ; 25(5)2024 Jul 25.

Artigo em Inglês | MEDLINE | ID: mdl-39073828

RESUMO

Recent advances in single-cell technologies enable the rapid growth of multi-omics data. Cell type annotation is one common task in analyzing single-cell data. It is a challenge that some cell types in the testing set are not present in the training set (i.e. unknown cell types). Most scATAC-seq cell type annotation methods generally assign each cell in the testing set to one known type in the training set but neglect unknown cell types. Here, we present OVAAnno, an automatic cell types annotation method which utilizes open-set domain adaptation to detect unknown cell types in scATAC-seq data. Comprehensive experiments show that OVAAnno successfully identifies known and unknown cell types. Further experiments demonstrate that OVAAnno also performs well on scRNA-seq data. Our codes are available online at https://github.com/lisaber/OVAAnno/tree/master.

Assuntos

Cromatina , Análise de Célula Única , Análise de Célula Única/métodos , Cromatina/metabolismo , Cromatina/genética , Humanos , Software , Biologia Computacional/métodos , Algoritmos , Animais

2.

Comprehensive single-cell RNA-seq analysis using deep interpretable generative modeling guided by biological hierarchy knowledge.

Chen, Hegang; Lu, Yuyin; Dai, Zhiming; Yang, Yuedong; Li, Qing; Rao, Yanghui.

Brief Bioinform ; 25(4)2024 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-38960404

RESUMO

Recent advances in microfluidics and sequencing technologies allow researchers to explore cellular heterogeneity at single-cell resolution. In recent years, deep learning frameworks, such as generative models, have brought great changes to the analysis of transcriptomic data. Nevertheless, relying on the potential space of these generative models alone is insufficient to generate biological explanations. In addition, most of the previous work based on generative models is limited to shallow neural networks with one to three layers of latent variables, which may limit the capabilities of the models. Here, we propose a deep interpretable generative model called d-scIGM for single-cell data analysis. d-scIGM combines sawtooth connectivity techniques and residual networks, thereby constructing a deep generative framework. In addition, d-scIGM incorporates hierarchical prior knowledge of biological domains to enhance the interpretability of the model. We show that d-scIGM achieves excellent performance in a variety of fundamental tasks, including clustering, visualization, and pseudo-temporal inference. Through topic pathway studies, we found that d-scIGM-learned topics are better enriched for biologically meaningful pathways compared to the baseline models. Furthermore, the analysis of drug response data shows that d-scIGM can capture drug response patterns in large-scale experiments, which provides a promising way to elucidate the underlying biological mechanisms. Lastly, in the melanoma dataset, d-scIGM accurately identified different cell types and revealed multiple melanin-related driver genes and key pathways, which are critical for understanding disease mechanisms and drug development.

Assuntos

Aprendizado Profundo , RNA-Seq , Análise da Expressão Gênica de Célula Única , Humanos , Algoritmos , Biologia Computacional/métodos , Redes Neurais de Computação , RNA-Seq/métodos , Análise da Expressão Gênica de Célula Única/métodos

3.

Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction.

Chen, Ken; Zhou, Yue; Ding, Maolin; Wang, Yu; Ren, Zhixiang; Yang, Yuedong.

Brief Bioinform ; 25(3)2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38605640

RESUMO

Language models pretrained by self-supervised learning (SSL) have been widely utilized to study protein sequences, while few models were developed for genomic sequences and were limited to single species. Due to the lack of genomes from different species, these models cannot effectively leverage evolutionary information. In this study, we have developed SpliceBERT, a language model pretrained on primary ribonucleic acids (RNA) sequences from 72 vertebrates by masked language modeling, and applied it to sequence-based modeling of RNA splicing. Pretraining SpliceBERT on diverse species enables effective identification of evolutionarily conserved elements. Meanwhile, the learned hidden states and attention weights can characterize the biological properties of splice sites. As a result, SpliceBERT was shown effective on several downstream tasks: zero-shot prediction of variant effects on splicing, prediction of branchpoints in humans, and cross-species prediction of splice sites. Our study highlighted the importance of pretraining genomic language models on a diverse range of species and suggested that SSL is a promising approach to enhance our understanding of the regulatory logic underlying genomic sequences.

Assuntos

Splicing de RNA , Vertebrados , Animais , Humanos , Sequência de Bases , Vertebrados/genética , RNA , Aprendizado de Máquina Supervisionado

4.

GPSFun: geometry-aware protein sequence function predictions with language models.

Yuan, Qianmu; Tian, Chong; Song, Yidong; Ou, Peihua; Zhu, Mingming; Zhao, Huiying; Yang, Yuedong.

Nucleic Acids Res ; 52(W1): W248-W255, 2024 Jul 05.

Artigo em Inglês | MEDLINE | ID: mdl-38738636

RESUMO

Knowledge of protein function is essential for elucidating disease mechanisms and discovering new drug targets. However, there is a widening gap between the exponential growth of protein sequences and their limited function annotations. In our prior studies, we have developed a series of methods including GraphPPIS, GraphSite, LMetalSite and SPROF-GO for protein function annotations at residue or protein level. To further enhance their applicability and performance, we now present GPSFun, a versatile web server for Geometry-aware Protein Sequence Function annotations, which equips our previous tools with language models and geometric deep learning. Specifically, GPSFun employs large language models to efficiently predict 3D conformations of the input protein sequences and extract informative sequence embeddings. Subsequently, geometric graph neural networks are utilized to capture the sequence and structure patterns in the protein graphs, facilitating various downstream predictions including protein-ligand binding sites, gene ontologies, subcellular locations and protein solubility. Notably, GPSFun achieves superior performance to state-of-the-art methods across diverse tasks without requiring multiple sequence alignments or experimental protein structures. GPSFun is freely available to all users at https://bio-web1.nscc-gz.cn/app/GPSFun with user-friendly interfaces and rich visualizations.

Assuntos

Proteínas , Software , Proteínas/química , Proteínas/metabolismo , Conformação Proteica , Análise de Sequência de Proteína , Aprendizado Profundo , Sítios de Ligação , Anotação de Sequência Molecular , Redes Neurais de Computação , Sequência de Aminoácidos , Humanos , Internet

5.

EVLncRNAs 3.0: an updated comprehensive database for manually curated functional long non-coding RNAs validated by low-throughput experiments.

Zhou, Bailing; Ji, Baohua; Shen, Congcong; Zhang, Xia; Yu, Xue; Huang, Pingping; Yu, Ru; Zhang, Hongmei; Dou, Xianghua; Chen, Qingshuai; Zeng, Qiangcheng; Wang, Xiaoxin; Cao, Zanxia; Hu, Guodong; Xu, Shicai; Zhao, Huiying; Yang, Yuedong; Zhou, Yaoqi; Wang, Jihua.

Nucleic Acids Res ; 52(D1): D98-D106, 2024 Jan 05.

Artigo em Inglês | MEDLINE | ID: mdl-37953349

RESUMO

Long noncoding RNAs (lncRNAs) have emerged as crucial regulators across diverse biological processes and diseases. While high-throughput sequencing has enabled lncRNA discovery, functional characterization remains limited. The EVLncRNAs database is the first and exclusive repository for all experimentally validated functional lncRNAs from various species. After previous releases in 2018 and 2021, this update marks a major expansion through exhaustive manual curation of nearly 25 000 publications from 15 May 2020, to 15 May 2023. It incorporates substantial growth across all categories: a 154% increase in functional lncRNAs, 160% in associated diseases, 186% in lncRNA-disease associations, 235% in interactions, 138% in structures, 234% in circular RNAs, 235% in resistant lncRNAs and 4724% in exosomal lncRNAs. More importantly, it incorporated additional information include functional classifications, detailed interaction pathways, homologous lncRNAs, lncRNA locations, COVID-19, phase-separation and organoid-related lncRNAs. The web interface was substantially improved for browsing, visualization, and searching. ChatGPT was tested for information extraction and functional overview with its limitation noted. EVLncRNAs 3.0 represents the most extensive curated resource of experimentally validated functional lncRNAs and will serve as an indispensable platform for unravelling emerging lncRNA functions. The updated database is freely available at https://www.sdklab-biophysics-dzu.net/EVLncRNAs3/.

Assuntos

Bases de Dados de Ácidos Nucleicos , RNA Longo não Codificante , Gerenciamento de Dados , Armazenamento e Recuperação da Informação , RNA Longo não Codificante/genética

6.

Subgraph extraction and graph representation learning for single cell Hi-C imputation and clustering.

Zheng, Jiahao; Yang, Yuedong; Dai, Zhiming.

Brief Bioinform ; 25(1)2023 11 22.

Artigo em Inglês | MEDLINE | ID: mdl-38040494

RESUMO

Single-cell Hi-C (scHi-C) technology enables the investigation of 3D chromatin structure variability across individual cells. However, the analysis of scHi-C data is challenged by a large number of missing values. Here, we present a scHi-C data imputation model HiC-SGL, based on Subgraph extraction and graph representation learning. HiC-SGL can also learn informative low-dimensional embeddings of cells. We demonstrate that our method surpasses existing methods in terms of imputation accuracy and clustering performance by various metrics.

Assuntos

Cromatina , Cromatina/genética , Análise por Conglomerados

7.

Accurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures.

Song, Yidong; Yuan, Qianmu; Zhao, Huiying; Yang, Yuedong.

Brief Bioinform ; 24(6)2023 09 22.

Artigo em Inglês | MEDLINE | ID: mdl-37824738

RESUMO

The interactions between nucleic acids and proteins are important in diverse biological processes. The high-quality prediction of nucleic-acid-binding sites continues to pose a significant challenge. Presently, the predictive efficacy of sequence-based methods is constrained by their exclusive consideration of sequence context information, whereas structure-based methods are unsuitable for proteins lacking known tertiary structures. Though protein structures predicted by AlphaFold2 could be used, the extensive computing requirement of AlphaFold2 hinders its use for genome-wide applications. Based on the recent breakthrough of ESMFold for fast prediction of protein structures, we have developed GLMSite, which accurately identifies DNA- and RNA-binding sites using geometric graph learning on ESMFold predicted structures. Here, the predicted protein structures are employed to construct protein structural graph with residues as nodes and spatially neighboring residue pairs for edges. The node representations are further enhanced through the pre-trained language model ProtTrans. The network was trained using a geometric vector perceptron, and the geometric embeddings were subsequently fed into a common network to acquire common binding characteristics. Finally, these characteristics were input into two fully connected layers to predict binding sites with DNA and RNA, respectively. Through comprehensive tests on DNA/RNA benchmark datasets, GLMSite was shown to surpass the latest sequence-based methods and be comparable with structure-based methods. Moreover, the prediction was shown useful for inferring nucleic-acid-binding proteins, demonstrating its potential for protein function discovery. The datasets, codes, and trained models are available at https://github.com/biomed-AI/nucleic-acid-binding.

Assuntos

Redes Neurais de Computação , Proteínas , Sítios de Ligação , Proteínas/química , RNA/metabolismo , DNA , Idioma

8.

Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion.

Yuan, Qianmu; Xie, Junjie; Xie, Jiancong; Zhao, Huiying; Yang, Yuedong.

Brief Bioinform ; 24(3)2023 05 19.

Artigo em Inglês | MEDLINE | ID: mdl-36964722

RESUMO

Protein function prediction is an essential task in bioinformatics which benefits disease mechanism elucidation and drug target discovery. Due to the explosive growth of proteins in sequence databases and the diversity of their functions, it remains challenging to fast and accurately predict protein functions from sequences alone. Although many methods have integrated protein structures, biological networks or literature information to improve performance, these extra features are often unavailable for most proteins. Here, we propose SPROF-GO, a Sequence-based alignment-free PROtein Function predictor, which leverages a pretrained language model to efficiently extract informative sequence embeddings and employs self-attention pooling to focus on important residues. The prediction is further advanced by exploiting the homology information and accounting for the overlapping communities of proteins with related functions through the label diffusion algorithm. SPROF-GO was shown to surpass state-of-the-art sequence-based and even network-based approaches by more than 14.5, 27.3 and 10.1% in area under the precision-recall curve on the three sub-ontology test sets, respectively. Our method was also demonstrated to generalize well on non-homologous proteins and unseen species. Finally, visualization based on the attention mechanism indicated that SPROF-GO is able to capture sequence domains useful for function prediction. The datasets, source codes and trained models of SPROF-GO are available at https://github.com/biomed-AI/SPROF-GO. The SPROF-GO web server is freely available at http://bio-web1.nscc-gz.cn/app/sprof-go.

Assuntos

Proteínas , Software , Proteínas/metabolismo , Algoritmos , Biologia Computacional/métodos , Ontologia Genética

9.

Fast and accurate protein intrinsic disorder prediction by using a pretrained language model.

Song, Yidong; Yuan, Qianmu; Chen, Sheng; Chen, Ken; Zhou, Yaoqi; Yang, Yuedong.

Brief Bioinform ; 24(4)2023 07 20.

Artigo em Inglês | MEDLINE | ID: mdl-37204193

RESUMO

Determining intrinsically disordered regions of proteins is essential for elucidating protein biological functions and the mechanisms of their associated diseases. As the gap between the number of experimentally determined protein structures and the number of protein sequences continues to grow exponentially, there is a need for developing an accurate and computationally efficient disorder predictor. However, current single-sequence-based methods are of low accuracy, while evolutionary profile-based methods are computationally intensive. Here, we proposed a fast and accurate protein disorder predictor LMDisorder that employed embedding generated by unsupervised pretrained language models as features. We showed that LMDisorder performs best in all single-sequence-based methods and is comparable or better than another language-model-based technique in four independent test sets, respectively. Furthermore, LMDisorder showed equivalent or even better performance than the state-of-the-art profile-based technique SPOT-Disorder2. In addition, the high computation efficiency of LMDisorder enabled proteome-scale analysis of human, showing that proteins with high predicted disorder content were associated with specific biological functions. The datasets, the source codes, and the trained model are available at https://github.com/biomed-AI/LMDisorder.

Assuntos

Proteoma , Software , Humanos , Sequência de Aminoácidos , Evolução Biológica

10.

Identifying spatial domain by adapting transcriptomics with histology through contrastive learning.

Zeng, Yuansong; Yin, Rui; Luo, Mai; Chen, Jianing; Pan, Zixiang; Lu, Yutong; Yu, Weijiang; Yang, Yuedong.

Brief Bioinform ; 24(2)2023 03 19.

Artigo em Inglês | MEDLINE | ID: mdl-36781228

RESUMO

Recent advances in spatial transcriptomics have enabled measurements of gene expression at cell/spot resolution meanwhile retaining both the spatial information and the histology images of the tissues. Accurately identifying the spatial domains of spots is a vital step for various downstream tasks in spatial transcriptomics analysis. To remove noises in gene expression, several methods have been developed to combine histopathological images for data analysis of spatial transcriptomics. However, these methods either use the image only for the spatial relations for spots, or individually learn the embeddings of the gene expression and image without fully coupling the information. Here, we propose a novel method ConGI to accurately exploit spatial domains by adapting gene expression with histopathological images through contrastive learning. Specifically, we designed three contrastive loss functions within and between two modalities (the gene expression and image data) to learn the common representations. The learned representations are then used to cluster the spatial domains on both tumor and normal spatial transcriptomics datasets. ConGI was shown to outperform existing methods for the spatial domain identification. In addition, the learned representations have also been shown powerful for various downstream tasks, including trajectory inference, clustering, and visualization.

Assuntos

Aprendizagem , Transcriptoma , Perfilação da Expressão Gênica , Análise por Conglomerados , Análise de Dados

11.

From intuition to AI: evolution of small molecule representations in drug discovery.

McGibbon, Miles; Shave, Steven; Dong, Jie; Gao, Yumiao; Houston, Douglas R; Xie, Jiancong; Yang, Yuedong; Schwaller, Philippe; Blay, Vincent.

Brief Bioinform ; 25(1)2023 11 22.

Artigo em Inglês | MEDLINE | ID: mdl-38033290

RESUMO

Within drug discovery, the goal of AI scientists and cheminformaticians is to help identify molecular starting points that will develop into safe and efficacious drugs while reducing costs, time and failure rates. To achieve this goal, it is crucial to represent molecules in a digital format that makes them machine-readable and facilitates the accurate prediction of properties that drive decision-making. Over the years, molecular representations have evolved from intuitive and human-readable formats to bespoke numerical descriptors and fingerprints, and now to learned representations that capture patterns and salient features across vast chemical spaces. Among these, sequence-based and graph-based representations of small molecules have become highly popular. However, each approach has strengths and weaknesses across dimensions such as generality, computational cost, inversibility for generative applications and interpretability, which can be critical in informing practitioners' decisions. As the drug discovery landscape evolves, opportunities for innovation continue to emerge. These include the creation of molecular representations for high-value, low-data regimes, the distillation of broader biological and chemical knowledge into novel learned representations and the modeling of up-and-coming therapeutic modalities.

Assuntos

Descoberta de Drogas , Intuição , Humanos , Aprendizagem

12.

EVlncRNA-Dpred: improved prediction of experimentally validated lncRNAs by deep learning.

Zhou, Bailing; Ding, Maolin; Feng, Jing; Ji, Baohua; Huang, Pingping; Zhang, Junye; Yu, Xue; Cao, Zanxia; Yang, Yuedong; Zhou, Yaoqi; Wang, Jihua.

Brief Bioinform ; 24(1)2023 01 19.

Artigo em Inglês | MEDLINE | ID: mdl-36573492

RESUMO

Long non-coding RNAs (lncRNAs) played essential roles in nearly every biological process and disease. Many algorithms were developed to distinguish lncRNAs from mRNAs in transcriptomic data and facilitated discoveries of more than 600 000 of lncRNAs. However, only a tiny fraction (<1%) of lncRNA transcripts (~4000) were further validated by low-throughput experiments (EVlncRNAs). Given the cost and labor-intensive nature of experimental validations, it is necessary to develop computational tools to prioritize those potentially functional lncRNAs because many lncRNAs from high-throughput sequencing (HTlncRNAs) could be resulted from transcriptional noises. Here, we employed deep learning algorithms to separate EVlncRNAs from HTlncRNAs and mRNAs. For overcoming the challenge of small datasets, we employed a three-layer deep-learning neural network (DNN) with a K-mer feature as the input and a small convolutional neural network (CNN) with one-hot encoding as the input. Three separate models were trained for human (h), mouse (m) and plant (p), respectively. The final concatenated models (EVlncRNA-Dpred (h), EVlncRNA-Dpred (m) and EVlncRNA-Dpred (p)) provided substantial improvement over a previous model based on support-vector-machines (EVlncRNA-pred). For example, EVlncRNA-Dpred (h) achieved 0.896 for the area under receiver-operating characteristic curve, compared with 0.582 given by sequence-based EVlncRNA-pred model. The models developed here should be useful for screening lncRNA transcripts for experimental validations. EVlncRNA-Dpred is available as a web server at https://www.sdklab-biophysics-dzu.net/EVlncRNA-Dpred/index.html, and the data and source code can be freely available along with the web server.

Assuntos

Aprendizado Profundo , RNA Longo não Codificante , Humanos , Animais , Camundongos , RNA Longo não Codificante/genética , Biologia Computacional/métodos , Software , Algoritmos , RNA Mensageiro/genética

13.

An uncertainty-based interpretable deep learning framework for predicting breast cancer outcome.

Chai, Hua; Lin, Siyin; Lin, Junqi; He, Minfan; Yang, Yuedong; OuYang, Yongzhong; Zhao, Huiying.

BMC Bioinformatics ; 25(1): 88, 2024 Feb 29.

Artigo em Inglês | MEDLINE | ID: mdl-38418940

RESUMO

BACKGROUND: Predicting outcome of breast cancer is important for selecting appropriate treatments and prolonging the survival periods of patients. Recently, different deep learning-based methods have been carefully designed for cancer outcome prediction. However, the application of these methods is still challenged by interpretability. In this study, we proposed a novel multitask deep neural network called UISNet to predict the outcome of breast cancer. The UISNet is able to interpret the importance of features for the prediction model via an uncertainty-based integrated gradients algorithm. UISNet improved the prediction by introducing prior biological pathway knowledge and utilizing patient heterogeneity information. RESULTS: The model was tested in seven public datasets of breast cancer, and showed better performance (average C-index = 0.691) than the state-of-the-art methods (average C-index = 0.650, ranged from 0.619 to 0.677). Importantly, the UISNet identified 20 genes as associated with breast cancer, among which 11 have been proven to be associated with breast cancer by previous studies, and others are novel findings of this study. CONCLUSIONS: Our proposed method is accurate and robust in predicting breast cancer outcomes, and it is an effective way to identify breast cancer-associated genes. The method codes are available at: https://github.com/chh171/UISNet .

Assuntos

Neoplasias da Mama , Aprendizado Profundo , Humanos , Feminino , Neoplasias da Mama/genética , Incerteza , Redes Neurais de Computação , Algoritmos

14.

Prioritizing genomic variants pathogenicity via DNA, RNA, and protein-level features based on extreme gradient boosting.

Ding, Maolin; Chen, Ken; Yang, Yuedong; Zhao, Huiying.

Hum Genet ; 2024 Apr 04.

Artigo em Inglês | MEDLINE | ID: mdl-38575818

RESUMO

Genetic diseases are mostly implicated with genetic variants, including missense, synonymous, non-sense, and copy number variants. These different kinds of variants are indicated to affect phenotypes in various ways from previous studies. It remains essential but challenging to understand the functional consequences of these genetic variants, especially the noncoding ones, due to the lack of corresponding annotations. While many computational methods have been proposed to identify the risk variants. Most of them have only curated DNA-level and protein-level annotations to predict the pathogenicity of the variants, and others have been restricted to missense variants exclusively. In this study, we have curated DNA-, RNA-, and protein-level features to discriminate disease-causing variants in both coding and noncoding regions, where the features of protein sequences and protein structures have been shown essential for analyzing missense variants in coding regions while the features related to RNA-splicing and RBP binding are significant for variants in noncoding regions and synonymous variants in coding regions. Through the integration of these features, we have formulated the Multi-level feature Genomic Variants Predictor (ML-GVP) using the gradient boosting tree. The method has been trained on more than 400,000 variants in the Sherloc-training set from the 6th critical assessment of genome interpretation with superior performance. The method is one of the two best-performing predictors on the blind test in the Sherloc assessment, and is further confirmed by another independent test dataset of de novo variants.

15.

Genome-wide association and Mendelian randomization analysis provide insights into the shared genetic architecture between high-dimensional electrocardiographic features and ischemic heart disease.

Wang, Xinfeng; Qi, Mengling; Zhang, Haoyang; Yang, Yuedong; Zhao, Huiying.

Hum Genet ; 143(1): 49-58, 2024 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-38180560

RESUMO

Observational studies have revealed that ischemic heart disease (IHD) has a unique manifestation on electrocardiographic (ECG). However, the genetic relationships between IHD and ECG remain unclear. We took 12-lead ECG as phenotypes to conduct genome-wide association studies (GWAS) for 41,960 samples from UK-Biobank (UKB). By leveraging large-scale GWAS summary of ECG and IHD (downloaded from FinnGen database), we performed LD score regression (LDSC), Mendelian randomization (MR), and polygenic risk score (PRS) regression to explore genetic relationships between IHD and ECG. Finally, we constructed an XGBoost model to predict IHD by integrating PRS and ECG. The GWAS identified 114 independent SNPs significantly (P value < 5 × 10-8/800, where 800 denotes the number of ECG features) associated with ECG. LDSC analysis indicated significant (P value < 0.05) genetic correlations between 39 ECG features and IHD. MR analysis performed by five approaches showed a putative causal effect of IHD on four S wave related ECG features at lead III. Integrating PRS for these ECG features with age and gender, the XGBoost model achieved Area Under Curve (AUC) 0.72 in predicting IHD. Here, we provide genetic evidence supporting S wave related ECG features at lead III to monitor the IHD risk, and open up a unique approach to integrate ECG with genetic factors for pre-warning IHD.

Assuntos

Estudo de Associação Genômica Ampla , Isquemia Miocárdica , Humanos , Análise da Randomização Mendeliana/métodos , Isquemia Miocárdica/genética , Polimorfismo de Nucleotídeo Único , Fenótipo , Estratificação de Risco Genético

16.

VCAT: an integrated variant function annotation tools.

Huang, Bi; Fan, Cong; Chen, Ken; Rao, Jiahua; Ou, Peihua; Tian, Chong; Yang, Yuedong; Cooper, David N; Zhao, Huiying.

Hum Genet ; 2024 Aug 27.

Artigo em Inglês | MEDLINE | ID: mdl-39192052

RESUMO

The development of sequencing technology has promoted discovery of variants in the human genome. Identifying functions of these variants is important for us to link genotype to phenotype, and to diagnose diseases. However, it usually requires researchers to visit multiple databases. Here, we presented a one-stop webserver for variant function annotation tools (VCAT, https://biomed.nscc-gz.cn/zhaolab/VCAT/ ) that is the first one connecting variant to functions via the epigenome, protein, drug and RNA. VCAT is also the first one to make all annotations visualized in interactive charts or molecular structures. VCAT allows users to upload data in VCF format, and download results via a URL. Moreover, VCAT has annotated a huge number (1,262,041,068) of variants collected from dbSNP, 1000 Genomes projects, gnomAD, ICGC, TCGA, and HPRC Pangenome project. For these variants, users are able to searcher their functions, related diseases and drugs from VCAT. In summary, VCAT provides a one-stop webserver to explore the potential functions of human genomic variants including their relationship with diseases and drugs.

17.

Predicting the effects of mutations on protein solubility using graph convolution network and protein language model representation.

Wang, Jing; Chen, Sheng; Yuan, Qianmu; Chen, Jianwen; Li, Danping; Wang, Lei; Yang, Yuedong.

J Comput Chem ; 45(8): 436-445, 2024 Mar 30.

Artigo em Inglês | MEDLINE | ID: mdl-37933773

RESUMO

Solubility is one of the most important properties of protein. Protein solubility can be greatly changed by single amino acid mutations and the reduced protein solubility could lead to diseases. Since experimental methods to determine solubility are time-consuming and expensive, in-silico methods have been developed to predict the protein solubility changes caused by mutations mostly through protein evolution information. However, these methods are slow since it takes long time to obtain evolution information through multiple sequence alignment. In addition, these methods are of low performance because they do not fully utilize protein 3D structures due to a lack of experimental structures for most proteins. Here, we proposed a sequence-based method DeepMutSol to predict solubility change from residual mutations based on the Graph Convolutional Neural Network (GCN), where the protein graph was initiated according to predicted protein structure from Alphafold2, and the nodes (residues) were represented by protein language embeddings. To circumvent the small data of solubility changes, we further pretrained the model over absolute protein solubility. DeepMutSol was shown to outperform state-of-the-art methods in benchmark tests. In addition, we applied the method to clinically relevant genes from the ClinVar database and the predicted solubility changes were shown able to separate pathogenic mutations. All of the data sets and the source code are available at https://github.com/biomed-AI/DeepMutSol.

Assuntos

Aminoácidos , Benchmarking , Solubilidade , Mutação , Idioma

18.

Capturing large genomic contexts for accurately predicting enhancer-promoter interactions.

Chen, Ken; Zhao, Huiying; Yang, Yuedong.

Brief Bioinform ; 23(2)2022 03 10.

Artigo em Inglês | MEDLINE | ID: mdl-35062021

RESUMO

Enhancer-promoter interaction (EPI) is a key mechanism underlying gene regulation. EPI prediction has always been a challenging task because enhancers could regulate promoters of distant target genes. Although many machine learning models have been developed, they leverage only the features in enhancers and promoters, or simply add the average genomic signals in the regions between enhancers and promoters, without utilizing detailed features between or outside enhancers and promoters. Due to a lack of large-scale features, existing methods could achieve only moderate performance, especially for predicting EPIs in different cell types. Here, we present a Transformer-based model, TransEPI, for EPI prediction by capturing large genomic contexts. TransEPI was developed based on EPI datasets derived from Hi-C or ChIA-PET data in six cell lines. To avoid over-fitting, we evaluated the TransEPI model by testing it on independent test datasets where the cell line and chromosome are different from the training data. TransEPI not only achieved consistent performance across the cross-validation and test datasets from different cell types but also outperformed the state-of-the-art machine learning and deep learning models. In addition, we found that the improved performance of TransEPI was attributed to the integration of large genomic contexts. Lastly, TransEPI was extended to study the non-coding mutations associated with brain disorders or neural diseases, and we found that TransEPI was also useful for predicting the target genes of non-coding mutations.

Assuntos

Elementos Facilitadores Genéticos , Genômica , Linhagem Celular , Genômica/métodos , Aprendizado de Máquina , Regiões Promotoras Genéticas

19.

AlphaFold2-aware protein-DNA binding site prediction using graph transformer.

Yuan, Qianmu; Chen, Sheng; Rao, Jiahua; Zheng, Shuangjia; Zhao, Huiying; Yang, Yuedong.

Brief Bioinform ; 23(2)2022 03 10.

Artigo em Inglês | MEDLINE | ID: mdl-35039821

RESUMO

Protein-DNA interactions play crucial roles in the biological systems, and identifying protein-DNA binding sites is the first step for mechanistic understanding of various biological activities (such as transcription and repair) and designing novel drugs. How to accurately identify DNA-binding residues from only protein sequence remains a challenging task. Currently, most existing sequence-based methods only consider contextual features of the sequential neighbors, which are limited to capture spatial information. Based on the recent breakthrough in protein structure prediction by AlphaFold2, we propose an accurate predictor, GraphSite, for identifying DNA-binding residues based on the structural models predicted by AlphaFold2. Here, we convert the binding site prediction problem into a graph node classification task and employ a transformer-based variant model to take the protein structural information into account. By leveraging predicted protein structures and graph transformer, GraphSite substantially improves over the latest sequence-based and structure-based methods. The algorithm is further confirmed on the independent test set of 181 proteins, where GraphSite surpasses the state-of-the-art structure-based method by 16.4% in area under the precision-recall curve and 11.2% in Matthews correlation coefficient, respectively. We provide the datasets, the predicted structures and the source codes along with the pre-trained models of GraphSite at https://github.com/biomed-AI/GraphSite. The GraphSite web server is freely available at https://biomed.nscc-gz.cn/apps/GraphSite.

Assuntos

Algoritmos , Proteínas , Sítios de Ligação , DNA/metabolismo , Ligação Proteica , Domínios Proteicos , Proteínas/química

20.

Alignment-free metal ion-binding site prediction from protein sequence through pretrained language model and multi-task learning.

Yuan, Qianmu; Chen, Sheng; Wang, Yu; Zhao, Huiying; Yang, Yuedong.

Brief Bioinform ; 23(6)2022 11 19.

Artigo em Inglês | MEDLINE | ID: mdl-36274238

RESUMO

More than one-third of the proteins contain metal ions in the Protein Data Bank. Correct identification of metal ion-binding residues is important for understanding protein functions and designing novel drugs. Due to the small size and high versatility of metal ions, it remains challenging to computationally predict their binding sites from protein sequence. Existing sequence-based methods are of low accuracy due to the lack of structural information, and time-consuming owing to the usage of multi-sequence alignment. Here, we propose LMetalSite, an alignment-free sequence-based predictor for binding sites of the four most frequently seen metal ions in BioLiP (Zn2+, Ca2+, Mg2+ and Mn2+). LMetalSite leverages the pretrained language model to rapidly generate informative sequence representations and employs transformer to capture long-range dependencies. Multi-task learning is adopted to compensate for the scarcity of training data and capture the intrinsic similarities between different metal ions. LMetalSite was shown to surpass state-of-the-art structure-based methods by more than 19.7, 14.4, 36.8 and 12.6% in area under the precision recall on the four independent tests, respectively. Further analyses indicated that the self-attention modules are effective to learn the structural contexts of residues from protein sequence. We provide the data sets, source codes and trained models of LMetalSite at https://github.com/biomed-AI/LMetalSite.

Assuntos

Idioma , Proteínas , Conformação Proteica , Ligação Proteica , Sítios de Ligação , Proteínas/química , Metais/química , Metais/metabolismo , Íons/química

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA