Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
1.
Hum Genet ; 143(11): 1311-1322, 2024 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-39192052

RESUMO

The development of sequencing technology has promoted discovery of variants in the human genome. Identifying functions of these variants is important for us to link genotype to phenotype, and to diagnose diseases. However, it usually requires researchers to visit multiple databases. Here, we presented a one-stop webserver for variant function annotation tools (VCAT, https://biomed.nscc-gz.cn/zhaolab/VCAT/ ) that is the first one connecting variant to functions via the epigenome, protein, drug and RNA. VCAT is also the first one to make all annotations visualized in interactive charts or molecular structures. VCAT allows users to upload data in VCF format, and download results via a URL. Moreover, VCAT has annotated a huge number (1,262,041,068) of variants collected from dbSNP, 1000 Genomes projects, gnomAD, ICGC, TCGA, and HPRC Pangenome project. For these variants, users are able to searcher their functions, related diseases and drugs from VCAT. In summary, VCAT provides a one-stop webserver to explore the potential functions of human genomic variants including their relationship with diseases and drugs.


Assuntos
Genoma Humano , Anotação de Sequência Molecular , Software , Humanos , Variação Genética , Bases de Dados Genéticas , Biologia Computacional/métodos , Genômica/métodos
2.
Brief Bioinform ; 23(2)2022 03 10.
Artigo em Inglês | MEDLINE | ID: mdl-35039821

RESUMO

Protein-DNA interactions play crucial roles in the biological systems, and identifying protein-DNA binding sites is the first step for mechanistic understanding of various biological activities (such as transcription and repair) and designing novel drugs. How to accurately identify DNA-binding residues from only protein sequence remains a challenging task. Currently, most existing sequence-based methods only consider contextual features of the sequential neighbors, which are limited to capture spatial information. Based on the recent breakthrough in protein structure prediction by AlphaFold2, we propose an accurate predictor, GraphSite, for identifying DNA-binding residues based on the structural models predicted by AlphaFold2. Here, we convert the binding site prediction problem into a graph node classification task and employ a transformer-based variant model to take the protein structural information into account. By leveraging predicted protein structures and graph transformer, GraphSite substantially improves over the latest sequence-based and structure-based methods. The algorithm is further confirmed on the independent test set of 181 proteins, where GraphSite surpasses the state-of-the-art structure-based method by 16.4% in area under the precision-recall curve and 11.2% in Matthews correlation coefficient, respectively. We provide the datasets, the predicted structures and the source codes along with the pre-trained models of GraphSite at https://github.com/biomed-AI/GraphSite. The GraphSite web server is freely available at https://biomed.nscc-gz.cn/apps/GraphSite.


Assuntos
Algoritmos , Proteínas , Sítios de Ligação , DNA/metabolismo , Ligação Proteica , Domínios Proteicos , Proteínas/química
3.
J Chem Inf Model ; 64(6): 1945-1954, 2024 03 25.
Artigo em Inglês | MEDLINE | ID: mdl-38484468

RESUMO

Self-supervised molecular representation learning has demonstrated great promise in bridging machine learning and chemical science to accelerate the development of new drugs. Due to the limited reaction data, existing methods are mostly pretrained by augmenting the intrinsic topology of molecules without effectively incorporating chemical reaction prior information, which makes them difficult to generalize to chemical reaction-related tasks. To address this issue, we propose ReaKE, a reaction knowledge embedding framework, which formulates chemical reactions as a knowledge graph. Specifically, we constructed a chemical synthesis knowledge graph with reactants and products as nodes and reaction rules as the edges. Based on the knowledge graph, we further proposed novel contrastive learning at both molecule and reaction levels to capture the reaction-related functional group information within and between molecules. Extensive experiments demonstrate the effectiveness of ReaKE compared with state-of-the-art methods on several downstream tasks, including reaction classification, product prediction, and yield prediction.


Assuntos
Aprendizado de Máquina , Reconhecimento Automatizado de Padrão
4.
Brief Bioinform ; 22(4)2021 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-33341877

RESUMO

Biomedical knowledge graphs (KGs), which can help with the understanding of complex biological systems and pathologies, have begun to play a critical role in medical practice and research. However, challenges remain in their embedding and use due to their complex nature and the specific demands of their construction. Existing studies often suffer from problems such as sparse and noisy datasets, insufficient modeling methods and non-uniform evaluation metrics. In this work, we established a comprehensive KG system for the biomedical field in an attempt to bridge the gap. Here, we introduced PharmKG, a multi-relational, attributed biomedical KG, composed of more than 500 000 individual interconnections between genes, drugs and diseases, with 29 relation types over a vocabulary of ~8000 disambiguated entities. Each entity in PharmKG is attached with heterogeneous, domain-specific information obtained from multi-omics data, i.e. gene expression, chemical structure and disease word embedding, while preserving the semantic and biomedical features. For baselines, we offered nine state-of-the-art KG embedding (KGE) approaches and a new biological, intuitive, graph neural network-based KGE method that uses a combination of both global network structure and heterogeneous domain features. Based on the proposed benchmark, we conducted extensive experiments to assess these KGE models using multiple evaluation metrics. Finally, we discussed our observations across various downstream biological tasks and provide insights and guidelines for how to use a KG in biomedicine. We hope that the unprecedented quality and diversity of PharmKG will lead to advances in biomedical KG construction, embedding and application.


Assuntos
Pesquisa Biomédica , Mineração de Dados , Redes Neurais de Computação , Semântica , Software , Benchmarking , Humanos
5.
Bioinformatics ; 36(17): 4576-4582, 2020 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-32467966

RESUMO

MOTIVATION: RNA secondary structure plays a vital role in fundamental cellular processes, and identification of RNA secondary structure is a key step to understand RNA functions. Recently, a few experimental methods were developed to profile genome-wide RNA secondary structure, i.e. the pairing probability of each nucleotide, through high-throughput sequencing techniques. However, these high-throughput methods have low precision and cannot cover all nucleotides due to limited sequencing coverage. RESULTS: Here, we have developed a new method for the prediction of genome-wide RNA secondary structure profile from RNA sequence based on the extreme gradient boosting technique. The method achieves predictions with areas under the receiver operating characteristic curve (AUC) >0.9 on three different datasets, and AUC of 0.888 by another independent test on the recently released Zika virus data. These AUCs are consistently >5% greater than those by the CROSS method recently developed based on a shallow neural network. Further analysis on the 1000 Genome Project data showed that our predicted unpaired probabilities are highly correlated (>0.8) with the minor allele frequencies at synonymous, non-synonymous mutations, and mutations in untranslated regions, which were higher than those generated by RNAplfold. Moreover, the prediction over all human mRNA indicated a consistent result with previous observation that there is a periodic distribution of unpaired probability on codons. The accurate predictions by our method indicate that such model trained on genome-wide experimental data might be an alternative for analytical methods. AVAILABILITY AND IMPLEMENTATION: The GRASP is available for academic use at https://github.com/sysu-yanglab/GRASP. SUPPLEMENTARY INFORMATION: Supplementary data are available online.


Assuntos
Infecção por Zika virus , Zika virus , Sequência de Bases , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Redes Neurais de Computação , RNA/genética , Software
6.
J Chem Inf Model ; 60(1): 47-55, 2020 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-31825611

RESUMO

Synthesis planning is the process of recursively decomposing target molecules into available precursors. Computer-aided retrosynthesis can potentially assist chemists in designing synthetic routes; however, at present, it is cumbersome and cannot provide satisfactory results. In this study, we have developed a template-free self-corrected retrosynthesis predictor (SCROP) to predict retrosynthesis using transformer neural networks. In the method, the retrosynthesis planning was converted to a machine translation problem from the products to molecular linear notations of the reactants. By coupling with a neural network-based syntax corrector, our method achieved an accuracy of 59.0% on a standard benchmark data set, which outperformed other deep learning methods by >21% and template-based methods by >6%. More importantly, our method was 1.7 times more accurate than other state-of-the-art methods for compounds not appearing in the training set.


Assuntos
Técnicas de Química Sintética/métodos , Redes Neurais de Computação , Conjuntos de Dados como Assunto
7.
Comput Biol Med ; 170: 108048, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38310804

RESUMO

Illuminating associations between diseases and genes can help reveal the pathogenesis of syndromes and contribute to treatments, but a large number of associations remained unexplored. To identify novel disease-gene associations, many computational methods have been developed using disease and gene-related prior knowledge. However, these methods remain of relatively inferior performance due to the limited external data sources and the inevitable noise among the prior knowledge. In this study, we have developed a new method, Self-Supervised Mutual Infomax Graph Convolution Network (MiGCN), to predict disease-gene associations under the guidance of external disease-disease and gene-gene collaborative graphs. The noises within the collaborative graphs were eliminated by maximizing the mutual information between nodes and neighbors through a graphical mutual infomax layer. In parallel, the node interactions were strengthened by a novel informative message passing layer to improve the learning ability of graph neural network. The extensive experiments showed that our model achieved performance improvement over the state-of-art method by more than 8 % on AUC. The datasets, source codes and trained models of MiGCN are available at https://github.com/biomed-AI/MiGCN.


Assuntos
Aprendizagem , Redes Neurais de Computação , Humanos , Software , Síndrome
8.
Nat Commun ; 15(1): 4476, 2024 May 25.
Artigo em Inglês | MEDLINE | ID: mdl-38796523

RESUMO

Protein functions are characterized by interactions with proteins, drugs, and other biomolecules. Understanding these interactions is essential for deciphering the molecular mechanisms underlying biological processes and developing new therapeutic strategies. Current computational methods mostly predict interactions based on either molecular network or structural information, without integrating them within a unified multi-scale framework. While a few multi-view learning methods are devoted to fusing the multi-scale information, these methods tend to rely intensively on a single scale and under-fitting the others, likely attributed to the imbalanced nature and inherent greediness of multi-scale learning. To alleviate the optimization imbalance, we present MUSE, a multi-scale representation learning framework based on a variant expectation maximization to optimize different scales in an alternating procedure over multiple iterations. This strategy efficiently fuses multi-scale information between atomic structure and molecular network scale through mutual supervision and iterative optimization. MUSE outperforms the current state-of-the-art models not only in molecular interaction (protein-protein, drug-protein, and drug-drug) tasks but also in protein interface prediction at the atomic structure scale. More importantly, the multi-scale learning framework shows potential for extension to other scales of computational drug discovery.


Assuntos
Biologia Computacional , Proteínas , Proteínas/química , Proteínas/metabolismo , Biologia Computacional/métodos , Algoritmos , Preparações Farmacêuticas/química , Preparações Farmacêuticas/metabolismo , Aprendizado de Máquina , Interações Medicamentosas , Humanos , Ligação Proteica
9.
Adv Sci (Weinh) ; 11(32): e2404845, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-39031820

RESUMO

Constructing discriminative representations of molecules lies at the core of a number of domains such as drug discovery, chemistry, and medicine. State-of-the-art methods employ graph neural networks and self-supervised learning (SSL) to learn unlabeled data for structural representations, which can then be fine-tuned for downstream tasks. Albeit powerful, these methods are pre-trained solely on molecular structures and thus often struggle with tasks involved in intricate biological processes. Here, it is proposed to assist the learning of molecular representation by using the perturbed high-content cell microscopy images at the phenotypic level. To incorporate the cross-modal pre-training, a unified framework is constructed to align them through multiple types of contrastive loss functions, which is proven effective in the formulated novel tasks to retrieve the molecules and corresponding images mutually. More importantly, the model can infer functional molecules according to cellular images generated by genetic perturbations. In parallel, the proposed model can transfer non-trivially to molecular property predictions, and has shown great improvement over clinical outcome predictions. These results suggest that such cross-modality learning can bridge molecules and phenotype to play important roles in drug discovery.


Assuntos
Redes Neurais de Computação , Humanos , Processamento de Imagem Assistida por Computador/métodos , Microscopia/métodos , Algoritmos , Aprendizado de Máquina
10.
Nat Comput Sci ; 2024 Nov 06.
Artigo em Inglês | MEDLINE | ID: mdl-39506080

RESUMO

RNAs represent a class of programmable biomolecules capable of performing diverse biological functions. Recent studies have developed accurate RNA three-dimensional structure prediction methods, which may enable new RNAs to be designed in a structure-guided manner. Here, we develop a structure-to-sequence deep learning platform for the de novo generative design of RNA aptamers. We show that our approach can design RNA aptamers that are predicted to be structurally similar, yet sequence dissimilar, to known light-up aptamers that fluoresce in the presence of small molecules. We experimentally validate several generated RNA aptamers to have fluorescent activity, show that these aptamers can be optimized for activity in silico, and find that they exhibit a mechanism of fluorescence similar to that of known light-up aptamers. Our results demonstrate how structural predictions can guide the targeted and resource-efficient design of new RNA sequences.

11.
iScience ; 26(9): 107663, 2023 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-37670781

RESUMO

The medicinal effect of a drug acts through a series of genes, and the pathological mechanism of a disease is also related to genes with certain biological functions. However, the complex information between drug or disease and a series of genes is neglected by traditional message passing methods. In this study, we proposed a new framework using two different strategies for gene-drug/disease and drug-disease networks, respectively. We employ long short-term memory (LSTM) network to extract the flow of message from series of genes (gene path) to drug/disease. Incorporating the resulting information of gene paths into drug-disease network, we utilize graph convolutional network (GCN) to predict drug-disease associations. Experimental results showed that our method GeneDR (gene-based drug repurposing) makes better use of the information in gene paths, and performs better in predicting drug-disease associations.

12.
Patterns (N Y) ; 3(12): 100653, 2022 Dec 09.
Artigo em Inglês | MEDLINE | ID: mdl-36569549

RESUMO

Jiahua Rao and Shuangjia Zheng are Ph.D. students in Prof. Yang's lab (Supercomputing And AI for Life science, SAIL Lab) at Sun Yat-sen University. They recently developed an interpretable framework to quantitatively assess the interpretability of Graph Neural Network (GNN) and made comparison with medicinal chemists. Their meaningful benchmarking and rigorous framework would greatly benefit development of new interpretable methods in GNNs.

13.
Patterns (N Y) ; 3(12): 100628, 2022 Dec 09.
Artigo em Inglês | MEDLINE | ID: mdl-36569553

RESUMO

Graph neural networks (GNNs) have received increasing attention because of their expressive power on topological data, but they are still criticized for their lack of interpretability. To interpret GNN models, explainable artificial intelligence (XAI) methods have been developed. However, these methods are limited to qualitative analyses without quantitative assessments from the real-world datasets due to a lack of ground truths. In this study, we have established five XAI-specific molecular property benchmarks, including two synthetic and three experimental datasets. Through the datasets, we quantitatively assessed six XAI methods on four GNN models and made comparisons with seven medicinal chemists of different experience levels. The results demonstrated that XAI methods could deliver reliable and informative answers for medicinal chemists in identifying the key substructures. Moreover, the identified substructures were shown to complement existing classical fingerprints to improve molecular property predictions, and the improvements increased with the growth of training data.

14.
iScience ; 24(5): 102393, 2021 May 21.
Artigo em Inglês | MEDLINE | ID: mdl-33997678

RESUMO

Single-cell RNA sequencing technology promotes the profiling of single-cell transcriptomes at an unprecedented throughput and resolution. However, in scRNA-seq studies, only a low amount of sequenced mRNA in each cell leads to missing detection for a portion of mRNA molecules, i.e. the dropout problem which hinders various downstream analyses. Therefore, it is necessary to develop robust and effective imputation methods for the increasing scRNA-seq data. In this study, we have developed an imputation method (GraphSCI) to impute the dropout events in scRNA-seq data based on the graph convolution networks. Extensive experiments demonstrated that GraphSCI outperforms other state-of-the-art methods for imputation on both simulated and real scRNA-seq data. Meanwhile, GraphSCI is able to accurately infer gene-to-gene relationships and the inferred gene-to-gene relationships could also provide powerful assistance for imputation dynamically during the training process, which is a key promotion of GraphSCI compared with other imputation algorithms.

15.
Comput Biol Med ; 134: 104481, 2021 07.
Artigo em Inglês | MEDLINE | ID: mdl-33989895

RESUMO

BACKGROUND: Genomic information is nowadays widely used for precise cancer treatments. Since the individual type of omics data only represents a single view that suffers from data noise and bias, multiple types of omics data are required for accurate cancer prognosis prediction. However, it is challenging to effectively integrate multi-omics data due to the large number of redundant variables but relatively small sample size. With the recent progress in deep learning techniques, Autoencoder was used to integrate multi-omics data for extracting representative features. Nevertheless, the generated model is fragile from data noises. Additionally, previous studies usually focused on individual cancer types without making comprehensive tests on pan-cancer. Here, we employed the denoising Autoencoder to get a robust representation of the multi-omics data, and then used the learned representative features to estimate patients' risks. RESULTS: By applying to 15 cancers from The Cancer Genome Atlas (TCGA), our method was shown to improve the C-index values over previous methods by 6.5% on average. Considering the difficulty to obtain multi-omics data in practice, we further used only mRNA data to fit the estimated risks by training XGboost models, and found the models could achieve an average C-index value of 0.627. As a case study, the breast cancer prognosis prediction model was independently tested on three datasets from the Gene Expression Omnibus (GEO), and shown able to significantly separate high-risk patients from low-risk ones (C-index>0.6, p-values<0.05). Based on the risk subgroups divided by our method, we identified nine prognostic markers highly associated with breast cancer, among which seven genes have been proved by literature review. CONCLUSION: Our comprehensive tests indicated that we have constructed an accurate and robust framework to integrate multi-omics data for cancer prognosis prediction. Moreover, it is an effective way to discover cancer prognosis-related genes.


Assuntos
Neoplasias da Mama , Aprendizado Profundo , Neoplasias da Mama/genética , Feminino , Perfilação da Expressão Gênica , Genômica , Humanos , Oncogenes
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA