Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 16 de 16
Filtrar
1.
Bioinformatics ; 39(39 Suppl 1): i465-i474, 2023 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-37387160

RESUMO

MOTIVATION: Despite the advances in sequencing technology, massive proteins with known sequences remain functionally unannotated. Biological network alignment (NA), which aims to find the node correspondence between species' protein-protein interaction (PPI) networks, has been a popular strategy to uncover missing annotations by transferring functional knowledge across species. Traditional NA methods assumed that topologically similar proteins in PPIs are functionally similar. However, it was recently reported that functionally unrelated proteins can be as topologically similar as functionally related pairs, and a new data-driven or supervised NA paradigm has been proposed, which uses protein function data to discern which topological features correspond to functional relatedness. RESULTS: Here, we propose GraNA, a deep learning framework for the supervised NA paradigm for the pairwise NA problem. Employing graph neural networks, GraNA utilizes within-network interactions and across-network anchor links for learning protein representations and predicting functional correspondence between across-species proteins. A major strength of GraNA is its flexibility to integrate multi-faceted non-functional relationship data, such as sequence similarity and ortholog relationships, as anchor links to guide the mapping of functionally related proteins across species. Evaluating GraNA on a benchmark dataset composed of several NA tasks between different pairs of species, we observed that GraNA accurately predicted the functional relatedness of proteins and robustly transferred functional annotations across species, outperforming a number of existing NA methods. When applied to a case study on a humanized yeast network, GraNA also successfully discovered functionally replaceable human-yeast protein pairs that were documented in previous studies. AVAILABILITY AND IMPLEMENTATION: The code of GraNA is available at https://github.com/luo-group/GraNA.


Assuntos
Benchmarking , Proteínas Fúngicas , Humanos , Redes Neurais de Computação , Mapas de Interação de Proteínas
2.
Elife ; 122023 05 02.
Artigo em Inglês | MEDLINE | ID: mdl-37129358

RESUMO

Hearing and vision sensory systems are tuned to the natural statistics of acoustic and electromagnetic energy on earth and are evolved to be sensitive in ethologically relevant ranges. But what are the natural statistics of odors, and how do olfactory systems exploit them? Dissecting an accurate machine learning model (Lee et al., 2022) for human odor perception, we find a computable representation for odor at the molecular level that can predict the odor-evoked receptor, neural, and behavioral responses of nearly all terrestrial organisms studied in olfactory neuroscience. Using this olfactory representation (principal odor map [POM]), we find that odorous compounds with similar POM representations are more likely to co-occur within a substance and be metabolically closely related; metabolic reaction sequences (Caspi et al., 2014) also follow smooth paths in POM despite large jumps in molecular structure. Just as the brain's visual representations have evolved around the natural statistics of light and shapes, the natural statistics of metabolism appear to shape the brain's representation of the olfactory world.


Assuntos
Percepção Olfatória , Receptores Odorantes , Humanos , Percepção Olfatória/fisiologia , Condutos Olfatórios/fisiologia , Olfato/fisiologia , Odorantes
3.
Science ; 379(6639): 1358-1363, 2023 03 31.
Artigo em Inglês | MEDLINE | ID: mdl-36996195

RESUMO

Enzyme function annotation is a fundamental challenge, and numerous computational tools have been developed. However, most of these tools cannot accurately predict functional annotations, such as enzyme commission (EC) number, for less-studied proteins or those with previously uncharacterized functions or multiple activities. We present a machine learning algorithm named CLEAN (contrastive learning-enabled enzyme annotation) to assign EC numbers to enzymes with better accuracy, reliability, and sensitivity compared with the state-of-the-art tool BLASTp. The contrastive learning framework empowers CLEAN to confidently (i) annotate understudied enzymes, (ii) correct mislabeled enzymes, and (iii) identify promiscuous enzymes with two or more EC numbers-functions that we demonstrate by systematic in silico and in vitro experiments. We anticipate that this tool will be widely used for predicting the functions of uncharacterized enzymes, thereby advancing many fields, such as genomics, synthetic biology, and biocatalysis.


Assuntos
Enzimas , Aprendizado de Máquina , Anotação de Sequência Molecular , Proteínas , Análise de Sequência de Proteína , Algoritmos , Biologia Computacional , Enzimas/química , Genômica , Proteínas/química , Reprodutibilidade dos Testes , Anotação de Sequência Molecular/métodos , Análise de Sequência de Proteína/métodos , Biocatálise
4.
Pac Symp Biocomput ; 28: 109-120, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36540969

RESUMO

Although protein sequence data is growing at an ever-increasing rate, the protein universe is still sparsely annotated with functional and structural annotations. Computational approaches have become efficient solutions to infer annotations for unlabeled proteins by transferring knowledge from proteins with experimental annotations. Despite the increasing availability of protein structure data and the high coverage of high-quality predicted structures, e.g., by AlphaFold, many existing computational tools still only rely on sequence data to predict structural or functional annotations, including alignment algorithms such as BLAST and several sequence-based deep learning models. Here, we develop PenLight, a general deep learning framework for protein structural and functional annotations. Pen-Light uses a graph neural network (GNN) to integrate 3D protein structure data and protein language model representations. In addition, PenLight applies a contrastive learning strategy to train the GNN for learning protein representations that reflect similarities beyond sequence identity, such as semantic similarities in the function or structure space. We benchmarked PenLight on a structural classification task and a functional annotation task, where PenLight achieved higher prediction accuracy and coverage than state-of-the-art methods.


Assuntos
Biologia Computacional , Redes Neurais de Computação , Humanos , Algoritmos , Proteínas/química
5.
Nat Comput Sci ; 3(2): 124-125, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38177630
6.
Bioinformatics ; 38(6): 1607-1614, 2022 03 04.
Artigo em Inglês | MEDLINE | ID: mdl-34999749

RESUMO

MOTIVATION: Rapidly generated scRNA-seq datasets enable us to understand cellular differences and the function of each individual cell at single-cell resolution. Cell-type classification, which aims at characterizing and labeling groups of cells according to their gene expression, is one of the most important steps for single-cell analysis. To facilitate the manual curation process, supervised learning methods have been used to automatically classify cells. Most of the existing supervised learning approaches only utilize annotated cells in the training step while ignoring the more abundant unannotated cells. In this article, we proposed scPretrain, a multi-task self-supervised learning approach that jointly considers annotated and unannotated cells for cell-type classification. scPretrain consists of a pre-training step and a fine-tuning step. In the pre-training step, scPretrain uses a multi-task learning framework to train a feature extraction encoder based on each dataset's pseudo-labels, where only unannotated cells are used. In the fine-tuning step, scPretrain fine-tunes this feature extraction encoder using the limited annotated cells in a new dataset. RESULTS: We evaluated scPretrain on 60 diverse datasets from different technologies, species and organs, and obtained a significant improvement on both cell-type classification and cell clustering. Moreover, the representations obtained by scPretrain in the pre-training step also enhanced the performance of conventional classifiers, such as random forest, logistic regression and support-vector machines. scPretrain is able to effectively utilize the massive amount of unlabeled data and be applied to annotating increasingly generated scRNA-seq datasets. AVAILABILITY AND IMPLEMENTATION: The data and code underlying this article are available in scPretrain: Multi-task self-supervised learning for cell type classification, at https://github.com/ruiyi-zhang/scPretrain and https://zenodo.org/record/5802306. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmo Florestas Aleatórias , Análise de Célula Única , Análise de Célula Única/métodos , Análise por Conglomerados , Máquina de Vetores de Suporte
8.
Nat Commun ; 12(1): 5743, 2021 09 30.
Artigo em Inglês | MEDLINE | ID: mdl-34593817

RESUMO

Machine learning has been increasingly used for protein engineering. However, because the general sequence contexts they capture are not specific to the protein being engineered, the accuracy of existing machine learning algorithms is rather limited. Here, we report ECNet (evolutionary context-integrated neural network), a deep-learning algorithm that exploits evolutionary contexts to predict functional fitness for protein engineering. This algorithm integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. As such, it enables accurate mapping from sequence to function and provides generalization from low-order mutants to higher-order mutants. We show that ECNet predicts the sequence-function relationship more accurately as compared to existing machine learning algorithms by using ~50 deep mutational scanning and random mutagenesis datasets. Moreover, we used ECNet to guide the engineering of TEM-1 ß-lactamase and identified variants with improved ampicillin resistance with high success rates.


Assuntos
Aprendizado Profundo , Evolução Molecular , Engenharia de Proteínas/métodos , Sequência de Aminoácidos/genética , Conjuntos de Dados como Assunto , Aptidão Genética , Ensaios de Triagem em Larga Escala , Mutação , Homologia de Sequência de Aminoácidos , Resistência beta-Lactâmica/genética , beta-Lactamases/genética
9.
PLoS Comput Biol ; 17(8): e1009284, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34347784

RESUMO

Modeling the impact of amino acid mutations on protein-protein interaction plays a crucial role in protein engineering and drug design. In this study, we develop GeoPPI, a novel structure-based deep-learning framework to predict the change of binding affinity upon mutations. Based on the three-dimensional structure of a protein, GeoPPI first learns a geometric representation that encodes topology features of the protein structure via a self-supervised learning scheme. These representations are then used as features for training gradient-boosting trees to predict the changes of protein-protein binding affinity upon mutations. We find that GeoPPI is able to learn meaningful features that characterize interactions between atoms in protein structures. In addition, through extensive experiments, we show that GeoPPI achieves new state-of-the-art performance in predicting the binding affinity changes upon both single- and multi-point mutations on six benchmark datasets. Moreover, we show that GeoPPI can accurately estimate the difference of binding affinities between a few recently identified SARS-CoV-2 antibodies and the receptor-binding domain (RBD) of the S protein. These results demonstrate the potential of GeoPPI as a powerful and useful computational tool in protein design and engineering. Our code and datasets are available at: https://github.com/Liuxg16/GeoPPI.


Assuntos
Substituição de Aminoácidos , Modelos Químicos , Proteínas/metabolismo , Mutação Puntual , Ligação Proteica , Proteínas/química , Proteínas/genética
10.
Nat Cancer ; 2(2): 233-244, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-34223192

RESUMO

Cell-line screens create expansive datasets for learning predictive markers of drug response, but these models do not readily translate to the clinic with its diverse contexts and limited data. In the present study, we apply a recently developed technique, few-shot machine learning, to train a versatile neural network model in cell lines that can be tuned to new contexts using few additional samples. The model quickly adapts when switching among different tissue types and in moving from cell-line models to clinical contexts, including patient-derived tumor cells and patient-derived xenografts. It can also be interpreted to identify the molecular features most important to a drug response, highlighting critical roles for RB1 and SMAD4 in the response to CDK inhibition and RNF8 and CHD4 in the response to ATM inhibition. The few-shot learning framework provides a bridge from the many samples surveyed in high-throughput screens (n-of-many) to the distinctive contexts of individual patients (n-of-one).


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Proteínas de Ligação a DNA , Humanos , Ubiquitina-Proteína Ligases
11.
Signal Transduct Target Ther ; 6(1): 165, 2021 04 24.
Artigo em Inglês | MEDLINE | ID: mdl-33895786

RESUMO

The global spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) requires an urgent need to find effective therapeutics for the treatment of coronavirus disease 2019 (COVID-19). In this study, we developed an integrative drug repositioning framework, which fully takes advantage of machine learning and statistical analysis approaches to systematically integrate and mine large-scale knowledge graph, literature and transcriptome data to discover the potential drug candidates against SARS-CoV-2. Our in silico screening followed by wet-lab validation indicated that a poly-ADP-ribose polymerase 1 (PARP1) inhibitor, CVL218, currently in Phase I clinical trial, may be repurposed to treat COVID-19. Our in vitro assays revealed that CVL218 can exhibit effective inhibitory activity against SARS-CoV-2 replication without obvious cytopathic effect. In addition, we showed that CVL218 can interact with the nucleocapsid (N) protein of SARS-CoV-2 and is able to suppress the LPS-induced production of several inflammatory cytokines that are highly relevant to the prevention of immunopathology induced by SARS-CoV-2 infection.


Assuntos
Antivirais/uso terapêutico , Tratamento Farmacológico da COVID-19 , COVID-19/metabolismo , Simulação por Computador , Reposicionamento de Medicamentos , Modelos Biológicos , SARS-CoV-2/metabolismo , Humanos
12.
PLoS Comput Biol ; 15(9): e1007283, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31483777

RESUMO

Predicting RNA-binding protein (RBP) specificity is important for understanding gene expression regulation and RNA-mediated enzymatic processes. It is widely believed that RBP binding specificity is determined by both the sequence and structural contexts of RNAs. Existing approaches, including traditional machine learning algorithms and more recently, deep learning models, have been extensively applied to integrate RNA sequence and its predicted or experimental RNA structural probabilities for improving the accuracy of RBP binding prediction. Such models were trained mostly on the large-scale in vitro datasets, such as the RNAcompete dataset. However, in RNAcompete, most synthetic RNAs are unstructured, which makes machine learning methods not effectively extract RBP-binding structural preferences. Furthermore, RNA structure may be variable or multi-modal according to both theoretical and experimental evidence. In this work, we propose ThermoNet, a thermodynamic prediction model by integrating a new sequence-embedding convolutional neural network model over a thermodynamic ensemble of RNA secondary structures. First, the sequence-embedding convolutional neural network generalizes the existing k-mer based methods by jointly learning convolutional filters and k-mer embeddings to represent RNA sequence contexts. Second, the thermodynamic average of deep-learning predictions is able to explore structural variability and improves the prediction, especially for the structured RNAs. Extensive experiments demonstrate that our method significantly outperforms existing approaches, including RCK, DeepBind and several other recent state-of-the-art methods for predictions on both in vitro and in vivo data. The implementation of ThermoNet is available at https://github.com/suyufeng/ThermoNet.


Assuntos
Biologia Computacional/métodos , Regulação da Expressão Gênica/genética , Proteínas de Ligação a RNA , RNA , Análise de Sequência de RNA/métodos , Algoritmos , Aprendizado Profundo , Humanos , Ligação Proteica/genética , RNA/química , RNA/genética , RNA/metabolismo , Proteínas de Ligação a RNA/química , Proteínas de Ligação a RNA/genética , Proteínas de Ligação a RNA/metabolismo , Termodinâmica
13.
Bioinformatics ; 35(2): 219-226, 2019 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-30010790

RESUMO

Motivation: Vastly greater quantities of microbial genome data are being generated where environmental samples mix together the DNA from many different species. Here, we present Opal for metagenomic binning, the task of identifying the origin species of DNA sequencing reads. We introduce 'low-density' locality sensitive hashing to bioinformatics, with the addition of Gallager codes for even coverage, enabling quick and accurate metagenomic binning. Results: On public benchmarks, Opal halves the error on precision/recall (F1-score) as compared with both alignment-based and alignment-free methods for species classification. We demonstrate even more marked improvement at higher taxonomic levels, allowing for the discovery of novel lineages. Furthermore, the innovation of low-density, even-coverage hashing should itself prove an essential methodological advance as it enables the application of machine learning to other bioinformatic challenges. Availability and implementation: Full source code and datasets are available at http://opal.csail.mit.edu and https://github.com/yunwilliamyu/opal. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Genoma Microbiano , Metagenômica , Software , Biologia Computacional , Análise de Sequência de DNA
14.
Pac Symp Biocomput ; 23: 44-55, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29218868

RESUMO

A variety of large-scale pharmacogenomic data, such as perturbation experiments and sensitivity profiles, enable the systematical identification of drug mechanism of actions (MoAs), which is a crucial task in the era of precision medicine. However, integrating these complementary pharmacogenomic datasets is inherently challenging due to the wild heterogeneity, high-dimensionality and noisy nature of these datasets. In this work, we develop Mania, a novel method for the scalable integration of large-scale pharmacogenomic data. Mania first constructs a drug-drug similarity network through integrating multiple heterogeneous data sources, including drug sensitivity, drug chemical structure, and perturbation assays. It then learns a compact vector representation for each drug to simultaneously encode its structural and pharmacogenomic properties. Extensive experiments demonstrate that Mania achieves substantially improved performance in both MoAs and targets prediction, compared to predictions based on individual data sources as well as a state-of-the-art integrative method. Moreover, Mania identifies drugs that target frequently mutated cancer genes, which provides novel insights into drug repurposing.


Assuntos
Farmacogenética/estatística & dados numéricos , Algoritmos , Biologia Computacional/métodos , Bases de Dados de Produtos Farmacêuticos/estatística & dados numéricos , Reposicionamento de Medicamentos/estatística & dados numéricos , Ensaios de Seleção de Medicamentos Antitumorais/estatística & dados numéricos , Humanos , Estrutura Molecular , Medicina de Precisão , Integração de Sistemas
15.
Nat Commun ; 8(1): 573, 2017 09 18.
Artigo em Inglês | MEDLINE | ID: mdl-28924171

RESUMO

The emergence of large-scale genomic, chemical and pharmacological data provides new opportunities for drug discovery and repositioning. In this work, we develop a computational pipeline, called DTINet, to predict novel drug-target interactions from a constructed heterogeneous network, which integrates diverse drug-related information. DTINet focuses on learning a low-dimensional vector representation of features, which accurately explains the topological properties of individual nodes in the heterogeneous network, and then makes prediction based on these representations via a vector space projection scheme. DTINet achieves substantial performance improvement over other state-of-the-art methods for drug-target interaction prediction. Moreover, we experimentally validate the novel interactions between three drugs and the cyclooxygenase proteins predicted by DTINet, and demonstrate the new potential applications of these identified cyclooxygenase inhibitors in preventing inflammatory diseases. These results indicate that DTINet can provide a practically useful tool for integrating heterogeneous information to predict new drug-target interactions and repurpose existing drugs.Network-based data integration for drug-target prediction is a promising avenue for drug repositioning, but performance is wanting. Here, the authors introduce DTINet, whose performance is enhanced in the face of noisy, incomplete and high-dimensional biological data by learning low-dimensional vector representations.


Assuntos
Algoritmos , Biologia Computacional/métodos , Reposicionamento de Medicamentos/métodos , Preparações Farmacêuticas/metabolismo , Proteínas/metabolismo , Animais , Celecoxib/química , Celecoxib/metabolismo , Células Cultivadas , Ciclo-Oxigenase 2/química , Ciclo-Oxigenase 2/metabolismo , Humanos , Camundongos Endogâmicos C57BL , Modelos Moleculares , Preparações Farmacêuticas/química , Ligação Proteica , Domínios Proteicos , Proteínas/química
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA