ABSTRACT
MOTIVATION: Predictive models of DNA chromatin profile (i.e. epigenetic state), such as transcription factor binding, are essential for understanding regulatory processes and developing gene therapies. It is known that the 3D genome, or spatial structure of DNA, is highly influential in the chromatin profile. Deep neural networks have achieved state of the art performance on chromatin profile prediction by using short windows of DNA sequences independently. These methods, however, ignore the long-range dependencies when predicting the chromatin profiles because modeling the 3D genome is challenging. RESULTS: In this work, we introduce ChromeGCN, a graph convolutional network for chromatin profile prediction by fusing both local sequence and long-range 3D genome information. By incorporating the 3D genome, we relax the independent and identically distributed assumption of local windows for a better representation of DNA. ChromeGCN explicitly incorporates known long-range interactions into the modeling, allowing us to identify and interpret those important long-range dependencies in influencing chromatin profiles. We show experimentally that by fusing sequential and 3D genome data using ChromeGCN, we get a significant improvement over the state-of-the-art deep learning methods as indicated by three metrics. Importantly, we show that ChromeGCN is particularly useful for identifying epigenetic effects in those DNA windows that have a high degree of interactions with other DNA windows. AVAILABILITY AND IMPLEMENTATION: https://github.com/QData/ChromeGCN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Genome , Neural Networks, Computer , Chromatin/genetics , Epigenesis, Genetic , EpigenomicsABSTRACT
MOTIVATION: Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task's alphabet size. RESULTS: In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of Ć¢ĀĀ¼100Ć and speedups of Ć¢ĀĀ¼800Ć for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. AVAILABILITY AND IMPLEMENTATION: Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Sequence Analysis, Protein , Support Vector Machine , Algorithms , Proteins , SoftwareABSTRACT
Motivation: Computational methods that predict differential gene expression from histone modification signals are highly desirable for understanding how histone modifications control the functional heterogeneity of cells through influencing differential gene regulation. Recent studies either failed to capture combinatorial effects on differential prediction or primarily only focused on cell type-specific analysis. In this paper we develop a novel attention-based deep learning architecture, DeepDiff, that provides a unified and end-to-end solution to model and to interpret how dependencies among histone modifications control the differential patterns of gene regulation. DeepDiff uses a hierarchy of multiple Long Short-Term Memory (LSTM) modules to encode the spatial structure of input signals and to model how various histone modifications cooperate automatically. We introduce and train two levels of attention jointly with the target prediction, enabling DeepDiff to attend differentially to relevant modifications and to locate important genome positions for each modification. Additionally, DeepDiff introduces a novel deep-learning based multi-task formulation to use the cell-type-specific gene expression predictions as auxiliary tasks, encouraging richer feature embeddings in our primary task of differential expression prediction. Results: Using data from Roadmap Epigenomics Project (REMC) for ten different pairs of cell types, we show that DeepDiff significantly outperforms the state-of-the-art baselines for differential gene expression prediction. The learned attention weights are validated by observations from previous studies about how epigenetic mechanisms connect to differential gene expression. Availability and implementation: Codes and results are available at deepchrome.org. Supplementary information: Supplementary data are available at Bioinformatics online.
Subject(s)
Gene Expression , Histones/metabolism , Machine Learning , Histone Code , Humans , Protein Processing, Post-Translational , SoftwareABSTRACT
Gene fusions and their products (RNA and protein) were once thought to be unique features to cancer. However, chimeric RNAs can also be found in normal cells. Here, we performed, curated and analyzed nearly 300 RNA-Seq libraries covering 30 different non-neoplastic human tissues and cells as well as 15 mouse tissues. A large number of fusion transcripts were found. Most fusions were detected only once, while 291 were seen in more than one sample. We focused on the recurrent fusions and performed RNA and protein level validations on a subset. We characterized these fusions based on various features of the fusions, and their parental genes. They tend to be expressed at higher levels relative to their parental genes than the non-recurrent ones. Over half of the recurrent fusions involve neighboring genes transcribing in the same direction. A few sequence motifs were found enriched close to the fusion junction sites. We performed functional analyses on a few widely expressed fusions, and found that silencing them resulted in dramatic reduction in normal cell growth and/or motility. Most chimeras use canonical splicing sites, thus are likely products of 'intergenic splicing'. We also explored the implications of these non-pathological fusions in cancer and in evolution.
Subject(s)
Fibroblasts/metabolism , Gene Fusion , Mesenchymal Stem Cells/metabolism , RNA Splicing , RNA, Messenger/genetics , Animals , Astrocytes/cytology , Astrocytes/metabolism , Base Sequence , Cell Line, Transformed , Computational Biology , Evolution, Molecular , Fibroblasts/cytology , Gene Library , Gene Silencing , High-Throughput Nucleotide Sequencing , Humans , Mesenchymal Stem Cells/cytology , Mice , Molecular Sequence Data , Primary Cell Culture , RNA, Messenger/antagonists & inhibitors , RNA, Messenger/metabolism , RNA, Small Interfering/genetics , RNA, Small Interfering/metabolism , Sequence Analysis, RNA , Species SpecificityABSTRACT
MOTIVATION: Histone modifications are among the most important factors that control gene regulation. Computational methods that predict gene expression from histone modification signals are highly desirable for understanding their combinatorial effects in gene regulation. This knowledge can help in developing 'epigenetic drugs' for diseases like cancer. Previous studies for quantifying the relationship between histone modifications and gene expression levels either failed to capture combinatorial effects or relied on multiple methods that separate predictions and combinatorial analysis. This paper develops a unified discriminative framework using a deep convolutional neural network to classify gene expression using histone modification data as input. Our system, called DeepChrome, allows automatic extraction of complex interactions among important features. To simultaneously visualize the combinatorial interactions among histone modifications, we propose a novel optimization-based technique that generates feature pattern maps from the learnt deep model. This provides an intuitive description of underlying epigenetic mechanisms that regulate genes. RESULTS: We show that DeepChrome outperforms state-of-the-art models like Support Vector Machines and Random Forests for gene expression classification task on 56 different cell-types from REMC database. The output of our visualization technique not only validates the previous observations but also allows novel insights about combinatorial interactions among histone modification marks, some of which have recently been observed by experimental studies. AVAILABILITY AND IMPLEMENTATION: Codes and results are available at www.deepchrome.org CONTACT: yanjun@virginia.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Gene Expression Regulation , Histone Code , Support Vector Machine , Cluster Analysis , Computational Biology , Epigenesis, Genetic , Gene Regulatory Networks , Humans , Neural Networks, ComputerABSTRACT
The CRISPR system has become a powerful biological tool with a wide range of applications. However, improving targeting specificity and accurately predicting potential off-targets remains a significant goal. Here, we introduce a web-based CR: ISPR/Cas9 O: ff-target P: rediction and I: dentification T: ool (CROP-IT) that performs improved off-target binding and cleavage site predictions. Unlike existing prediction programs that solely use DNA sequence information; CROP-IT integrates whole genome level biological information from existing Cas9 binding and cleavage data sets. Utilizing whole-genome chromatin state information from 125 human cell types further enhances its computational prediction power. Comparative analyses on experimentally validated datasets show that CROP-IT outperforms existing computational algorithms in predicting both Cas9 binding as well as cleavage sites. With a user-friendly web-interface, CROP-IT outputs scored and ranked list of potential off-targets that enables improved guide RNA design and more accurate prediction of Cas9 binding or cleavage sites.
Subject(s)
CRISPR-Associated Proteins/metabolism , CRISPR-Cas Systems , Chromatin/metabolism , Deoxyribonucleases/metabolism , Software , Algorithms , Binding Sites , DNA Cleavage , Humans , Sequence Analysis, DNA , Sequence Analysis, RNA/methodsABSTRACT
OBJECTIVE: To observe the changes of proliferation and angiogenesis of residual tumor in rabbit lung after radiofrequency ablation (RFA). METHODS: The model of VX2 tumor in rabbit lung was established by injection of tissue block suspension. 64 New Zealand White rabbits bearing VX2 tumor were assigned randomly to the control group (n = 10) and the RFA group (n = 48). During the RFA procedure, residual tumors were achieved by controlling the range of electrode expanding, output power and treatment time. At several points of time, Ki-67 labeling index (Ki-67LI) and microvessel density (MVD) of the residual tumors were calculated by immunohistochemical detection. RESULTS: Ki-67LI of the control group was 45.3% Ā± 2.1%. Ki-67LI of the RFA group at the first, 3 and 5 day were 56.4% Ā± 3.4%, 60.1% Ā± 4.1% and 59.8% Ā± 2.4% respectively, significantly higher than that of the control group; however, at the seventh, 9, 14 and 21 day, they were 45.4% Ā± 2.0%, 46.2% Ā± 3.4%, 45.1% Ā± 4.4% and 47.8% Ā± 3.9% respectively, no significant difference compared with the control group. The control group MVD was 28.9 Ā± 2.9. MVD of the RFA group at third, 5 and 7 day were 36.8 Ā± 2.6, 55.6 Ā± 4.8 and 51.5 Ā± 2.8 respectively, significantly higher than that of the control group; however, at the first, 9, 14 and 21 day were 27 Ā± 2.8, 29.2 Ā± 3.2, 30 Ā± 2.8 and 28.8 Ā± 3.1 respectively, no significant difference compared with the control group. CONCLUSIONS: The proliferation and angiogenesis of pulmonary residual tumor exhibit a transient increase phenomenon after RFA.
Subject(s)
Cell Proliferation , Lung Neoplasms/blood supply , Lung Neoplasms/pathology , Neovascularization, Pathologic , Animals , Catheter Ablation , Electrodes , Neoplasm, Residual , RabbitsABSTRACT
Computational prediction of molecule-protein interactions has been key for developing new molecules to interact with a target protein for therapeutics development. Previous work includes two independent streams of approaches: (1) predicting protein-protein interactions (PPIs) between naturally occurring proteins and (2) predicting binding affinities between proteins and small-molecule ligands [also known as drug-target interaction (DTI)]. Studying the two problems in isolation has limited the ability of these computational models to generalize across the PPI and DTI tasks, both of which ultimately involve noncovalent interactions with a protein target. In this work, we developed Equivariant Graph of Graphs neural Network (EGGNet), a geometric deep learning (GDL) framework, for molecule-protein binding predictions that can handle three types of molecules for interacting with a target protein: (1) small molecules, (2) synthetic peptides, and (3) natural proteins. EGGNet leverages a graph of graphs (GoG) representation constructed from the molecular structures at atomic resolution and utilizes a multiresolution equivariant graph neural network to learn from such representations. In addition, EGGNet leverages the underlying biophysics and makes use of both atom- and residue-level interactions, which improve EGGNet's ability to rank candidate poses from blind docking. EGGNet achieves competitive performance on both a public protein-small-molecule binding affinity prediction task (80.2% top 1 success rate on CASF-2016) and a synthetic protein interface prediction task (88.4% area under the precision-recall curve). We envision that the proposed GDL framework can generalize to many other protein interaction prediction problems, such as binding site prediction and molecular docking, helping accelerate protein engineering and structure-based drug development.
ABSTRACT
MOTIVATION: Protein-protein interactions (PPIs) are critical for virtually every biological function. Recently, researchers suggested to use supervised learning for the task of classifying pairs of proteins as interacting or not. However, its performance is largely restricted by the availability of truly interacting proteins (labeled). Meanwhile, there exists a considerable amount of protein pairs where an association appears between two partners, but not enough experimental evidence to support it as a direct interaction (partially labeled). RESULTS: We propose a semi-supervised multi-task framework for predicting PPIs from not only labeled, but also partially labeled reference sets. The basic idea is to perform multi-task learning on a supervised classification task and a semi-supervised auxiliary task. The supervised classifier trains a multi-layer perceptron network for PPI predictions from labeled examples. The semi-supervised auxiliary task shares network layers of the supervised classifier and trains with partially labeled examples. Semi-supervision could be utilized in multiple ways. We tried three approaches in this article, (i) classification (to distinguish partial positives with negatives); (ii) ranking (to rate partial positive more likely than negatives); (iii) embedding (to make data clusters get similar labels). We applied this framework to improve the identification of interacting pairs between HIV-1 and human proteins. Our method improved upon the state-of-the-art method for this task indicating the benefits of semi-supervised multi-task learning using auxiliary information. AVAILABILITY: http://www.cs.cmu.edu/~qyj/HIVsemi.
Subject(s)
Artificial Intelligence , Computational Biology/methods , HIV-1/physiology , Human Immunodeficiency Virus Proteins/metabolism , Protein Interaction Mapping/methods , Proteins/metabolism , Algorithms , Data Interpretation, Statistical , Humans , Models, StatisticalABSTRACT
After emerging in China in late 2019, the novel coronavirus severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spread worldwide, and as of mid-2021, it remains a significant threat globally. Only a few coronaviruses are known to infect humans, and only two cause infections similar in severity to SARS-CoV-2: Severe acute respiratory syndrome-related coronavirus, a species closely related to SARS-CoV-2 that emerged in 2002, and Middle East respiratory syndrome-related coronavirus, which emerged in 2012. Unlike the current pandemic, previous epidemics were controlled rapidly through public health measures, but the body of research investigating severe acute respiratory syndrome and Middle East respiratory syndrome has proven valuable for identifying approaches to treating and preventing novel coronavirus disease 2019 (COVID-19). Building on this research, the medical and scientific communities have responded rapidly to the COVID-19 crisis and identified many candidate therapeutics. The approaches used to identify candidates fall into four main categories: adaptation of clinical approaches to diseases with related pathologies, adaptation based on virological properties, adaptation based on host response, and data-driven identification (ID) of candidates based on physical properties or on pharmacological compendia. To date, a small number of therapeutics have already been authorized by regulatory agencies such as the Food and Drug Administration (FDA), while most remain under investigation. The scale of the COVID-19 crisis offers a rare opportunity to collect data on the effects of candidate therapeutics. This information provides insight not only into the management of coronavirus diseases but also into the relative success of different approaches to identifying candidate therapeutics against an emerging disease. IMPORTANCE The COVID-19 pandemic is a rapidly evolving crisis. With the worldwide scientific community shifting focus onto the SARS-CoV-2 virus and COVID-19, a large number of possible pharmaceutical approaches for treatment and prevention have been proposed. What was known about each of these potential interventions evolved rapidly throughout 2020 and 2021. This fast-paced area of research provides important insight into how the ongoing pandemic can be managed and also demonstrates the power of interdisciplinary collaboration to rapidly understand a virus and match its characteristics with existing or novel pharmaceuticals. As illustrated by the continued threat of viral epidemics during the current millennium, a rapid and strategic response to emerging viral threats can save lives. In this review, we explore how different modes of identifying candidate therapeutics have borne out during COVID-19.
ABSTRACT
After emerging in China in late 2019, the novel coronavirus SARS-CoV-2 spread worldwide and as of mid-2021 remains a significant threat globally. Only a few coronaviruses are known to infect humans, and only two cause infections similar in severity to SARS-CoV-2: Severe acute respiratory syndrome-related coronavirus, a closely related species of SARS-CoV-2 that emerged in 2002, and Middle East respiratory syndrome-related coronavirus, which emerged in 2012. Unlike the current pandemic, previous epidemics were controlled rapidly through public health measures, but the body of research investigating severe acute respiratory syndrome and Middle East respiratory syndrome has proven valuable for identifying approaches to treating and preventing novel coronavirus disease 2019 (COVID-19). Building on this research, the medical and scientific communities have responded rapidly to the COVID-19 crisis to identify many candidate therapeutics. The approaches used to identify candidates fall into four main categories: adaptation of clinical approaches to diseases with related pathologies, adaptation based on virological properties, adaptation based on host response, and data-driven identification of candidates based on physical properties or on pharmacological compendia. To date, a small number of therapeutics have already been authorized by regulatory agencies such as the Food and Drug Administration (FDA), while most remain under investigation. The scale of the COVID-19 crisis offers a rare opportunity to collect data on the effects of candidate therapeutics. This information provides insight not only into the management of coronavirus diseases, but also into the relative success of different approaches to identifying candidate therapeutics against an emerging disease.
ABSTRACT
The novel coronavirus SARS-CoV-2, which emerged in late 2019, has since spread around the world and infected hundreds of millions of people with coronavirus disease 2019 (COVID-19). While this viral species was unknown prior to January 2020, its similarity to other coronaviruses that infect humans has allowed for rapid insight into the mechanisms that it uses to infect human hosts, as well as the ways in which the human immune system can respond. Here, we contextualize SARS-CoV-2 among other coronaviruses and identify what is known and what can be inferred about its behavior once inside a human host. Because the genomic content of coronaviruses, which specifies the virus's structure, is highly conserved, early genomic analysis provided a significant head start in predicting viral pathogenesis and in understanding potential differences among variants. The pathogenesis of the virus offers insights into symptomatology, transmission, and individual susceptibility. Additionally, prior research into interactions between the human immune system and coronaviruses has identified how these viruses can evade the immune system's protective mechanisms. We also explore systems-level research into the regulatory and proteomic effects of SARS-CoV-2 infection and the immune response. Understanding the structure and behavior of the virus serves to contextualize the many facets of the COVID-19 pandemic and can influence efforts to control the virus and treat the disease. IMPORTANCE COVID-19 involves a number of organ systems and can present with a wide range of symptoms. From how the virus infects cells to how it spreads between people, the available research suggests that these patterns are very similar to those seen in the closely related viruses SARS-CoV-1 and possibly Middle East respiratory syndrome-related CoV (MERS-CoV). Understanding the pathogenesis of the SARS-CoV-2 virus also contextualizes how the different biological systems affected by COVID-19 connect. Exploring the structure, phylogeny, and pathogenesis of the virus therefore helps to guide interpretation of the broader impacts of the virus on the human body and on human populations. For this reason, an in-depth exploration of viral mechanisms is critical to a robust understanding of SARS-CoV-2 and, potentially, future emergent human CoVs (HCoVs).
ABSTRACT
The novel coronavirus SARS-CoV-2, which emerged in late 2019, has since spread around the world and infected hundreds of millions of people with coronavirus disease 2019 (COVID-19). While this viral species was unknown prior to January 2020, its similarity to other coronaviruses that infect humans has allowed for rapid insight into the mechanisms that it uses to infect human hosts, as well as the ways in which the human immune system can respond. Here, we contextualize SARS-CoV-2 among other coronaviruses and identify what is known and what can be inferred about its behavior once inside a human host. Because the genomic content of coronaviruses, which specifies the virus's structure, is highly conserved, early genomic analysis provided a significant head start in predicting viral pathogenesis and in understanding potential differences among variants. The pathogenesis of the virus offers insights into symptomatology, transmission, and individual susceptibility. Additionally, prior research into interactions between the human immune system and coronaviruses has identified how these viruses can evade the immune system's protective mechanisms. We also explore systems-level research into the regulatory and proteomic effects of SARS-CoV-2 infection and the immune response. Understanding the structure and behavior of the virus serves to contextualize the many facets of the COVID-19 pandemic and can influence efforts to control the virus and treat the disease.
ABSTRACT
Membrane receptor-activated signal transduction pathways are integral to cellular functions and disease mechanisms in humans. Identification of the full set of proteins interacting with membrane receptors by high-throughput experimental means is difficult because methods to directly identify protein interactions are largely not applicable to membrane proteins. Unlike prior approaches that attempted to predict the global human interactome, we used a computational strategy that only focused on discovering the interacting partners of human membrane receptors leading to improved results for these proteins. We predict specific interactions based on statistical integration of biological data containing highly informative direct and indirect evidences together with feedback from experts. The predicted membrane receptor interactome provides a system-wide view, and generates new biological hypotheses regarding interactions between membrane receptors and other proteins. We have experimentally validated a number of these interactions. The results suggest that a framework of systematically integrating computational predictions, global analyses, biological experimentation and expert feedback is a feasible strategy to study the human membrane receptor interactome.
Subject(s)
Computational Biology/methods , Protein Interaction Mapping/methods , Receptors, Cell Surface/analysis , Receptors, Cell Surface/metabolism , ErbB Receptors/analysis , ErbB Receptors/metabolism , Humans , Proteome/analysis , Proteome/metabolism , Proteomics/methods , Signal Transduction , Systems Biology/methodsABSTRACT
MOTIVATION: Protein complexes integrate multiple gene products to coordinate many biological functions. Given a graph representing pairwise protein interaction data one can search for subgraphs representing protein complexes. Previous methods for performing such search relied on the assumption that complexes form a clique in that graph. While this assumption is true for some complexes, it does not hold for many others. New algorithms are required in order to recover complexes with other types of topological structure. RESULTS: We present an algorithm for inferring protein complexes from weighted interaction graphs. By using graph topological patterns and biological properties as features, we model each complex subgraph by a probabilistic Bayesian network (BN). We use a training set of known complexes to learn the parameters of this BN model. The log-likelihood ratio derived from the BN is then used to score subgraphs in the protein interaction graph and identify new complexes. We applied our method to protein interaction data in yeast. As we show our algorithm achieved a considerable improvement over clique based algorithms in terms of its ability to recover known complexes. We discuss some of the new complexes predicted by our algorithm and determine that they likely represent true complexes. AVAILABILITY: Matlab implementation is available on the supporting website: www.cs.cmu.edu/~qyj/SuperComplex.
Subject(s)
Algorithms , Cluster Analysis , Models, Biological , Protein Interaction Mapping/methods , Proteome/metabolism , Signal Transduction/physiology , Computer SimulationABSTRACT
PURPOSE: Mortality from head and neck squamous cell carcinoma (HNSCC) is usually associated with locoregional invasion of the tumor into vital organs, including the airway. Understanding the signaling mechanisms that abrogate HNSCC invasion may reveal novel therapeutic targets for intervention. The purpose of this study was to investigate the efficacy of combined inhibition of c-Src and PLCgamma-1 in the abrogation of HNSCC invasion. EXPERIMENTAL DESIGN: PLCgamma-1 and c-Src inhibition was achieved by a combination of small molecule inhibitors and dominant negative approaches. The effect of inhibition of PLCgamma-1 and c-Src on invasion of HNSCC cells was assessed in an in vitro Matrigel-coated transwell invasion assay. In addition, the immunoprecipitation reactions and in silico database mining was used to examine the interactions between PLCgamma-1 and c-Src. RESULTS: Here, we show that inhibition of PLCgamma-1 or c-Src with the PLC inhibitor U73122 or the Src family inhibitor AZD0530 or using dominant-negative constructs attenuated epidermal growth factor (EGF)-stimulated HNSCC invasion. Furthermore, EGF stimulation increased the association between PLCgamma-1 and c-Src in HNSCC cells. Combined inhibition of PLCgamma-1 and c-Src resulted in further attenuation of HNSCC cell invasion in vitro. CONCLUSIONS: These cumulative results suggest that PLCgamma-1 and c-Src activation contribute to HNSCC invasion downstream of EGF receptor and that targeting these pathways may be a novel strategy to prevent tumor invasion in HNSCC.
Subject(s)
Antineoplastic Agents/pharmacology , Carcinoma, Squamous Cell/metabolism , ErbB Receptors/metabolism , Gene Expression Regulation, Neoplastic , Head and Neck Neoplasms/metabolism , Phospholipase C gamma/antagonists & inhibitors , src-Family Kinases/antagonists & inhibitors , Cell Line, Tumor , Collagen/chemistry , Drug Combinations , Enzyme Inhibitors/pharmacology , Genes, Dominant , Humans , Laminin/chemistry , Models, Biological , Neoplasm Invasiveness , Neoplasm Metastasis , Proteoglycans/chemistryABSTRACT
Through sequence-based classification, this paper tries to accurately predict the DNA binding sites of transcription factors (TFs) in an unannotated cellular context. Related methods in the literature fail to perform such predictions accurately, since they do not consider sample distribution shift of sequence segments from an annotated (source) context to an unannotated (target) context. We, therefore, propose a method called "Transfer String Kernel" (TSK) that achieves improved prediction of transcription factor binding site (TFBS) using knowledge transfer via cross-context sample adaptation. TSK maps sequence segments to a high-dimensional feature space using a discriminative mismatch string kernel framework. In this high-dimensional space, labeled examples of the source context are re-weighted so that the revised sample distribution matches the target context more closely. We have experimentally verified TSK for TFBS identifications on 14 different TFs under a cross-organism setting. We find that TSK consistently outperforms the state-of-the-art TFBS tools, especially when working with TFs whose binding sequences are not conserved across contexts. We also demonstrate the generalizability of TSK by showing its cutting-edge performance on a different set of cross-context tasks for the MHC peptide binding predictions.
Subject(s)
Computational Biology/methods , DNA-Binding Proteins , DNA , Machine Learning , Models, Statistical , Algorithms , Animals , DNA/chemistry , DNA/metabolism , DNA-Binding Proteins/chemistry , DNA-Binding Proteins/metabolism , Humans , Mice , Protein BindingABSTRACT
Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.
Subject(s)
Biomedical Research/trends , Biomedical Technology/trends , Deep Learning/trends , Algorithms , Biomedical Research/methods , Decision Making , Delivery of Health Care/methods , Delivery of Health Care/trends , Disease/genetics , Drug Design , Electronic Health Records/trends , Humans , Terminology as TopicABSTRACT
BACKGROUND: High-throughput methods can directly detect the set of interacting proteins in model species but the results are often incomplete and exhibit high false positive and false negative rates. A number of researchers have recently presented methods for integrating direct and indirect data for predicting interactions. These methods utilize a common classifier for all pairs. However, due to missing data and high redundancy among the features used, different protein pairs may benefit from different features based on the set of attributes available. In addition, in many cases it is hard to directly determine which of the data sources contributed to a prediction. This information is important for biologists using these predications in the design of new experiments. RESULTS: To address these challenges we propose a Mixture-of-Feature-Experts method for protein-protein interaction prediction. We split the features into roughly homogeneous sets of feature experts. The individual experts use logistic regression and their scores are combined using another logistic regression. When combining the scores the weighting of each expert depends on the set of input attributes available for that pair. Thus, different experts will have different influence on the prediction depending on the available features. CONCLUSION: We applied our method to predict the set of interacting proteins in yeast and human cells. Our method improved upon the best previous methods for this task. In addition, the weighting of the experts provides means to evaluate the prediction based on the high scoring features.
Subject(s)
Protein Interaction Mapping/methods , Amino Acid Sequence , Databases, Protein , Humans , Predictive Value of Tests , Saccharomyces cerevisiae/geneticsABSTRACT
Deep neural network (DNN) models have recently obtained state-of-the-art prediction accuracy for the transcription factor binding (TFBS) site classification task. However, it remains unclear how these approaches identify meaningful DNA sequence signals and give insights as to why TFs bind to certain locations. In this paper, we propose a toolkit called the Deep Motif Dashboard (DeMo Dashboard) which provides a suite of visualization strategies to extract motifs, or sequence patterns from deep neural network models for TFBS classification. We demonstrate how to visualize and understand three important DNN models: convolutional, recurrent, and convolutional-recurrent networks. Our first visualization method is finding a test sequence's saliency map which uses first-order derivatives to describe the importance of each nucleotide in making the final prediction. Second, considering recurrent models make predictions in a temporal manner (from one end of a TFBS sequence to the other), we introduce temporal output scores, indicating the prediction score of a model over time for a sequential input. Lastly, a class-specific visualization strategy finds the optimal input sequence for a given TFBS positive class via stochastic gradient optimization. Our experimental results indicate that a convolutional-recurrent architecture performs the best among the three architectures. The visualization techniques indicate that CNN-RNN makes predictions by modeling both motifs as well as dependencies among them.