RESUMO
BACKGROUND: Bistable systems, i.e., systems that exhibit two stable steady states, are of particular interest in biology. They can implement binary cellular decision making, e.g., in pathways for cellular differentiation and cell cycle regulation. The onset of cancer, prion diseases, and neurodegenerative diseases are known to be associated with malfunctioning bistable systems. Exploring and characterizing parameter spaces in bistable systems, so that they retain or lose bistability, is part of a lot of therapeutic research such as cancer pharmacology. RESULTS: We use eigenvalue sensitivity analysis and stable state separation sensitivity analysis to understand bistable system behaviors, and to characterize the most sensitive parameters of a bistable system. While eigenvalue sensitivity analysis is an established technique in engineering disciplines, it has not been frequently used to study biological systems. We demonstrate the utility of these approaches on a published bistable system. We also illustrate scalability and generalizability of these methods to larger bistable systems. CONCLUSIONS: Eigenvalue sensitivity analysis and separation sensitivity analysis prove to be promising tools to define parameter design rules to make switching decisions between either stable steady state of a bistable system and a corresponding monostable state after bifurcation. These rules were applied to the smallest two-component bistable system and results were validated analytically. We showed that with multiple parameter settings of the same bistable system, we can design switching to a desirable state to retain or lose bistability when the most sensitive parameter is varied according to our parameter perturbation recommendations. We propose eigenvalue and stable state separation sensitivity analyses as a framework to evaluate large and complex bistable systems.
Assuntos
Biologia Computacional , Modelos BiológicosRESUMO
BACKGROUND: Over the past few decades, numerous forecasting methods have been proposed in the field of epidemic forecasting. Such methods can be classified into different categories such as deterministic vs. probabilistic, comparative methods vs. generative methods, and so on. In some of the more popular comparative methods, researchers compare observed epidemiological data from the early stages of an outbreak with the output of proposed models to forecast the future trend and prevalence of the pandemic. A significant problem in this area is the lack of standard well-defined evaluation measures to select the best algorithm among different ones, as well as for selecting the best possible configuration for a particular algorithm. RESULTS: In this paper we present an evaluation framework which allows for combining different features, error measures, and ranking schema to evaluate forecasts. We describe the various epidemic features (Epi-features) included to characterize the output of forecasting methods and provide suitable error measures that could be used to evaluate the accuracy of the methods with respect to these Epi-features. We focus on long-term predictions rather than short-term forecasting and demonstrate the utility of the framework by evaluating six forecasting methods for predicting influenza in the United States. Our results demonstrate that different error measures lead to different rankings even for a single Epi-feature. Further, our experimental analyses show that no single method dominates the rest in predicting all Epi-features when evaluated across error measures. As an alternative, we provide various Consensus Ranking schema that summarize individual rankings, thus accounting for different error measures. Since each Epi-feature presents a different aspect of the epidemic, multiple methods need to be combined to provide a comprehensive forecast. Thus we call for a more nuanced approach while evaluating epidemic forecasts and we believe that a comprehensive evaluation framework, as presented in this paper, will add value to the computational epidemiology community.
Assuntos
Algoritmos , Influenza Humana/epidemiologia , Fatores Etários , Surtos de Doenças , Previsões , Humanos , Modelos Teóricos , Pandemias , Processos Estocásticos , Estados UnidosRESUMO
Biological processes such as circadian rhythms, cell division, metabolism, and development occur as ordered sequences of events. The synchronization of these coordinated events is essential for proper cell function, and hence the determination of critical time points in biological processes is an important component of all biological investigations. In particular, such critical time points establish logical ordering constraints on subprocesses, impose prerequisites on temporal regulation and spatial compartmentalization, and situate dynamic reorganization of functional elements in preparation for subsequent stages. Thus, building temporal phenomenological representations of biological processes from genome-wide datasets is relevant in formulating biological hypotheses on: how processes are mechanistically regulated; how the regulations vary on an evolutionary scale, and how their inadvertent disregulation leads to a diseased state or fatality. This paper presents a general framework (GOALIE) to reconstruct temporal models of cellular processes from time-course gene expression data. We mathematically formulate the problem as one of optimally segmenting datasets into a succession of "informative" windows such that time points within a window expose concerted clusters of gene action whereas time points straddling window boundaries constitute points of significant restructuring. We illustrate here how GOALIE successfully brings out the interplay between multiple yeast processes, inferred from combined experimental datasets for the cell cycle and the metabolic cycle.
Assuntos
Fenômenos Fisiológicos Celulares , Fenômenos Biológicos , Ciclo Celular/genética , Divisão Celular , Análise por Conglomerados , Expressão Gênica , Saccharomyces cerevisiae/genéticaRESUMO
Forecasting societal events such as civil unrest, mass protests, and violent conflicts is a challenging problem with several important real-world applications in planning and policy making. While traditional forecasting approaches have typically relied on historical time series for generating such forecasts, recent research has focused on using open source surrogate data for more accurate and timely forecasts. Furthermore, leveraging such data can also help to identify precursors of those events that can be used to gain insights into the generated forecasts. The key challenge is to develop a unified framework for forecasting and precursor identification that can deal with missing historical data. Other challenges include sufficient flexibility in handling different types of events and providing interpretable representations of identified precursors. Although existing methods exhibit promising performance for predictive modeling in event detection, these models do not adequately address the above challenges. Here, we propose a unified framework based on an attention-based long short-term memory (LSTM) model to simultaneously forecast events with sequential text datasets as well as identify precursors at different granularity such as documents and document excerpts. The key idea is to leverage word context in sequential and time-stamped documents such as news articles and blogs for learning a rich set of precursors. We validate the proposed framework by conducting extensive experiments with two real-world datasets-military action and violent conflicts in the Middle East and mass protests in Latin America. Our results show that overall, the proposed approach generates more accurate forecasts compared to the existing state-of-the-art methods, while at the same time producing a rich set of precursors for the forecasted events.
RESUMO
Causality visualization can help people understand temporal chains of events, such as messages sent in a distributed system, cause and effect in a historical conflict, or the interplay between political actors over time. However, as the scale and complexity of these event sequences grows, even these visualizations can become overwhelming to use. In this paper, we propose the use of textual narratives as a data-driven storytelling method to augment causality visualization. We first propose a design space for how textual narratives can be used to describe causal data. We then present results from a crowdsourced user study where participants were asked to recover causality information from two causality visualizations-causal graphs and Hasse diagrams-with and without an associated textual narrative. Finally, we describe Causeworks, a causality visualization system for understanding how specific interventions influence a causal model. The system incorporates an automatic textual narrative mechanism based on our design space. We validate Causeworks through interviews with experts who used the system for understanding complex events.
RESUMO
CMGSDB (Database for Computational Modeling of Gene Silencing) is an integration of heterogeneous data sources about Caenorhabditis elegans with capabilities for compositional data mining (CDM) across diverse domains. Besides gene, protein and functional annotations, CMGSDB currently unifies information about 531 RNAi phenotypes obtained from heterogeneous databases using a hierarchical scheme. A phenotype browser at the CMGSDB website serves this hierarchy and relates phenotypes to other biological entities. The application of CDM to CMGSDB produces 'chains' of relationships in the data by finding two-way connections between sets of biological entities. Chains can, for example, relate the knock down of a set of genes during an RNAi experiment to the disruption of a pathway or specific gene expression through another set of genes not directly related to the former set. The web interface for CMGSDB is available at https://bioinformatics.cs.vt.edu/cmgs/CMGSDB/, and serves individual biological entity information as well as details of all chains computed by CDM.
Assuntos
Proteínas de Caenorhabditis elegans/genética , Caenorhabditis elegans/genética , Bases de Dados Genéticas , Interferência de RNA , Animais , Proteínas de Caenorhabditis elegans/antagonistas & inibidores , Proteínas de Caenorhabditis elegans/metabolismo , Biologia Computacional , Internet , Fenótipo , Transdução de Sinais , Integração de Sistemas , Interface Usuário-ComputadorRESUMO
In recent times, sequence-to-sequence (seq2seq) models have gained a lot of popularity and provide state-of-the-art performance in a wide variety of tasks, such as machine translation, headline generation, text summarization, speech-to-text conversion, and image caption generation. The underlying framework for all these models is usually a deep neural network comprising an encoder and a decoder. Although simple encoder-decoder models produce competitive results, many researchers have proposed additional improvements over these seq2seq models, e.g., using an attention-based model over the input, pointer-generation models, and self-attention models. However, such seq2seq models suffer from two common problems: 1) exposure bias and 2) inconsistency between train/test measurement. Recently, a completely novel point of view has emerged in addressing these two problems in seq2seq models, leveraging methods from reinforcement learning (RL). In this survey, we consider seq2seq problems from the RL point of view and provide a formulation combining the power of RL methods in decision-making with seq2seq models that enable remembering long-term memories. We present some of the most recent frameworks that combine the concepts from RL and deep neural networks. Our work aims to provide insights into some of the problems that inherently arise with current approaches and how we can address them with better RL models. We also provide the source code for implementing most of the RL models discussed in this paper to support the complex task of abstractive text summarization and provide some targeted experiments for these RL models, both in terms of performance and training time.
RESUMO
Physics-based simulations are often used to model and understand complex physical systems in domains such as fluid dynamics. Such simulations, although used frequently, often suffer from inaccurate or incomplete representations either due to their high computational costs or due to lack of complete physical knowledge of the system. In such situations, it is useful to employ machine learning (ML) to fill the gap by learning a model of the complex physical process directly from simulation data. However, as data generation through simulations is costly, we need to develop models being cognizant of data paucity issues. In such scenarios, it is helpful if the rich physical knowledge of the application domain is incorporated in the architectural design of ML models. We can also use information from physics-based simulations to guide the learning process using aggregate supervision to favorably constrain the learning process. In this article, we propose PhyNet, a deep learning model using physics-guided structural priors and physics-guided aggregate supervision for modeling the drag forces acting on each particle in a computational fluid dynamics-discrete element method. We conduct extensive experiments in the context of drag force prediction and showcase the usefulness of including physics knowledge in our deep learning formulation. PhyNet has been compared with several state-of-the-art models and achieves a significant performance improvement of 7.09% on average. The source code has been made available*.
Assuntos
Simulação por Computador , Aprendizado Profundo , Hidrodinâmica , Mineração de Dados , Redes Neurais de ComputaçãoRESUMO
There is large interest in networked social science experiments for understanding human behavior at-scale. Significant effort is required to perform data analytics on experimental outputs and for computational modeling of custom experiments. Moreover, experiments and modeling are often performed in a cycle, enabling iterative experimental refinement and data modeling to uncover interesting insights and to generate/refute hypotheses about social behaviors. The current practice for social analysts is to develop tailor-made computer programs and analytical scripts for experiments and modeling. This often leads to inefficiencies and duplication of effort. In this work, we propose a pipeline framework to take a significant step towards overcoming these challenges. Our contribution is to describe the design and implementation of a software system to automate many of the steps involved in analyzing social science experimental data, building models to capture the behavior of human subjects, and providing data to test hypotheses. The proposed pipeline framework consists of formal models, formal algorithms, and theoretical models as the basis for the design and implementation. We propose a formal data model, such that if an experiment can be described in terms of this model, then our pipeline software can be used to analyze data efficiently. The merits of the proposed pipeline framework is elaborated by several case studies of networked social science experiments.
Assuntos
Processamento Eletrônico de Dados , Modelos Teóricos , Comportamento Social , Ciências Sociais/métodos , Software , Algoritmos , HumanosRESUMO
Protein-protein interactions are mediated by complementary amino acids defining complementary surfaces. Typically not all members of a family of related proteins interact equally well with all members of a partner family; thus analysis of the sequence record can reveal the complementary amino acid partners that confer interaction specificity. This article develops methods for learning and using probabilistic graphical models of such residue "cross-coupling" constraints between interacting protein families, based on multiple sequence alignments and information about which pairs of proteins are known to interact. Our models generalize traditional consensus sequence binding motifs, and provide a probabilistic semantics enabling sound evaluation of the plausibility of new possible interactions. Furthermore, predictions made by the models can be explained in terms of the underlying residue interactions. Our approach supports different levels of prior knowledge regarding interactions, including both one-to-one (e.g., pairs of proteins from the same organism) and many-to-many (e.g., experimentally identified interactions), and we present a technique to account for possible bias in the represented interactions. We apply our approach in studies of PDZ domains and their ligands, fundamental building blocks in a number of protein assemblies. Our algorithms are able to identify biologically interesting cross-coupling constraints, to successfully identify known interactions, and to make explainable predictions about novel interactions.
Assuntos
Domínios PDZ , Mapeamento de Interação de Proteínas/métodos , Proteínas/genética , Proteínas/metabolismo , Alinhamento de Sequência/métodos , Simulação por Computador , Ligantes , Modelos Biológicos , Modelos Moleculares , Proteínas/químicaRESUMO
We present a new approach to segmenting multiple time series by analyzing the dynamics of cluster formation and rearrangement around putative segment boundaries. This approach finds application in distilling large numbers of gene expression profiles into temporal relationships underlying biological processes. By directly minimizing information-theoretic measures of segmentation quality derived from Kullback-Leibler (KL) divergences, our formulation reveals clusters of genes along with a segmentation such that clusters show concerted behavior within segments but exhibit significant regrouping across segmentation boundaries. The results of the segmentation algorithm can be summarized as Gantt charts revealing temporal dependencies in the ordering of key biological processes. Applications to the yeast metabolic cycle and the yeast cell cycle are described.
Assuntos
Algoritmos , Análise por Conglomerados , Perfilação da Expressão Gênica/métodos , Modelos Biológicos , Simulação por ComputadorRESUMO
Just as complex electronic circuits are built from simple Boolean gates, diverse biological functions, including signal transduction, differentiation, and stress response, frequently use biochemical switches as a functional module. A relatively small number of such switches have been described in the literature, and these exhibit considerable diversity in chemical topology. We asked if biochemical switches are indeed rare and if there are common chemical motifs and family relationships among such switches. We performed a systematic exploration of chemical reaction space by generating all possible stoichiometrically valid chemical configurations up to 3 molecules and 6 reactions and up to 4 molecules and 3 reactions. We used Monte Carlo sampling of parameter space for each such configuration to generate specific models and checked each model for switching properties. We found nearly 4,500 reaction topologies, or about 10% of our tested configurations, that demonstrate switching behavior. Commonly accepted topological features such as feedback were poor predictors of bistability, and we identified new reaction motifs that were likely to be found in switches. Furthermore, the discovered switches were related in that most of the larger configurations were derived from smaller ones by addition of one or more reactions. To explore even larger configurations, we developed two tools: the "bistabilizer," which converts almost-bistable systems into bistable ones, and frequent motif mining, which helps rank untested configurations. Both of these tools increased the coverage of our library of bistable systems. Thus, our systematic exploration of chemical reaction space has produced a valuable resource for investigating the key signaling motif of bistability.
Assuntos
Bioquímica/métodos , Biologia Computacional/métodos , Transdução de Sinais/fisiologia , Transferência de Energia/fisiologia , Retroalimentação Fisiológica/fisiologia , Cinética , Modelos Químicos , Biologia de Sistemas/métodosRESUMO
Exploring coordinated relationships (e.g., shared relationships between two sets of entities) is an important analytics task in a variety of real-world applications, such as discovering similarly behaved genes in bioinformatics, detecting malware collusions in cyber security, and identifying products bundles in marketing analysis. Coordinated relationships can be formalized as biclusters. In order to support visual exploration of biclusters, bipartite graphs based visualizations have been proposed, and edge bundling is used to show biclusters. However, it suffers from edge crossings due to possible overlaps of biclusters, and lacks in-depth understanding of its impact on user exploring biclusters in bipartite graphs. To address these, we propose a novel bicluster-based seriation technique that can reduce edge crossings in bipartite graphs drawing and conducted a user experiment to study the effect of edge bundling and this proposed technique on visualizing biclusters in bipartite graphs. We found that they both had impact on reducing entity visits for users exploring biclusters, and edge bundles helped them find more justified answers. Moreover, we identified four key trade-offs that inform the design of future bicluster visualizations. The study results suggest that edge bundling is critical for exploring biclusters in bipartite graphs, which helps to reduce low-level perceptual problems and support high-level inferences.
RESUMO
Since 2013, the Centers for Disease Control and Prevention (CDC) has hosted an annual influenza season forecasting challenge. The 2015-2016 challenge consisted of weekly probabilistic forecasts of multiple targets, including fourteen models submitted by eleven teams. Forecast skill was evaluated using a modified logarithmic score. We averaged submitted forecasts into a mean ensemble model and compared them against predictions based on historical trends. Forecast skill was highest for seasonal peak intensity and short-term forecasts, while forecast skill for timing of season onset and peak week was generally low. Higher forecast skill was associated with team participation in previous influenza forecasting challenges and utilization of ensemble forecasting techniques. The mean ensemble consistently performed well and outperformed historical trend predictions. CDC and contributing teams will continue to advance influenza forecasting and work to improve the accuracy and reliability of forecasts to facilitate increased incorporation into public health response efforts.
Assuntos
Influenza Humana/epidemiologia , Modelos Estatísticos , Centers for Disease Control and Prevention, U.S. , Surtos de Doenças , Humanos , Influenza Humana/mortalidade , Morbidade , Estações do Ano , Estados Unidos/epidemiologiaRESUMO
Many statistical measures and algorithmic techniques have been proposed for studying residue coupling in protein families. Generally speaking, two residue positions are considered coupled if, in the sequence record, some of their amino acid type combinations are significantly more common than others. While the proposed approaches have proven useful in finding and describing coupling, a significant missing component is a formal probabilistic model that explicates and compactly represents the coupling, integrates information about sequence,structure, and function, and supports inferential procedures for analysis, diagnosis, and prediction.We present an approach to learning and using probabilistic graphical models of residue coupling. These models capture significant conservation and coupling constraints observable ina multiply-aligned set of sequences. Our approach can place a structural prior on considered couplings, so that all identified relationships have direct mechanistic explanations. It can also incorporate information about functional classes, and thereby learn a differential graphical model that distinguishes constraints common to all classes from those unique to individual classes. Such differential models separately account for class-specific conservation and family-wide coupling, two different sources of sequence covariation. They are then able to perform interpretable functional classification of new sequences, explaining classification decisions in terms of the underlying conservation and coupling constraints. We apply our approach in studies of both G protein-coupled receptors and PDZ domains, identifying and analyzing family-wide and class-specific constraints, and performing functional classification. The results demonstrate that graphical models of residue coupling provide a powerful tool for uncovering, representing, and utilizing significant sequence structure-function relationships in protein families.
Assuntos
Simulação por Computador , Modelos Moleculares , Proteínas/química , Sequência de Aminoácidos , Animais , Inteligência Artificial , Bovinos , Gráficos por Computador , Humanos , Funções Verossimilhança , Modelos Estatísticos , Domínios PDZ/genética , Proteínas/classificação , Proteínas/genética , Receptores Acoplados a Proteínas G/química , Receptores Acoplados a Proteínas G/classificação , Receptores Acoplados a Proteínas G/genética , Rodopsina/química , Alinhamento de SequênciaRESUMO
Dimension reduction algorithms and clustering algorithms are both frequently used techniques in visual analytics. Both families of algorithms assist analysts in performing related tasks regarding the similarity of observations and finding groups in datasets. Though initially used independently, recent works have incorporated algorithms from each family into the same visualization systems. However, these algorithmic combinations are often ad hoc or disconnected, working independently and in parallel rather than integrating some degree of interdependence. A number of design decisions must be addressed when employing dimension reduction and clustering algorithms concurrently in a visualization system, including the selection of each algorithm, the order in which they are processed, and how to present and interact with the resulting projection. This paper contributes an overview of combining dimension reduction and clustering into a visualization system, discussing the challenges inherent in developing a visualization system that makes use of both families of algorithms.
RESUMO
Accurate forecasts could enable more informed public health decisions. Since 2013, CDC has worked with external researchers to improve influenza forecasts by coordinating seasonal challenges for the United States and the 10 Health and Human Service Regions. Forecasted targets for the 2014-15 challenge were the onset week, peak week, and peak intensity of the season and the weekly percent of outpatient visits due to influenza-like illness (ILI) 1-4 weeks in advance. We used a logarithmic scoring rule to score the weekly forecasts, averaged the scores over an evaluation period, and then exponentiated the resulting logarithmic score. Poor forecasts had a score near 0, and perfect forecasts a score of 1. Five teams submitted forecasts from seven different models. At the national level, the team scores for onset week ranged from <0.01 to 0.41, peak week ranged from 0.08 to 0.49, and peak intensity ranged from <0.01 to 0.17. The scores for predictions of ILI 1-4 weeks in advance ranged from 0.02-0.38 and was highest 1 week ahead. Forecast skill varied by HHS region. Forecasts can predict epidemic characteristics that inform public health actions. CDC, state and local health officials, and researchers are working together to improve forecasts.
Assuntos
Influenza Humana/epidemiologia , Estações do Ano , Comportamento Cooperativo , Coleta de Dados/estatística & dados numéricos , Coleta de Dados/tendências , Epidemias/estatística & dados numéricos , Previsões , Humanos , Saúde Pública/estatística & dados numéricos , Saúde Pública/tendências , Estados Unidos/epidemiologiaRESUMO
Earlier work rigorously derived a general probabilistic model for the PCR process that includes as a special case the Velikanov-Kapral model where all nucleotide reaction rates are the same. In this model, the probability of binding of deoxy-nucleoside triphosphate (dNTP) molecules with template strands is derived from the microscopic chemical kinetics. A recursive solution for the probability function of binding of dNTPs is developed for a single cycle and is used to calculate expected yield for a multicycle PCR. The model is able to reproduce important features of the PCR amplification process quantitatively. With a set of favorable reaction conditions, the amplification of the target sequence is fast enough to rapidly outnumber all side products. Furthermore, the final yield of the target sequence in a multicycle PCR run always approaches an asymptotic limit that is less than one. The amplification process itself is highly sensitive to initial concentrations and the reaction rates of addition to the template strand of each type of dNTP in the solution. This paper extends the earlier Saha model with a physics based model of the dependence of the reaction rates on temperature, and estimates parameters in this new model by nonlinear regression. The calibrated model is validated using RT-PCR data.