RESUMEN
Reconstructing the topology of gene regulatory network from gene expression data has been extensively studied. With the abundance functional transcriptomic data available, it is now feasible to systematically decipher regulatory interaction dynamics in a logic form such as a Boolean network (BN) framework, which qualitatively indicates how multiple regulators aggregated to affect a common target gene. However, inferring both the network topology and gene interaction dynamics simultaneously is still a challenging problem since gene expression data are typically noisy and data discretization is prone to information loss. We propose a new method for BN inference from time-series transcriptional profiles, called LogicGep. LogicGep formulates the identification of Boolean functions as a symbolic regression problem that learns the Boolean function expression and solve it efficiently through multi-objective optimization using an improved gene expression programming algorithm. To avoid overly emphasizing dynamic characteristics at the expense of topology structure ones, as traditional methods often do, a set of promising Boolean formulas for each target gene is evolved firstly, and a feed-forward neural network trained with continuous expression data is subsequently employed to pick out the final solution. We validated the efficacy of LogicGep using multiple datasets including both synthetic and real-world experimental data. The results elucidate that LogicGep adeptly infers accurate BN models, outperforming other representative BN inference algorithms in both network topology reconstruction and the identification of Boolean functions. Moreover, the execution of LogicGep is hundreds of times faster than other methods, especially in the case of large network inference.
Asunto(s)
Algoritmos , Perfilación de la Expresión Génica , Redes Reguladoras de Genes , Perfilación de la Expresión Génica/métodos , Humanos , Transcriptoma , Programas Informáticos , Biología Computacional/métodos , Redes Neurales de la ComputaciónRESUMEN
Big data and large-scale machine learning have had a profound impact on science and engineering, particularly in fields focused on forecasting and prediction. Yet, it is still not clear how we can use the superior pattern-matching abilities of machine learning models for scientific discovery. This is because the goals of machine learning and science are generally not aligned. In addition to being accurate, scientific theories must also be causally consistent with the underlying physical process and allow for human analysis, reasoning, and manipulation to advance the field. In this paper, we present a case study on discovering a symbolic model for oceanic rogue waves from data using causal analysis, deep learning, parsimony-guided model selection, and symbolic regression. We train an artificial neural network on causal features from an extensive dataset of observations from wave buoys, while selecting for predictive performance and causal invariance. We apply symbolic regression to distill this black-box model into a mathematical equation that retains the neural network's predictive capabilities, while allowing for interpretation in the context of existing wave theory. The resulting model reproduces known behavior, generates well-calibrated probabilities, and achieves better predictive scores on unseen data than current theory. This showcases how machine learning can facilitate inductive scientific discovery and paves the way for more accurate rogue wave forecasting.
RESUMEN
Efficiently finding covariate model structures that minimize the need for random effects to describe pharmacological data is challenging. The standard approach focuses on identification of relevant covariates, and present methodology lacks tools for automatic identification of covariate model structures. Although neural networks could potentially be used to approximate covariate-parameter relationships, such approximations are not human-readable and come at the risk of poor generalizability due to high model complexity.In the present study, a novel methodology for the simultaneous selection of covariate model structure and optimization of its parameters is proposed. It is based on symbolic regression, posed as an optimization problem with a smooth loss function. This enables training of the model through back-propagation using efficient gradient computations.Feasibility and effectiveness are demonstrated by application to a clinical pharmacokinetic data set for propofol, containing infusion and blood sample time series from 1031 individuals. The resulting model is compared to a published state-of-the-art model for the same data set. Our methodology finds a covariate model structure and corresponding parameter values with a slightly better fit, while relying on notably fewer covariates than the state-of-the-art model. Unlike contemporary practice, finding the covariate model structure is achieved without an iterative procedure involving manual interactions.
Asunto(s)
Redes Neurales de la Computación , Propofol , Humanos , Factores de TiempoRESUMEN
Reproducibility is important for having confidence in evolutionary machine learning algorithms. Although the focus of reproducibility is usually to recreate an aggregate prediction error score using fixed random seeds, this is not sufficient. Firstly, multiple runs of an algorithm, without a fixed random seed, should ideally return statistically equivalent results. Secondly, it should be confirmed whether the expected behaviour of an algorithm matches its actual behaviour, in terms of how an algorithm targets a reduction in prediction error. Confirming the behaviour of an algorithm is not possible when using a total error aggregate score. Using an error decomposition framework as a methodology for improving the reproducibility of results in evolutionary computation addresses both of these factors. By estimating decomposed error using multiple runs of an algorithm and multiple training sets, the framework provides a greater degree of certainty about the prediction error. Also, decomposing error into bias, variance due to the algorithm (internal variance), and variance due to the training data (external variance) more fully characterises evolutionary algorithms. This allows the behaviour of an algorithm to be confirmed. Applying the framework to a number of evolutionary algorithms shows that their expected behaviour can be different to their actual behaviour. Identifying a behaviour mismatch is important in terms of understanding how to further refine an algorithm as well as how to effectively apply an algorithm to a problem.
Asunto(s)
Algoritmos , Aprendizaje Automático , Reproducibilidad de los ResultadosRESUMEN
The identification of key materials' parameters correlated with the performance can accelerate the development of heterogeneous catalysts and unveil the relevant underlying physical processes. However, the analysis of correlations is often hindered by inconsistent data. Besides, nontrivial, yet unknown relationships may be important, and the intricacy of the various processes may be significant. Here, we tackle these challenges for the CO oxidation catalyzed by perovskites using a combination of rigorous experiments and artificial intelligence. A series of 13 ABO3 (A = La, Pr, Nd, Sm; B = Cr, Mn, Fe, Co) perovskites was synthesized, characterized, and tested in catalysis. To the resulting dataset, we applied the symbolic-regression SISSO approach. We identified an analytical expression correlated with the activity that contains the normalized unit-cell volume, the Pauling electronegativity of the elements A and B, and the ionization energy of the element B. Therefore, the activity is described by crystallographic distortions and by the chemical nature of A and B elements. The generalizability of the identified descriptor is confirmed by the good quality of the predictions for 3 additional ABO3 and of 16 chemically more complex AMn(1-x)B'xO3 (A = La, Pr, Nd; B' = Fe, Co Ni Cu, Zn) perovskites.
RESUMEN
It is common practice in the early drug discovery process to conduct in vitro screening experiments using liver microsomes in order to obtain an initial assessment of test compound metabolic stability. Compounds which bind to liver microsomes are unavailable for interaction with the drug metabolizing enzymes. As such, assessment of the unbound fraction of compound available for biotransformation is an important factor for interpretation of in vitro experimental results and to improve prediction of the in vivo metabolic clearance. Various in silico methods have been proposed for the prediction of test compound binding to microsomes, from various simple lipophilicity-based models with moderate performance to sophisticated machine learning models which demonstrate superior performance at the cost of increased complexity and higher data requirements. In this work, we attempt to strike a middle ground by developing easily implementable equations with improved predictive performance. We employ a symbolic regression approach based on a medium-size in-house data set of fraction unbound in human liver microsomes measurements allowing the identification of novel equations with improved performance. We validate the model performance on an in-house held-out test set and an external validation set.
Asunto(s)
Microsomas Hepáticos , Humanos , Microsomas Hepáticos/metabolismo , Cinética , Biotransformación , Tasa de Depuración Metabólica , Preparaciones Farmacéuticas/metabolismoRESUMEN
Machine learning (ML) models were developed for understanding the root uptake of per- and polyfluoroalkyl substances (PFASs) under complex PFAS-crop-soil interactions. Three hundred root concentration factor (RCF) data points and 26 features associated with PFAS structures, crop properties, soil properties, and cultivation conditions were used for the model development. The optimal ML model, obtained by stratified sampling, Bayesian optimization, and 5-fold cross-validation, was explained by permutation feature importance, individual conditional expectation plot, and 3D interaction plot. The results showed that soil organic carbon contents, pH, chemical logP, soil PFAS concentration, root protein contents, and exposure time greatly affected the root uptake of PFASs with 0.43, 0.25, 0.10, 0.05, 0.05, and 0.05 of relative importance, respectively. Furthermore, these factors presented the key threshold ranges in favor of the PFAS uptake. Carbon-chain length was identified as the critical molecular structure affecting root uptake of PFASs with 0.12 of relative importance, based on the extended connectivity fingerprints. A user-friendly model was established with symbolic regression for accurately predicting RCF values of the PFASs (including branched PFAS isomerides). The present study provides a novel approach for profound insight into the uptake of PFASs by crops under complex PFAS-crop-soil interactions, aiming to ensure food safety and human health.
Asunto(s)
Fluorocarburos , Contaminantes Químicos del Agua , Humanos , Suelo/química , Carbono , Teorema de Bayes , Fluorocarburos/análisis , Aprendizaje Automático , Contaminantes Químicos del Agua/análisisRESUMEN
In many engineering fields and scientific disciplines, the results of experiments are in the form of time series, which can be quite problematic to interpret and model. Genetic programming tools are quite powerful in extracting knowledge from data. In this work, several upgrades and refinements are proposed and tested to improve the explorative capabilities of symbolic regression (SR) via genetic programming (GP) for the investigation of time series, with the objective of extracting mathematical models directly from the available signals. The main task is not simply prediction but consists of identifying interpretable equations, reflecting the nature of the mechanisms generating the signals. The implemented improvements involve almost all aspects of GP, from the knowledge representation and the genetic operators to the fitness function. The unique capabilities of genetic programming, to accommodate prior information and knowledge, are also leveraged effectively. The proposed upgrades cover the most important applications of empirical modeling of time series, ranging from the identification of autoregressive systems and partial differential equations to the search of models in terms of dimensionless quantities and appropriate physical units. Particularly delicate systems to identify, such as those showing hysteretic behavior or governed by delayed differential equations, are also addressed. The potential of the developed tools is substantiated with both a battery of systematic numerical tests with synthetic signals and with applications to experimental data.
Asunto(s)
Algoritmos , Modelos Teóricos , Factores de TiempoRESUMEN
BACKGROUND: Heart failure is a clinical syndrome characterised by a reduced ability of the heart to pump blood. Patients with heart failure have a high mortality rate, and physicians need reliable prognostic predictions to make informed decisions about the appropriate application of devices, transplantation, medications, and palliative care. In this study, we demonstrate that combining symbolic regression with the Cox proportional hazards model improves the ability to predict death due to heart failure compared to using the Cox proportional hazards model alone. METHODS: We used a newly invented symbolic regression method called the QLattice to analyse a data set of medical records for 299 Pakistani patients diagnosed with heart failure. The QLattice identified non-linear mathematical transformations of the available covariates, which we then used in a Cox model to predict survival. RESULTS: An exponential function of age, the inverse of ejection fraction, and the inverse of serum creatinine were identified as the best risk factors for predicting heart failure deaths. A Cox model fitted on these transformed covariates had improved predictive performance compared with a Cox model on the same covariates without mathematical transformations. CONCLUSION: Symbolic regression is a way to find transformations of covariates from patients' medical records which can improve the performance of survival regression models. At the same time, these simple functions are intuitive and easy to apply in clinical settings. The direct interpretability of the simple forms may help researchers gain new insights into the actual causal pathways leading to deaths.
Asunto(s)
Insuficiencia Cardíaca , Humanos , Modelos de Riesgos Proporcionales , Análisis de Regresión , Factores de Riesgo , Volumen SistólicoRESUMEN
The indoor localization of people is the key to realizing "smart city" applications, such as smart homes, elderly care, and an energy-saving grid. The localization method based on electrostatic information is a passive label-free localization technique with a better balance of localization accuracy, system power consumption, privacy protection, and environmental friendliness. However, the physical information of each actual application scenario is different, resulting in the transfer function from the human electrostatic potential to the sensor signal not being unique, thus limiting the generality of this method. Therefore, this study proposed an indoor localization method based on on-site measured electrostatic signals and symbolic regression machine learning algorithms. A remote, non-contact human electrostatic potential sensor was designed and implemented, and a prototype test system was built. Indoor localization of moving people was achieved in a 5 m × 5 m space with an 80% positioning accuracy and a median error absolute value range of 0.4-0.6 m. This method achieved on-site calibration without requiring physical information about the actual scene. It has the advantages of low computational complexity and only a small amount of training data is required.
Asunto(s)
Algoritmos , Tecnología Inalámbrica , Anciano , Humanos , Aprendizaje Automático , Movimiento , Electricidad EstáticaRESUMEN
This study designs a simple current controller employing deep symbolic regression (DSR) in a surface-mounted permanent magnet synchronous machine (SPMSM). A novel DSR-based optimal current control scheme is proposed, which after proper training and fitting, generates an analytical dynamic numerical expression that characterizes the data. This creates an understandable model and has the potential to estimate data that have not been seen before. The goal of this study was to overcome the traditional linear proportional-integral (PI) current controller because the performance of the PI is highly dependent on the system model. Moreover, the outer speed control loop gains are tuned using the cuckoo search algorithm, which yields optimal gain values. To demonstrate the efficacy of the proposed design, we apply the control design to different test cases, that is varied speed and load conditions, as well as sinusoidal speed reference, and compare the results with those of a traditional vector control design. Compared with traditional control approaches, we deduce that the DSR-based control design could be extrapolated far beyond the training dataset, laying the foundation for the use of deep learning techniques in power conversion applications.
RESUMEN
The amount of China's sulfur dioxide emission remains significantly large in recent years. To further reduce sulfur dioxide emission, the key is to find out the leading factors affecting sulfur dioxide emission and then take measures to control it accordingly. In order to investigate the influential factors of sulfur dioxide emission of various provinces, the data of sulfur dioxide emission of 30 provinces in China from 2001 to 2020 were collected. We established the symbolic regression model to explore the relationship between the GDP (x1), total population (x2), total energy consumption (x3), thermal power installed capacity (x4), and sulfur dioxide emission (dependent variable) for each province. The results show that the amount of China's total sulfur dioxide emission and sulfur dioxide emission in most provinces meet the environmental Kuznets curve (EKC). The influential degree of the factors affecting China's sulfur dioxide emission are GDP, total energy consumption, thermal power installed capacity, and total population. The provinces with the primary factor of GDP have the lowest average total energy consumption and average thermal power installed capacity, and their average sulfur dioxide emissions are also relatively low. The provinces with the primary factor of GDP do not show obvious geographical characteristics, but the provinces with the primary factor of total energy consumption are all distributed in southern China. Based on the research results, some control measures are also put forward.
Asunto(s)
Dióxido de Carbono , Dióxido de Azufre , Dióxido de Azufre/análisis , Dióxido de Carbono/análisis , Monitoreo del Ambiente , China , Desarrollo Económico , CarbonoRESUMEN
We investigate the addition of constraints on the function image and its derivatives for the incorporation of prior knowledge in symbolic regression. The approach is called shape-constrained symbolic regression and allows us to enforce, for example, monotonicity of the function over selected inputs. The aim is to find models which conform to expected behavior and which have improved extrapolation capabilities. We demonstrate the feasibility of the idea and propose and compare two evolutionary algorithms for shape-constrained symbolic regression: (i) an extension of tree-based genetic programming which discards infeasible solutions in the selection step, and (ii) a two-population evolutionary algorithm that separates the feasible from the infeasible solutions. In both algorithms we use interval arithmetic to approximate bounds for models and their partial derivatives. The algorithms are tested on a set of 19 synthetic and four real-world regression problems. Both algorithms are able to identify models which conform to shape constraints which is not the case for the unmodified symbolic regression algorithms. However, the predictive accuracy of models with constraints is worse on the training set and the test set. Shape-constrained polynomial regression produces the best results for the test set but also significantly larger models.
Asunto(s)
Algoritmos , Evolución BiológicaRESUMEN
Understanding the dynamics of complex ecosystems is a necessary step to maintain and control them. Yet, reverse-engineering ecological dynamics remains challenging largely due to the very broad class of dynamics that ecosystems may take. Here, this challenge is tackled through symbolic regression, a machine learning method that automatically reverse-engineers both the model structure and parameters from temporal data. How combining symbolic regression with a "dictionary" of possible ecological functional responses opens the door to correctly reverse-engineering ecosystem dynamics, even in the case of poorly informative data, is shown. This strategy is validated using both synthetic and experimental data, and it is found that this strategy is promising for the systematic modeling of complex ecological systems.
Asunto(s)
Ecología , Modelos Teóricos , EcosistemaRESUMEN
Interaction-Transformation (IT) is a new representation for Symbolic Regression that reduces the space of solutions to a set of expressions that follow a specific structure. The potential of this representation was illustrated in prior work with the algorithm called SymTree. This algorithm starts with a simple linear model and incrementally introduces new transformed features until a stop criterion is met. While the results obtained by this algorithm were competitive with the literature, it had the drawback of not scaling well with the problem dimension. This article introduces a mutation-only Evolutionary Algorithm, called ITEA, capable of evolving a population of IT expressions. One advantage of this algorithm is that it enables the user to specify the maximum number of terms in an expression. In order to verify the competitiveness of this approach, ITEA is compared to linear, nonlinear, and Symbolic Regression models from the literature. The results indicate that ITEA is capable of finding equal or better approximations than other Symbolic Regression models while being competitive to state-of-the-art nonlinear models. Additionally, since this representation follows a specific structure, it is possible to extract the importance of each original feature of a data set as an analytical function, enabling us to automate the explanation of any prediction. In conclusion, ITEA is competitive when comparing to regression models with the additional benefit of automating the extraction of additional information of the generated models.
Asunto(s)
Algoritmos , Dinámicas no Lineales , Evolución BiológicaRESUMEN
The Gene-pool Optimal Mixing Evolutionary Algorithm (GOMEA) is a model-based EA framework that has been shown to perform well in several domains, including Genetic Programming (GP). Differently from traditional EAs where variation acts blindly, GOMEA learns a model of interdependencies within the genotype, that is, the linkage, to estimate what patterns to propagate. In this article, we study the role of Linkage Learning (LL) performed by GOMEA in Symbolic Regression (SR). We show that the non-uniformity in the distribution of the genotype in GP populations negatively biases LL, and propose a method to correct for this. We also propose approaches to improve LL when ephemeral random constants are used. Furthermore, we adapt a scheme of interleaving runs to alleviate the burden of tuning the population size, a crucial parameter for LL, to SR. We run experiments on 10 real-world datasets, enforcing a strict limitation on solution size, to enable interpretability. We find that the new LL method outperforms the standard one, and that GOMEA outperforms both traditional and semantic GP. We also find that the small solutions evolved by GOMEA are competitive with tuned decision trees, making GOMEA a promising new approach to SR.
Asunto(s)
Algoritmos , Evolución Biológica , Ligamiento Genético , SemánticaRESUMEN
Lexicase selection is a parent selection method that considers training cases individually, rather than in aggregate, when performing parent selection. Whereas previous work has demonstrated the ability of lexicase selection to solve difficult problems in program synthesis and symbolic regression, the central goal of this article is to develop the theoretical underpinnings that explain its performance. To this end, we derive an analytical formula that gives the expected probabilities of selection under lexicase selection, given a population and its behavior. In addition, we expand upon the relation of lexicase selection to many-objective optimization methods to describe the behavior of lexicase selection, which is to select individuals on the boundaries of Pareto fronts in high-dimensional space. We show analytically why lexicase selection performs more poorly for certain sizes of population and training cases, and show why it has been shown to perform more poorly in continuous error spaces. To address this last concern, we propose new variants of ε-lexicase selection, a method that modifies the pass condition in lexicase selection to allow near-elite individuals to pass cases, thereby improving selection performance with continuous errors. We show that ε-lexicase outperforms several diversity-maintenance strategies on a number of real-world and synthetic regression problems.
Asunto(s)
Biología Computacional/métodos , Lingüística/estadística & datos numéricos , Modelos Estadísticos , Algoritmos , Humanos , Análisis de Regresión , Motor de Búsqueda/estadística & datos numéricos , SemánticaRESUMEN
In genetic programming (GP), computer programs are often coevolved with training data subsets that are known as fitness predictors. In order to maximize performance of GP, it is important to find the most suitable parameters of coevolution, particularly the fitness predictor size. This is a very time-consuming process as the predictor size depends on a given application, and many experiments have to be performed to find its suitable size. A new method is proposed which enables us to automatically adapt the predictor and its size for a given problem and thus to reduce not only the time of evolution, but also the time needed to tune the evolutionary algorithm. The method was implemented in the context of Cartesian genetic programming and evaluated using five symbolic regression problems and three image filter design problems. In comparison with three different CGP implementations, the time required by CGP search was reduced while the quality of results remained unaffected.
Asunto(s)
Algoritmos , Evolución Biológica , Programas Informáticos , Biología Computacional/métodos , Simulación por Computador , Aptitud Genética , Humanos , Aumento de la Imagen/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Análisis de Regresión , Relación Señal-RuidoRESUMEN
The problem of the creation of numerical constants has haunted the Genetic Programming (GP) community for a long time and is still considered one of the principal open research issues. Many problems tackled by GP include finding mathematical formulas, which often contain numerical constants. It is, however, a great challenge for GP to create highly accurate constants as their values are normally continuous, while GP is intrinsically suited for combinatorial optimization. The prevailing attempts to resolve this issue either employ separate real-valued local optimizers or special numeric mutations. While the former yield better accuracy than the latter, they add to implementation complexity and significantly increase computational cost. In this paper, we propose a special numeric crossover operator for use with Robust Gene Expression Programming (RGEP). RGEP is a type of genotype/phenotype evolutionary algorithm closely related to GP, but employing linear chromosomes. Using normalized least squares error as a fitness measure, we show that the proposed operator is significantly better in finding highly accurate solutions than the existing numeric mutation operators on several symbolic regression problems. Another two important advantages of the proposed operator are that it is extremely simple to implement, and it comes at no additional computational cost. The latter is true because the operator is integrated into an existing crossover operator and does not call for an additional cost function evaluation.
RESUMEN
A novel framework is proposed that utilizes symbolic regression via genetic programming to identify free-form partial differential equations from scarce and noisy data. The framework successfully identified ground truth models for four synthetic systems (an isothermal plug flow reactor, a continuously stirred tank reactor, a nonisothermal reactor, and viscous flow governed by Burgers' equation) from time-variant data collected at one location. A comparative analysis against the so-called weak Sparse Identification of Nonlinear Dynamics (SINDy) demonstrated the proposed framework's superior ability to identify meaningful partial differential equation (PDE) models when data was scarce. The framework was further tested for robustness to noise and scarcity, showing successful model recovery from as few as eight time series data points collected at a single point in space with 50% noise. These results emphasize the potential of the proposed framework for the discovery of PDE models when data collection is expensive or otherwise difficult.