RESUMEN
Background: The generalized relevance network approach to network inference reconstructs network links based on the strength of associations between data in individual network nodes. It can reconstruct undirected networks, i.e., relevance networks, sensu stricto, as well as directed networks, referred to as causal relevance networks. The generalized approach allows the use of an arbitrary measure of pairwise association between nodes, an arbitrary scoring scheme that transforms the associations into weights of the network links, and a method for inferring the directions of the links. While this makes the approach powerful and flexible, it introduces the challenge of finding a combination of components that would perform well on a given inference task. Results: We address this challenge by performing an extensive empirical analysis of the performance of 114 variants of the generalized relevance network approach on 47 tasks of gene network inference from time-series data and 39 tasks of gene network inference from steady-state data. We compare the different variants in a multi-objective manner, considering their ranking in terms of different performance metrics. The results suggest a set of recommendations that provide guidance for selecting an appropriate variant of the approach in different data settings. Conclusions: The association measures based on correlation, combined with a particular scoring scheme of asymmetric weighting, lead to optimal performance of the relevance network approach in the general case. In the two special cases of inference tasks involving short time-series data and/or large networks, association measures based on identifying qualitative trends in the time series are more appropriate.
Asunto(s)
Biología Computacional/métodos , Regulación de la Expresión Génica , Redes Reguladoras de Genes , Biología Computacional/normas , Bases de Datos Genéticas , Escherichia coli/genética , Curva ROCRESUMEN
The pollution of ground and surface waters with pesticides is a serious ecological issue that requires adequate treatment. Most of the existing water pollution models are mechanistic mathematical models. While they have made a significant contribution to understanding the transfer processes, they face the problem of validation because of their complexity, the user subjectivity in their parameterization, and the lack of empirical data for validation. In addition, the data describing water pollution with pesticides are, in most cases, very imbalanced. This is due to strict regulations for pesticide applications, which lead to only a few pollution events. In this study, we propose the use of data mining to build models for assessing the risk of water pollution by pesticides in field-drained outflow water. Unlike the mechanistic models, the models generated by data mining are based on easily obtainable empirical data, while the parameterization of the models is not influenced by the subjectivity of ecological modelers. We used empirical data from field trials at the La Jaillière experimental site in France and applied the random forests algorithm to build predictive models that predict "risky" and "not-risky" pesticide application events. To address the problems of the imbalanced classes in the data, cost-sensitive learning and different measures of predictive performance were used. Despite the high imbalance between risky and not-risky application events, we managed to build predictive models that make reliable predictions. The proposed modeling approach can be easily applied to other ecological modeling problems where we encounter empirical data with highly imbalanced classes.
Asunto(s)
Plaguicidas/análisis , Contaminantes Químicos del Agua/análisis , Agricultura , Análisis de Datos , Francia , Modelos Teóricos , RiesgoRESUMEN
The task of gene regulatory network reconstruction from high-throughput data is receiving increasing attention in recent years. As a consequence, many inference methods for solving this task have been proposed in the literature. It has been recently observed, however, that no single inference method performs optimally across all datasets. It has also been shown that the integration of predictions from multiple inference methods is more robust and shows high performance across diverse datasets. Inspired by this research, in this paper, we propose a machine learning solution which learns to combine predictions from multiple inference methods. While this approach adds additional complexity to the inference process, we expect it would also carry substantial benefits. These would come from the automatic adaptation to patterns on the outputs of individual inference methods, so that it is possible to identify regulatory interactions more reliably when these patterns occur. This article demonstrates the benefits (in terms of accuracy of the reconstructed networks) of the proposed method, which exploits an iterative, semi-supervised ensemble-based algorithm. The algorithm learns to combine the interactions predicted by many different inference methods in the multi-view learning setting. The empirical evaluation of the proposed algorithm on a prokaryotic model organism (E. coli) and on a eukaryotic model organism (S. cerevisiae) clearly shows improved performance over the state of the art methods. The results indicate that gene regulatory network reconstruction for the real datasets is more difficult for S. cerevisiae than for E. coli. The software, all the datasets used in the experiments and all the results are available for download at the following link: http://figshare.com/articles/Semi_supervised_Multi_View_Learning_for_Gene_Network_Reconstruction/1604827.
Asunto(s)
Escherichia coli/fisiología , Redes Reguladoras de Genes/fisiología , Genes Bacterianos/fisiología , Genes Fúngicos/fisiología , Aprendizaje Automático , Saccharomyces cerevisiae/fisiología , Programas InformáticosRESUMEN
The estimation of the pollution risk of surface and ground water with plant protection products applied on fields depends highly on the reliable prediction of the water outflows over (surface runoff) and through (discharge through sub-surface drainage systems) the soil. In previous studies, water movement through the soil has been simulated mainly using physically-based models. The most frequently used models for predicting soil water movement are MACRO, HYDRUS-1D/2D and Root Zone Water Quality Model. However, these models are difficult to apply to a small portion of land due to the information required about the soil and climate, which are difficult to obtain for each plot separately. In this paper, we focus on improving the performance and applicability of water outflow modeling by using a modeling approach based on machine learning techniques. It allows us to overcome the major drawbacks of physically-based models e.g., the complexity and difficulty of obtaining the information necessary for the calibration and the validation, by learning models from data collected from experimental fields that are representative for a wider area (region). We evaluate the proposed approach on data obtained from the La Jaillière experimental site, located in Western France. This experimental site represents one of the ten scenarios contained in the MACRO system. Our study focuses on two types of water outflows: discharge through sub-surface drainage systems and surface runoff. The results show that the proposed modeling approach successfully extracts knowledge from the collected data, avoiding the need to provide the information for calibration and validation of physically-based models. In addition, we compare the overall performance of the learned models with the performance of existing models MACRO and RZWQM. The comparison shows overall improvement in the prediction of discharge through sub-surface drainage systems, and partial improvement in the prediction of the surface runoff, in years with intensive rainfall.