RESUMO
Air pollution through particulate matter (PM) is one of the largest threats to human health. To understand the causes of PM pollution and enact suitable countermeasures, reliable predictions of future PM concentrations are required. In the scientific literature, many methods exist for machine learning (ML)-based PM prediction, though their quality is difficult to compare because, among other things, they use different data sets and evaluate the resulting predictions differently. For a new data set, it is not apparent which of the existing prediction methods is best suited. In order to ease the assessment of said models, we present evalPM, a framework to easily create, evaluate, and compare different ML models for immission-based PM prediction. To achieve this, the framework provides flexibility regarding data sets, input features, target variables, model types, hyperparameters, and model evaluation. It has a modular design consisting of several components, each providing at least one required flexibility. The individual capabilities of the framework are demonstrated using 16 different models from the related literature by means of temporal prediction of PM concentrations for four European data sets, showing the capabilities and advantages of the evalPM framework. In doing so, it is shown that the framework allows fast creation and evaluation of ML-based PM prediction models.
Assuntos
Poluentes Atmosféricos , Poluição do Ar , Humanos , Material Particulado/análise , Poluentes Atmosféricos/análise , Monitoramento Ambiental/métodos , Poluição do Ar/análise , Aprendizado de MáquinaRESUMO
Modern organizations typically store their data in a raw format in data lakes. These data are then processed and usually stored under hybrid layouts, because they allow projection and selection operations. Thus, they allow (when required) to read less data from the disk. However, this is not very well exploited by distributed processing frameworks (e.g., Hadoop, Spark) when analytical queries are posed. These frameworks divide the data into multiple partitions and then process each partition in a separate task, consequently creating tasks based on the total file size and not the actual size of the data to be read. This typically leads to launching more tasks than needed, which, in turn, increases the query execution time and induces significant waste of computing resources. To allow a more efficient use of resources and reduce the query execution time, we propose a method that decides the number of tasks based on the data being read. To this end, we first propose a cost-based model for estimating the size of data read in hybrid layouts. Next, we use the estimated reading size in a multi-objective optimization method to decide the number of tasks and computational resources to be used. We prototyped our solution for Apache Parquet and Spark and found that our estimations are highly correlated (0.96) with the real executions. Further, using TPC-H we show that our recommended configurations are only 5.6% away from the Pareto front and provide 2.1 × speedup compared with default solutions.
Assuntos
Big Data , Gerenciamento de Dados/métodos , Armazenamento e Recuperação da Informação , Software , AlgoritmosRESUMO
Mass media reports attribute the occurrence of decomposed or mummified corpses in a domestic setting mainly to an increasing social isolation of elderly people. Not much is known about the demographic and medical conditions under which individuals are found months or even years after death in their homes. For this study, autopsy reports of individuals found dead and mummified or decomposed between 1993 and 1997 with those from 1963 to 1967 were retrospectively analyzed. Between 1993 and 1997, a total number of 320 individuals were found decomposed at home compared to 412 such cases between 1963 and 1967. The proportion of individuals older than 64 years was significantly higher during the 1990s study period. Furthermore, the proportion of deaths attributable to natural causes was significantly lower during the 1990s, whereas the rate of suicides was nearly three times higher.