RESUMO
Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or steps (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem.
RESUMO
Data science is an exploratory and iterative process that often leads to complex and unstructured code. This code is usually poorly documented and, consequently, hard to understand by a third party. In this paper, we first collect empirical evidence for the non-linearity of data science code from real-world Jupyter notebooks, confirming the need for new approaches that aid in data science code interaction and comprehension. Second, we propose a visualisation method that elucidates implicit workflow information in data science code and assists data scientists in navigating the so-called garden of forking paths in non-linear code. The visualisation also provides information such as the rationale and the identification of the data science pipeline step based on cell annotations. We conducted a user experiment with data scientists to evaluate the proposed method, assessing the influence of (i) different workflow visualisations and (ii) cell annotations on code comprehension. Our results show that visualising the exploration helps the users obtain an overview of the notebook, significantly improving code comprehension. Furthermore, our qualitative analysis provides more insights into the difficulties faced during data science code comprehension.
RESUMO
Epidemic spreading is a widely studied process due to its importance and possibly grave consequences for society. While the classical context of epidemic spreading refers to pathogens transmitted among humans or animals, it is straightforward to apply similar ideas to the spread of information (e.g., a rumor) or the spread of computer viruses. This paper addresses the question of how to optimally select nodes for monitoring in a network of timestamped contact events between individuals. We consider three optimization objectives: the detection likelihood, the time until detection, and the population that is affected by an outbreak. The optimization approach we use is based on a simple greedy approach and has been proposed in a seminal paper focusing on information spreading and water contamination. We extend this work to the setting of disease spreading and present its application with two example networks: a timestamped network of sexual contacts and a network of animal transports between farms. We apply the optimization procedure to a large set of outbreak scenarios that we generate with a susceptible-infectious-recovered model. We find that simple heuristic methods that select nodes with high degree or many contacts compare well in terms of outbreak detection performance with the (greedily) optimal set of nodes. Furthermore, we observe that nodes optimized on past periods may not be optimal for outbreak detection in future periods. However, seasonal effects may help in determining which past period generalizes well to some future period. Finally, we demonstrate that the detection performance depends on the simulation settings. In general, if we force the simulator to generate larger outbreaks, the detection performance will improve, as larger outbreaks tend to occur in the more connected part of the network where the top monitoring nodes are typically located. A natural progression of this work is to analyze how a representative set of outbreak scenarios can be generated, possibly taking into account more realistic propagation models.
RESUMO
The topology of animal transport networks contributes substantially to how fast and to what extent a disease can transmit between animal holdings. Therefore, public authorities in many countries mandate livestock holdings to report all movements of animals. However, the reported data often does not contain information about the exact sequence of transports, making it impossible to assess the effect of truck sharing and truck contamination on disease transmission. The aim of this study was to analyze the topology of the Swiss pig transport network by means of social network analysis and to assess the implications for disease transmission between animal holdings. In particular, we studied how additional information about transport sequences changes the topology of the contact network. The study is based on the official animal movement database in Switzerland and a sample of transport data from one transport company. The results show that the Swiss pig transport network is highly fragmented, which mitigates the risk of a large-scale disease outbreak. By considering the time sequence of transports, we found that even in the worst case, only 0.34% of all farm-pairs were connected within one month. However, both network connectivity and individual connectedness of farms increased if truck sharing and especially truck contamination were considered. Therefore, the extent to which a disease may be transmitted between animal holdings may be underestimated if we only consider data from the official animal movement database. Our results highlight the need for a comprehensive analysis of contacts between farms that includes indirect contacts due to truck sharing and contamination. As the nature of animal transport networks is inherently temporal, we strongly suggest the use of temporal network measures in order to evaluate individual and overall risk of disease transmission through animal transportation.
Assuntos
Doenças Transmissíveis/transmissão , Doenças dos Suínos/transmissão , Meios de Transporte , Criação de Animais Domésticos , Animais , Doenças Transmissíveis/epidemiologia , Surtos de Doenças/prevenção & controle , Fazendas , Humanos , Gado , Fatores de Risco , Suínos , Doenças dos Suínos/epidemiologia , Suíça/epidemiologiaRESUMO
Big Data approaches offer potential benefits for improving animal health, but they have not been broadly implemented in livestock production systems. Privacy issues, the large number of stakeholders, and the competitive environment all make data sharing, and integration a challenge in livestock production systems. The Swiss pig production industry illustrates these and other Big Data issues. It is a highly decentralized and fragmented complex network made up of a large number of small independent actors collecting a large amount of heterogeneous data. Transdisciplinary approaches hold promise for overcoming some of the barriers to implementing Big Data approaches in livestock production systems. The purpose of our paper is to describe the use of a transdisciplinary approach in a Big Data research project in the Swiss pig industry. We provide a brief overview of the research project named "Pig Data," describing the structure of the project, the tools developed for collaboration and knowledge transfer, the data received, and some of the challenges. Our experience provides insight and direction for researchers looking to use similar approaches in livestock production system research.