RESUMO
The concept of augmented reality (AR) assistants has captured the human imagination for decades, becoming a staple of modern science fiction. To pursue this goal, it is necessary to develop artificial intelligence (AI)-based methods that simultaneously perceive the 3D environment, reason about physical tasks, and model the performer, all in real-time. Within this framework, a wide variety of sensors are needed to generate data across different modalities, such as audio, video, depth, speech, and time-of-flight. The required sensors are typically part of the AR headset, providing performer sensing and interaction through visual, audio, and haptic feedback. AI assistants not only record the performer as they perform activities, but also require machine learning (ML) models to understand and assist the performer as they interact with the physical world. Therefore, developing such assistants is a challenging task. We propose ARGUS, a visual analytics system to support the development of intelligent AR assistants. Our system was designed as part of a multi-year-long collaboration between visualization researchers and ML and AR experts. This co-design process has led to advances in the visualization of ML in AR. Our system allows for online visualization of object, action, and step detection as well as offline analysis of previously recorded AR sessions. It visualizes not only the multimodal sensor data streams but also the output of the ML models. This allows developers to gain insights into the performer activities as well as the ML models, helping them troubleshoot, improve, and fine-tune the components of the AR assistant.
RESUMO
Access to high-quality data is an important barrier in the digital analysis of urban settings, including applications within computer vision and urban design. Diverse forms of data collected from sensors in areas of high activity in the urban environment, particularly at street intersections, are valuable resources for researchers interpreting the dynamics between vehicles, pedestrians, and the built environment. In this paper, we present a high-resolution audio, video, and LiDAR dataset of three urban intersections in Brooklyn, New York, totaling almost 8 unique hours. The data were collected with custom Reconfigurable Environmental Intelligence Platform (REIP) sensors that were designed with the ability to accurately synchronize multiple video and audio inputs. The resulting data are novel in that they are inclusively multimodal, multi-angular, high-resolution, and synchronized. We demonstrate four ways the data could be utilized - (1) to discover and locate occluded objects using multiple sensors and modalities, (2) to associate audio events with their respective visual representations using both video and audio modes, (3) to track the amount of each type of object in a scene over time, and (4) to measure pedestrian speed using multiple synchronized camera views. In addition to these use cases, our data are available for other researchers to carry out analyses related to applying machine learning to understanding the urban environment (in which existing datasets may be inadequate), such as pedestrian-vehicle interaction modeling and pedestrian attribute recognition. Such analyses can help inform decisions made in the context of urban sensing and smart cities, including accessibility-aware urban design and Vision Zero initiatives.
RESUMO
Recognizing the importance of road infrastructure to promote human health and economic development, actors around the globe are regularly investing in both new roads and road improvements. However, in many contexts there is a sparsity-or complete lack-of accurate information regarding existing road infrastructure, challenging the effective identification of where investments should be made. Previous literature has focused on overcoming this gap through the use of satellite imagery to detect and map roads. In this piece, we extend this literature by leveraging satellite imagery to estimate road quality and concomitant information about travel speed. We adopt a transfer learning approach in which a convolutional neural network architecture is first trained on data collected in the United States (where data is readily available), and then "fine-tuned" on an independent, smaller dataset collected from Nigeria. We test and compare eight different convolutional neural network architectures using a dataset of 53,686 images of 2,400 kilometers of roads in the United States, in which each road segment is measured as "low", "middle", or "high" quality using an open, cellphone-based measuring platform. Using satellite imagery to estimate these classes, we achieve an accuracy of 80.0%, with 99.4% of predictions falling within the actual or an adjacent class. The highest performing base model was applied to a preliminary case study in Nigeria, using a dataset of 1,000 images of paved and unpaved roads. By tailoring our US-model on the basis of this Nigeria-specific data, we were able to achieve an accuracy of 94.0% in predicting the quality of Nigerian roads. A continuous case estimate also showed the ability, on average, to predict road quality to within 0.32 on a 0 to 3 scale (with higher values indicating higher levels of quality).