RESUMO
BACKGROUND: In recent years, researchers have made significant strides in understanding the heterogeneity of breast cancer and its various subtypes. However, the wealth of genomic and proteomic data available today necessitates efficient frameworks, instruments, and computational tools for meaningful analysis. Despite its success as a prognostic tool, the PAM50 gene signature's reliance on many genes presents challenges in terms of cost and complexity. Consequently, there is a need for more efficient methods to classify breast cancer subtypes using a reduced gene set accurately. RESULTS: This study explores the potential of achieving precise breast cancer subtype categorization using a reduced gene set derived from the PAM50 gene signature. By employing a "Few-Shot Genes Selection" method, we randomly select smaller subsets from PAM50 and evaluate their performance using metrics and a linear model, specifically the Support Vector Machine (SVM) classifier. In addition, we aim to assess whether a more compact gene set can maintain performance while simplifying the classification process. Our findings demonstrate that certain reduced gene subsets can perform comparable or superior to the full PAM50 gene signature. CONCLUSIONS: The identified gene subsets, with 36 genes, have the potential to contribute to the development of more cost-effective and streamlined diagnostic tools in breast cancer research and clinical settings.
Assuntos
Neoplasias da Mama , Humanos , Feminino , Neoplasias da Mama/genética , Neoplasias da Mama/diagnóstico , Biomarcadores Tumorais/genética , Proteômica , Perfilação da Expressão Gênica/métodos , Técnicas GenéticasRESUMO
Breast cancer is the second most common cancer type and is the leading cause of cancer-related deaths worldwide. Since it is a heterogeneous disease, subtyping breast cancer plays an important role in performing a specific treatment. Gene expression data is a viable alternative to be employed on cancer subtype classification, as they represent the state of a cell at the molecular level, but generally has a relatively small number of samples compared to a large number of genes. Gene selection is a promising approach that addresses this uneven high-dimensional matrix of genes versus samples and plays an important role in the development of efficient cancer subtype classification. In this work, an innovative outlier-based gene selection (OGS) method is proposed to select relevant genes for efficiently and effectively classify breast cancer subtypes. Experiments show that our strategy presents an F1 score of 1.0 for basal and 0.86 for her 2, the two subtypes with the worst prognoses, respectively. Compared to other methods, our proposed method outperforms in the F1 score using 80% less genes. In general, our method selects only a few highly relevant genes, speeding up the classification, and significantly improving the classifier's performance.
Assuntos
Técnicas Genéticas , Neoplasias , Feminino , HumanosRESUMO
Understanding the interpretation of machine learning (ML) models has been of paramount importance when making decisions with societal impacts, such as transport control, financial activities, and medical diagnosis. While local explanation techniques are popular methods to interpret ML models on a single instance, they do not scale to the understanding of a model's behavior on the whole dataset. In this article, we outline the challenges and needs of visually analyzing local explanations and propose SUBPLEX, a visual analytics approach to help users understand local explanations with subpopulation visual analysis. SUBPLEX provides steerable clustering and projection visualization techniques that allow users to derive interpretable subpopulations of local explanations with users' expertise. We evaluate our approach through two use cases and experts' feedback.
Assuntos
Aprendizado de Máquina , Análise por ConglomeradosRESUMO
Extracting and analyzing crime patterns in big cities is a challenging spatiotemporal problem. The hardness of the problem is linked to two main factors, the sparse nature of the crime activity and its spread in large spatial areas. Sparseness hampers most time series (crime time series) comparison methods from working properly, while the handling of large urban areas tends to render the computational costs of such methods impractical. Visualizing different patterns hidden in crime time series data is another issue in this context, mainly due to the number of patterns that can show up in the time series analysis. In this article, we present a new methodology to deal with the issues above, enabling the analysis of spatiotemporal crime patterns in a street-level of detail. Our approach is made up of two main components designed to handle the spatial sparsity and spreading of crimes in large areas of the city. The first component relies on a stochastic mechanism from which one can visually analyze probable×intensive crime hotspots. Such analysis reveals important patterns that can not be observed in the typical intensity-based hotspot visualization. The second component builds upon a deep learning mechanism to embed crime time series in Cartesian space. From the embedding, one can identify spatial locations where the crime time series have similar behavior. The two components have been integrated into a web-based analytical tool called CriPAV (Crime Pattern Analysis and Visualization), which enables global as well as a street-level view of crime patterns. Developed in close collaboration with domain experts, CriPAV has been validated through a set of case studies with real crime data in São Paulo - Brazil. The provided experiments and case studies reveal the effectiveness of CriPAV in identifying patterns such as locations where crimes are not intense but highly probable to occur as well as locations that are far apart from each other but bear similar crime patterns.
Assuntos
Gráficos por Computador , Crime , Brasil , Cidades , Fatores de TempoRESUMO
Background The trapezius muscle is often utilized as a muscle or nerve donor for repairing shoulder function in those with brachial plexus birth palsy (BPBP). To evaluate the native role of the trapezius in the affected limb, we demonstrate use of the Motion Browser, a novel visual analytics system to assess an adolescent with BPBP. Method An 18-year-old female with extended upper trunk (C5-6-7) BPBP underwent bilateral upper extremity three-dimensional motion analysis with Motion Browser. Surface electromyography (EMG) from eight muscles in each limb which was recorded during six upper extremity movements, distinguishing between upper trapezius (UT) and lower trapezius (LT). The Motion Browser calculated active range of motion (AROM), compiled the EMG data into measures of muscle activity, and displayed the results in charts. Results All movements, excluding shoulder abduction, had similar AROM in affected and unaffected limbs. In the unaffected limb, LT was more active in proximal movements of shoulder abduction, and shoulder external and internal rotations. In the affected limb, LT was more active in distal movements of forearm pronation and supination; UT was more active in shoulder abduction. Conclusion In this female with BPBP, Motion Browser demonstrated that the native LT in the affected limb contributed to distal movements. Her results suggest that sacrificing her trapezius as a muscle or nerve donor may affect her distal functionality. Clinicians should exercise caution when considering nerve transfers in children with BPBP and consider individualized assessment of functionality before pursuing surgery.
RESUMO
Boundary detection has long been a fundamental tool for image processing and computer vision, supporting the analysis of static and time-varying data. In this work, we built upon the theory of Graph Signal Processing to propose a novel boundary detection filter in the context of graphs, having as main application scenario the visual analysis of spatio-temporal data. More specifically, we propose the equivalent for graphs of the so-called Laplacian of Gaussian edge detection filter, which is widely used in image processing. The proposed filter is able to reveal interesting spatial patterns while still enabling the definition of entropy of time slices. The entropy reveals the degree of randomness of a time slice, helping users to identify expected and unexpected phenomena over time. The effectiveness of our approach appears in applications involving synthetic and real data sets, which show that the proposed methodology is able to uncover interesting spatial and temporal phenomena. The provided examples and case studies make clear the usefulness of our approach as a mechanism to support visual analytic tasks involving spatio-temporal data.
RESUMO
São Paulo is the largest city in South America, with crime rates that reflect its size. The number and type of crimes vary considerably around the city, assuming different patterns depending on urban and social characteristics of each particular location. Previous works have mostly focused on the analysis of crimes with the intent of uncovering patterns associated to social factors, seasonality, and urban routine activities. Therefore, those studies and tools are more global in the sense that they are not designed to investigate specific regions of the city such as particular neighborhoods, avenues, or public areas. Tools able to explore specific locations of the city are essential for domain experts to accomplish their analysis in a bottom-up fashion, revealing how urban features related to mobility, passersby behavior, and presence of public infrastructures (e.g., terminals of public transportation and schools) can influence the quantity and type of crimes. In this paper, we present CrimAnalyzer, a visual analytic tool that allows users to study the behavior of crimes in specific regions of a city. The system allows users to identify local hotspots and the pattern of crimes associated to them, while still showing how hotspots and corresponding crime patterns change over time. CrimAnalyzer has been developed from the needs of a team of experts in criminology and deals with three major challenges: i) flexibility to explore local regions and understand their crime patterns, ii) identification of spatial crime hotspots that might not be the most prevalent ones in terms of the number of crimes but that are important enough to be investigated, and iii) understand the dynamic of crime patterns over time. The effectiveness and usefulness of the proposed system are demonstrated by qualitative and quantitative comparisons as well as by case studies run by domain experts involving real data. The experiments show the capability of CrimAnalyzer in identifying crime-related phenomena.
RESUMO
Dataflow visualization systems enable flexible visual data exploration by allowing the user to construct a dataflow diagram that composes query and visualization modules to specify system functionality. However learning dataflow diagram usage presents overhead that often discourages the user. In this work we design FlowSense, a natural language interface for dataflow visualization systems that utilizes state-of-the-art natural language processing techniques to assist dataflow diagram construction. FlowSense employs a semantic parser with special utterance tagging and special utterance placeholders to generalize to different datasets and dataflow diagrams. It explicitly presents recognized dataset and diagram special utterances to the user for dataflow context awareness. With FlowSense the user can expand and adjust dataflow diagrams more conveniently via plain English. We apply FlowSense to the VisFlow subset-flow visualization system to enhance its usability. We evaluate FlowSense by one case study with domain experts on a real-world data analysis problem and a formal user study.
RESUMO
Geographical maps encoded with rainbow color scales are widely used by climate scientists. Despite a plethora of evidence from the visualization and vision sciences literature about the shortcomings of the rainbow color scale, they continue to be preferred over perceptually optimal alternatives. To study and analyze this mismatch between theory and practice, we present a web-based user study that compares the effect of color scales on performance accuracy for climate-modeling tasks. In this study, we used pairs of continuous geographical maps generated using climatological metrics for quantifying pairwise magnitude difference and spatial similarity. For each pair of maps, 39 scientist-observers judged: i) the magnitude of their difference, ii) their degree of spatial similarity, and iii) the region of greatest dissimilarity between them. Besides the rainbow color scale, two other continuous color scales were chosen such that all three of them covaried two dimensions (luminance monotonicity and hue banding), hypothesized to have an impact on task performance. We also analyzed subjective performance measures, such as user confidence, perceived accuracy, preference, and familiarity in using the different color scales. We found that monotonic luminance scales produced significantly more accurate judgments of magnitude difference but were not superior in spatial comparison tasks, and that hue banding had differential effects based on the task and conditions. Scientists expressed the highest preference and perceived confidence and accuracy with the rainbow, despite its poor performance on the magnitude comparison tasks. We also report on interesting interactions among stimulus conditions, tasks, and color scales, that lead to open research questions.
Assuntos
Percepção de Cores/fisiologia , Gráficos por Computador , Análise de Dados , Mapas como Assunto , Adulto , Idoso , Mudança Climática , Cor , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Análise e Desempenho de Tarefas , Adulto JovemRESUMO
The brachial plexus is a complex network of peripheral nerves that enables sensing from and control of the movements of the arms and hand. Nowadays, the coordination between the muscles to generate simple movements is still not well understood, hindering the knowledge of how to best treat patients with this type of peripheral nerve injury. To acquire enough information for medical data analysis, physicians conduct motion analysis assessments with patients to produce a rich dataset of electromyographic signals from multiple muscles recorded with joint movements during real-world tasks. However, tools for the analysis and visualization of the data in a succinct and interpretable manner are currently not available. Without the ability to integrate, compare, and compute multiple data sources in one platform, physicians can only compute simple statistical values to describe patient's behavior vaguely, which limits the possibility to answer clinical questions and generate hypotheses for research. To address this challenge, we have developed MOTION BROWSER, an interactive visual analytics system which provides an efficient framework to extract and compare muscle activity patterns from the patient's limbs and coordinated views to help users analyze muscle signals, motion data, and video information to address different tasks. The system was developed as a result of a collaborative endeavor between computer scientists and orthopedic surgery and rehabilitation physicians. We present case studies showing physicians can utilize the information displayed to understand how individuals coordinate their muscles to initiate appropriate treatment and generate new hypotheses for future research.
RESUMO
Large scale shadows from buildings in a city play an important role in determining the environmental quality of public spaces. They can be both beneficial, such as for pedestrians during summer, and detrimental, by impacting vegetation and by blocking direct sunlight. Determining the effects of shadows requires the accumulation of shadows over time across different periods in a year. In this paper, we propose a simple yet efficient class of approach that uses the properties of sun movement to track the changing position of shadows within a fixed time interval. We use this approach to extend two commonly used shadow techniques, shadow maps and ray tracing, and demonstrate the efficiency of our approach. Our technique is used to develop an interactive visual analysis system, Shadow Profiler, targeted at city planners and architects that allows them to test the impact of shadows for different development scenarios. We validate the usefulness of this system through case studies set in Manhattan, a dense borough of New York City.
RESUMO
Dynamic networks naturally appear in a multitude of applications from different fields. Analyzing and exploring dynamic networks in order to understand and detect patterns and phenomena is challenging, fostering the development of new methodologies, particularly in the field of visual analytics. In this work, we propose a novel visual analytics methodology for dynamic networks, which relies on the spectral graph wavelet theory. We enable the automatic analysis of a signal defined on the nodes of the network, making viable the robust detection of network properties. Specifically, we use a fast approximation of a graph wavelet transform to derive a set of wavelet coefficients, which are then used to identify activity patterns on large networks, including their temporal recurrence. The coefficients naturally encode the spatial and temporal variations of the signal, leading to an efficient and meaningful representation. This methodology allows for the exploration of the structural evolution of the network and their patterns over time. The effectiveness of our approach is demonstrated using usage scenarios and comparisons involving real dynamic networks.
RESUMO
Data flow systems allow the user to design a flow diagram that specifies the relations between system components which process, filter or visually present the data. Visualization systems may benefit from user-defined data flows as an analysis typically consists of rendering multiple plots on demand and performing different types of interactive queries across coordinated views. In this paper, we propose VisFlow, a web-based visualization framework for tabular data that employs a specific type of data flow model called the subset flow model. VisFlow focuses on interactive queries within the data flow, overcoming the limitation of interactivity from past computational data flow systems. In particular, VisFlow applies embedded visualizations and supports interactive selections, brushing and linking within a visualization-oriented data flow. The model requires all data transmitted by the flow to be a data item subset (i.e. groups of table rows) of some original input table, so that rendering properties can be assigned to the subset unambiguously for tracking and comparison. VisFlow features the analysis flexibility of a flow diagram, and at the same time reduces the diagram complexity and improves usability. We demonstrate the capability of VisFlow on two case studies with domain experts on real-world datasets showing that VisFlow is capable of accomplishing a considerable set of visualization and analysis tasks. The VisFlow system is available as open source on GitHub.
RESUMO
Cities are inherently dynamic. Interesting patterns of behavior typically manifest at several key areas of a city over multiple temporal resolutions. Studying these patterns can greatly help a variety of experts ranging from city planners and architects to human behavioral experts. Recent technological innovations have enabled the collection of enormous amounts of data that can help in these studies. However, techniques using these data sets typically focus on understanding the data in the context of the city, thus failing to capture the dynamic aspects of the city. The goal of this work is to instead understand the city in the context of multiple urban data sets. To do so, we define the concept of an "urban pulse" which captures the spatio-temporal activity in a city across multiple temporal resolutions. The prominent pulses in a city are obtained using the topology of the data sets, and are characterized as a set of beats. The beats are then used to analyze and compare different pulses. We also design a visual exploration framework that allows users to explore the pulses within and across multiple cities under different conditions. Finally, we present three case studies carried out by experts from two different domains that demonstrate the utility of our framework.
RESUMO
Major League Baseball (MLB) has a long history of providing detailed, high-quality data, leading to a tremendous surge in sports analytics research in recent years. In 2015, MLB.com released the StatCast spatiotemporal data-tracking system, which has been used in approximately 2,500 games since its inception to capture player and ball locations as well as semantically meaningful game events. This article presents a visualization and analytics infrastructure to help query and facilitate the analysis of this new tracking data. The goal is to go beyond descriptive statistics of individual plays, allowing analysts to study diverse collections of games and game events. The proposed system enables the exploration of the data using a simple querying interface and a set of flexible interactive visualization tools.
RESUMO
Public transportation schedules are designed by agencies to optimize service quality under multiple constraints. However, real service usually deviates from the plan. Therefore, transportation analysts need to identify, compare and explain both eventual and systemic performance issues that must be addressed so that better timetables can be created. The purely statistical tools commonly used by analysts pose many difficulties due to the large number of attributes at trip- and station-level for planned and real service. Also challenging is the need for models at multiple scales to search for patterns at different times and stations, since analysts do not know exactly where or when relevant patterns might emerge and need to compute statistical summaries for multiple attributes at different granularities. To aid in this analysis, we worked in close collaboration with a transportation expert to design TR-EX, a visual exploration tool developed to identify, inspect and compare spatio-temporal patterns for planned and real transportation service. TR-EX combines two new visual encodings inspired by Marey's Train Schedule: Trips Explorer for trip-level analysis of frequency, deviation and speed; and Stops Explorer for station-level study of delay, wait time, reliability and performance deficiencies such as bunching. To tackle overplotting and to provide a robust representation for a large numbers of trips and stops at multiple scales, the system supports variable kernel bandwidths to achieve the level of detail required by users for different tasks. We justify our design decisions based on specific analysis needs of transportation analysts. We provide anecdotal evidence of the efficacy of TR-EX through a series of case studies that explore NYC subway service, which illustrate how TR-EX can be used to confirm hypotheses and derive new insights through visual exploration.
RESUMO
Evaluation methodologies in visualization have mostly focused on how well the tools and techniques cater to the analytical needs of the user. While this is important in determining the effectiveness of the tools and advancing the state-of-the-art in visualization research, a key area that has mostly been overlooked is how well established visualization theories and principles are instantiated in practice. This is especially relevant when domain experts, and not visualization researchers, design visualizations for analysis of their data or for broader dissemination of scientific knowledge. There is very little research on exploring the synergistic capabilities of cross-domain collaboration between domain experts and visualization researchers. To fill this gap, in this paper we describe the results of an exploratory study of climate data visualizations conducted in tight collaboration with a pool of climate scientists. The study analyzes a large set of static climate data visualizations for identifying their shortcomings in terms of visualization design. The outcome of the study is a classification scheme that categorizes the design problems in the form of a descriptive taxonomy. The taxonomy is a first attempt for systematically categorizing the types, causes, and consequences of design problems in visualizations created by domain experts. We demonstrate the use of the taxonomy for a number of purposes, such as, improving the existing climate data visualizations, reflecting on the impact of the problems for enabling domain experts in designing better visualizations, and also learning about the gaps and opportunities for future visualization research. We demonstrate the applicability of our taxonomy through a number of examples and discuss the lessons learnt and implications of our findings.
RESUMO
We propose an approach for verification of volume rendering correctness based on an analysis of the volume rendering integral, the basis of most DVR algorithms. With respect to the most common discretization of this continuous model (Riemann summation), we make assumptions about the impact of parameter changes on the rendered results and derive convergence curves describing the expected behavior. Specifically, we progressively refine the number of samples along the ray, the grid size, and the pixel size, and evaluate how the errors observed during refinement compare against the expected approximation errors. We derive the theoretical foundations of our verification approach, explain how to realize it in practice, and discuss its limitations. We also report the errors identified by our approach when applied to two publicly available volume rendering packages.
RESUMO
Elucidation of transcriptional regulatory networks (TRNs) is a fundamental goal in biology, and one of the most important components of TRNs are transcription factors (TFs), proteins that specifically bind to gene promoter and enhancer regions to alter target gene expression patterns. Advances in genomic technologies as well as advances in computational biology have led to multiple large regulatory network models (directed networks) each with a large corpus of supporting data and gene-annotation. There are multiple possible biological motivations for exploring large regulatory network models, including: validating TF-target gene relationships, figuring out co-regulation patterns, and exploring the coordination of cell processes in response to changes in cell state or environment. Here we focus on queries aimed at validating regulatory network models, and on coordinating visualization of primary data and directed weighted gene regulatory networks. The large size of both the network models and the primary data can make such coordinated queries cumbersome with existing tools and, in particular, inhibits the sharing of results between collaborators. In this work, we develop and demonstrate a web-based framework for coordinating visualization and exploration of expression data (RNA-seq, microarray), network models and gene-binding data (ChIP-seq). Using specialized data structures and multiple coordinated views, we design an efficient querying model to support interactive analysis of the data. Finally, we show the effectiveness of our framework through case studies for the mouse immune system (a dataset focused on a subset of key cellular functions) and a model bacteria (a small genome with high data-completeness).
Assuntos
Biologia Computacional/métodos , Gráficos por Computador , Bases de Dados Genéticas , Redes Reguladoras de Genes , Internet , Animais , Camundongos , Fatores de Transcrição , Interface Usuário-ComputadorRESUMO
Visual data analysis often requires grouping of data objects based on their similarity. In many application domains researchers use algorithms and techniques like clustering and multidimensional scaling to extract groupings from data. While extracting these groups using a single similarity criteria is relatively straightforward, comparing alternative criteria poses additional challenges. In this paper we define visual reconciliation as the problem of reconciling multiple alternative similarity spaces through visualization and interaction. We derive this problem from our work on model comparison in climate science where climate modelers are faced with the challenge of making sense of alternative ways to describe their models: one through the output they generate, another through the large set of properties that describe them. Ideally, they want to understand whether groups of models with similar spatio-temporal behaviors share similar sets of criteria or, conversely, whether similar criteria lead to similar behaviors. We propose a visual analytics solution based on linked views, that addresses this problem by allowing the user to dynamically create, modify and observe the interaction among groupings, thereby making the potential explanations apparent. We present case studies that demonstrate the usefulness of our technique in the area of climate science.