Search | VHL Regional Portal

1.

Exploring and Interacting with the Set of Good Sparse Generalized Additive Models.

Zhong, Chudi; Chen, Zhi; Liu, Jiachang; Seltzer, Margo; Rudin, Cynthia.

Adv Neural Inf Process Syst ; 36: 56673-56699, 2023 Dec.

Article in English | MEDLINE | ID: mdl-38623077

ABSTRACT

In real applications, interaction between machine learning models and domain experts is critical; however, the classical machine learning paradigm that usually produces only a single model does not facilitate such interaction. Approximating and exploring the Rashomon set, i.e., the set of all near-optimal models, addresses this practical challenge by providing the user with a searchable space containing a diverse set of models from which domain experts can choose. We present algorithms to efficiently and accurately approximate the Rashomon set of sparse, generalized additive models with ellipsoids for fixed support sets and use these ellipsoids to approximate Rashomon sets for many different support sets. The approximated Rashomon set serves as a cornerstone to solve practical challenges such as (1) studying the variable importance for the model class; (2) finding models under user-specified constraints (monotonicity, direct editing); and (3) investigating sudden changes in the shape functions. Experiments demonstrate the fidelity of the approximated Rashomon set and its effectiveness in solving practical challenges.

2.

Optimal Sparse Regression Trees.

Zhang, Rui; Xin, Rui; Seltzer, Margo; Rudin, Cynthia.

Proc AAAI Conf Artif Intell ; 37(9): 11270-11279, 2023 Jun.

Article in English | MEDLINE | ID: mdl-38650922

ABSTRACT

Regression trees are one of the oldest forms of AI models, and their predictions can be made without a calculator, which makes them broadly useful, particularly for high-stakes applications. Within the large literature on regression trees, there has been little effort towards full provable optimization, mainly due to the computational hardness of the problem. This work proposes a dynamic-programming-with-bounds approach to the construction of provably-optimal sparse regression trees. We leverage a novel lower bound based on an optimal solution to the k-Means clustering algorithm on one dimensional data. We are often able to find optimal sparse trees in seconds, even for challenging datasets that involve large numbers of samples and highly-correlated features.

3.

Fast Sparse Decision Tree Optimization via Reference Ensembles.

McTavish, Hayden; Zhong, Chudi; Achermann, Reto; Karimalis, Ilias; Chen, Jacques; Rudin, Cynthia; Seltzer, Margo.

Proc AAAI Conf Artif Intell ; 36(9): 9604-9613, 2022.

Article in English | MEDLINE | ID: mdl-36051654

ABSTRACT

Sparse decision tree optimization has been one of the most fundamental problems in AI since its inception and is a challenge at the core of interpretable machine learning. Sparse decision tree optimization is computationally hard, and despite steady effort since the 1960's, breakthroughs have been made on the problem only within the past few years, primarily on the problem of finding optimal sparse decision trees. However, current state-of-the-art algorithms often require impractical amounts of computation time and memory to find optimal or near-optimal trees for some real-world datasets, particularly those having several continuous-valued features. Given that the search spaces of these decision tree optimization problems are massive, can we practically hope to find a sparse decision tree that competes in accuracy with a black box machine learning model? We address this problem via smart guessing strategies that can be applied to any optimal branch-and-bound-based decision tree algorithm. The guesses come from knowledge gleaned from black box models. We show that by using these guesses, we can reduce the run time by multiple orders of magnitude while providing bounds on how far the resulting trees can deviate from the black box's accuracy and expressive power. Our approach enables guesses about how to bin continuous features, the size of the tree, and lower bounds on the error for the optimal decision tree. Our experiments show that in many cases we can rapidly construct sparse decision trees that match the accuracy of black box models. To summarize: when you are having trouble optimizing, just guess.

4.

Fast Sparse Classification for Generalized Linear and Additive Models.

Liu, Jiachang; Zhong, Chudi; Seltzer, Margo; Rudin, Cynthia.

Proc Mach Learn Res ; 151: 9304-9333, 2022 Mar.

Article in English | MEDLINE | ID: mdl-35601052

ABSTRACT

We present fast classification techniques for sparse generalized linear and additive models. These techniques can handle thousands of features and thousands of observations in minutes, even in the presence of many highly correlated features. For fast sparse logistic regression, our computational speed-up over other best-subset search techniques owes to linear and quadratic surrogate cuts for the logistic loss that allow us to efficiently screen features for elimination, as well as use of a priority queue that favors a more uniform exploration of features. As an alternative to the logistic loss, we propose the exponential loss, which permits an analytical solution to the line search at each iteration. Our algorithms are generally 2 to 5 times faster than previous approaches. They produce interpretable models that have accuracy comparable to black box models on challenging datasets.

5.

Fast Optimization of Weighted Sparse Decision Trees for use in Optimal Treatment Regimes and Optimal Policy Design.

Behrouz, Ali; Lécuyer, Mathias; Rudin, Cynthia; Seltzer, Margo.

CEUR Workshop Proc ; 33182022 Oct.

Article in English | MEDLINE | ID: mdl-36970634

ABSTRACT

Sparse decision trees are one of the most common forms of interpretable models. While recent advances have produced algorithms that fully optimize sparse decision trees for prediction, that work does not address policy design, because the algorithms cannot handle weighted data samples. Specifically, they rely on the discreteness of the loss function, which means that real-valued weights cannot be directly used. For example, none of the existing techniques produce policies that incorporate inverse propensity weighting on individual data points. We present three algorithms for efficient sparse weighted decision tree optimization. The first approach directly optimizes the weighted loss function; however, it tends to be computationally inefficient for large datasets. Our second approach, which scales more efficiently, transforms weights to integer values and uses data duplication to transform the weighted decision tree optimization problem into an unweighted (but larger) counterpart. Our third algorithm, which scales to much larger datasets, uses a randomized procedure that samples each data point with a probability proportional to its weight. We present theoretical bounds on the error of the two fast methods and show experimentally that these methods can be two orders of magnitude faster than the direct optimization of the weighted loss, without losing significant accuracy.

6.

Exploring the Whole Rashomon Set of Sparse Decision Trees.

Xin, Rui; Zhong, Chudi; Chen, Zhi; Takagi, Takuya; Seltzer, Margo; Rudin, Cynthia.

Adv Neural Inf Process Syst ; 35: 14071-14084, 2022.

Article in English | MEDLINE | ID: mdl-37786624

ABSTRACT

In any given machine learning problem, there might be many models that explain the data almost equally well. However, most learning algorithms return only one of these models, leaving practitioners with no practical way to explore alternative models that might have desirable properties beyond what could be expressed by a loss function. The Rashomon set is the set of these all almost-optimal models. Rashomon sets can be large in size and complicated in structure, particularly for highly nonlinear function classes that allow complex interaction terms, such as decision trees. We provide the first technique for completely enumerating the Rashomon set for sparse decision trees; in fact, our work provides the first complete enumeration of any Rashomon set for a non-trivial problem with a highly nonlinear discrete function class. This allows the user an unprecedented level of control over model choice among all models that are approximately equally good. We represent the Rashomon set in a specialized data structure that supports efficient querying and sampling. We show three applications of the Rashomon set: 1) it can be used to study variable importance for the set of almost-optimal trees (as opposed to a single tree), 2) the Rashomon set for accuracy enables enumeration of the Rashomon sets for balanced accuracy and F1-score, and 3) the Rashomon set for a full dataset can be used to produce Rashomon sets constructed with only subsets of the data set. Thus, we are able to examine Rashomon sets across problems with a new lens, enabling users to choose models rather than be at the mercy of an algorithm that produces only a single model.

7.

The End-to-End Provenance Project.

Ellison, Aaron M; Boose, Emery R; Lerner, Barbara S; Fong, Elizabeth; Seltzer, Margo.

Patterns (N Y) ; 1(2): 100016, 2020 May 08.

Article in English | MEDLINE | ID: mdl-33205093

ABSTRACT

Data provenance is a machine-readable summary of the collection and computational history of a dataset. Data provenance confers or adds value to a dataset, helps reproduce computational analyses, or validates scientific conclusions. The people of the End-to-End Provenance Project are a community of professionals who have developed software tools to collect and use data provenance.

8.

A user-centered, learning asthma smartphone application for patients and providers.

Gaynor, Mark; Schneider, David; Seltzer, Margo; Crannage, Erica; Barron, Mary Lee; Waterman, Jason; Oberle, Andrew.

Learn Health Syst ; 4(3): e10217, 2020 Jul.

Article in English | MEDLINE | ID: mdl-32685685

ABSTRACT

PROBLEM: Smartphone applications are an increasingly useful part of patients' self-management of chronic health conditions. Asthma is a common chronic health condition for which good self-management by patients is very helpful in maintaining stability. User-centered design and intelligent systems that learn are steps forward in building applications that are more effective in providing quality care that is scalable and tailored to each patient. METHODS: A literature and application store search to review historic and current asthma smart phone applications. User-centered design is a methodology that involves all stakeholders of a proposed system from the beginning of the design phase to the end of installation. One aspect of this user-centered approach involved conducting focus groups with patients and health care providers to determine what features they desire for use in applications and create a model to build smart infrastructure for a learning health care system. A simple prototype for an asthma smartphone application is designed and built with basic functionality. OUTCOMES: Only one publication in the literature review of asthma smartphone applications describes both user-centered design and intelligent learning systems. The authors have presented a set of user-desired attributes for a smart health care application and a possible data flow diagram of information for a learning system. A prototype simple user-centered designed asthma smartphone application that better assists patients in their care illustrates the value of the proposed architecture. DISCUSSION: Our user-centered approach helped design and implement a learning prototype smart phone application to help patients better manage their asthma and provide information to clinical care providers. While popular in other industries, user-centered design has had slow adoption in the health care area. However, the popularity of this approach is increasing and will hopefully result in mobile application that better meets the needs of both patients and their care providers.

9.

If these data could talk.

Pasquier, Thomas; Lau, Matthew K; Trisovic, Ana; Boose, Emery R; Couturier, Ben; Crosas, Mercè; Ellison, Aaron M; Gibson, Valerie; Jones, Chris R; Seltzer, Margo.

Sci Data ; 4: 170114, 2017 09 05.

Article in English | MEDLINE | ID: mdl-28872630

ABSTRACT

In the last few decades, data-driven methods have come to dominate many fields of scientific inquiry. Open data and open-source software have enabled the rapid implementation of novel methods to manage and analyze the growing flood of data. However, it has become apparent that many scientific fields exhibit distressingly low rates of reproducibility. Although there are many dimensions to this issue, we believe that there is a lack of formalism used when describing end-to-end published results, from the data source to the analysis to the final published results. Even when authors do their best to make their research and data accessible, this lack of formalism reduces the clarity and efficiency of reporting, which contributes to issues of reproducibility. Data provenance aids both reproducibility through systematic and formal records of the relationships among data sources, processes, datasets, publications and researchers.

10.

Evaluation of filesystem provenance visualization tools.

Borkin, Michelle A; Yeh, Chelsea S; Boyd, Madelaine; Macko, Peter; Gajos, Krzysztof Z; Seltzer, Margo; Pfister, Hanspeter.

IEEE Trans Vis Comput Graph ; 19(12): 2476-85, 2013 Dec.

Article in English | MEDLINE | ID: mdl-24051814

ABSTRACT

Having effective visualizations of filesystem provenance data is valuable for understanding its complex hierarchical structure. The most common visual representation of provenance data is the node-link diagram. While effective for understanding local activity, the node-link diagram fails to offer a high-level summary of activity and inter-relationships within the data. We present a new tool, InProv, which displays filesystem provenance with an interactive radial-based tree layout. The tool also utilizes a new time-based hierarchical node grouping method for filesystem provenance data we developed to match the user's mental model and make data exploration more intuitive. We compared InProv to a conventional node-link based tool, Orbiter, in a quantitative evaluation with real users of filesystem provenance data including provenance data experts, IT professionals, and computational scientists. We also compared in the evaluation our new node grouping method to a conventional method. The results demonstrate that InProv results in higher accuracy in identifying system activity than Orbiter with large complex data sets. The results also show that our new time-based hierarchical node grouping method improves performance in both tools, and participants found both tools significantly easier to use with the new time-based node grouping method. Subjective measures show that participants found InProv to require less mental activity, less physical activity, less work, and is less stressful to use. Our study also reveals one of the first cases of gender differences in visualization; both genders had comparable performance with InProv, but women had a significantly lower average accuracy (56%) compared to men (70%) with Orbiter.

Subject(s)

Computer Graphics , Databases, Factual , Information Storage and Retrieval/methods , Multimodal Imaging/methods , Pattern Recognition, Visual/physiology , Software , User-Computer Interface , Adult , Algorithms , Artificial Intelligence , Female , Humans , Image Enhancement/methods , Male , Task Performance and Analysis

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL