Pesquisa | BVS Integralidade em Saúde

Regularized impurity reduction: accurate decision trees with complexity guarantees.

Zhang, Guangyi; Gionis, Aristides.

Data Min Knowl Discov ; 37(1): 434-475, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-36618773

RESUMO

Decision trees are popular classification models, providing high accuracy and intuitive explanations. However, as the tree size grows the model interpretability deteriorates. Traditional tree-induction algorithms, such as C4.5 and CART, rely on impurity-reduction functions that promote the discriminative power of each split. Thus, although these traditional methods are accurate in practice, there has been no theoretical guarantee that they will produce small trees. In this paper, we justify the use of a general family of impurity functions, including the popular functions of entropy and Gini-index, in scenarios where small trees are desirable, by showing that a simple enhancement can equip them with complexity guarantees. We consider a general setting, where objects to be classified are drawn from an arbitrary probability distribution, classification can be binary or multi-class, and splitting tests are associated with non-uniform costs. As a measure of tree complexity, we adopt the expected cost to classify an object drawn from the input distribution, which, in the uniform-cost case, is the expected number of tests. We propose a tree-induction algorithm that gives a logarithmic approximation guarantee on the tree complexity. This approximation factor is tight up to a constant factor under mild assumptions. The algorithm recursively selects a test that maximizes a greedy criterion defined as a weighted sum of three components. The first two components encourage the selection of tests that improve the balance and the cost-efficiency of the tree, respectively, while the third impurity-reduction component encourages the selection of more discriminative tests. As shown in our empirical evaluation, compared to the original heuristics, the enhanced algorithms strike an excellent balance between predictive accuracy and tree complexity.

Strengthening ties towards a highly-connected world.

Matakos, Antonis; Gionis, Aristides.

Data Min Knowl Discov ; 36(1): 448-476, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35125932

RESUMO

Online social networks provide a forum where people make new connections, learn more about the world, get exposed to different points of view, and access information that were previously inaccessible. It is natural to assume that content-delivery algorithms in social networks should not only aim to maximize user engagement but also to offer opportunities for increasing connectivity and enabling social networks to achieve their full potential. Our motivation and aim is to develop methods that foster the creation of new connections, and subsequently, improve the flow of information in the network. To achieve our goal, we propose to leverage the strong triadic closure principle, and consider violations to this principle as opportunities for creating more social links. We formalize this idea as an algorithmic problem related to the densest k-subgraph problem. For this new problem, we establish hardness results and propose approximation algorithms. We identify two special cases of the problem that admit a constant-factor approximation. Finally, we experimentally evaluate our proposed algorithm on real-world social networks, and we additionally evaluate some simpler but more scalable algorithms.

Ranking with submodular functions on a budget.

Zhang, Guangyi; Tatti, Nikolaj; Gionis, Aristides.

Data Min Knowl Discov ; 36(3): 1197-1218, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35601821

RESUMO

Submodular maximization has been the backbone of many important machine-learning problems, and has applications to viral marketing, diversification, sensor placement, and more. However, the study of maximizing submodular functions has mainly been restricted in the context of selecting a set of items. On the other hand, many real-world applications require a solution that is a ranking over a set of items. The problem of ranking in the context of submodular function maximization has been considered before, but to a much lesser extent than item-selection formulations. In this paper, we explore a novel formulation for ranking items with submodular valuations and budget constraints. We refer to this problem as max-submodular ranking ( MSR ). In more detail, given a set of items and a set of non-decreasing submodular functions, where each function is associated with a budget, we aim to find a ranking of the set of items that maximizes the sum of values achieved by all functions under the budget constraints. For the MSR problem with cardinality- and knapsack-type budget constraints we propose practical algorithms with approximation guarantees. In addition, we perform an empirical evaluation, which demonstrates the superior performance of the proposed algorithms against strong baselines.

Provable randomized rounding for minimum-similarity diversification.

Ordozgoiti, Bruno; Mahadevan, Ananth; Matakos, Antonis; Gionis, Aristides.

Data Min Knowl Discov ; 36(2): 709-738, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35401029

RESUMO

When searching for information in a data collection, we are often interested not only in finding relevant items, but also in assembling a diverse set, so as to explore different concepts that are present in the data. This problem has been researched extensively. However, finding a set of items with minimal pairwise similarities can be computationally challenging, and most existing works striving for quality guarantees assume that item relatedness is measured by a distance function. Given the widespread use of similarity functions in many domains, we believe this to be an important gap in the literature. In this paper we study the problem of finding a diverse set of items, when item relatedness is measured by a similarity function. We formulate the diversification task using a flexible, broadly applicable minimization objective, consisting of the sum of pairwise similarities of the selected items and a relevance penalty term. To find good solutions we adopt a randomized rounding strategy, which is challenging to analyze because of the cardinality constraint present in our formulation. Even though this obstacle can be overcome using dependent rounding, we show that it is possible to obtain provably good solutions using an independent approach, which is faster, simpler to implement and completely parallelizable. Our analysis relies on a novel bound for the ratio of Poisson-Binomial densities, which is of independent interest and has potential implications for other combinatorial-optimization problems. We leverage this result to design an efficient randomized algorithm that provides a lower-order additive approximation guarantee. We validate our method using several benchmark datasets, and show that it consistently outperforms the greedy approaches that are commonly used in the literature.

Finding Path Motifs in Large Temporal Graphs Using Algebraic Fingerprints.

Thejaswi, Suhas; Gionis, Aristides; Lauri, Juho.

Big Data ; 8(5): 335-362, 2020 10.

Artigo em Inglês | MEDLINE | ID: mdl-33017173

RESUMO

We study a family of pattern-detection problems in vertex-colored temporal graphs. In particular, given a vertex-colored temporal graph and a multiset of colors as a query, we search for temporal paths in the graph that contain the colors specified in the query. These types of problems have several applications, for example, in recommending tours for tourists or detecting abnormal behavior in a network of financial transactions. For the family of pattern-detection problems we consider, we establish complexity results and design an algebraic-algorithmic framework based on constrained multilinear sieving. We demonstrate that our solution scales to massive graphs with up to a billion edges for a multiset query with 5 colors and up to 100 million edges for a multiset query with 10 colors, despite the problems being non-deterministic polynomial time-hard. Our implementation, which is publicly available, exhibits practical edge-linear scalability and is highly optimized. For instance, in a real-world graph dataset with >6 million edges and a multiset query with 10 colors, we can extract an optimal solution in <8 minutes on a Haswell desktop with four cores.

Assuntos

Algoritmos , Gráficos por Computador , Reconhecimento Automatizado de Padrão , Software

Automated 3D phenotype analysis using data mining.

Plyusnin, Ilya; Evans, Alistair R; Karme, Aleksis; Gionis, Aristides; Jernvall, Jukka.

PLoS One ; 3(3): e1742, 2008 Mar 05.

Artigo em Inglês | MEDLINE | ID: mdl-18320060

RESUMO

The ability to analyze and classify three-dimensional (3D) biological morphology has lagged behind the analysis of other biological data types such as gene sequences. Here, we introduce the techniques of data mining to the study of 3D biological shapes to bring the analyses of phenomes closer to the efficiency of studying genomes. We compiled five training sets of highly variable morphologies of mammalian teeth from the MorphoBrowser database. Samples were labeled either by dietary class or by conventional dental types (e.g. carnassial, selenodont). We automatically extracted a multitude of topological attributes using Geographic Information Systems (GIS)-like procedures that were then used in several combinations of feature selection schemes and probabilistic classification models to build and optimize classifiers for predicting the labels of the training sets. In terms of classification accuracy, computational time and size of the feature sets used, non-repeated best-first search combined with 1-nearest neighbor classifier was the best approach. However, several other classification models combined with the same searching scheme proved practical. The current study represents a first step in the automatic analysis of 3D phenotypes, which will be increasingly valuable with the future increase in 3D morphology and phenomics databases.

Assuntos

Dieta , Processamento Eletrônico de Dados , Reconhecimento Automatizado de Padrão , Fenótipo , Dente/anatomia & histologia , Dente/fisiologia , Algoritmos , Biologia Computacional , Simulação por Computador , Sistemas de Gerenciamento de Base de Dados

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa