RESUMEN
Coreset is usually a small weighted subset of an input set of items, that provably approximates their loss function for a given set of queries (models, classifiers, hypothesis). That is, the maximum (worst-case) error over all queries is bounded. To obtain smaller coresets, we suggest a natural relaxation: coresets whose average error over the given set of queries is bounded. We provide both deterministic and randomized (generic) algorithms for computing such a coreset for any finite set of queries. Unlike most corresponding coresets for the worst-case error, the size of the coreset in this work is independent of both the input size and its Vapnik-Chervonenkis (VC) dimension. The main technique is to reduce the average-case coreset into the vector summarization problem, where the goal is to compute a weighted subset of the n input vectors which approximates their sum. We then suggest the first algorithm for computing this weighted subset in time that is linear in the input size, for nâ«1/ε, where ε is the approximation error, improving, e.g., both [ICML'17] and applications for principal component analysis (PCA) [NIPS'16]. Experimental results show significant and consistent improvement also in practice. Open source code is provided.
RESUMEN
A coreset of a dataset is a small weighted set, such that querying the coreset provably yields a ( 1 + ε )-factor approximation to the original (full) dataset, for a given family of queries. This paper suggests accurate coresets ( ε = 0 ) that are subsets of the input for fundamental optimization problems. These coresets enabled us to implement a "Guardian Angel" system that computes pose-estimation in a rate > 20 frames per second. It tracks a toy quadcopter which guides guests in a supermarket, hospital, mall, airport, and so on. We prove that any set of n matrices in R d × d whose sum is a matrix S of rank r, has a coreset whose sum has the same left and right singular vectors as S, and consists of O ( d r ) = O ( d 2 ) matrices, independent of n. This implies the first (exact, weighted subset) coreset of O ( d 2 ) points to problems such as linear regression, PCA/SVD, and Wahba's problem, with corresponding streaming, dynamic, and distributed versions. Our main tool is a novel usage of the Caratheodory Theorem for coresets, an algorithm that computes its set in time that is linear in its cardinality. Extensive experimental results on both synthetic and real data, companion video of our system, and open code are provided.
RESUMEN
Least-mean-squares (LMS) solvers such as Linear / Ridge-Regression and SVD not only solve fundamental machine learning problems, but are also the building blocks in a variety of other methods, such as matrix factorizations. We suggest an algorithm that gets a finite set of n d-dimensional real vectors and returns a subset of d+1 vectors with positive weights whose weighted sum is exactly the same. The constructive proof in Caratheodory's Theorem computes such a subset in O(n2d2) time and thus not used in practice. Our algorithm computes this subset in O(nd+d4logn) time, using O(logn) calls to Caratheodory's construction on small but "smart" subsets. This is based on a novel paradigm of fusion between different data summarization techniques, known as sketches and coresets. For large values of d, we suggest a faster construction that takes O(nd) time and returns a weighted subset of O(d) sparsified input points. Here, a sparsified point means that some of its entries were set to zero. As an application, we show how to boost the performance of existing LMS solvers, such as those in scikit-learn library, up to x100. Generalization for streaming and distributed data is trivial. Extensive experimental results and open source code are provided.
RESUMEN
INTRODUCTION: The rate of hospitalization represents a morbidity indicator in HD patients. The study aimed to evaluate hospitalization patterns in a large HD cohort. METHODS: All DaVita-KSA HD patients from October 2014 to December 2019 were included. Demographical and clinical characteristics and hospitalization data were recorded. Less than 24 h admission was excluded. Overall and cause-specific hospitalization rates were calculated. RESULTS: During the follow-up period, 3982 patients with a mean age of 52.5 ± 16.8 years, 2667 hospitalizations were recorded in 34.1% of the patients and 45.6% had repeated admissions. Infectious causes accounted for 26.6% of all recorded causes vs. 15.6% for cardiovascular complications. The median hospital stay length was 11 days, while the overall annual hospitalization rate of 34.9% and the annual duration of 3.7 days per patient. Hospitalized patients had a higher risk of mortality (p < 0.001). CONCLUSION: Infectious complications were the leading cause of hospitalization and had the longest hospital stay.