IST Lunch Bunch
One of the great challenges in computational theory is the extraction
of patterns from massive and high-dimensional data sets, e.g.,
clustering. The field of geometric clustering is an extensively
studied one with a wide spectrum of "real world" applications.
A "coreset" for a given clustering optimization problem is a
"compressed" representation of the data at hand, in the sense that a
solution for the clustering problem with the (small) coreset as input
would yield an approximate solution to the problem with the original
(larger) input. Using traditional techniques, a coreset usually
implies provable linear time algorithms for the corresponding
clustering problem.
In this talk I will present a unified and abstract framework that
yields the efficient construction of succinct coresets for several
clustering problems. Representing the data by a set F of positive
functions over a ground set X, our framework forges a link between the
combinatorial complexity of the function family F at hand (measured in
terms of classical VC dimension) and the paradigm of coresets. Our
coresets are obtained via randomized sampling according to a
delicately designed sampling distribution. Examples in which our
framework has been successful include the k-median, the k-line median,
projective clustering, linear regression, low rank approximation, and
subspace approximation problems.
This talk is based on joint works with Dan Feldman and Leonard Schulman.