1 code implementation • 2 Apr 2024 • Andrew Draganov, David Saulpic, Chris Schwiegelshohn
We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets.
no code implementations • 27 Feb 2024 • Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder
We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model.
no code implementations • 7 Jul 2023 • Max Dupré la Tour, Monika Henzinger, David Saulpic
We consider the problem of clustering privately a dataset in $\mathbb{R}^d$ that undergoes both insertion and deletion of points.
no code implementations • 15 Nov 2022 • Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, Omar Ali Sheikh-Omar
the Euclidean $k$-median problem) consists of finding $k$ centers such that the sum of squared distances (resp.
1 code implementation • 17 Jun 2022 • Vincent Cohen-Addad, Alessandro Epasto, Silvio Lattanzi, Vahab Mirrokni, Andres Munoz, David Saulpic, Chris Schwiegelshohn, Sergei Vassilvitskii
We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space.
no code implementations • 25 Feb 2022 • Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn
Given a set of points in a metric space, the $(k, z)$-clustering problem consists of finding a set of $k$ points called centers, such that the sum of distances raised to the power of $z$ of every data point to its closest center is minimized.
no code implementations • NeurIPS 2021 • Vincent Cohen-Addad, David Saulpic, Chris Schwiegelshohn
Special cases of problem include the well-known Fermat-Weber problem -- or geometric median problem -- where $z = 1$, the mean or centroid where $z=2$, and the Minimum Enclosing Ball problem, where $z = \infty$. We consider these problem in the big data regime. Here, we are interested in sampling as few points as possible such that we can accurately estimate $m$. More specifically, we consider sublinear algorithms as well as coresets for these problems. Sublinear algorithms have a random query access to the $A$ and the goal is to minimize the number of queries. Here, we show that $\tilde{O}(\varepsilon^{-z-3})$ samples are sufficient to achieve a $(1+\varepsilon)$ approximation, generalizing the results from Cohen, Lee, Miller, Pachocki, and Sidford [STOC '16] and Inaba, Katoh, and Imai [SoCG '94] to arbitrary $z$.
no code implementations • NeurIPS 2020 • Vincent Cohen-Addad, Adrian Kosowski, Frederik Mallmann-Trenn, David Saulpic
A classic problem in machine learning and data analysis is to partition the vertices of a network in such a way that vertices in the same set are densely connected and vertices in different sets are loosely connected.
1 code implementation • NeurIPS 2019 • Vincent Cohen-Addad, Niklas Oskar D. Hjuler, Nikos Parotsidis, David Saulpic, Chris Schwiegelshohn
This improves over the naive algorithm which consists in recomputing a solution at each time step and that can take up to $O(n^2)$ update time, and $O(n^2)$ total recourse.