no code implementations • 1 Dec 2024 • Maryam Aliakbarpour, Piotr Indyk, Ronitt Rubinfeld, Sandeep Silwal
We provide lower bounds to indicate that the improvements in sample complexity achieved by our algorithms are information-theoretically optimal.
no code implementations • 30 Oct 2024 • Anders Aamand, Alexandr Andoni, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal, Haike Xu
In particular, if an algorithm uses $O(n/\log^c k)$ samples for some constant $c>0$ and polynomial space, then the query time of the data structure must be at least $k^{1-O(1)/\log \log k}$, i. e., close to linear in the number of distributions $k$.
no code implementations • 15 Jun 2024 • Haike Xu, Zongyu Lin, Yizhou Sun, Kai-Wei Chang, Piotr Indyk
Our experiments demonstrate the efficacy of our approach not only in contradiction retrieval with more than 30% accuracy improvements on MSMARCO and HotpotQA across different model architectures but also in applications such as cleaning corrupted corpora to restore high-quality QA retrieval.
1 code implementation • 5 Jun 2024 • Haike Xu, Sandeep Silwal, Piotr Indyk
In both cases we show that, as long as the proxy metric used to construct the data structure approximates the ground-truth metric up to a bounded factor, our data structure achieves arbitrarily good approximation guarantees with respect to the ground-truth metric.
1 code implementation • NeurIPS 2023 • Piotr Indyk, Haike Xu
Graph-based approaches to nearest neighbor search are popular and powerful tools for handling large datasets in practice, but they have limited theoretical guarantees.
no code implementations • 6 Jul 2023 • Ainesh Bakshi, Piotr Indyk, Rajesh Jayaram, Sandeep Silwal, Erik Waingarten
For any two point sets $A, B \subset \mathbb{R}^d$ of size up to $n$, the Chamfer distance from $A$ to $B$ is defined as $\text{CH}(A, B)=\sum_{a \in A} \min_{b \in B} d_X(a, b)$, where $d_X$ is the underlying distance measure (e. g., the Euclidean or Manhattan distance).
1 code implementation • 20 Jun 2023 • Anders Aamand, Alexandr Andoni, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal
We study statistical/computational tradeoffs for the following density estimation problem: given $k$ distributions $v_1, \ldots, v_k$ over a discrete domain of size $n$, and sampling access to a distribution $p$, identify $v_i$ that is "close" to $p$.
no code implementations • 15 Apr 2023 • Nicholas Schiefer, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal, Tal Wagner
An $\varepsilon$-approximate quantile sketch over a stream of $n$ inputs approximates the rank of any query point $q$ - that is, the number of input points less than $q$ - up to an additive error of $\varepsilon n$, generally with some probability of at least $1 - 1/\mathrm{poly}(n)$, while consuming $o(n)$ space.
no code implementations • 1 Dec 2022 • Ainesh Bakshi, Piotr Indyk, Praneeth Kacham, Sandeep Silwal, Samson Zhou
We build on the recent Kernel Density Estimation framework, which (after preprocessing in time subquadratic in $n$) can return estimates of row/column sums of the kernel matrix.
no code implementations • 6 Nov 2022 • Anders Aamand, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Nicholas Schiefer, Sandeep Silwal, Tal Wagner
However, those simulations involve neural networks for the 'combine' function of size polynomial or even exponential in the number of graph nodes $n$, as well as feature vectors of length linear in $n$.
no code implementations • 16 Jun 2022 • Peter Bartlett, Piotr Indyk, Tal Wagner
Our techniques are general, and provide generalization bounds for many other recently proposed data-driven algorithms in numerical linear algebra, covering both sketching-based and multigrid-based methods.
no code implementations • ICLR 2022 • Justin Y. Chen, Talya Eden, Piotr Indyk, Honghao Lin, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner, David P. Woodruff, Michael Zhang
We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature.
no code implementations • NeurIPS 2021 • Piotr Indyk, Tal Wagner, David Woodruff
Recently, data-driven and learning-based algorithms for low rank matrix approximation were shown to outperform classical data-oblivious algorithms by wide margins in terms of accuracy.
1 code implementation • CVPR 2022 • Tianhong Li, Peng Cao, Yuan Yuan, Lijie Fan, Yuzhe Yang, Rogerio Feris, Piotr Indyk, Dina Katabi
This forces all classes, including minority classes, to maintain a uniform distribution in the feature space, improves class boundaries, and provides better generalization even in the presence of long-tail data.
Ranked #24 on
Long-tail Learning
on CIFAR-10-LT (ρ=100)
no code implementations • 19 Nov 2021 • Talya Eden, Piotr Indyk, Haike Xu
In particular, we consider heuristics induced by norm embeddings and distance labeling schemes, and provide lower bounds for the tradeoffs between the number of dimensions or bits used to represent each graph node, and the running time of the A* algorithm.
no code implementations • 21 Oct 2021 • Anders Aamand, Justin Y. Chen, Piotr Indyk
For the bipartite version of a stochastic graph model due to Chung, Lu, and Vu where the expected values of the offline degrees are known and used as predictions, we show that MinPredictedDegree stochastically dominates any other online algorithm, i. e., it is optimal for graphs drawn from this model.
no code implementations • 5 Jul 2021 • Shyam Narayanan, Sandeep Silwal, Piotr Indyk, Or Zamir
Random dimensionality reduction is a versatile tool for speeding up algorithms for high-dimensional problems.
no code implementations • ICLR 2021 • Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner
We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements.
no code implementations • 16 Feb 2021 • Arturs Backurs, Piotr Indyk, Cameron Musco, Tal Wagner
In particular, we consider estimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector.
no code implementations • 17 Dec 2020 • Tianhong Li, Lijie Fan, Yuan Yuan, Hao He, Yonglong Tian, Rogerio Feris, Piotr Indyk, Dina Katabi
However, contrastive learning is susceptible to feature suppression, i. e., it may discard important information relevant to the task of interest, and learn irrelevant features.
no code implementations • 9 Jun 2020 • Piotr Indyk, Frederik Mallmann-Trenn, Slobodan Mitrović, Ronitt Rubinfeld
In contrast, we show that if the algorithm is given a prediction of the input sequence, then it can achieve a competitive ratio that tends to $1$ as the prediction error rate tends to $0$.
1 code implementation • NeurIPS 2019 • Arturs Backurs, Piotr Indyk, Tal Wagner
We instantiate our framework with the Laplacian and Exponential kernels, two popular kernels which possess the aforementioned property.
no code implementations • NeurIPS 2019 • Jayadev Acharya, Sourbh Bhadane, Piotr Indyk, Ziteng Sun
We consider the task of estimating the entropy of $k$-ary distributions from samples in the streaming model, where space is limited.
no code implementations • NeurIPS 2019 • Piotr Indyk, Ali Vakilian, Yang Yuan
Our experiments show that, for multiple types of data sets, a learned sketch matrix can substantially reduce the approximation loss compared to a random matrix $S$, sometimes by one order of magnitude.
1 code implementation • ICML 2020 • Arturs Backurs, Yihe Dong, Piotr Indyk, Ilya Razenshteyn, Tal Wagner
Our extensive experiments, on real-world text and image datasets, show that Flowtree improves over various baselines and existing methods in either running time or accuracy.
Data Structures and Algorithms
no code implementations • 25 Sep 2019 • Xiyuan Zhang, Yang Yuan, Piotr Indyk
The edit distance between two sequences is an important metric with many applications.
no code implementations • 6 Jul 2019 • Piotr Indyk, Sepideh Mahabadi, Shayan Oveis Gharan, Alireza Rezaei
In this work, first we provide a theoretical approximation guarantee of $O(C^{k^2})$ for the Greedy algorithm in the context of composable core-sets; Further, we propose to use a Local Search based algorithm that while being still practical, achieves a nearly optimal approximation bound of $O(k)^{2k}$; Finally, we implement all three algorithms and show the effectiveness of our proposed algorithm on standard data sets.
no code implementations • 2 Jun 2019 • Piotr Indyk, Ali Vakilian, Tal Wagner, David Woodruff
Recent work by Bakshi and Woodruff (NeurIPS 2018) showed it is possible to compute a rank-$k$ approximation of a distance matrix in time $O((n+m)^{1+\gamma}) \cdot \mathrm{poly}(k, 1/\epsilon)$, where $\epsilon>0$ is an error parameter and $\gamma>0$ is an arbitrarily small constant.
no code implementations • ICLR 2019 • Chen-Yu Hsu, Piotr Indyk, Dina Katabi, Ali Vakilian
Estimating the frequencies of elements in a data stream is a fundamental task in data analysis and machine learning.
1 code implementation • 10 Feb 2019 • Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, Tal Wagner
In the fair variant of $k$-median, the points are colored, and the goal is to minimize the same average distance objective while ensuring that all clusters have an "approximately equal" number of points of each color.
1 code implementation • ICLR 2020 • Yihe Dong, Piotr Indyk, Ilya Razenshteyn, Tal Wagner
Space partitions of $\mathbb{R}^d$ underlie a vast and important class of fast nearest neighbor search (NNS) algorithms.
no code implementations • 31 Jul 2018 • Piotr Indyk, Sepideh Mahabadi, Shayan Oveis Gharan, Alireza Rezaei
We show that for many objective functions one can use a spectral spanner, independent of the underlying functions, as a core-set and obtain almost optimal composable core-sets.
no code implementations • 26 Jun 2018 • Alexandr Andoni, Piotr Indyk, Ilya Razenshteyn
The nearest neighbor problem is defined as follows: Given a set $P$ of $n$ points in some metric space $(X, D)$, build a data structure that, given any point $q$, returns a point in $P$ that is closest to $q$ (its "nearest neighbor" in $P$).
no code implementations • NeurIPS 2017 • Piotr Indyk, Ilya Razenshteyn, Tal Wagner
We introduce a new distance-preserving compact representation of multi-dimensional point-sets.
no code implementations • NeurIPS 2017 • Arturs Backurs, Piotr Indyk, Ludwig Schmidt
We also give similar hardness results for computing the gradient of the empirical loss, which is the main computational burden in many non-convex learning tasks.
no code implementations • NeurIPS 2016 • Chinmay Hegde, Piotr Indyk, Ludwig Schmidt
We address the problem of recovering a high-dimensional but structured vector from linear observations in a general setting where the vector can come from an arbitrary union of subspaces.
1 code implementation • NeurIPS 2015 • Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, Ludwig Schmidt
Our lower bound implies that the above LSH family exhibits a trade-off between evaluation time and quality that is close to optimal for a natural class of LSH functions.
no code implementations • 28 Apr 2015 • Mahdi Cheraghchi, Piotr Indyk
Moreover, we design a deterministic and non-adaptive $\ell_1/\ell_1$ compressed sensing scheme based on general lossless condensers that is equipped with a fast reconstruction algorithm running in time $k^{1+\alpha} (\log N)^{O(1)}$ (for the GUV-based condenser) and is of independent interest.