1 code implementation • 20 Jan 2025 • Michał Dereziński, Deanna Needell, Elizaveta Rebrova, Jiaming Yang
In this paper, we introduce Kaczmarz++, an accelerated randomized block Kaczmarz algorithm that exploits outlying singular values in the input to attain a fast Krylov-style convergence.
no code implementations • 13 Nov 2024 • Shabarish Chenakkod, Michał Dereziński, Xiaoyu Dong
An oblivious subspace embedding is a random $m\times n$ matrix $\Pi$ such that, for any $d$-dimensional subspace, with high probability $\Pi$ preserves the norms of all vectors in that subspace within a $1\pm\epsilon$ factor.
no code implementations • 17 Jun 2024 • Michał Dereziński, Michael W. Mahoney
Large matrices arise in many machine learning and data analysis applications, including as representations of datasets, graphs, model weights, and first and second-order derivatives.
no code implementations • 3 Jun 2024 • Ruichen Jiang, Michał Dereziński, Aryan Mokhtari
In this paper, we propose a novel stochastic Newton proximal extragradient method that improves these bounds, achieving a faster global linear rate and reaching the same fast superlinear rate in $\tilde{O}(\kappa)$ iterations.
no code implementations • 9 May 2024 • Michał Dereziński, Christopher Musco, Jiaming Yang
Our methods are based on constructing a low-rank Nystr\"om approximation to $A$ using sparse random sketching.
no code implementations • 9 May 2024 • Michał Dereziński, Daniel LeJeune, Deanna Needell, Elizaveta Rebrova
While effective in practice, iterative methods for solving large systems of linear equations can be significantly affected by problem-dependent condition number quantities.
no code implementations • 8 May 2024 • Sachin Garg, Kevin Tan, Michał Dereziński
Matrix sketching is a powerful tool for reducing the size of large data matrices.
no code implementations • 23 Apr 2024 • Sachin Garg, Albert S. Berahas, Michał Dereziński
We show that, for finite-sum minimization problems, incorporating partial second-order information of the objective function can dramatically improve the robustness to mini-batch size of variance-reduced stochastic gradient methods, making them more scalable while retaining their benefits over traditional Newton-type approaches.
no code implementations • 26 Mar 2024 • Yongyi Yang, Jiaming Yang, Wei Hu, Michał Dereziński
In this paper, we propose HERTA: a High-Efficiency and Rigorous Training Algorithm for Unfolded GNNs that accelerates the whole training process, achieving a nearly-linear time worst-case training guarantee.
no code implementations • 14 Dec 2023 • Michał Dereziński, Jiaming Yang
We give a stochastic optimization algorithm that solves a dense $n\times n$ real-valued linear system $Ax=b$, returning $\tilde x$ such that $\|A\tilde x-b\|\leq \epsilon\|b\|$ in time: $$\tilde O((n^2+nk^{\omega-1})\log1/\epsilon),$$ where $k$ is the number of singular values of $A$ larger than $O(1)$ times its smallest positive singular value, $\omega < 2. 372$ is the matrix multiplication exponent, and $\tilde O$ hides a poly-logarithmic in $n$ factor.
no code implementations • 17 Nov 2023 • Shabarish Chenakkod, Michał Dereziński, Xiaoyu Dong, Mark Rudelson
We use this to construct the first oblivious subspace embedding with $O(d)$ embedding dimension that can be applied faster than current matrix multiplication time, and to obtain an optimal single-pass algorithm for least squares regression.
1 code implementation • 30 Aug 2023 • Younghyun Cho, James W. Demmel, Michał Dereziński, Haoyun Li, Hengrui Luo, Michael W. Mahoney, Riley J. Murray
Algorithms from Randomized Numerical Linear Algebra (RandNLA) are known to be effective in handling high-dimensional computational problems, providing high-quality empirical performance as well as strong probabilistic guarantees.
no code implementations • 20 Aug 2022 • Michał Dereziński, Elizaveta Rebrova
Sketch-and-project is a framework which unifies many known iterative methods for solving linear systems and their variants, as well as further extensions to non-linear optimization problems.
no code implementations • 21 Jun 2022 • Michał Dereziński
Algorithmic Gaussianization is a phenomenon that can arise when using randomized sketching or sampling methods to produce smaller representations of large datasets: For certain tasks, these sketched representations have been observed to exhibit many robust performance characteristics that are known to occur when a data sample comes from a sub-gaussian random design, which is a powerful statistical model of data distributions.
1 code implementation • 6 Jun 2022 • Michał Dereziński
Stochastic variance reduction has proven effective at accelerating first-order algorithms for solving convex finite-sum optimization tasks such as empirical risk minimization.
1 code implementation • 20 Apr 2022 • Sen Na, Michał Dereziński, Michael W. Mahoney
Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still exhibits a superlinear convergence rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.
1 code implementation • NeurIPS 2021 • Michał Dereziński, Jonathan Lacotte, Mert Pilanci, Michael W. Mahoney
In second-order optimization, a potential bottleneck can be computing the Hessian matrix of the optimized function at every iteration.
no code implementations • 3 Feb 2021 • Xue Chen, Michał Dereziński
An important example is least absolute deviation regression ($\ell_1$ regression) which enjoys superior robustness to outliers compared to least squares.
no code implementations • 21 Nov 2020 • Michał Dereziński, Zhenyu Liao, Edgar Dobriban, Michael W. Mahoney
For a tall $n\times d$ matrix $A$ and a random $m\times n$ sketching matrix $S$, the sketched estimate of the inverse covariance matrix $(A^\top A)^{-1}$ is typically biased: $E[(\tilde A^\top\tilde A)^{-1}]\ne(A^\top A)^{-1}$, where $\tilde A=SA$.
no code implementations • NeurIPS 2020 • Michał Dereziński, Burak Bartan, Mert Pilanci, Michael W. Mahoney
In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data.
no code implementations • 30 Jun 2020 • Daniele Calandriello, Michał Dereziński, Michal Valko
Determinantal point processes (DPPs) are a useful probabilistic model for selecting a small diverse subset out of a large collection of items, with applications in summarization, stochastic optimization, active learning and more.
no code implementations • NeurIPS 2020 • Michał Dereziński, Feynman Liang, Zhenyu Liao, Michael W. Mahoney
It is often desirable to reduce the dimensionality of a large dataset by projecting it onto a low-dimensional subspace.
no code implementations • 7 May 2020 • Michał Dereziński, Michael W. Mahoney
For example, random sampling with a DPP leads to new kinds of unbiased estimators for least squares, enabling more refined statistical and inferential understanding of these algorithms; a DPP is, in some sense, an optimal randomized algorithm for the Nystr\"om method; and a RandNLA technique called leverage score sampling can be derived as the marginal distribution of a DPP.
no code implementations • 21 Feb 2020 • Michał Dereziński, Rajiv Khanna, Michael W. Mahoney
The Column Subset Selection Problem (CSSP) and the Nystr\"om method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing.
no code implementations • NeurIPS 2020 • Michał Dereziński, Feynman Liang, Michael W. Mahoney
We provide the first exact non-asymptotic expressions for double descent of the minimum norm linear estimator.
no code implementations • 25 Oct 2019 • Mojmír Mutný, Michał Dereziński, Andreas Krause
We analyze the convergence rate of the randomized Newton-like method introduced by Qu et.
no code implementations • 8 Jul 2019 • Michał Dereziński, Manfred K. Warmuth, Daniel Hsu
We use them to show that for any input distribution and $\epsilon>0$ there is a random design consisting of $O(d\log d+ d/\epsilon)$ points from which an unbiased estimator can be constructed whose expected square loss over the entire distribution is bounded by $1+\epsilon$ times the loss of the optimum.
1 code implementation • 10 Jun 2019 • Michał Dereziński, Feynman Liang, Michael W. Mahoney
In experimental design, we are given $n$ vectors in $d$ dimensions, and our goal is to select $k\ll n$ of them to perform expensive measurements, e. g., to obtain labels/responses, for a linear regression task.
2 code implementations • NeurIPS 2019 • Michał Dereziński, Daniele Calandriello, Michal Valko
For this purpose, we propose a new algorithm which, given access to $\mathbf{L}$, samples exactly from a determinantal point process while satisfying the following two properties: (1) its preprocessing cost is $n \cdot \text{poly}(k)$, i. e., sublinear in the size of $\mathbf{L}$, and (2) its sampling cost is $\text{poly}(k)$, i. e., independent of the size of $\mathbf{L}$.
no code implementations • NeurIPS 2019 • Michał Dereziński, Michael W. Mahoney
In distributed optimization and distributed numerical linear algebra, we often encounter an inversion bias: if we want to compute a quantity that depends on the inverse of a sum of distributed matrices, then the sum of the inverses does not equal the inverse of the sum.
no code implementations • 4 Feb 2019 • Michał Dereziński, Kenneth L. Clarkson, Michael W. Mahoney, Manfred K. Warmuth
In the process, we develop a new algorithm for a joint sampling distribution called volume sampling, and we propose a new i. i. d.
no code implementations • 8 Nov 2018 • Michał Dereziński
To that end, we propose a new determinantal point process algorithm which has the following two properties, both of which are novel: (1) a preprocessing step which runs in time $O(\text{number-of-non-zeros}(\mathbf{X})\cdot\log n)+\text{poly}(d)$, and (2) a sampling step which runs in $\text{poly}(d)$ time, independent of the number of rows $n$.
no code implementations • 4 Oct 2018 • Michał Dereziński, Manfred K. Warmuth, Daniel Hsu
Without any assumptions on the noise, the linear least squares solution for any i. i. d.
no code implementations • 6 Jun 2018 • Michał Dereziński, Manfred K. Warmuth
We can only afford to attain the responses for a small subset of the points that are then used to construct linear predictions for all points in the dataset.
no code implementations • NeurIPS 2018 • Michał Dereziński, Manfred K. Warmuth, Daniel Hsu
We then develop a new rescaled variant of volume sampling that produces an unbiased estimate which avoids this bad behavior and has at least as good a tail bound as leverage score sampling: sample size $k=O(d\log d + d/\epsilon)$ suffices to guarantee total loss at most $1+\epsilon$ times the minimum with high probability.
no code implementations • 14 Oct 2017 • Michał Dereziński, Manfred K. Warmuth
However, when labels are expensive, we are forced to select only a small subset of vectors $\mathbf{x}_i$ for which we obtain the labels $y_i$.
no code implementations • NeurIPS 2017 • Michał Dereziński, Manfred K. Warmuth
Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of $X$.
no code implementations • 22 Apr 2017 • Michał Dereziński, Dhruv Mahajan, S. Sathiya Keerthi, S. V. N. Vishwanathan, Markus Weimer
We propose Batch-Expansion Training (BET), a framework for running a batch optimizer on a gradually expanding dataset.