Search Results for author: Michał Dereziński

Found 31 papers, 5 papers with code

Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients

no code implementations • 23 Apr 2024 • Sachin Garg, Albert S. Berahas, Michał Dereziński

We show that, for finite-sum minimization problems, incorporating partial second-order information of the objective function can dramatically improve the robustness to mini-batch size of variance-reduced stochastic gradient methods, making them more scalable while retaining their benefits over traditional Newton-type approaches.

Paper
Add Code

HERTA: A High-Efficiency and Rigorous Training Algorithm for Unfolded Graph Neural Networks

no code implementations • 26 Mar 2024 • Yongyi Yang, Jiaming Yang, Wei Hu, Michał Dereziński

In this paper, we propose HERTA: a High-Efficiency and Rigorous Training Algorithm for Unfolded GNNs that accelerates the whole training process, achieving a nearly-linear time worst-case training guarantee.

Paper
Add Code

Solving Dense Linear Systems Faster than via Preconditioning

no code implementations • 14 Dec 2023 • Michał Dereziński, Jiaming Yang

We give a stochastic optimization algorithm that solves a dense $n\times n$ real-valued linear system $Ax=b$, returning $\tilde x$ such that $\|A\tilde x-b\|\leq \epsilon\|b\|$ in time: $$\tilde O((n^2+nk^{\omega-1})\log1/\epsilon),$$ where $k$ is the number of singular values of $A$ larger than $O(1)$ times its smallest positive singular value, $\omega < 2. 372$ is the matrix multiplication exponent, and $\tilde O$ hides a poly-logarithmic in $n$ factor.

Stochastic Optimization

Paper
Add Code

Optimal Embedding Dimension for Sparse Subspace Embeddings

no code implementations • 17 Nov 2023 • Shabarish Chenakkod, Michał Dereziński, Xiaoyu Dong, Mark Rudelson

We use this to construct the first oblivious subspace embedding with $O(d)$ embedding dimension that can be applied faster than current matrix multiplication time, and to obtain an optimal single-pass algorithm for least squares regression.

Paper
Add Code

Surrogate-based Autotuning for Randomized Sketching Algorithms in Regression Problems

no code implementations • 30 Aug 2023 • Younghyun Cho, James W. Demmel, Michał Dereziński, Haoyun Li, Hengrui Luo, Michael W. Mahoney, Riley J. Murray

Algorithms from Randomized Numerical Linear Algebra (RandNLA) are known to be effective in handling high-dimensional computational problems, providing high-quality empirical performance as well as strong probabilistic guarantees.

regression

Paper
Add Code

Sharp Analysis of Sketch-and-Project Methods via a Connection to Randomized Singular Value Decomposition

no code implementations • 20 Aug 2022 • Michał Dereziński, Elizaveta Rebrova

Sketch-and-project is a framework which unifies many known iterative methods for solving linear systems and their variants, as well as further extensions to non-linear optimization problems.

Paper
Add Code

Algorithmic Gaussianization through Sketching: Converting Data into Sub-gaussian Random Designs

no code implementations • 21 Jun 2022 • Michał Dereziński

Algorithmic Gaussianization is a phenomenon that can arise when using randomized sketching or sampling methods to produce smaller representations of large datasets: For certain tasks, these sketched representations have been observed to exhibit many robust performance characteristics that are known to occur when a data sample comes from a sub-gaussian random design, which is a powerful statistical model of data distributions.

Paper
Add Code

Stochastic Variance-Reduced Newton: Accelerating Finite-Sum Minimization with Large Batches

1 code implementation • 6 Jun 2022 • Michał Dereziński

Stochastic variance reduction has proven effective at accelerating first-order algorithms for solving convex finite-sum optimization tasks such as empirical risk minimization.

Second-order methods

Paper
Code

Hessian Averaging in Stochastic Newton Methods Achieves Superlinear Convergence

1 code implementation • 20 Apr 2022 • Sen Na, Michał Dereziński, Michael W. Mahoney

Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still exhibits a superlinear convergence rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.

Paper
Code

Newton-LESS: Sparsification without Trade-offs for the Sketched Newton Update

1 code implementation • NeurIPS 2021 • Michał Dereziński, Jonathan Lacotte, Mert Pilanci, Michael W. Mahoney

In second-order optimization, a potential bottleneck can be computing the Hessian matrix of the optimized function at every iteration.

Paper
Code

Query Complexity of Least Absolute Deviation Regression via Robust Uniform Convergence

no code implementations • 3 Feb 2021 • Xue Chen, Michał Dereziński

An important example is least absolute deviation regression ($\ell_1$ regression) which enjoys superior robustness to outliers compared to least squares.

Learning Theory regression +1

Paper
Add Code

Sparse sketches with small inversion bias

no code implementations • 21 Nov 2020 • Michał Dereziński, Zhenyu Liao, Edgar Dobriban, Michael W. Mahoney

For a tall $n\times d$ matrix $A$ and a random $m\times n$ sketching matrix $S$, the sketched estimate of the inverse covariance matrix $(A^\top A)^{-1}$ is typically biased: $E[(\tilde A^\top\tilde A)^{-1}]\ne(A^\top A)^{-1}$, where $\tilde A=SA$.

Distributed Optimization

Paper
Add Code

Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization

no code implementations • NeurIPS 2020 • Michał Dereziński, Burak Bartan, Mert Pilanci, Michael W. Mahoney

In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data.

Point Processes Second-order methods

Paper
Add Code

Sampling from a $k$-DPP without looking at all items

no code implementations • 30 Jun 2020 • Daniele Calandriello, Michał Dereziński, Michal Valko

Determinantal point processes (DPPs) are a useful probabilistic model for selecting a small diverse subset out of a large collection of items, with applications in summarization, stochastic optimization, active learning and more.

Active Learning Point Processes +1

Paper
Add Code

Precise expressions for random projections: Low-rank approximation and randomized Newton

no code implementations • NeurIPS 2020 • Michał Dereziński, Feynman Liang, Zhenyu Liao, Michael W. Mahoney

It is often desirable to reduce the dimensionality of a large dataset by projecting it onto a low-dimensional subspace.

Dimensionality Reduction Stochastic Optimization

Paper
Add Code

Determinantal Point Processes in Randomized Numerical Linear Algebra

no code implementations • 7 May 2020 • Michał Dereziński, Michael W. Mahoney

For example, random sampling with a DPP leads to new kinds of unbiased estimators for least squares, enabling more refined statistical and inferential understanding of these algorithms; a DPP is, in some sense, an optimal randomized algorithm for the Nystr\"om method; and a RandNLA technique called leverage score sampling can be derived as the marginal distribution of a DPP.

Point Processes

Paper
Add Code

Improved guarantees and a multiple-descent curve for Column Subset Selection and the Nyström method

no code implementations • 21 Feb 2020 • Michał Dereziński, Rajiv Khanna, Michael W. Mahoney

The Column Subset Selection Problem (CSSP) and the Nystr\"om method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing.

Paper
Add Code

Exact expressions for double descent and implicit regularization via surrogate random design

no code implementations • NeurIPS 2020 • Michał Dereziński, Feynman Liang, Michael W. Mahoney

We provide the first exact non-asymptotic expressions for double descent of the minimum norm linear estimator.

Paper
Add Code

Convergence Analysis of Block Coordinate Algorithms with Determinantal Sampling

no code implementations • 25 Oct 2019 • Mojmír Mutný, Michał Dereziński, Andreas Krause

We analyze the convergence rate of the randomized Newton-like method introduced by Qu et.

Point Processes

Paper
Add Code

Unbiased estimators for random design regression

no code implementations • 8 Jul 2019 • Michał Dereziński, Manfred K. Warmuth, Daniel Hsu

We use them to show that for any input distribution and $\epsilon>0$ there is a random design consisting of $O(d\log d+ d/\epsilon)$ points from which an unbiased estimator can be constructed whose expected square loss over the entire distribution is bounded by $1+\epsilon$ times the loss of the optimum.

regression

Paper
Add Code

Bayesian experimental design using regularized determinantal point processes

1 code implementation • 10 Jun 2019 • Michał Dereziński, Feynman Liang, Michael W. Mahoney

In experimental design, we are given $n$ vectors in $d$ dimensions, and our goal is to select $k\ll n$ of them to perform expensive measurements, e. g., to obtain labels/responses, for a linear regression task.

Experimental Design Point Processes

Paper
Code

Exact sampling of determinantal point processes with sublinear time preprocessing

2 code implementations • NeurIPS 2019 • Michał Dereziński, Daniele Calandriello, Michal Valko

For this purpose, we propose a new algorithm which, given access to $\mathbf{L}$, samples exactly from a determinantal point process while satisfying the following two properties: (1) its preprocessing cost is $n \cdot \text{poly}(k)$, i. e., sublinear in the size of $\mathbf{L}$, and (2) its sampling cost is $\text{poly}(k)$, i. e., independent of the size of $\mathbf{L}$.

Point Processes

217

Paper
Code

Distributed estimation of the inverse Hessian by determinantal averaging

no code implementations • NeurIPS 2019 • Michał Dereziński, Michael W. Mahoney

In distributed optimization and distributed numerical linear algebra, we often encounter an inversion bias: if we want to compute a quantity that depends on the inverse of a sum of distributed matrices, then the sum of the inverses does not equal the inverse of the sum.

Distributed Optimization Uncertainty Quantification

Paper
Add Code

Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression

no code implementations • 4 Feb 2019 • Michał Dereziński, Kenneth L. Clarkson, Michael W. Mahoney, Manfred K. Warmuth

In the process, we develop a new algorithm for a joint sampling distribution called volume sampling, and we propose a new i. i. d.

Experimental Design regression

Paper
Add Code

Fast determinantal point processes via distortion-free intermediate sampling

no code implementations • 8 Nov 2018 • Michał Dereziński

To that end, we propose a new determinantal point process algorithm which has the following two properties, both of which are novel: (1) a preprocessing step which runs in time $O(\text{number-of-non-zeros}(\mathbf{X})\cdot\log n)+\text{poly}(d)$, and (2) a sampling step which runs in $\text{poly}(d)$ time, independent of the number of rows $n$.

Data Summarization Point Processes +1

Paper
Add Code

Correcting the bias in least squares regression with volume-rescaled sampling

no code implementations • 4 Oct 2018 • Michał Dereziński, Manfred K. Warmuth, Daniel Hsu

Without any assumptions on the noise, the linear least squares solution for any i. i. d.

regression

Paper
Add Code

Reverse iterative volume sampling for linear regression

no code implementations • 6 Jun 2018 • Michał Dereziński, Manfred K. Warmuth

We can only afford to attain the responses for a small subset of the points that are then used to construct linear predictions for all points in the dataset.

BIG-bench Machine Learning regression

Paper
Add Code

Leveraged volume sampling for linear regression

no code implementations • NeurIPS 2018 • Michał Dereziński, Manfred K. Warmuth, Daniel Hsu

We then develop a new rescaled variant of volume sampling that produces an unbiased estimate which avoids this bad behavior and has at least as good a tail bound as leverage score sampling: sample size $k=O(d\log d + d/\epsilon)$ suffices to guarantee total loss at most $1+\epsilon$ times the minimum with high probability.

Point Processes regression

Paper
Add Code

Subsampling for Ridge Regression via Regularized Volume Sampling

no code implementations • 14 Oct 2017 • Michał Dereziński, Manfred K. Warmuth

However, when labels are expensive, we are forced to select only a small subset of vectors $\mathbf{x}_i$ for which we obtain the labels $y_i$.

regression

Paper
Add Code

Unbiased estimates for linear regression via volume sampling

no code implementations • NeurIPS 2017 • Michał Dereziński, Manfred K. Warmuth

Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of $X$.

regression

Paper
Add Code

Batch-Expansion Training: An Efficient Optimization Framework

no code implementations • 22 Apr 2017 • Michał Dereziński, Dhruv Mahajan, S. Sathiya Keerthi, S. V. N. Vishwanathan, Markus Weimer

We propose Batch-Expansion Training (BET), a framework for running a batch optimizer on a gradually expanding dataset.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.