Search Results for author: Michał Dereziński

Found 31 papers, 5 papers with code

Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients

no code implementations23 Apr 2024 Sachin Garg, Albert S. Berahas, Michał Dereziński

We show that, for finite-sum minimization problems, incorporating partial second-order information of the objective function can dramatically improve the robustness to mini-batch size of variance-reduced stochastic gradient methods, making them more scalable while retaining their benefits over traditional Newton-type approaches.

HERTA: A High-Efficiency and Rigorous Training Algorithm for Unfolded Graph Neural Networks

no code implementations26 Mar 2024 Yongyi Yang, Jiaming Yang, Wei Hu, Michał Dereziński

In this paper, we propose HERTA: a High-Efficiency and Rigorous Training Algorithm for Unfolded GNNs that accelerates the whole training process, achieving a nearly-linear time worst-case training guarantee.

Solving Dense Linear Systems Faster than via Preconditioning

no code implementations14 Dec 2023 Michał Dereziński, Jiaming Yang

We give a stochastic optimization algorithm that solves a dense $n\times n$ real-valued linear system $Ax=b$, returning $\tilde x$ such that $\|A\tilde x-b\|\leq \epsilon\|b\|$ in time: $$\tilde O((n^2+nk^{\omega-1})\log1/\epsilon),$$ where $k$ is the number of singular values of $A$ larger than $O(1)$ times its smallest positive singular value, $\omega < 2. 372$ is the matrix multiplication exponent, and $\tilde O$ hides a poly-logarithmic in $n$ factor.

Stochastic Optimization

Optimal Embedding Dimension for Sparse Subspace Embeddings

no code implementations17 Nov 2023 Shabarish Chenakkod, Michał Dereziński, Xiaoyu Dong, Mark Rudelson

We use this to construct the first oblivious subspace embedding with $O(d)$ embedding dimension that can be applied faster than current matrix multiplication time, and to obtain an optimal single-pass algorithm for least squares regression.

Surrogate-based Autotuning for Randomized Sketching Algorithms in Regression Problems

no code implementations30 Aug 2023 Younghyun Cho, James W. Demmel, Michał Dereziński, Haoyun Li, Hengrui Luo, Michael W. Mahoney, Riley J. Murray

Algorithms from Randomized Numerical Linear Algebra (RandNLA) are known to be effective in handling high-dimensional computational problems, providing high-quality empirical performance as well as strong probabilistic guarantees.

regression

Sharp Analysis of Sketch-and-Project Methods via a Connection to Randomized Singular Value Decomposition

no code implementations20 Aug 2022 Michał Dereziński, Elizaveta Rebrova

Sketch-and-project is a framework which unifies many known iterative methods for solving linear systems and their variants, as well as further extensions to non-linear optimization problems.

Algorithmic Gaussianization through Sketching: Converting Data into Sub-gaussian Random Designs

no code implementations21 Jun 2022 Michał Dereziński

Algorithmic Gaussianization is a phenomenon that can arise when using randomized sketching or sampling methods to produce smaller representations of large datasets: For certain tasks, these sketched representations have been observed to exhibit many robust performance characteristics that are known to occur when a data sample comes from a sub-gaussian random design, which is a powerful statistical model of data distributions.

Stochastic Variance-Reduced Newton: Accelerating Finite-Sum Minimization with Large Batches

1 code implementation6 Jun 2022 Michał Dereziński

Stochastic variance reduction has proven effective at accelerating first-order algorithms for solving convex finite-sum optimization tasks such as empirical risk minimization.

Second-order methods

Hessian Averaging in Stochastic Newton Methods Achieves Superlinear Convergence

1 code implementation20 Apr 2022 Sen Na, Michał Dereziński, Michael W. Mahoney

Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still exhibits a superlinear convergence rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.

Newton-LESS: Sparsification without Trade-offs for the Sketched Newton Update

1 code implementation NeurIPS 2021 Michał Dereziński, Jonathan Lacotte, Mert Pilanci, Michael W. Mahoney

In second-order optimization, a potential bottleneck can be computing the Hessian matrix of the optimized function at every iteration.

Query Complexity of Least Absolute Deviation Regression via Robust Uniform Convergence

no code implementations3 Feb 2021 Xue Chen, Michał Dereziński

An important example is least absolute deviation regression ($\ell_1$ regression) which enjoys superior robustness to outliers compared to least squares.

Learning Theory regression +1

Sparse sketches with small inversion bias

no code implementations21 Nov 2020 Michał Dereziński, Zhenyu Liao, Edgar Dobriban, Michael W. Mahoney

For a tall $n\times d$ matrix $A$ and a random $m\times n$ sketching matrix $S$, the sketched estimate of the inverse covariance matrix $(A^\top A)^{-1}$ is typically biased: $E[(\tilde A^\top\tilde A)^{-1}]\ne(A^\top A)^{-1}$, where $\tilde A=SA$.

Distributed Optimization

Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization

no code implementations NeurIPS 2020 Michał Dereziński, Burak Bartan, Mert Pilanci, Michael W. Mahoney

In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data.

Point Processes Second-order methods

Sampling from a $k$-DPP without looking at all items

no code implementations30 Jun 2020 Daniele Calandriello, Michał Dereziński, Michal Valko

Determinantal point processes (DPPs) are a useful probabilistic model for selecting a small diverse subset out of a large collection of items, with applications in summarization, stochastic optimization, active learning and more.

Active Learning Point Processes +1

Determinantal Point Processes in Randomized Numerical Linear Algebra

no code implementations7 May 2020 Michał Dereziński, Michael W. Mahoney

For example, random sampling with a DPP leads to new kinds of unbiased estimators for least squares, enabling more refined statistical and inferential understanding of these algorithms; a DPP is, in some sense, an optimal randomized algorithm for the Nystr\"om method; and a RandNLA technique called leverage score sampling can be derived as the marginal distribution of a DPP.

Point Processes

Improved guarantees and a multiple-descent curve for Column Subset Selection and the Nyström method

no code implementations21 Feb 2020 Michał Dereziński, Rajiv Khanna, Michael W. Mahoney

The Column Subset Selection Problem (CSSP) and the Nystr\"om method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing.

Unbiased estimators for random design regression

no code implementations8 Jul 2019 Michał Dereziński, Manfred K. Warmuth, Daniel Hsu

We use them to show that for any input distribution and $\epsilon>0$ there is a random design consisting of $O(d\log d+ d/\epsilon)$ points from which an unbiased estimator can be constructed whose expected square loss over the entire distribution is bounded by $1+\epsilon$ times the loss of the optimum.

regression

Bayesian experimental design using regularized determinantal point processes

1 code implementation10 Jun 2019 Michał Dereziński, Feynman Liang, Michael W. Mahoney

In experimental design, we are given $n$ vectors in $d$ dimensions, and our goal is to select $k\ll n$ of them to perform expensive measurements, e. g., to obtain labels/responses, for a linear regression task.

Experimental Design Point Processes

Exact sampling of determinantal point processes with sublinear time preprocessing

2 code implementations NeurIPS 2019 Michał Dereziński, Daniele Calandriello, Michal Valko

For this purpose, we propose a new algorithm which, given access to $\mathbf{L}$, samples exactly from a determinantal point process while satisfying the following two properties: (1) its preprocessing cost is $n \cdot \text{poly}(k)$, i. e., sublinear in the size of $\mathbf{L}$, and (2) its sampling cost is $\text{poly}(k)$, i. e., independent of the size of $\mathbf{L}$.

Point Processes

Distributed estimation of the inverse Hessian by determinantal averaging

no code implementations NeurIPS 2019 Michał Dereziński, Michael W. Mahoney

In distributed optimization and distributed numerical linear algebra, we often encounter an inversion bias: if we want to compute a quantity that depends on the inverse of a sum of distributed matrices, then the sum of the inverses does not equal the inverse of the sum.

Distributed Optimization Uncertainty Quantification

Fast determinantal point processes via distortion-free intermediate sampling

no code implementations8 Nov 2018 Michał Dereziński

To that end, we propose a new determinantal point process algorithm which has the following two properties, both of which are novel: (1) a preprocessing step which runs in time $O(\text{number-of-non-zeros}(\mathbf{X})\cdot\log n)+\text{poly}(d)$, and (2) a sampling step which runs in $\text{poly}(d)$ time, independent of the number of rows $n$.

Data Summarization Point Processes +1

Reverse iterative volume sampling for linear regression

no code implementations6 Jun 2018 Michał Dereziński, Manfred K. Warmuth

We can only afford to attain the responses for a small subset of the points that are then used to construct linear predictions for all points in the dataset.

BIG-bench Machine Learning regression

Leveraged volume sampling for linear regression

no code implementations NeurIPS 2018 Michał Dereziński, Manfred K. Warmuth, Daniel Hsu

We then develop a new rescaled variant of volume sampling that produces an unbiased estimate which avoids this bad behavior and has at least as good a tail bound as leverage score sampling: sample size $k=O(d\log d + d/\epsilon)$ suffices to guarantee total loss at most $1+\epsilon$ times the minimum with high probability.

Point Processes regression

Subsampling for Ridge Regression via Regularized Volume Sampling

no code implementations14 Oct 2017 Michał Dereziński, Manfred K. Warmuth

However, when labels are expensive, we are forced to select only a small subset of vectors $\mathbf{x}_i$ for which we obtain the labels $y_i$.

regression

Unbiased estimates for linear regression via volume sampling

no code implementations NeurIPS 2017 Michał Dereziński, Manfred K. Warmuth

Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of $X$.

regression

Batch-Expansion Training: An Efficient Optimization Framework

no code implementations22 Apr 2017 Michał Dereziński, Dhruv Mahajan, S. Sathiya Keerthi, S. V. N. Vishwanathan, Markus Weimer

We propose Batch-Expansion Training (BET), a framework for running a batch optimizer on a gradually expanding dataset.

Cannot find the paper you are looking for? You can Submit a new open access paper.