Search Results for author: Michael W. Mahoney

Found 164 papers, 63 papers with code

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

1 code implementation • 22 Mar 2024 • Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipali, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data.

Data Augmentation GSM8K +1

Paper
Code

AI and Memory Wall

no code implementations • 21 Mar 2024 • Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, Kurt Keutzer

The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs.

Paper
Add Code

Using Uncertainty Quantification to Characterize and Improve Out-of-Domain Learning for PDEs

1 code implementation • 15 Mar 2024 • S. Chandra Mouli, Danielle C. Maddix, Shima Alizadeh, Gaurav Gupta, Andrew Stuart, Michael W. Mahoney, Yuyang Wang

Existing work in scientific machine learning (SciML) has shown that data-driven learning of solution operators can provide a fast approximate alternative to classical numerical partial differential equation (PDE) solvers.

Uncertainty Quantification

Paper
Code

Chronos: Learning the Language of Time Series

2 code implementations • 12 Mar 2024 • Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, Yuyang Wang

We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models.

Gaussian Processes Language Modelling +2

1,606

Paper
Code

Data-Efficient Operator Learning via Unsupervised Pretraining and In-Context Learning

no code implementations • 24 Feb 2024 • Wuyang Chen, Jialin Song, Pu Ren, Shashank Subramanian, Dmitriy Morozov, Michael W. Mahoney

To reduce the need for training data with simulated solutions, we pretrain neural operators on unlabeled PDE data using reconstruction-based proxy tasks.

In-Context Learning Operator learning

Paper
Add Code

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

1 code implementation • 31 Jan 2024 • Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.

Quantization

184

Paper
Code

SALSA: Sequential Approximate Leverage-Score Algorithm with Application in Analyzing Big Time Series Data

no code implementations • 30 Dec 2023 • Ali Eshragh, Luke Yerbury, Asef Nazari, Fred Roosta, Michael W. Mahoney

We demonstrate that, with high probability, the accuracy of SALSA's approximations is within $(1 + O({\varepsilon}))$ of the true leverage scores.

Time Series

Paper
Add Code

An LLM Compiler for Parallel Function Calling

1 code implementation • 7 Dec 2023 • Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

To address this, we introduce LLMCompiler, which executes functions in parallel to efficiently orchestrate multiple function calling.

1,062

Paper
Code

Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training

1 code implementation • NeurIPS 2023 • Yefan Zhou, Tianyu Pang, Keqin Liu, Charles H. Martin, Michael W. Mahoney, Yaoqing Yang

In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training.

Scheduling

Paper
Code

DMLR: Data-centric Machine Learning Research -- Past, Present and Future

no code implementations • 21 Nov 2023 • Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, Max Bartolo, William A Gaviria Rojas, Ryan Hileman, Rainier Aliment, Michael W. Mahoney, Meg Risdal, Matthew Lease, Wojciech Samek, Debojyoti Dutta, Curtis G Northcutt, Cody Coleman, Braden Hancock, Bernard Koch, Girmaw Abebe Tadesse, Bojan Karlaš, Ahmed Alaa, Adji Bousso Dieng, Natasha Noy, Vijay Janapa Reddi, James Zou, Praveen Paritosh, Mihaela van der Schaar, Kurt Bollacker, Lora Aroyo, Ce Zhang, Joaquin Vanschoren, Isabelle Guyon, Peter Mattson

Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science.

Paper
Add Code

A PAC-Bayesian Perspective on the Interpolating Information Criterion

no code implementations • 13 Nov 2023 • Liam Hodgkinson, Chris van der Heide, Robert Salomone, Fred Roosta, Michael W. Mahoney

Deep learning is renowned for its theory-practice gap, whereby principled theory typically fails to provide much beneficial guidance for implementation in practice.

Paper
Add Code

Equation Discovery with Bayesian Spike-and-Slab Priors and Efficient Kernels

1 code implementation • 9 Oct 2023 • Da Long, Wei W. Xing, Aditi S. Krishnapriyan, Robert M. Kirby, Shandian Zhe, Michael W. Mahoney

To overcome the computational challenge of kernel regression, we place the function values on a mesh and induce a Kronecker product construction, and we use tensor algebra to enable efficient computation and optimization.

regression Uncertainty Quantification

Paper
Code

Generative Modeling of Regular and Irregular Time Series Data via Koopman VAEs

no code implementations • 4 Oct 2023 • Ilan Naiman, N. Benjamin Erichson, Pu Ren, Michael W. Mahoney, Omri Azencot

In this work, we introduce Koopman VAE (KVAE), a new generative framework that is based on a novel design for the model prior, and that can be optimized for either regular and irregular training data.

Irregular Time Series Time Series +1

Paper
Add Code

Robustifying State-space Models for Long Sequences via Approximate Diagonalization

no code implementations • 2 Oct 2023 • Annan Yu, Arnur Nigmetov, Dmitriy Morozov, Michael W. Mahoney, N. Benjamin Erichson

An example is the structured state-space sequence (S4) layer, which uses the diagonal-plus-low-rank structure of the HiPPO initialization framework.

Computational Efficiency

Paper
Add Code

Surrogate-based Autotuning for Randomized Sketching Algorithms in Regression Problems

no code implementations • 30 Aug 2023 • Younghyun Cho, James W. Demmel, Michał Dereziński, Haoyun Li, Hengrui Luo, Michael W. Mahoney, Riley J. Murray

Algorithms from Randomized Numerical Linear Algebra (RandNLA) are known to be effective in handling high-dimensional computational problems, providing high-quality empirical performance as well as strong probabilistic guarantees.

regression

Paper
Add Code

Probabilistic Forecasting with Coherent Aggregation

no code implementations • 19 Jul 2023 • Geoffrey Négiar, Ruijun Ma, O. Nangba Meetei, Mengfei Cao, Michael W. Mahoney

Our model uses a convolutional neural network to produce parameters for the factors, their loadings and base-level distributions; it produces samples which can be differentiated with respect to the model's parameters; and it can therefore optimize for any sample-based loss function, including the Continuous Ranked Probability Score and quantile losses.

energy management Management

Paper
Add Code

The Interpolating Information Criterion for Overparameterized Models

no code implementations • 15 Jul 2023 • Liam Hodgkinson, Chris van der Heide, Robert Salomone, Fred Roosta, Michael W. Mahoney

The problem of model selection is considered for the setting of interpolating estimators, where the number of model parameters exceeds the size of the dataset.

Model Selection

Paper
Add Code

GEANN: Scalable Graph Augmentations for Multi-Horizon Time Series Forecasting

no code implementations • 7 Jul 2023 • Sitan Yang, Malcolm Wolff, Shankar Ramasubramanian, Vincent Quenneville-Belair, Ronak Metha, Michael W. Mahoney

Encoder-decoder deep neural networks have been increasingly studied for multi-horizon time series forecasting, especially in real-world applications.

Data Augmentation Time Series +1

Paper
Add Code

SuperBench: A Super-Resolution Benchmark Dataset for Scientific Machine Learning

1 code implementation • 24 Jun 2023 • Pu Ren, N. Benjamin Erichson, Shashank Subramanian, Omer San, Zarija Lukic, Michael W. Mahoney

Super-Resolution (SR) techniques aim to enhance data resolution, enabling the retrieval of finer details, and improving the overall quality and fidelity of the data representation.

Retrieval Super-Resolution

Paper
Code

SqueezeLLM: Dense-and-Sparse Quantization

2 code implementations • 13 Jun 2023 • Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2. 1x as compared to the state-of-the-art methods with the same memory requirement.

Quantization

18,192

Paper
Code

Constrained Optimization via Exact Augmented Lagrangian and Randomized Iterative Sketching

1 code implementation • 28 May 2023 • Ilgee Hong, Sen Na, Michael W. Mahoney, Mladen Kolar

Our method adaptively controls the accuracy of the randomized solver and the penalty parameters of the exact augmented Lagrangian, to ensure that the inexact Newton direction is a descent direction of the exact augmented Lagrangian.

Paper
Code

A Three-regime Model of Network Pruning

1 code implementation • 28 May 2023 • Yefan Zhou, Yaoqing Yang, Arin Chang, Michael W. Mahoney

Our approach uses temperature-like and load-like parameters to model the impact of neural network (NN) training hyperparameters on pruning performance.

Efficient Neural Network Hyperparameter Optimization +1

Paper
Code

End-to-end codesign of Hessian-aware quantized neural networks for FPGAs and ASICs

no code implementations • 13 Apr 2023 • Javier Campos, Zhen Dong, Javier Duarte, Amir Gholami, Michael W. Mahoney, Jovan Mitrevski, Nhan Tran

We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) hardware.

Quantization

Paper
Add Code

Full Stack Optimization of Transformer Inference: a Survey

no code implementations • 27 Feb 2023 • Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami

In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search.

Neural Architecture Search Scheduling

Paper
Add Code

Learning Physical Models that Can Respect Conservation Laws

1 code implementation • 21 Feb 2023 • Derek Hansen, Danielle C. Maddix, Shima Alizadeh, Gaurav Gupta, Michael W. Mahoney

We provide a detailed analysis of ProbConserv on learning with the Generalized Porous Medium Equation (GPME), a widely-applicable parameterized family of PDEs that illustrates the qualitative properties of both easier and harder PDEs.

Uncertainty Quantification

Paper
Code

Speculative Decoding with Big Little Decoder

1 code implementation • NeurIPS 2023 • Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer

To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications.

Machine Translation Text Generation

Paper
Code

Gated Recurrent Neural Networks with Weighted Time-Delay Feedback

no code implementations • 1 Dec 2022 • N. Benjamin Erichson, Soon Hoe Lim, Michael W. Mahoney

We prove the existence and uniqueness of solutions for the continuous-time model, and we demonstrate that the proposed feedback mechanism can help improve the modeling of long-term dependencies.

Human Activity Recognition speech-recognition +4

Paper
Add Code

Fully Stochastic Trust-Region Sequential Quadratic Programming for Equality-Constrained Optimization Problems

1 code implementation • 29 Nov 2022 • Yuchen Fang, Sen Na, Michael W. Mahoney, Mladen Kolar

We propose a trust-region stochastic sequential quadratic programming algorithm (TR-StoSQP) to solve nonlinear optimization problems with stochastic objectives and deterministic equality constraints.

Paper
Code

Monotonicity and Double Descent in Uncertainty Estimation with Gaussian Processes

no code implementations • 14 Oct 2022 • Liam Hodgkinson, Chris van der Heide, Fred Roosta, Michael W. Mahoney

One prominent issue is the curse of dimensionality: it is commonly believed that the marginal likelihood should be reminiscent of cross-validation metrics and that both should deteriorate with larger input dimensions.

Gaussian Processes Uncertainty Quantification

Paper
Add Code

Gradient Gating for Deep Multi-Rate Learning on Graphs

1 code implementation • 2 Oct 2022 • T. Konstantin Rusch, Benjamin P. Chamberlain, Michael W. Mahoney, Michael M. Bronstein, Siddhartha Mishra

We present Gradient Gating (G$^2$), a novel framework for improving the performance of Graph Neural Networks (GNNs).

Ranked #3 on Node Classification on arXiv-year

Graph Learning +1

Paper
Code

Learning differentiable solvers for systems with hard constraints

no code implementations • 18 Jul 2022 • Geoffrey Négiar, Michael W. Mahoney, Aditi S. Krishnapriyan

Our method leverages differentiable optimization and the implicit function theorem to effectively enforce physical constraints.

Dictionary Learning

Paper
Add Code

Adaptive Self-supervision Algorithms for Physics-informed Neural Networks

1 code implementation • 8 Jul 2022 • Shashank Subramanian, Robert M. Kirby, Michael W. Mahoney, Amir Gholami

We find that training vanilla PINNs for these problems can result in up to 70% prediction error in the solution, especially in the regime of low collocation points.

Paper
Code

Neurotoxin: Durable Backdoors in Federated Learning

2 code implementations • 12 Jun 2022 • Zhengming Zhang, Ashwinee Panda, Linyue Song, Yaoqing Yang, Michael W. Mahoney, Joseph E. Gonzalez, Kannan Ramchandran, Prateek Mittal

In this type of attack, the goal of the attacker is to use poisoned updates to implant so-called backdoors into the learned model such that, at test time, the model's outputs can be fixed to a given target for certain inputs.

Backdoor Attack Federated Learning +1

296

Paper
Code

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

4 code implementations • 2 Jun 2022 • Sehoon Kim, Amir Gholami, Albert Shaw, Nicholas Lee, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer

After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes.

Ranked #30 on Speech Recognition on LibriSpeech test-clean

Automatic Speech Recognition Automatic Speech Recognition (ASR)

10,040

Paper
Code

Statistical Inference of Constrained Stochastic Optimization via Sketched Sequential Quadratic Programming

1 code implementation • 27 May 2022 • Sen Na, Michael W. Mahoney

To reduce dominant computational cost of the method, we inexactly solve the quadratic program in each iteration by employing an iterative sketching solver.

Second-order methods Stochastic Optimization

Paper
Code

Fat-Tailed Variational Inference with Anisotropic Tail Adaptive Flows

no code implementations • 16 May 2022 • Feynman Liang, Liam Hodgkinson, Michael W. Mahoney

While fat-tailed densities commonly arise as posterior and marginal distributions in robust models and scale mixtures, they present challenges when Gaussian-based variational inference fails to capture tail decay accurately.

Variational Inference

Paper
Add Code

Hessian Averaging in Stochastic Newton Methods Achieves Superlinear Convergence

1 code implementation • 20 Apr 2022 • Sen Na, Michał Dereziński, Michael W. Mahoney

Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still exhibits a superlinear convergence rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.

Paper
Code

A Fast Post-Training Pruning Framework for Transformers

2 code implementations • 29 Mar 2022 • Woosuk Kwon, Sehoon Kim, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, Amir Gholami

To address this, we propose a fast post-training pruning framework for Transformers that does not require any retraining.

145

Paper
Code

Fast Feature Selection with Fairness Constraints

no code implementations • 28 Feb 2022 • Francesco Quinzan, Rajiv Khanna, Moshik Hershcovitch, Sarel Cohen, Daniel G. Waddington, Tobias Friedrich, Michael W. Mahoney

We study the fundamental problem of selecting optimal features for model construction.

Fairness feature selection

Paper
Add Code

Learning continuous models for continuous physics

no code implementations • 17 Feb 2022 • Aditi S. Krishnapriyan, Alejandro F. Queiruga, N. Benjamin Erichson, Michael W. Mahoney

Dynamical systems that evolve continuously over time are ubiquitous throughout science and engineering.

Paper
Add Code

Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data

1 code implementation • 6 Feb 2022 • Yaoqing Yang, Ryan Theisen, Liam Hodgkinson, Joseph E. Gonzalez, Kannan Ramchandran, Charles H. Martin, Michael W. Mahoney

Our analyses consider (I) hundreds of Transformers trained in different settings, in which we systematically vary the amount of data, the model size and the optimization hyperparameters, (II) a total of 51 pretrained Transformers from eight families of Huggingface NLP models, including GPT2, BERT, etc., and (III) a total of 28 existing and novel generalization metrics.

Model Selection

Paper
Code

NoisyMix: Boosting Model Robustness to Common Corruptions

no code implementations • 2 Feb 2022 • N. Benjamin Erichson, Soon Hoe Lim, Winnie Xu, Francisco Utrera, Ziang Cao, Michael W. Mahoney

For many real-world applications, obtaining stable and robust statistical performance is more important than simply achieving state-of-the-art predictive test accuracy, and thus robustness of neural networks is an increasingly important topic.

Data Augmentation

Paper
Add Code

Learning from learning machines: a new generation of AI technology to meet the needs of science

no code implementations • 27 Nov 2021 • Luca Pion-Tonachini, Kristofer Bouchard, Hector Garcia Martin, Sean Peisert, W. Bradley Holtz, Anil Aswani, Dipankar Dwivedi, Haruko Wainwright, Ghanshyam Pilania, Benjamin Nachman, Babetta L. Marrone, Nicola Falco, Prabhat, Daniel Arnold, Alejandro Wolf-Yadlin, Sarah Powers, Sharlee Climer, Quinn Jackson, Ty Carlson, Michael Sohn, Petrus Zwart, Neeraj Kumar, Amy Justice, Claire Tomlin, Daniel Jacobson, Gos Micklem, Georgios V. Gkoutos, Peter J. Bickel, Jean-Baptiste Cazier, Juliane Müller, Bobbie-Jo Webb-Robertson, Rick Stevens, Mark Anderson, Ken Kreutz-Delgado, Michael W. Mahoney, James B. Brown

We outline emerging opportunities and challenges to enhance the utility of AI for scientific discovery.

Paper
Add Code

Long Expressive Memory for Sequence Modeling

1 code implementation • ICLR 2022 • T. Konstantin Rusch, Siddhartha Mishra, N. Benjamin Erichson, Michael W. Mahoney

We propose a novel method called Long Expressive Memory (LEM) for learning long-term sequential dependencies.

Ranked #1 on Time Series Classification on EigenWorms

Language Modelling Sequential Image Classification +5

Paper
Code

Noisy Feature Mixup

2 code implementations • ICLR 2022 • Soon Hoe Lim, N. Benjamin Erichson, Francisco Utrera, Winnie Xu, Michael W. Mahoney

We introduce Noisy Feature Mixup (NFM), an inexpensive yet effective method for data augmentation that combines the best of interpolation based training and noise injection schemes.

Data Augmentation

Paper
Code

Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information

no code implementations • ICLR 2022 • Majid Jahani, Sergey Rusakov, Zheng Shi, Peter Richtárik, Michael W. Mahoney, Martin Takáč

We present a novel adaptive optimization algorithm for large-scale machine learning problems.

BIG-bench Machine Learning Second-order methods

Paper
Add Code

What's Hidden in a One-layer Randomly Weighted Transformer?

1 code implementation • 8 Sep 2021 • Sheng Shen, Zhewei Yao, Douwe Kiela, Kurt Keutzer, Michael W. Mahoney

Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29. 45/17. 29 BLEU on IWSLT14/WMT14.

Machine Translation Translation

Paper
Code

Characterizing possible failure modes in physics-informed neural networks

2 code implementations • NeurIPS 2021 • Aditi S. Krishnapriyan, Amir Gholami, Shandian Zhe, Robert M. Kirby, Michael W. Mahoney

We provide evidence that the soft regularization in PINNs, which involves PDE-based differential operators, can introduce a number of subtle problems, including making the problem more ill-conditioned.

131

Paper
Code

Generalization Bounds using Lower Tail Exponents in Stochastic Optimizers

no code implementations • 2 Aug 2021 • Liam Hodgkinson, Umut Şimşekli, Rajiv Khanna, Michael W. Mahoney

Despite the ubiquitous use of stochastic optimization algorithms in machine learning, the precise impact of these algorithms and their dynamics on generalization performance in realistic non-convex settings is still poorly understood.

Generalization Bounds Stochastic Optimization

Paper
Add Code

Taxonomizing local versus global structure in neural network loss landscapes

1 code implementation • NeurIPS 2021 • Yaoqing Yang, Liam Hodgkinson, Ryan Theisen, Joe Zou, Joseph E. Gonzalez, Kannan Ramchandran, Michael W. Mahoney

Viewing neural network models in terms of their loss landscapes has a long history in the statistical mechanics approach to learning, and in recent years it has received attention within machine learning proper.

Paper
Code

Newton-LESS: Sparsification without Trade-offs for the Sketched Newton Update

1 code implementation • NeurIPS 2021 • Michał Dereziński, Jonathan Lacotte, Mert Pilanci, Michael W. Mahoney

In second-order optimization, a potential bottleneck can be computing the Hessian matrix of the optimized function at every iteration.

Paper
Code

Stateful ODE-Nets using Basis Function Expansions

3 code implementations • NeurIPS 2021 • Alejandro Queiruga, N. Benjamin Erichson, Liam Hodgkinson, Michael W. Mahoney

The recently-introduced class of ordinary differential equation networks (ODE-Nets) establishes a fruitful connection between deep learning and dynamical systems.

Image Classification Sentence

Paper
Code

Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics

no code implementations • 1 Jun 2021 • Charles H. Martin, Michael W. Mahoney

Our results highlight the subtlety of comparing models when both architectures and hyperparameters are varied; the complementary role of implicit scale versus implicit shape parameters in understanding NN model quality; and the need to go beyond one-size-fits-all metrics based on upper bounds from generalization theory to describe the performance of NN models.

Learning Theory

Paper
Add Code

LEAP: Learnable Pruning for Transformer-based Models

1 code implementation • 30 May 2021 • Zhewei Yao, Xiaoxia Wu, Linjian Ma, Sheng Shen, Kurt Keutzer, Michael W. Mahoney, Yuxiong He

Moreover, in order to reduce hyperparameter tuning, a novel adaptive regularization coefficient is deployed to control the regularization penalty adaptively.

QQP

Paper
Code

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

4 code implementations • 29 Apr 2021 • Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael W. Mahoney, Joseph E. Gonzalez

On all these tasks, ActNN compresses the activation to 2 bits on average, with negligible accuracy loss.

Quantization

192

Paper
Code

Integer-only Zero-shot Quantization for Efficient Speech Recognition

1 code implementation • 31 Mar 2021 • Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Aniruddha Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, Kurt Keutzer

End-to-end neural network models achieve improved performance on various automatic speech recognition (ASR) tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Code

A Survey of Quantization Methods for Efficient Neural Network Inference

no code implementations • 25 Mar 2021 • Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer

Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks.

Efficient Neural Network Quantization

Paper
Add Code

Hessian Eigenspectra of More Realistic Nonlinear Models

no code implementations • NeurIPS 2021 • Zhenyu Liao, Michael W. Mahoney

Given an optimization problem, the Hessian matrix and its eigenspectrum can be used in many ways, ranging from designing more efficient second-order algorithms to performing model analysis and regression diagnostics.

Paper
Add Code

A Differential Geometry Perspective on Orthogonal Recurrent Models

no code implementations • 18 Feb 2021 • Omri Azencot, N. Benjamin Erichson, Mirela Ben-Chen, Michael W. Mahoney

In this work, we employ tools and insights from differential geometry to offer a novel perspective on orthogonal RNNs.

Paper
Add Code

Noisy Recurrent Neural Networks

1 code implementation • NeurIPS 2021 • Soon Hoe Lim, N. Benjamin Erichson, Liam Hodgkinson, Michael W. Mahoney

We provide a general framework for studying recurrent neural networks (RNNs) trained by injecting noise into hidden states.

General Classification

Paper
Code

I-BERT: Integer-only BERT Quantization

4 code implementations • 5 Jan 2021 • Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks.

Natural Language Inference Natural Language Understanding +1

124,866

Paper
Code

Improved guarantees and a multiple-descent curve for Column Subset Selection and the Nystrom method

no code implementations • NeurIPS 2020 • Michal Derezinski, Rajiv Khanna, Michael W. Mahoney

The Column Subset Selection Problem (CSSP) and the Nystrom method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing.

Paper
Add Code

Sparse sketches with small inversion bias

no code implementations • 21 Nov 2020 • Michał Dereziński, Zhenyu Liao, Edgar Dobriban, Michael W. Mahoney

For a tall $n\times d$ matrix $A$ and a random $m\times n$ sketching matrix $S$, the sketched estimate of the inverse covariance matrix $(A^\top A)^{-1}$ is typically biased: $E[(\tilde A^\top\tilde A)^{-1}]\ne(A^\top A)^{-1}$, where $\tilde A=SA$.

Distributed Optimization

Paper
Add Code

HAWQV3: Dyadic Neural Network Quantization

1 code implementation • 20 Nov 2020 • Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, Kurt Keutzer

Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.

Model Compression Quantization

393

Paper
Code

A Statistical Framework for Low-bitwidth Training of Deep Neural Networks

1 code implementation • NeurIPS 2020 • Jianfei Chen, Yu Gai, Zhewei Yao, Michael W. Mahoney, Joseph E. Gonzalez

We show that the FQT gradient is an unbiased estimator of the QAT gradient, and we discuss the impact of gradient quantization on its variance.

Ranked #9 on Semantic Textual Similarity on STS Benchmark

Linguistic Acceptability Natural Language Inference +3

Paper
Code

Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism

no code implementations • 18 Oct 2020 • Vipul Gupta, Dhruv Choudhary, Ping Tak Peter Tang, Xiaohan Wei, Xing Wang, Yuzhen Huang, Arun Kejariwal, Kannan Ramchandran, Michael W. Mahoney

This is done by identifying and updating only the most relevant neurons of the neural network for each training sample in the data.

Recommendation Systems

Paper
Add Code

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

1 code implementation • EMNLP 2020 • Qinxin Wang, Hao Tan, Sheng Shen, Michael W. Mahoney, Zhewei Yao

Phrase localization is a task that studies the mapping from textual phrases to regions of an image.

Phrase Grounding

Paper
Code

Sparse Quantized Spectral Clustering

no code implementations • ICLR 2021 • Zhenyu Liao, Romain Couillet, Michael W. Mahoney

Given a large data matrix, sparsifying, quantizing, and/or performing other entry-wise nonlinear operations can have numerous benefits, ranging from speeding up iterative algorithms for core numerical linear algebra problems to providing nonlinear filters to design state-of-the-art neural network models.

Clustering Quantization

Paper
Add Code

Improving Semi-supervised Federated Learning by Reducing the Gradient Diversity of Models

1 code implementation • 26 Aug 2020 • Zhengming Zhang, Yaoqing Yang, Zhewei Yao, Yujun Yan, Joseph E. Gonzalez, Michael W. Mahoney

Replacing BN with the recently-proposed Group Normalization (GN) can reduce gradient diversity and improve test accuracy.

Federated Learning

Paper
Code

Continuous-in-Depth Neural Networks

4 code implementations • 5 Aug 2020 • Alejandro F. Queiruga, N. Benjamin Erichson, Dane Taylor, Michael W. Mahoney

We first show that ResNets fail to be meaningful dynamical integrators in this richer sense.

Numerical Integration

Paper
Code

Noise-Response Analysis of Deep Neural Networks Quantifies Robustness and Fingerprints Structural Malware

no code implementations • 31 Jul 2020 • N. Benjamin Erichson, Dane Taylor, Qixuan Wu, Michael W. Mahoney

The ubiquity of deep neural networks (DNNs), cloud-based training, and transfer learning is giving rise to a new cybersecurity frontier in which unsecure DNNs have `structural malware' (i. e., compromised weights and activation pathways).

Transfer Learning

Paper
Add Code

Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification

1 code implementation • ICLR 2021 • Francisco Utrera, Evan Kravitz, N. Benjamin Erichson, Rajiv Khanna, Michael W. Mahoney

Transfer learning has emerged as a powerful methodology for adapting pre-trained deep neural networks on image recognition tasks to new domains.

Classification General Classification +2

Paper
Code

Boundary thickness and robustness in learning models

1 code implementation • NeurIPS 2020 • Yaoqing Yang, Rajiv Khanna, Yaodong Yu, Amir Gholami, Kurt Keutzer, Joseph E. Gonzalez, Kannan Ramchandran, Michael W. Mahoney

Using these observations, we show that noise-augmentation on mixup training further increases boundary thickness, thereby combating vulnerability to various forms of adversarial attacks and OOD transforms.

Adversarial Defense Data Augmentation

Paper
Code

Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization

no code implementations • NeurIPS 2020 • Michał Dereziński, Burak Bartan, Mert Pilanci, Michael W. Mahoney

In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data.

Point Processes Second-order methods

Paper
Add Code

Lipschitz Recurrent Neural Networks

1 code implementation • ICLR 2021 • N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, Michael W. Mahoney

Viewing recurrent neural networks (RNNs) as continuous-time dynamical systems, we propose a recurrent unit that describes the hidden state's evolution with two parts: a well-understood linear component plus a Lipschitz nonlinearity.

Ranked #10 on Sequential Image Classification on Sequential CIFAR-10

Language Modelling Sequential Image Classification

Paper
Code

Good Classifiers are Abundant in the Interpolating Regime

no code implementations • 22 Jun 2020 • Ryan Theisen, Jason M. Klusowski, Michael W. Mahoney

Inspired by the statistical mechanics approach to learning, we formally define and develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers from several model classes.

Learning Theory

Paper
Add Code

Precise expressions for random projections: Low-rank approximation and randomized Newton

no code implementations • NeurIPS 2020 • Michał Dereziński, Feynman Liang, Zhenyu Liao, Michael W. Mahoney

It is often desirable to reduce the dimensionality of a large dataset by projecting it onto a low-dimensional subspace.

Dimensionality Reduction Stochastic Optimization

Paper
Add Code

Multiplicative noise and heavy tails in stochastic optimization

no code implementations • 11 Jun 2020 • Liam Hodgkinson, Michael W. Mahoney

Although stochastic optimization is central to modern machine learning, the precise mechanisms underlying its success, and in particular, the precise role of the stochasticity, still remain unclear.

Stochastic Optimization

Paper
Add Code

A Random Matrix Analysis of Random Fourier Features: Beyond the Gaussian Kernel, a Precise Phase Transition, and the Corresponding Double Descent

no code implementations • NeurIPS 2020 • Zhenyu Liao, Romain Couillet, Michael W. Mahoney

This article characterizes the exact asymptotics of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples $n$, their dimension $p$, and the dimension of feature space $N$ are all large and comparable.

regression

Paper
Add Code

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

3 code implementations • 1 Jun 2020 • Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael W. Mahoney

We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN.

BIG-bench Machine Learning Second-order methods +1

246

Paper
Code

Determinantal Point Processes in Randomized Numerical Linear Algebra

no code implementations • 7 May 2020 • Michał Dereziński, Michael W. Mahoney

For example, random sampling with a DPP leads to new kinds of unbiased estimators for least squares, enabling more refined statistical and inferential understanding of these algorithms; a DPP is, in some sense, an optimal randomized algorithm for the Nystr\"om method; and a RandNLA technique called leverage score sampling can be derived as the marginal distribution of a DPP.

Point Processes

Paper
Add Code

PowerNorm: Rethinking Batch Normalization in Transformers

1 code implementation • ICML 2020 • Sheng Shen, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

To address this, we propose Power Normalization (PN), a novel normalization scheme that resolves this issue by (i) relaxing zero-mean normalization in BN, (ii) incorporating a running quadratic mean instead of per batch statistics to stabilize fluctuations, and (iii) using an approximate backpropagation for incorporating the running statistics in the forward pass.

Ranked #12 on Machine Translation on WMT2014 English-German

Machine Translation

119

Paper
Code

Error Estimation for Sketched SVD via the Bootstrap

no code implementations • 10 Mar 2020 • Miles E. Lopes, N. Benjamin Erichson, Michael W. Mahoney

In order to compute fast approximations to the singular value decompositions (SVD) of very large matrices, randomized sketching algorithms have become a leading approach.

Paper
Add Code

Forecasting Sequential Data using Consistent Koopman Autoencoders

1 code implementation • ICML 2020 • Omri Azencot, N. Benjamin Erichson, Vanessa Lin, Michael W. Mahoney

Recurrent neural networks are widely used on time series data, yet such models often ignore the underlying physical structures in such sequences.

Time Series Time Series Analysis

Paper
Code

Asymptotic Analysis of Sampling Estimators for Randomized Numerical Linear Algebra Algorithms

no code implementations • 24 Feb 2020 • Ping Ma, Xinlian Zhang, Xin Xing, Jingyi Ma, Michael W. Mahoney

In this article, we develop an asymptotic analysis to derive the distribution of RandNLA sampling estimators for the least-squares problem.

Two-sample testing

Paper
Add Code

Improved guarantees and a multiple-descent curve for Column Subset Selection and the Nyström method

no code implementations • 21 Feb 2020 • Michał Dereziński, Rajiv Khanna, Michael W. Mahoney

The Column Subset Selection Problem (CSSP) and the Nystr\"om method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing.

Paper
Add Code

Stochastic Normalizing Flows

no code implementations • NeurIPS 2020 • Liam Hodgkinson, Chris van der Heide, Fred Roosta, Michael W. Mahoney

We introduce stochastic normalizing flows, an extension of continuous normalizing flows for maximum likelihood estimation and variational inference (VI) using stochastic differential equations (SDEs).

Variational Inference

Paper
Add Code

Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data

1 code implementation • 17 Feb 2020 • Charles H. Martin, Tongsu, Peng, Michael W. Mahoney

We find that norm based metrics correlate well with reported test accuracies for well-trained models, but that they often cannot distinguish well-trained versus poorly-trained models.

Paper
Code

ZeroQ: A Novel Zero Shot Quantization Framework

3 code implementations • CVPR 2020 • Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

Importantly, ZeroQ has a very low computational overhead, and it can finish the entire quantization process in less than 30s (0. 5\% of one epoch training time of ResNet50 on ImageNet).

Ranked #1 on Data Free Quantization on CIFAR10 (CIFAR-10 W8A8 Top-1 Accuracy metric)

Data Free Quantization Neural Network Compression

270

Paper
Code

Exact expressions for double descent and implicit regularization via surrogate random design

no code implementations • NeurIPS 2020 • Michał Dereziński, Feynman Liang, Michael W. Mahoney

We provide the first exact non-asymptotic expressions for double descent of the minimum norm linear estimator.

Paper
Add Code

ANODEV2: A Coupled Neural ODE Framework

1 code implementation • NeurIPS 2019 • Tianjun Zhang, Zhewei Yao, Amir Gholami, Joseph E. Gonzalez, Kurt Keutzer, Michael W. Mahoney, George Biros

It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE).

Paper
Code

LSAR: Efficient Leverage Score Sampling Algorithm for the Analysis of Big Time Series Data

no code implementations • 27 Nov 2019 • Ali Eshragh, Fred Roosta, Asef Nazari, Michael W. Mahoney

We first develop a new fast algorithm to estimate the leverage scores of an autoregressive (AR) model in big data regimes.

Time Series Time Series Analysis

Paper
Add Code

HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks

2 code implementations • NeurIPS 2020 • Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfeen, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

However, the search space for a mixed-precision quantization is exponential in the number of layers.

object-detection Object Detection +1

626

Paper
Code

Limit theorems for out-of-sample extensions of the adjacency and Laplacian spectral embeddings

no code implementations • 29 Sep 2019 • Keith Levin, Fred Roosta, Minh Tang, Michael W. Mahoney, Carey E. Priebe

In both cases, we prove that when the underlying graph is generated according to a latent space model called the random dot product graph, which includes the popular stochastic block model as a special case, an out-of-sample extension based on a least-squares objective obeys a central limit theorem about the true latent position of the out-of-sample vertex.

Dimensionality Reduction Graph Embedding +1

Paper
Add Code

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

no code implementations • 12 Sep 2019 • Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further.

Ranked #13 on Semantic Textual Similarity on STS Benchmark

Linguistic Acceptability Natural Language Inference +4

Paper
Add Code

Geometric Rates of Convergence for Kernel-based Sampling Algorithms

no code implementations • 19 Jul 2019 • Rajiv Khanna, Liam Hodgkinson, Michael W. Mahoney

The rate of convergence of weighted kernel herding (WKH) and sequential Bayesian quadrature (SBQ), two kernel-based sampling algorithms for estimating integrals with respect to some target probability measure, is investigated.

Paper
Add Code

Statistical guarantees for local graph clustering

no code implementations • 11 Jun 2019 • Wooseok Ha, Kimon Fountoulakis, Michael W. Mahoney

In this paper, we adopt a statistical perspective on local graph clustering, and we analyze the performance of the l1-regularized PageRank method~(Fountoulakis et.

Clustering Graph Clustering

Paper
Add Code

Bayesian experimental design using regularized determinantal point processes

1 code implementation • 10 Jun 2019 • Michał Dereziński, Feynman Liang, Michael W. Mahoney

In experimental design, we are given $n$ vectors in $d$ dimensions, and our goal is to select $k\ll n$ of them to perform expensive measurements, e. g., to obtain labels/responses, for a linear regression task.

Experimental Design Point Processes

Paper
Code

Residual Networks as Nonlinear Systems: Stability Analysis using Linearization

no code implementations • 31 May 2019 • Kai Rothauge, Zhewei Yao, Zixi Hu, Michael W. Mahoney

We regard pre-trained residual networks (ResNets) as nonlinear systems and use linearization, a common method used in the qualitative analysis of nonlinear systems, to understand the behavior of the networks under small perturbations of the input images.

Paper
Add Code

Distributed estimation of the inverse Hessian by determinantal averaging

no code implementations • NeurIPS 2019 • Michał Dereziński, Michael W. Mahoney

In distributed optimization and distributed numerical linear algebra, we often encounter an inversion bias: if we want to compute a quantity that depends on the inverse of a sum of distributed matrices, then the sum of the inverses does not equal the inverse of the sum.

Distributed Optimization Uncertainty Quantification

Paper
Add Code

Physics-informed Autoencoders for Lyapunov-stable Fluid Flow Prediction

no code implementations • 26 May 2019 • N. Benjamin Erichson, Michael Muehlebach, Michael W. Mahoney

In addition to providing high-profile successes in computer vision and natural language processing, neural networks also provide an emerging set of techniques for scientific problems.

Paper
Add Code

Traditional and Heavy Tailed Self Regularization in Neural Network Models

no code implementations • ICLR 2019 • Charles H. Martin, Michael W. Mahoney

Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models such as AlexNet and Inception, and smaller models trained from scratch, such as LeNet5 and a miniature-AlexNet.

Paper
Add Code

JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks

1 code implementation • 7 Apr 2019 • N. Benjamin Erichson, Zhewei Yao, Michael W. Mahoney

To complement these approaches, we propose a very simple and inexpensive strategy which can be used to ``retrofit'' a previously-trained network to improve its resilience to adversarial attacks.

Paper
Code

OverSketched Newton: Fast Convex Optimization for Serverless Systems

1 code implementation • 21 Mar 2019 • Vipul Gupta, Swanand Kadhe, Thomas Courtade, Michael W. Mahoney, Kannan Ramchandran

Motivated by recent developments in serverless systems for large-scale computation as well as improvements in scalable randomized matrix algorithms, we develop OverSketched Newton, a randomized Hessian-based optimization algorithm to solve large-scale convex optimization problems in serverless systems.

Distributed Optimization

Paper
Code

Inefficiency of K-FAC for Large Batch Size Training

no code implementations • 14 Mar 2019 • Linjian Ma, Gabe Montague, Jiayu Ye, Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael W. Mahoney

In stochastic optimization, using large batch sizes during training can leverage parallel resources to produce faster wall-clock training times per training epoch.

Stochastic Optimization

Paper
Add Code

Shallow Neural Networks for Fluid Flow Reconstruction with Limited Sensors

1 code implementation • 20 Feb 2019 • N. Benjamin Erichson, Lionel Mathelin, Zhewei Yao, Steven L. Brunton, Michael W. Mahoney, J. Nathan Kutz

In many applications, it is important to reconstruct a fluid flow field, or some other high-dimensional state, from limited measurements and limited data.

Paper
Code

Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression

no code implementations • 4 Feb 2019 • Michał Dereziński, Kenneth L. Clarkson, Michael W. Mahoney, Manfred K. Warmuth

In the process, we develop a new algorithm for a joint sampling distribution called volume sampling, and we propose a new i. i. d.

Experimental Design regression

Paper
Add Code

Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks

no code implementations • 24 Jan 2019 • Charles H. Martin, Michael W. Mahoney

In this paper, we show how to use a new Theory of Heavy-Tailed Self-Regularization (HT-SR) to answer this.

Paper
Add Code

Traditional and Heavy-Tailed Self Regularization in Neural Network Models

2 code implementations • 24 Jan 2019 • Charles H. Martin, Michael W. Mahoney

1,392

Paper
Code

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

no code implementations • 30 Nov 2018 • Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W. Mahoney, Joseph Gonzalez

Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique.

Image Classification Image Segmentation +2

Paper
Add Code

A Short Introduction to Local Graph Clustering Methods and Software

1 code implementation • 17 Oct 2018 • Kimon Fountoulakis, David F. Gleich, Michael W. Mahoney

Scalability problems led to the development of local graph clustering algorithms that come with a variety of theoretical guarantees.

Social and Information Networks

129

Paper
Code

Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning

3 code implementations • 2 Oct 2018 • Charles H. Martin, Michael W. Mahoney

Random Matrix Theory (RMT) is applied to analyze weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models such as AlexNet and Inception, and smaller models trained from scratch, such as LeNet5 and a miniature-AlexNet.

Paper
Code

Newton-MR: Inexact Newton Method With Minimum Residual Sub-problem Solver

no code implementations • 30 Sep 2018 • Fred Roosta, Yang Liu, Peng Xu, Michael W. Mahoney

We consider a variant of inexact Newton Method, called Newton-MR, in which the least-squares sub-problems are solved approximately using Minimum Residual method.

Paper
Add Code

Newton-ADMM: A Distributed GPU-Accelerated Optimizer for Multiclass Classification Problems

1 code implementation • 18 Jul 2018 • Chih-Hao Fang, Sudhir B. Kylasa, Fred Roosta, Michael W. Mahoney, Ananth Grama

First-order optimization methods, such as stochastic gradient descent (SGD) and its variants, are widely used in machine learning applications due to their simplicity and low per-iteration costs.

General Classification

Paper
Code

Error Estimation for Randomized Least-Squares Algorithms via the Bootstrap

no code implementations • ICML 2018 • Miles E. Lopes, Shusen Wang, Michael W. Mahoney

As a more practical alternative, we propose a bootstrap method to compute a posteriori error estimates for randomized LS algorithms.

Paper
Add Code

GPU Accelerated Sub-Sampled Newton's Method

no code implementations • 26 Feb 2018 • Sudhir B. Kylasa, Farbod Roosta-Khorasani, Michael W. Mahoney, Ananth Grama

In particular, in convex settings, we consider variants of classical Newton\textsf{'}s method in which the Hessian and/or the gradient are randomly sub-sampled.

Second-order methods

Paper
Add Code

Hessian-based Analysis of Large Batch Training and Robustness to Adversaries

6 code implementations • NeurIPS 2018 • Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, Michael W. Mahoney

Extensive experiments on multiple networks show that saddle-points are not the cause for generalization gap of large batch size training, and the results consistently show that large batch converges to points with noticeably higher Hessian spectrum.

626

Paper
Code

Out-of-sample extension of graph adjacency spectral embedding

no code implementations • ICML 2018 • Keith Levin, Farbod Roosta-Khorasani, Michael W. Mahoney, Carey E. Priebe

Many popular dimensionality reduction procedures have out-of-sample extensions, which allow a practitioner to apply a learned embedding to observations not seen in the initial training sample.

Dimensionality Reduction Position

Paper
Add Code

Lectures on Randomized Numerical Linear Algebra

1 code implementation • 24 Dec 2017 • Petros Drineas, Michael W. Mahoney

This chapter is based on lectures on Randomized Numerical Linear Algebra from the 2016 Park City Mathematics Institute summer school on The Mathematics of Data.

Paper
Code

Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization

no code implementations • 17 Dec 2017 • Aditya Devarakonda, Kimon Fountoulakis, James Demmel, Michael W. Mahoney

Parallel computing has played an important role in speeding up convex optimization methods for big data analytics and large-scale machine learning (ML).

Paper
Add Code

A Berkeley View of Systems Challenges for AI

no code implementations • 15 Dec 2017 • Ion Stoica, Dawn Song, Raluca Ada Popa, David Patterson, Michael W. Mahoney, Randy Katz, Anthony D. Joseph, Michael Jordan, Joseph M. Hellerstein, Joseph E. Gonzalez, Ken Goldberg, Ali Ghodsi, David Culler, Pieter Abbeel

With the increasing commoditization of computer vision, speech recognition and machine translation systems and the widespread deployment of learning-based back-end technologies such as digital advertising and intelligent infrastructures, AI (Artificial Intelligence) has moved from research labs to production.

Machine Translation speech-recognition +1

Paper
Add Code

Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior

no code implementations • ICLR 2018 • Charles H. Martin, Michael W. Mahoney

Using this model, we describe how a very simple application of ideas from the statistical mechanics theory of generalization provides a strong qualitative description of recently-observed empirical results regarding the inability of deep neural networks not to overfit training data, discontinuous learning and sharp transitions in the generalization properties of learning algorithms, etc.

Paper
Add Code

LASAGNE: Locality And Structure Aware Graph Node Embedding

no code implementations • 17 Oct 2017 • Evgeniy Faerman, Felix Borutta, Kimon Fountoulakis, Michael W. Mahoney

For larger graphs with flat NCPs that are strongly expander-like, existing methods lead to random walks that expand rapidly, touching many dissimilar nodes, thereby leading to lower-quality vector representations that are less useful for downstream tasks.

Link Prediction Multi-Label Classification

Paper
Add Code

GIANT: Globally Improved Approximate Newton Method for Distributed Optimization

no code implementations • NeurIPS 2018 • Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, Michael W. Mahoney

For distributed computing environment, we consider the empirical risk minimization problem and propose a distributed and communication-efficient Newton-type optimization method.

Distributed Computing Distributed Optimization

Paper
Add Code

Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study

no code implementations • 25 Aug 2017 • Peng Xu, Farbod Roosta-Khorasani, Michael W. Mahoney

While first-order optimization methods such as stochastic gradient descent (SGD) are popular in machine learning (ML), they come with well-known deficiencies, including relatively-slow convergence, sensitivity to the settings of hyper-parameters such as learning rate, stagnation at high training errors, and difficulty in escaping flat regions and saddle points.

BIG-bench Machine Learning Second-order methods

Paper
Add Code

Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information

no code implementations • 23 Aug 2017 • Peng Xu, Fred Roosta, Michael W. Mahoney

In this light, we consider the canonical problem of finite-sum minimization, provide appropriate uniform and non-uniform sub-sampling strategies to construct such Hessian approximations, and obtain optimal iteration complexity for the corresponding sub-sampled trust-region and cubic regularization methods.

Vocal Bursts Type Prediction

Paper
Add Code

A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

no code implementations • 6 Aug 2017 • Miles E. Lopes, Shusen Wang, Michael W. Mahoney

In recent years, randomized methods for numerical linear algebra have received growing interest as a general approach to large-scale problems.

Dimensionality Reduction

Paper
Add Code

Capacity Releasing Diffusion for Speed and Locality.

no code implementations • ICML 2017 • Di Wang, Kimon Fountoulakis, Monika Henzinger, Michael W. Mahoney, Satish Rao

As an application, we use our CRD Process to develop an improved local algorithm for graph clustering.

Clustering Graph Clustering

Paper
Add Code

Skip-Gram âˆ’ Zipf + Uniform = Vector Additivity

no code implementations • ACL 2017 • Alex Gittens, Dimitris Achlioptas, Michael W. Mahoney

An unexpected {``}side-effect{''} of such models is that their vectors often exhibit compositionality, i. e., \textit{adding}two word-vectors results in a vector that is only a small angle away from the vector of a word representing the semantic composite of the original words, e. g., {``}man{''} + {``}royal{''} = {``}king{''}.

Caption Generation Dimensionality Reduction +1

Paper
Add Code

Capacity Releasing Diffusion for Speed and Locality

no code implementations • 19 Jun 2017 • Di Wang, Kimon Fountoulakis, Monika Henzinger, Michael W. Mahoney, Satish Rao

Thus, our CRD Process is the first local graph clustering algorithm that is not subject to the well-known quadratic Cheeger barrier.

Clustering Graph Clustering

Paper
Add Code

Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds

no code implementations • 9 Jun 2017 • Shusen Wang, Alex Gittens, Michael W. Mahoney

This work analyzes the application of this paradigm to kernel $k$-means clustering, and shows that applying the linear $k$-means clustering algorithm to $\frac{k}{\epsilon} (1 + o(1))$ features constructed using a so-called rank-restricted Nystr\"om approximation results in cluster assignments that satisfy a $1 + \epsilon$ approximation ratio in terms of the kernel $k$-means cost function, relative to the guarantee provided by the same algorithm without the use of the Nystr\"om method.

Clustering

Paper
Add Code

Union of Intersections (UoI) for Interpretable Data Driven Discovery and Prediction

no code implementations • NeurIPS 2017 • Kristofer E. Bouchard, Alejandro F. Bujan, Farbod Roosta-Khorasani, Shashanka Ubaru, Prabhat, Antoine M. Snijders, Jian-Hua Mao, Edward F. Chang, Michael W. Mahoney, Sharmodeep Bhattacharyya

The increasing size and complexity of scientific data could dramatically enhance discovery and prediction for basic scientific applications.

Model Selection

Paper
Add Code

Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging

no code implementations • ICML 2017 • Shusen Wang, Alex Gittens, Michael W. Mahoney

In particular, there is a bias-variance trade-off in sketched MRR that is not present in sketched LSR.

regression

Paper
Add Code

Feature-distributed sparse regression: a screen-and-clean approach

no code implementations • NeurIPS 2016 • Jiyan Yang, Michael W. Mahoney, Michael Saunders, Yuekai Sun

Most existing approaches to distributed sparse regression assume the data is partitioned by samples.

Distributed Computing regression

Paper
Add Code

Mapping the Similarities of Spectra: Global and Locally-biased Approaches to SDSS Galaxy Data

no code implementations • 13 Sep 2016 • David Lawlor, Tamás Budavári, Michael W. Mahoney

This technique permits us to characterize empirically the natural variations in observed spectra data, and we illustrate how this approach can be used in an exploratory manner to highlight both large-scale global as well as small-scale local structure in Sloan Digital Sky Survey (SDSS) data.

Paper
Add Code

Lecture Notes on Spectral Graph Methods

no code implementations • 17 Aug 2016 • Michael W. Mahoney

These are lecture notes that are based on the lectures from a class I taught on the topic of Spectral Graph Methods at UC Berkeley during the Spring 2015 semester.

Paper
Add Code

Lecture Notes on Randomized Linear Algebra

4 code implementations • 16 Aug 2016 • Michael W. Mahoney

These are lecture notes that are based on the lectures from a class I taught on the topic of Randomized Linear Algebra (RLA) at UC Berkeley during the Fall 2013 semester.

Paper
Code

Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

1 code implementation • 5 Jul 2016 • Alex Gittens, Aditya Devarakonda, Evan Racah, Michael Ringenburg, Lisa Gerhardt, Jey Kottalam, Jialin Liu, Kristyn Maschhoff, Shane Canon, Jatin Chhugani, Pramod Sharma, Jiyan Yang, James Demmel, Jim Harrell, Venkat Krishnamurthy, Michael W. Mahoney, Prabhat

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms.

Distributed, Parallel, and Cluster Computing G.1.3; C.2.4

Paper
Code

Sub-sampled Newton Methods with Non-uniform Sampling

no code implementations • NeurIPS 2016 • Peng Xu, Jiyan Yang, Farbod Roosta-Khorasani, Christopher Ré, Michael W. Mahoney

As second-order methods prove to be effective in finding the minimizer to a high-precision, in this work, we propose randomized Newton-type algorithms that exploit \textit{non-uniform} sub-sampling of $\{\nabla^2 f_i(w)\}_{i=1}^{n}$, as well as inexact updates, as means to reduce the computational complexity.

Second-order methods

Paper
Add Code

FLAG n' FLARE: Fast Linearly-Coupled Adaptive Gradient Methods

no code implementations • 26 May 2016 • Xiang Cheng, Farbod Roosta-Khorasani, Stefan Palombo, Peter L. Bartlett, Michael W. Mahoney

We consider first order gradient methods for effectively optimizing a composite objective in the form of a sum of smooth and, potentially, non-smooth functions.

Paper
Add Code

Sub-Sampled Newton Methods II: Local Convergence Rates

no code implementations • 18 Jan 2016 • Farbod Roosta-Khorasani, Michael W. Mahoney

In such problems, sub-sampling as a way to reduce $n$ can offer great amount of computational efficiency.

Computational Efficiency Second-order methods

Paper
Add Code

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

no code implementations • 18 Jan 2016 • Farbod Roosta-Khorasani, Michael W. Mahoney

As a remedy, for all of our algorithms, we also give global convergence results for the case of inexact updates where such linear system is solved only approximately.

Paper
Add Code

Fast Randomized Kernel Ridge Regression with Statistical Guarantees

no code implementations • NeurIPS 2015 • Ahmed Alaoui, Michael W. Mahoney

One approach to improving the running time of kernel-based methods is to build a small sketch of the kernel matrix and use it in lieu of the full matrix in the machine learning task of interest.

regression

Paper
Add Code

A Local Perspective on Community Structure in Multilayer Networks

no code implementations • 18 Oct 2015 • Lucas G. S. Jeub, Michael W. Mahoney, Peter J. Mucha, Mason A. Porter

The analysis of multilayer networks is among the most active areas of network science, and there are now several methods to detect dense "communities" of nodes in multilayer networks.

Social and Information Networks Probability Adaptation and Self-Organizing Systems Data Analysis, Statistics and Probability Physics and Society

Paper
Add Code

Optimal Subsampling Approaches for Large Sample Linear Regression

no code implementations • 17 Sep 2015 • Rong Zhu, Ping Ma, Michael W. Mahoney, Bin Yu

For unweighted estimation algorithm, we show that its resulting subsample estimator is not consistent to the full sample OLS estimator.

regression

Paper
Add Code

Block Basis Factorization for Scalable Kernel Matrix Evaluation

no code implementations • 3 May 2015 • Ruoxi Wang, Yingzhou Li, Michael W. Mahoney, Eric Darve

Kernel methods are widespread in machine learning; however, they are limited by the quadratic complexity of the construction, application, and storage of kernel matrices.

BIG-bench Machine Learning

Paper
Add Code

Weighted SGD for $\ell_p$ Regression with Randomized Preconditioning

no code implementations • 12 Feb 2015 • Jiyan Yang, Yin-Lam Chow, Christopher Ré, Michael W. Mahoney

We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems---e. g., $\ell_2$ and $\ell_1$ regression problems.

regression

Paper
Add Code

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

no code implementations • 10 Feb 2015 • Jiyan Yang, Xiangrui Meng, Michael W. Mahoney

and demonstrate that $\ell_1$ and $\ell_2$ regression problems can be solved to low, medium, or high precision in existing distributed systems on up to terabyte-sized data.

regression

Paper
Add Code

Fast Randomized Kernel Methods With Statistical Guarantees

no code implementations • 2 Nov 2014 • Ahmed El Alaoui, Michael W. Mahoney

By extending the notion of \emph{statistical leverage scores} to the setting of kernel ridge regression, our main statistical result is to identify an importance sampling distribution that reduces the size of the sketch (i. e., the required number of columns to be sampled) to the \emph{effective dimensionality} of the problem.

Paper
Add Code

Random Laplace Feature Maps for Semigroup Kernels on Histograms

no code implementations • CVPR 2014 • Jiyan Yang, Vikas Sindhwani, Quanfu Fan, Haim Avron, Michael W. Mahoney

With the goal of accelerating the training and testing complexity of nonlinear kernel methods, several recent papers have proposed explicit embeddings of the input data into low-dimensional feature spaces, where fast linear methods can instead be used to generate approximate solutions.

Event Detection Image Classification

Paper
Add Code

Think Locally, Act Locally: The Detection of Small, Medium-Sized, and Large Communities in Large Networks

1 code implementation • 15 Mar 2014 • Lucas G. S. Jeub, Prakash Balachandran, Mason A. Porter, Peter J. Mucha, Michael W. Mahoney

In this paper, we adopt a complementary perspective that "communities" are associated with bottlenecks of locally-biased dynamical processes that begin at seed sets of nodes, and we employ several different community-identification procedures (using diffusion-based and geodesic-based dynamics) to investigate community quality as a function of community size.

Social and Information Networks Disordered Systems and Neural Networks Combinatorics Adaptation and Self-Organizing Systems Physics and Society

Paper
Code

A Statistical Perspective on Algorithmic Leveraging

no code implementations • 23 Jun 2013 • Ping Ma, Michael W. Mahoney, Bin Yu

A detailed empirical evaluation of existing leverage-based methods as well as these two new methods is carried out on both synthetic and real data sets.

Computational Efficiency

Paper
Add Code

Quantile Regression for Large-scale Applications

no code implementations • 1 May 2013 • Jiyan Yang, Xiangrui Meng, Michael W. Mahoney

Our empirical evaluation illustrates that our algorithm is competitive with the best previous work on small to medium-sized problems, and that in addition it can be implemented in MapReduce-like environments and applied to terabyte-sized problems.

regression

Paper
Add Code

Semi-supervised Eigenvectors for Large-scale Locally-biased Learning

no code implementations • 28 Apr 2013 • Toke J. Hansen, Michael W. Mahoney

For example, one might be interested in the clustering structure of a data graph near a prespecified "seed set" of nodes, or one might be interested in finding partitions in an image that are near a prespecified "ground truth" set of pixels.

BIG-bench Machine Learning Clustering

Paper
Add Code

Revisiting the Nystrom Method for Improved Large-Scale Machine Learning

no code implementations • 7 Mar 2013 • Alex Gittens, Michael W. Mahoney

Our main results consist of an empirical evaluation of the performance quality and running time of sampling and projection methods on a diverse suite of SPSD matrices.

BIG-bench Machine Learning

Paper
Add Code

Semi-supervised Eigenvectors for Locally-biased Learning

no code implementations • NeurIPS 2012 • Toke Hansen, Michael W. Mahoney

In many applications, one has information, e. g., labels that are provided in a semi-supervised manner, about a specific target region of a large data set, and one wants to perform machine learning and data analysis tasks nearby that pre-specified target region.

BIG-bench Machine Learning

Paper
Add Code

The Fast Cauchy Transform and Faster Robust Linear Regression

no code implementations • 19 Jul 2012 • Kenneth L. Clarkson, Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, Xiangrui Meng, David P. Woodruff

We provide fast algorithms for overconstrained $\ell_p$ regression and related problems: for an $n\times d$ input matrix $A$ and vector $b\in\mathbb{R}^n$, in $O(nd\log n)$ time we reduce the problem $\min_{x\in\mathbb{R}^d} \|Ax-b\|_p$ to the same problem with input matrix $\tilde A$ of dimension $s \times d$ and corresponding $\tilde b$ of dimension $s\times 1$.

regression

Paper
Add Code

Regularized Laplacian Estimation and Fast Eigenvector Approximation

no code implementations • NeurIPS 2011 • Patrick O. Perry, Michael W. Mahoney

Conversely, it will imply that the solution to this regularized estimation problem can be computed very quickly by running, e. g., the fast diffusion-based PageRank procedure for computing an approximation to the first nontrivial eigenvector of the graph Laplacian.

regression

Paper
Add Code

Randomized Dimensionality Reduction for k-means Clustering

no code implementations • 13 Oct 2011 • Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, Petros Drineas

On the other hand, two provably accurate feature extraction methods for $k$-means clustering are known in the literature; one is based on random projections and the other is based on the singular value decomposition (SVD).

Clustering Dimensionality Reduction +1

Paper
Add Code

Randomized algorithms for matrices and data

no code implementations • 29 Apr 2011 • Michael W. Mahoney

This monograph will provide a detailed overview of recent work on the theory of randomized matrix algorithms as well as the application of those ideas to the solution of practical problems in large-scale data analysis.

Data Structures and Algorithms

Paper
Add Code

CUR from a Sparse Optimization Viewpoint

no code implementations • NeurIPS 2010 • Jacob Bien, Ya Xu, Michael W. Mahoney

The CUR decomposition provides an approximation of a matrix X that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of X.

Paper
Add Code

Effective Resistances, Statistical Leverage, and Applications to Linear Equation Solving

1 code implementation • 18 May 2010 • Petros Drineas, Michael W. Mahoney

Our first and main result is a simple algorithm to approximate the solution to a set of linear equations defined by a Laplacian (for a graph $G$ with $n$ nodes and $m \le n^2$ edges) constraint matrix.

Numerical Analysis

Paper
Code

Unsupervised Feature Selection for the k-means Clustering Problem

no code implementations • NeurIPS 2009 • Christos Boutsidis, Petros Drineas, Michael W. Mahoney

We present a novel feature selection algorithm for the $k$-means clustering problem.

Clustering feature selection

Paper
Add Code

Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters

no code implementations • 8 Oct 2008 • Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, Michael W. Mahoney

A large body of work has been devoted to defining and identifying clusters or communities in social and information networks.

Data Structures and Algorithms Data Analysis, Statistics and Probability Physics and Society

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.