# Instance-hiding Schemes for Private Distributed Learning

The new ideas in the current paper are: (a) new variants of mixup with negative as well as positive coefficients, and extend the sample-wise mixup to be pixel-wise.

# Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression

no code implementations26 Nov 2023, , ,

Our main results involve analyzing the convergence properties of an approximate Newton method used to minimize the regularized training loss.

# One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

Despite this, storing the Key and Value matrices $K, V \in \mathbb{R}^{n \times d}$ still necessitates $O( n d)$ space, leading to significant memory usage.

# Revisiting Quantum Algorithms for Linear Regressions: Quadratic Speedups without Data-Dependent Parameters

no code implementations24 Nov 2023, ,

However, the running times of these algorithms depend on some quantum linear algebra-related parameters, such as $\kappa(A)$, the condition number of $A$.

# A Theoretical Insight into Attack and Defense of Gradient Leakage in Transformer

The Deep Leakage from Gradient (DLG) attack has emerged as a prevalent and highly effective method for extracting sensitive training data by inspecting exchanged gradients.

# Fast Heavy Inner Product Identification Between Weights and Inputs in Neural Network Training

In this paper, we consider a heavy inner product identification problem, which generalizes the Light Bulb problem~(\cite{prr89}): Given two sets $A \subset \{-1,+1\}^d$ and $B \subset \{-1,+1\}^d$ with $|A|=|B| = n$, if there are exact $k$ pairs whose inner product passes a certain threshold, i. e., $\{(a_1, b_1), \cdots, (a_k, b_k)\} \subset A \times B$ such that $\forall i \in [k], \langle a_i, b_i \rangle \geq \rho \cdot d$, for a threshold $\rho \in (0, 1)$, the goal is to identify those $k$ heavy inner products.

# The Expressibility of Polynomial based Attention Scheme

no code implementations30 Oct 2023, ,

In this paper, we offer a theoretical analysis of the expressive capabilities of polynomial attention.

# Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM's quality or in-context learning ability.

88

# Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights

In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks.

# Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention

no code implementations18 Oct 2023, ,

Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks.

# An Automatic Learning Rate Schedule Algorithm for Achieving Faster Convergence and Steeper Descent

no code implementations17 Oct 2023,

The delta-bar-delta algorithm is recognized as a learning rate adaptation technique that enhances the convergence speed of the training process in optimization by dynamically scheduling the learning rate based on the difference between the current and previous weight updates.

# How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation

no code implementations6 Oct 2023,

Interestingly, the higher the order of the tensors, the lower the bound on the entries needs to be for an efficient algorithm.

# Fine-tune Language Models to Approximate Unbiased In-context Learning

no code implementations5 Oct 2023, ,

To address this issue, we introduce a reweighted algorithm called RICL (Reweighted In-context Learning).

# A Unified Scheme of ResNet and Softmax

no code implementations23 Sep 2023, ,

The Hessian is shown to be positive semidefinite, and its structure is characterized as the sum of a low-rank matrix and a diagonal matrix.

# Is Solving Graph Neural Tangent Kernel Equivalent to Training Graph Neural Network?

no code implementations14 Sep 2023, ,

A rising trend in theoretical deep learning is to understand why deep learning works through Neural Tangent Kernel (NTK) [jgh18], a kernel method that is equivalent to using gradient descent to train a multi-layer infinitely-wide neural network.

# A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

no code implementations14 Sep 2023, , ,

$A_3$ is a matrix in $\mathbb{R}^{n \times d}$, $\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}$ is the $j_0$-th block of $\mathsf{A}$.

# Solving Attention Kernel Regression Problem via Pre-conditioner

no code implementations28 Aug 2023, ,

Large language models have shown impressive performance in many tasks.

# How to Protect Copyright Data in Optimization of Large Language Models?

no code implementations23 Aug 2023, ,

Large language models (LLMs) and generative AI have played a transformative role in computer research and applications.

# Clustered Linear Contextual Bandits with Knapsacks

no code implementations21 Aug 2023, ,

Thus, maximizing the total reward requires learning not only models about the reward and the resource consumption, but also cluster memberships.

# GradientCoin: A Peer-to-Peer Decentralized Large Language Models

no code implementations21 Aug 2023, ,

It is likely that only two types of people would be interested in setting up a practical system for it: $\bullet$ Those who prefer to use a decentralized ChatGPT-like software.

# Convergence of Two-Layer Regression with Nonlinear Units

no code implementations16 Aug 2023, ,

Softmax unit and ReLU unit are the key structure in attention computation.

# Zero-th Order Algorithm for Softmax Attention Optimization

We demonstrate the convergence of our algorithm, highlighting its effectiveness in efficiently computing gradients for large-scale LLMs.

# Fast Quantum Algorithm for Attention Computation

no code implementations16 Jul 2023, , ,

It is well-known that quantum machine has certain computational advantages compared to the classical machine.

# Faster Algorithms for Structured Linear and Kernel Support Vector Machines

no code implementations15 Jul 2023, ,

Consequently, we obtain a variety of results for SVMs: * For linear SVM, where the quadratic constraint matrix has treewidth $\tau$, we can solve the corresponding program in time $\widetilde O(n\tau^{(\omega+1)/2}\log(1/\epsilon))$; * For linear SVM, where the quadratic constraint matrix admits a low-rank factorization of rank-$k$, we can solve the corresponding program in time $\widetilde O(nk^{(\omega+1)/2}\log(1/\epsilon))$; * For Gaussian kernel SVM, where the data dimension $d = \Theta(\log n)$ and the squared dataset radius is small, we can solve it in time $O(n^{1+o(1)}\log(1/\epsilon))$.

# Efficient SGD Neural Network Training via Sublinear Activated Neuron Identification

no code implementations13 Jul 2023, ,

Deep learning has been widely used in many fields, but the model training process usually consumes massive computational resources and time.

# In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick

no code implementations5 Jul 2023, ,

Given matrices $A_1 \in \mathbb{R}^{n \times d}$, and $A_2 \in \mathbb{R}^{n \times d}$ and $B \in \mathbb{R}^{n \times n}$, the purpose is to solve some certain optimization problems: Normalized version $\min_{X} \| D(X)^{-1} \exp(A_1 X A_2^\top) - B \|_F^2$ and Rescaled version $\| \exp(A_1 X A_2^\top) - D(X) \cdot B \|_F^2$.

# H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Based on these insights, we propose Heavy Hitter Oracle (H$_2$O), a KV cache eviction policy that dynamically retains a balance of recent and H$_2$ tokens.

133

# Efficient Alternating Minimization with Applications to Weighted Low Rank Approximation

For weighted low rank approximation, this improves the runtime of [LLR16] from $n^2 k^2$ to $n^2k$.

# Query Complexity of Active Learning for Function Family With Nearly Orthogonal Basis

Many machine learning algorithms require large numbers of labeled data to deliver state-of-the-art results.

# A Mathematical Abstraction for Balancing the Trade-off Between Creativity and Reality in Large Language Models

no code implementations4 Jun 2023, ,

A model trained on these losses balances the trade-off between the creativity and reality of the model.

# Faster Robust Tensor Power Method for Arbitrary Order

no code implementations1 Jun 2023, ,

Tensor decomposition is a fundamental method used in various areas to deal with high-dimensional data.

# Federated Empirical Risk Minimization via Second-Order Method

no code implementations27 May 2023, ,

Many convex optimization problems with important applications in machine learning are formulated as empirical risk minimization (ERM).

# Fast Submodular Function Maximization

no code implementations15 May 2023, ,

We consider both the online and offline versions of the problem: in each iteration, the data set changes incrementally or is not changed, and a user can issue a query to maximize the function on a given subset of the data.

# Fast and Efficient Matching Algorithm with Deadline Instances

no code implementations15 May 2023, ,

But in \textsc{FastPostponedGreedy} algorithm, the status of each node is unknown at first.

# Efficient Asynchronize Stochastic Gradient Algorithm with Structured Data

no code implementations13 May 2023,

Deep learning has achieved impressive success in a variety of fields because of its good generalization.

# Differentially Private Attention Computation

no code implementations8 May 2023, ,

Inspired by [Vyas, Kakade and Barak 2023], in this work, we provide a provable result for showing how to differentially private approximate the attention matrix.

# An Iterative Algorithm for Rescaled Hyperbolic Functions Regression

no code implementations1 May 2023, ,

LLMs have shown great promise in improving the accuracy and efficiency of these tasks, and have the potential to revolutionize the field of natural language processing (NLP) in the years to come.

# The Closeness of In-Context Learning and Weight Shifting for Softmax Regression

no code implementations26 Apr 2023, , , ,

Large language models (LLMs) are known for their exceptional performance in natural language processing, making them highly effective in many human life-related or even job-related tasks.

# PVP: Pre-trained Visual Parameter-Efficient Tuning

Large-scale pre-trained transformers have demonstrated remarkable success in various computer vision tasks.

# Attention Scheme Inspired Softmax Regression

no code implementations20 Apr 2023, ,

One of the key computation in LLMs is the softmax unit.

# Solving Tensor Low Cycle Rank Approximation

no code implementations13 Apr 2023, ,

For the tensor classical rank, tucker rank and train rank, it has been well studied in [Song, Woodruff, Zhong SODA 2019].

# Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension

It runs in $\widetilde{O}(\mathrm{nnz}(X) + n^{\omega} )$ time, has $1-\delta$ succeed probability, and chooses $m = O(n \log(n/\delta))$.

# An Over-parameterized Exponential Regression

no code implementations29 Mar 2023, ,

Mathematically, we define the neural function $F: \mathbb{R}^{d \times m} \times \mathbb{R}^d \rightarrow \mathbb{R}$ using an exponential activation function.

# Solving Regularized Exp, Cosh and Sinh Regression Problems

no code implementations28 Mar 2023, ,

In this paper, we make use of the input sparsity and purpose an algorithm that use $\log ( \|x_0 - x^*\|_2 / \epsilon)$ iterations and $\widetilde{O}(\mathrm{nnz}(A) + d^{\omega} )$ per iteration time to solve the problem.

# A General Algorithm for Solving Rank-one Matrix Sensing

no code implementations22 Mar 2023, ,

In this paper, we relax that rank-$k$ assumption and solve a much more general matrix sensing problem.

# A Theoretical Analysis Of Nearest Neighbor Search On Approximate Near Neighbor Graph

Current theoretical literature focuses on greedy search on exact near neighbor graph while practitioners use approximate near neighbor graph (ANN-Graph) to reduce the preprocessing time.

# Streaming Kernel PCA Algorithm With Small Space

The kernel method, which is commonly used in learning algorithms such as Support Vector Machines (SVMs), has also been applied in PCA algorithms.

# Low Rank Matrix Completion via Robust Alternating Minimization in Nearly Linear Time

no code implementations21 Feb 2023, , ,

Moreover, our algorithm runs in time $\widetilde O(|\Omega| k)$, which is nearly linear in the time to verify the solution while preserving the sample complexity.

# A Nearly-Optimal Bound for Fast Regression with $\ell_\infty$ Guarantee

One popular approach for solving such $\ell_2$ regression problem is via sketching: picking a structured random matrix $S\in \mathbb{R}^{m\times n}$ with $m\ll n$ and $SA$ can be quickly computed, solve the sketched'' regression problem $\arg\min_{x\in \mathbb{R}^d} \|SAx-Sb\|_2$.

# Exit options sustain altruistic punishment and decrease the second-order free-riders, but it is not a panacea

Altruistic punishment, where individuals incur personal costs to punish others who have harmed third parties, presents an evolutionary conundrum as it undermines individual fitness.

# Adaptive and Dynamic Multi-Resolution Hashing for Pairwise Summations

In this paper, we propose Adam-Hash: an adaptive and dynamic multi-resolution hashing data-structure for fast pairwise summation estimation.

# A Faster $k$-means++ Algorithm

We propose a new algorithm \textsc{FastKmeans++} that only takes in $\widetilde{O}(nd + nk^2)$ time, in total.

# A Convergence Theory for Federated Average: Beyond Smoothness

As a leading algorithm in this setting, Federated Average FedAvg, which runs Stochastic Gradient Descent (SGD) in parallel on local devices and averages the sequences only once in a while, have been widely used due to their simplicity and low communication cost.

# Sketching for First Order Method: Efficient Algorithm for Low-Bandwidth Channel and Vulnerability

no code implementations15 Oct 2022, , ,

In this paper, we propose a novel sketching scheme for the first order method in large-scale distributed learning setting, such that the communication costs between distributed agents are saved while the convergence of the algorithms is still guaranteed.

# Dynamic Tensor Product Regression

In this work, we initiate the study of \emph{Dynamic Tensor Product Regression}.

# A Sublinear Adversarial Training Algorithm

no code implementations10 Aug 2022, , ,

For a neural network of width $m$, $n$ input training data in $d$ dimension, it takes $\Omega(mnd)$ time cost per training iteration for the forward and backward computation.

# Training Overparametrized Neural Networks in Sublinear Time

The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of AI.

# Dynamic Maintenance of Kernel Density Estimation Data Structure: From Practice to Theory

In this work, we focus on the dynamic maintenance of KDE data structures with robustness to adversarial queries.

# Federated Adversarial Learning: A Framework with Convergence Analysis

no code implementations7 Aug 2022, ,

Unlike the convergence analysis in classical centralized training that relies on the gradient direction, it is significantly harder to analyze the convergence in FAL for three reasons: 1) the complexity of min-max optimization, 2) model not updating in the gradient direction due to the multi-local updates on the client-side before aggregation and 3) inter-client heterogeneity.

# Sublinear Time Algorithm for Online Weighted Bipartite Matching

Online bipartite matching is a fundamental problem in online algorithms.

# Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis

A common method in training neural networks is to initialize all the weights to be independent Gaussian vectors.

# Smoothed Online Combinatorial Optimization Using Imperfect Predictions

Smoothed online combinatorial optimization considers a learner who repeatedly chooses a combinatorial decision to minimize an unknown changing cost function with a penalty on switching decisions in consecutive rounds.

# Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning

We first prove that adding a weighted class-conditional InfoNCE loss to SupCon controls the degree of spread.

21

# Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time

no code implementations14 Dec 2021, ,

We consider the problem of training a multi-layer over-parametrized neural network to minimize the empirical risk induced by a loss function.

# On Convergence of Federated Averaging Langevin Dynamics

We develop theoretical guarantees for FA-LD for strongly log-concave distributions with non-i. i. d data and study how the injected noise and the stochastic-gradient noise, the heterogeneity of data, and the varying learning rates affect the convergence.

# Fast Graph Neural Tangent Kernel via Kronecker Sketching

Given a kernel matrix of $n$ graphs, using sketching in solving kernel regression can reduce the running time to $o(n^3)$.

# Breaking the Linear Iteration Cost Barrier for Some Well-known Conditional Gradient Methods Using MaxIP Data-structures

In this work, we focus on improving the per iteration cost of CGM.

# Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices.

147

# Evaluating Gradient Inversion Attacks and Defenses in Federated Learning

Gradient inversion attack (or input recovery from gradient) is an emerging threat to the security and privacy preservation of Federated learning, whereby malicious eavesdroppers or participants in the protocol can recover (partially) the clients' private data.

159

# Online MAP Inference and Learning for Nonsymmetric Determinantal Point Processes

In this paper, we introduce the online and streaming MAP inference and learning problems for Non-symmetric Determinantal Point Processes (NDPPs) where data points arrive in an arbitrary order and the algorithms are constrained to use a single-pass over the data as well as sub-linear memory.

# Scatterbrain: Unifying Sparse and Low-rank Attention Approximation

Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences.

147

# Does Preprocessing Help Training Over-parameterized Neural Networks?

The classical training method requires paying $\Omega(mnd)$ cost for both forward computation and backward computation, where $m$ is the width of the neural network, and we are given $n$ training points in $d$-dimensional space.

# InstaHide’s Sample Complexity When Mixing Two Private Images

Inspired by InstaHide challenge [Huang, Song, Li and Arora'20], [Chen, Song and Zhuo'20] recently provides one mathematical formulation of InstaHide attack problem under Gaussian images distribution.

# Iterative Sketching and its Application to Federated Learning

no code implementations29 Sep 2021, ,

Though most federated learning frameworks only require clients and the server to send gradient information over the network, they still face the challenges of communication efficiency and data privacy.

# Sample Complexity of Deep Active Learning

no code implementations29 Sep 2021, ,

In this paper, we present the first deep active learning algorithm which has a provable sample complexity.

# Provable Federated Adversarial Learning via Min-max Optimization

no code implementations29 Sep 2021, ,

Unlike the convergence analysis in centralized training that relies on the gradient direction, it is significantly harder to analyze the convergence in FAL for two reasons: 1) the complexity of min-max optimization, and 2) model not updating in the gradient direction due to the multi-local updates on the client-side before aggregation.

# Fast Sketching of Polynomial Kernels of Polynomial Degree

Recent techniques in oblivious sketching reduce the dependence in the running time on the degree $q$ of the polynomial kernel from exponential to polynomial, which is useful for the Gaussian kernel, for which $q$ can be chosen to be polylogarithmic.

# Scatterbrain: Unifying Sparse and Low-rank Attention

Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences.

147

# Sublinear Least-Squares Value Iteration via Locality Sensitive Hashing

We present the first provable Least-Squares Value Iteration (LSVI) algorithms that have runtime complexity sublinear in the number of actions.

# FL-NTK: A Neural Tangent Kernel-based Framework for Federated Learning Convergence Analysis

no code implementations11 May 2021, , ,

Nevertheless, training analysis of neural networks in FL is non-trivial for two reasons: first, the objective loss function we are optimizing is non-smooth and non-convex, and second, we are even not updating in the gradient direction.

# Near-Optimal Two-Pass Streaming Algorithm for Sampling Random Walks over Directed Graphs

In addition, we show a similar $\tilde{\Theta}(n \cdot \sqrt{L})$ bound on the space complexity of any algorithm (with any number of passes) for the related problem of sampling an $L$-step random walk from every vertex in the graph.

Data Structures and Algorithms Computational Complexity

# Symmetric Sparse Boolean Matrix Factorization and Applications

As this problem is hard in the worst-case, we study a natural average-case variant that arises in the context of these reconstruction attacks: $\mathbf{M} = \mathbf{W}\mathbf{W}^{\top}$ for $\mathbf{W}$ a random Boolean matrix with $k$-sparse rows, and the goal is to recover $\mathbf{W}$ up to column permutation.

# Solving SDP Faster: A Robust IPM Framework and Efficient Implementation

This paper introduces a new robust interior point method analysis for semidefinite programming (SDP).

Optimization and Control Data Structures and Algorithms

# Minimum Cost Flows, MDPs, and $\ell_1$-Regression in Nearly Linear Time for Dense Instances

In the special case of the minimum cost flow problem on $n$-vertex $m$-edge graphs with integer polynomially-bounded costs and capacities we obtain a randomized method which solves the problem in $\tilde{O}(m+n^{1. 5})$ time.

Data Structures and Algorithms Optimization and Control

# MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training

Recent advances by practitioners in the deep learning community have breathed new life into Locality Sensitive Hashing (LSH), using it to reduce memory and time bottlenecks in neural network (NN) training.

# Graph Neural Network Acceleration via Matrix Dimension Reduction

Theoretically, we present two techniques to speed up GNTK training while preserving the generalization error: (1) We use a novel matrix decoupling method to reduce matrix dimensions during the kernel solving.

# What Can Phase Retrieval Tell Us About Private Distributed Learning?

In this work, we examine the security of InstaHide, a scheme recently proposed by \cite{hsla20} for preserving the security of private datasets in the context of distributed learning.

# Oblivious Sketching-based Central Path Method for Solving Linear Programming Problems

no code implementations1 Jan 2021,

In this work, we propose a sketching-based central path method for solving linear programmings, whose running time matches the state of art results [Cohen, Lee, Song STOC 19; Lee, Song, Zhang COLT 19].

# InstaHide's Sample Complexity When Mixing Two Private Images

Inspired by InstaHide challenge [Huang, Song, Li and Arora'20], [Chen, Song and Zhuo'20] recently provides one mathematical formulation of InstaHide attack problem under Gaussian images distribution.

# On InstaHide, Phase Retrieval, and Sparse Matrix Factorization

In this work, we examine the security of InstaHide, a scheme recently proposed by [Huang, Song, Li and Arora, ICML'20] for preserving the security of private datasets in the context of distributed learning.

# Metric Transforms and Low Rank Matrices via Representation Theory of the Real Hyperrectangle

This completes the theory of Manhattan to Manhattan metric transforms initiated by Assouad in 1980.

# Algorithms and Hardness for Linear Algebra on Geometric Graphs

We investigate whether or not it is possible to solve the following problems in $n^{1+o(1)}$ time for a $\mathsf{K}$-graph $G_P$ when $d < n^{o(1)}$: $\bullet$ Multiply a given vector by the adjacency matrix or Laplacian matrix of $G_P$ $\bullet$ Find a spectral sparsifier of $G_P$ $\bullet$ Solve a Laplacian system in $G_P$'s Laplacian matrix For each of these problems, we consider all functions of the form $\mathsf{K}(u, v) = f(\|u-v\|_2^2)$ for a function $f:\mathbb{R} \rightarrow \mathbb{R}$.

# MixCon: Adjusting the Separability of Data Representations for Harder Data Recovery

To address the issue that deep neural networks (DNNs) are vulnerable to model inversion attacks, we design an objective function, which adjusts the separability of the hidden data representations, as a way to control the trade-off between data utility and vulnerability to inversion attacks.

# TextHide: Tackling Data Privacy in Language Understanding Tasks

In addition, TextHide fits well with the popular framework of fine-tuning pre-trained language models (e. g., BERT) for any sentence or sentence-pair task.

26

# InstaHide: Instance-hiding Schemes for Private Distributed Learning

This paper introduces InstaHide, a simple encryption of training images, which can be plugged into existing distributed deep learning pipelines.

46

# Generalized Leverage Score Sampling for Neural Networks

Leverage score sampling is a powerful technique that originates from theoretical computer science, which can be used to speed up a large number of fundamental questions, e. g. linear regression, linear programming, semi-definite programming, cutting plane method, graph sparsification, maximum matching and max-flow.

# Training (Overparametrized) Neural Networks in Near-Linear Time

The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster $\mathit{second}$-$\mathit{order}$ optimization algorithms beyond SGD, without compromising the generalization error.

# When is Particle Filtering Efficient for Planning in Partially Observed Linear Dynamical Systems?

Though errors in past actions may affect the future, we are able to bound the number of particles needed so that the long-run reward of the policy based on particle filtering is close to that based on exact inference.

# Average Case Column Subset Selection for Entrywise $\ell_1$-Norm Loss

entries drawn from any distribution $\mu$ for which the $(1+\gamma)$-th moment exists, for an arbitrarily small constant $\gamma > 0$, then it is possible to obtain a $(1+\epsilon)$-approximate column subset selection to the entrywise $\ell_1$-norm in nearly linear time.

# An Improved Cutting Plane Method for Convex Optimization, Convex-Concave Games and its Applications

We propose a new cutting plane algorithm that uses an optimal $O(n \log (\kappa))$ evaluations of the oracle and an additional $O(n^2)$ time per evaluation, where $\kappa = nR/\epsilon$.

# Privacy-preserving Learning via Deep Net Pruning

This paper attempts to answer the question whether neural network pruning can be used as a tool to achieve differential privacy without losing much data utility.

# Sketching Transformed Matrices with Applications to Natural Language Processing

no code implementations23 Feb 2020 Yingyu Liang, , , ,

We show that our approach obtains small error and is efficient in both space and time.

# Meta-learning for mixed linear regression

In modern supervised learning, there are a large number of tasks, but many of them are associated with only a small amount of labeled data.

# Over-parameterized Adversarial Training: An Analysis Overcoming the Curse of Dimensionality

Our work proves convergence to low robust training loss for \emph{polynomial} width instead of exponential, under natural assumptions and with the ReLU activation.

# Parallel Neural Text-to-Speech

In this work, we first propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram.

# Learning Mixtures of Linear Regressions in Subexponential Time via Fourier Moments

no code implementations16 Dec 2019, ,

In this paper, we give the first algorithm for learning an MLR that runs in time which is sub-exponential in $k$.

# WaveFlow: A Compact Flow-based Model for Raw Audio

WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases.

591

# Provable Non-linear Inductive Matrix Completion

Inductive matrix completion (IMC) method is a standard approach for this problem where the given query as well as the items are embedded in a common low-dimensional space.

# Average Case Column Subset Selection for Entrywise \ell_1-Norm Loss

entries drawn from any distribution $\mu$ for which the $(1+\gamma)$-th moment exists, for an arbitrarily small constant $\gamma > 0$, then it is possible to obtain a $(1+\epsilon)$-approximate column subset selection to the entrywise $\ell_1$-norm in nearly linear time.

0

# Efficient Symmetric Norm Regression via Linear Sketching

When the loss function is a general symmetric norm, our algorithm produces a $\sqrt{d} \cdot \mathrm{polylog} n \cdot \mathrm{mmc}(\ell)$-approximate solution in input-sparsity time, where $\mathrm{mmc}(\ell)$ is a quantity related to the symmetric norm under consideration.

# Optimal Sketching for Kronecker Product Regression and Low Rank Approximation

For input $\mathcal{A}$ as above, we give $O(\sum_{i=1}^q \text{nnz}(A_i))$ time algorithms, which is much faster than computing $\mathcal{A}$.

# Total Least Squares Regression in Input Sparsity Time

In the total least squares problem, one is given an $m \times n$ matrix $A$, and an $m \times d$ matrix $B$, and one seeks to "correct" both $A$ and $B$, obtaining matrices $\hat{A}$ and $\hat{B}$, so that there exists an $X$ satisfying the equation $\hat{A}X = \hat{B}$.

4

# Quadratic Suffices for Over-parametrization via Matrix Chernoff Bound

no code implementations9 Jun 2019,

We improve the over-parametrization size over two beautiful results [Li and Liang' 2018] and [Du, Zhai, Poczos and Singh' 2019] in deep learning theory.

# Non-Autoregressive Neural Text-to-Speech

In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram.

118

# Solving Empirical Risk Minimization in the Current Matrix Multiplication Time

no code implementations11 May 2019, ,

Our result generalizes the very recent result of solving linear programs in the current matrix multiplication time [Cohen, Lee, Song'19] to a more broad class of problems.

# Efficient Model-free Reinforcement Learning in Metric Spaces

1 code implementation1 May 2019,

Model-free Reinforcement Learning (RL) algorithms such as Q-learning [Watkins, Dayan 92] have been widely used in practice and can achieve human level performance in applications such as video games [Mnih et al. 15].

2

# The Limitations of Adversarial Training and the Blind-Spot Attack

In our paper, we shed some lights on the practicality and the hardness of adversarial training by showing that the effectiveness (robustness on test set) of adversarial training has a strong correlation with the distance between a test point and the manifold of training data embedded by the network.

# Towards a Theoretical Understanding of Hashing-Based Neural Nets

no code implementations26 Dec 2018, ,

In this paper, we provide provable guarantees on some hashing-based parameter reduction methods in neural nets.

# Algorithmic Theory of ODEs and Sampling from Well-conditioned Logconcave Densities

We apply this to the sampling problem to obtain a nearly linear implementation of HMC for a broad class of smooth, strongly logconcave densities, with the number of iterations (parallel depth) and gradient evaluations being $\mathit{polylogarithmic}$ in the dimension (rather than polynomial as in previous work).

# Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

The impact of softmax on the value function itself in reinforcement learning (RL) is often viewed as problematic because it leads to sub-optimal value (or Q) functions and interferes with the contraction properties of the Bellman operator.

16

# A Convergence Theory for Deep Learning via Over-Parameterization

In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

# Towards a Zero-One Law for Column Subset Selection

Our approximation algorithms handle functions which are not even scale-invariant, such as the Huber loss function, which we show have very different structural properties than $\ell_p$-norms, e. g., one can show the lack of scale-invariance causes any column subset selection algorithm to provably require a $\sqrt{\log n}$ factor larger number of columns than $\ell_p$-norms; nevertheless we design the first efficient column subset selection algorithms for such error measures.

0

# On the Convergence Rate of Training Recurrent Neural Networks

In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing.

# Nonlinear Inductive Matrix Completion based on One-layer Neural Networks

A standard approach to modeling this problem is Inductive Matrix Completion where the predicted rating is modeled as an inner product of the user and the item features projected onto a latent space.

# Towards Fast Computation of Certified Robustness for ReLU Networks

Verifying the robustness property of a general Rectified Linear Unit (ReLU) network is an NP-complete problem [Katz, Barrett, Dill, Julian and Kochenderfer CAV17].

30

# Learning Long Term Dependencies via Fourier Recurrent Units

In this paper we propose a simple recurrent architecture, the Fourier Recurrent Unit (FRU), that stabilizes the gradients that arise in its training while giving us stronger expressive power.

36

# Nearly Optimal Dynamic $k$-Means Clustering for High-Dimensional Data

We consider the $k$-means clustering problem in the dynamic streaming setting, where points from a discrete Euclidean space $\{1, 2, \ldots, \Delta\}^d$ can be dynamically inserted to or deleted from the dataset.

# Sketching for Kronecker Product Regression and P-splines

That is, TensorSketch only provides input sparsity time for Kronecker product regression with respect to the $2$-norm.

# Stochastic Multi-armed Bandits in Constant Space

no code implementations25 Dec 2017, , ,

We consider the stochastic bandit problem in the sublinear space setting, where one cannot record the win-loss record for all $K$ arms.

# Scalable Model Selection for Belief Networks

We propose a scalable algorithm for model selection in sigmoid belief networks (SBNs), based on the factorized asymptotic Bayesian (FAB) framework.

# Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels

In this paper, we consider parameter recovery for non-overlapping convolutional neural networks (CNNs) with multiple kernels.

# Recovery Guarantees for One-hidden-layer Neural Networks

For activation functions that are also smooth, we show $\mathit{local~linear~convergence}$ guarantees of gradient descent under a resampling rule.

# Fast Regression with an $\ell_\infty$ Guarantee

no code implementations30 May 2017, ,

Our main result is that, when $S$ is the subsampled randomized Fourier/Hadamard transform, the error $x' - x^*$ behaves as if it lies in a "random" direction within this bound: for any fixed direction $a\in \mathbb{R}^d$, we have with $1 - d^{-c}$ probability that $\langle a, x'-x^*\rangle \lesssim \frac{\|a\|_2\|x'-x^*\|_2}{d^{\frac{1}{2}-\gamma}}, \quad (1)$ where $c, \gamma > 0$ are arbitrary constants.

# Relative Error Tensor Low Rank Approximation

Despite the success on obtaining relative error low rank approximations for matrices, no such results were known for tensors.

# Sublinear Time Orthogonal Tensor Decomposition

We show in a number of cases one can achieve the same theoretical guarantees in sublinear time, i. e., even without reading most of the input tensor.

9

# Linear Feature Encoding for Reinforcement Learning

We then develop a supervised linear feature encoding method that is motivated by insights from linear value function approximation theory, as well as empirical successes from deep RL.

# Low Rank Approximation with Entrywise $\ell_1$-Norm Error

We give the first provable approximation algorithms for $\ell_1$-low rank approximation, showing that it is possible to achieve approximation factor $\alpha = (\log d) \cdot \mathrm{poly}(k)$ in $\mathrm{nnz}(A) + (n+d) \mathrm{poly}(k)$ time, where $\mathrm{nnz}(A)$ denotes the number of non-zero entries of $A$.

# A Max-Product EM Algorithm for Reconstructing Markov-tree Sparse Signals from Compressive Samples

no code implementations5 Sep 2012,

Our signal reconstruction scheme is based on an EM iteration that aims at maximizing the posterior distribution of the signal and its state variables given the noise variance.

Cannot find the paper you are looking for? You can Submit a new open access paper.