Near-Optimal Mean Estimation with Unknown, Heteroskedastic Variances

Given data drawn from a collection of Gaussian variables with a common mean but different and unknown variances, what is the best algorithm for estimating their common mean?

Testing with Non-identically Distributed Samples

From a learning standpoint, even with $c=1$ samples from each distribution, $\Theta(k/\varepsilon^2)$ samples are necessary and sufficient to learn $\textbf{p}_{\mathrm{avg}}$ to within error $\varepsilon$ in TV distance.

One-sided Matrix Completion from Two Observations Per Row

We propose a natural algorithm that involves imputing the missing values of the matrix $X^TX$ and show that even with only two observations per row in $X$, we can provably recover $X^TX$ as long as we have at least $\Omega(r^2 d \log d)$ rows, where $r$ is the rank and $d$ is the number of columns.

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e. g., linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn "most" functions from this class?

182

Efficient Convex Optimization Requires Superlinear Memory

We show that any memory-constrained, first-order algorithm which minimizes $d$-dimensional, $1$-Lipschitz convex functions over the unit ball to $1/\mathrm{poly}(d)$ accuracy using at most $d^{1. 25 - \delta}$ bits of memory must make at least $\tilde{\Omega}(d^{1 + (4/3)\delta})$ first-order queries (for any constant $\delta \in [0, 1/4]$).

On the Statistical Complexity of Sample Amplification

In this work, we place the sample amplification problem on a firm statistical foundation by deriving generally applicable amplification procedures, lower bound techniques and connections to existing statistical notions.

Big-Step-Little-Step: Efficient Gradient Methods for Objectives with Multiple Scales

We consider the problem of minimizing a function $f : \mathbb{R}^d \rightarrow \mathbb{R}$ which is implicitly decomposable as the sum of $m$ unknown non-interacting smooth, strongly convex functions and provide a method which solves this problem with a number of gradient evaluations that scales (up to logarithmic factors) as the product of the square-root of the condition numbers of the components.

Beyond Laurel/Yanny: An Autoencoder-Enabled Search for Polyperceivable Audio

Our results suggest that polyperceivable examples are surprisingly prevalent in natural language, existing for {\textgreater}2{\%} of English words.

Exponential Weights Algorithms for Selective Learning

no code implementations29 Jun 2021,

We study the selective learning problem introduced by Qiao and Valiant (2019), in which the learner observes $n$ labeled data points one at a time.

Sinkhorn Label Allocation: Semi-Supervised Classification via Annealed Self-Training

Self-training is a standard approach to semi-supervised learning where the learner's own predictions on unlabeled data are used as supervision during training.

53

On Misspecification in Prediction Problems and Robustness via Improper Learning

We study probabilistic prediction games when the underlying model is misspecified, investigating the consequences of predicting using an incorrect parametric model.

Stronger Calibration Lower Bounds via Sidestepping

no code implementations7 Dec 2020,

In this paper, we prove an $\Omega(T^{0. 528})$ bound on the calibration error, which is the first super-$\sqrt{T}$ lower bound for this setting to the best of our knowledge.

On the Generalization Effects of Linear Transformations in Data Augmentation

We validate our proposed scheme on image and text datasets.

47

Sublinear Optimal Policy Value Estimation in Contextual Bandits

We study the problem of estimating the expected reward of the optimal policy in the stochastic disjoint linear bandit setting.

Worst-Case Analysis for Randomly Collected Data

Crucially, we assume that the sets $A$ and $B$ are drawn according to some known distribution $P$ over pairs of subsets of $[n]$.

3

Making AI Forget You: Data Deletion in Machine Learning

Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used --- the EU's Right To Be Forgotten regulation is an example of this effort.

22

A Surprising Density of Illusionable Natural Speech

no code implementations3 Jun 2019,

Recent work on adversarial examples has demonstrated that most natural inputs can be perturbed to fool even state-of-the-art machine learning systems.

Sample Amplification: Increasing Dataset Size even when Learning is Impossible

In the Gaussian case, we show that an $\left(n, n+\Theta(\frac{n}{\sqrt{d}} )\right)$ amplifier exists, even though learning the distribution to small constant total variation distance requires $\Theta(d)$ samples.

Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm of the gradient of the model with respect to the parameter vector, evaluated at each data point.

Memory-Sample Tradeoffs for Linear Regression with Small Error

We consider the problem of performing linear regression over a stream of $d$-dimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints.

A Theory of Selective Prediction

no code implementations12 Feb 2019,

The algorithm is allowed to choose when to make the prediction as well as the length of the prediction window, possibly depending on the observations so far.

Maximum Likelihood Estimation for Learning Populations of Parameters

Precisely, for sufficiently large $N$, the MLE achieves the information theoretic optimal error bound of $\mathcal{O}(\frac{1}{t})$ for $t < c\log{N}$, with regards to the earth mover's distance (between the estimated and true distributions).

Equivariant Transformer Networks

How can prior knowledge on the transformation invariances of a domain be incorporated into the architecture of a neural network?

88

A Spectral View of Adversarially Robust Features

This connection can be leveraged to provide both robust features, and a lower bound on the robustness of any function that has significant variance across the dataset.

Estimating Learnability in the Sublinear Data Regime

In this setting, we show that with $O(\sqrt{d})$ samples, one can accurately estimate the fraction of the variance of the label that can be explained via the best linear function of the data.

Learning Discrete Distributions from Untrusted Batches

no code implementations22 Nov 2017,

Specifically, we consider the setting where there is some underlying distribution, $p$, and each data source provides a batch of $\ge k$ samples, with the guarantee that at least a $(1-\epsilon)$ fraction of the sources draw their samples from a distribution with total variation distance at most $\eta$ from $p$.

Learning Overcomplete HMMs

On the other hand, we show that learning is impossible given only a polynomial number of samples for HMMs with a small output alphabet and whose transition matrices are random regular graphs with large degree.

Sketching Linear Classifiers over Data Streams

We introduce a new sub-linear space sketch---the Weight-Median Sketch---for learning compressed linear classifiers over data streams while supporting the efficient recovery of large-magnitude weights in the model.

37

Learning Populations of Parameters

Consider the following estimation problem: there are $n$ entities, each with an unknown parameter $p_i \in [0, 1]$, and we observe $n$ independent random variables, $X_1,\ldots, X_n$, with $X_i \sim$ Binomial$(t, p_i)$.

A Data Prism: Semi-Verified Learning in the Small-Alpha Regime

This setting can be viewed as an instance of the semi-verified learning model introduced in [CSV17], which explores the tradeoff between the number of items evaluated by each worker and the fraction of good evaluators.

Compressed Factorization: Fast and Accurate Low-Rank Factorization of Compressively-Sensed Data

What learning algorithms can be run directly on compressively-sensed data?

Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers

We introduce a criterion, resilience, which allows properties of a dataset (such as its mean or best low rank approximation) to be robustly computed, even in the presence of a large fraction of arbitrary additional data.

Orthogonalized ALS: A Theoretically Principled Tensor Decomposition Algorithm for Practical Use

The popular Alternating Least Squares (ALS) algorithm for tensor decomposition is efficient and easy to implement, but often converges to poor local optima---particularly when the weights of the factors are non-uniform.

Prediction with a Short Memory

For a Hidden Markov Model with $n$ hidden states, $I$ is bounded by $\log n$, a quantity that does not depend on the mixing time, and we show that the trivial prediction algorithm based on the empirical frequencies of length $O(\log n/\epsilon)$ windows of observations achieves this error, provided the length of the sequence is $d^{\Omega(\log n/\epsilon)}$, where $d$ is the size of the observation alphabet.

Learning from Untrusted Data

For example, given a dataset of $n$ points for which an unknown subset of $\alpha n$ points are drawn from a distribution of interest, and no assumptions are made about the remaining $(1-\alpha)n$ points, is it possible to return a list of $\operatorname{poly}(1/\alpha)$ answers, one of which is correct?

Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction

We consider a crowdsourcing model in which $n$ workers are asked to rate the quality of $n$ items previously generated by other workers.

Recovering Structured Probability Matrices

When can accurate reconstruction be accomplished in the sparse data regime?

Spectrum Estimation from Samples

1 code implementation30 Jan 2016,

We consider this fundamental recovery problem in the regime where the number of samples is comparable, or even sublinear in the dimensionality of the distribution in question.

0

Instance Optimal Learning

no code implementations21 Apr 2015,

One conceptual implication of this result is that for large samples, Bayesian assumptions on the "shape" or bounds on the tail probabilities of a distribution over discrete support are not helpful for the task of learning the distribution.

Testing Closeness With Unequal Sized Samples

We consider the problem of closeness testing for two discrete distributions in the practically relevant setting of \emph{unequal} sized samples drawn from each of them.

Estimating the Unseen: Improved Estimators for Entropy and other Properties

Recently, [Valiant and Valiant] showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a SUBLINEAR sized sample.

Least Squares Revisited: Scalable Approaches for Multi-class Prediction

This work provides simple algorithms for multi-class (and multi-label) prediction in settings where both the number of examples n and the data dimension d are relatively large.

Optimal Algorithms for Testing Closeness of Discrete Distributions

We study the question of closeness testing for two discrete distributions.

Cannot find the paper you are looking for? You can Submit a new open access paper.