In this work, we dissect these performance gains through the lens of data memorization in overparameterized models.
Algorithm- and data-dependent generalization bounds are required to explain the generalization behavior of modern machine learning algorithms.
We present novel bounds for coreset construction, feature selection, and dimensionality reduction for logistic regression.
Modern deep learning models are over-parameterized, where different optima can result in widely varying generalization performance.
We study the fundamental problem of selecting optimal features for model construction.
Despite the ubiquitous use of stochastic optimization algorithms in machine learning, the precise impact of these algorithms and their dynamics on generalization performance in realistic non-convex settings is still poorly understood.
To enhance practicability, we devise an adaptive scheme to choose L, and we show that this reduces the number of local iterations in worker machines between two model synchronizations as the training proceeds, successively refining the model quality at the master.
The Column Subset Selection Problem (CSSP) and the Nystrom method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing.
Transfer learning has emerged as a powerful methodology for adapting pre-trained deep neural networks on image recognition tasks to new domains.
Using these observations, we show that noise-augmentation on mixup training further increases boundary thickness, thereby combating vulnerability to various forms of adversarial attacks and OOD transforms.
Bayesian coresets have emerged as a promising approach for implementing scalable Bayesian inference.
The Column Subset Selection Problem (CSSP) and the Nystr\"om method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing.
Iterative hard thresholding (IHT) is a projected gradient descent algorithm, known to achieve state of the art performance for a wide range of structured estimation problems, such as sparse inference.
The rate of convergence of weighted kernel herding (WKH) and sequential Bayesian quadrature (SBQ), two kernel-based sampling algorithms for estimating integrals with respect to some target probability measure, is investigated.
Research in both machine learning and psychology suggests that salient examples can help humans to interpret learning models.
Finally, we present a stopping criterion drawn from the duality gap in the classic FW analyses and exhaustive experiments to illustrate the usefulness of our theoretical and algorithmic contributions.
Variational inference is a popular technique to approximate a possibly intractable Bayesian posterior with a more tractable one.
We provide new approximation guarantees for greedy low rank matrix estimation under standard assumptions of restricted strong convexity and smoothness.
Furthermore, we show that a bounded submodularity ratio can be used to provide data dependent bounds that can sometimes be tighter also for submodular functions.
Two of the most fundamental prototypes of greedy optimization are the matching pursuit and Frank-Wolfe algorithms.
Our results extend the work of Das and Kempe (2011) from the setting of linear regression to arbitrary objective functions.
Approximate inference via information projection has been recently introduced as a general-purpose approach for efficient probabilistic inference given sparse variables.
In a recent paper, Levy and Goldberg pointed out an interesting connection between prediction-based word embedding models and count models based on pointwise mutual information.
In cases where this projection is intractable, we propose a family of parameterized approximations indexed by subsets of the domain.