Search Results for author: Andrej Risteski

Found 59 papers, 9 papers with code

Progressive distillation induces an implicit curriculum

no code implementations7 Oct 2024 Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Andrej Risteski, Surbhi Goel

Our theoretical and empirical findings on sparse parity, complemented by empirical observations on more complex tasks, highlight the benefit of progressive distillation via implicit curriculum across setups.

Knowledge Distillation

On the Benefits of Memory for Modeling Time-Dependent PDEs

no code implementations3 Sep 2024 Ricardo Buitrago Ruiz, Tanya Marwah, Albert Gu, Andrej Risteski

Data-driven techniques have emerged as a promising alternative to traditional numerical methods for solving partial differential equations (PDEs).

Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars

no code implementations NeurIPS 2023 Kaiyue Wen, Yuchen Li, Bingbin Liu, Andrej Risteski

Interpretability methods aim to understand the algorithm implemented by a trained model (e. g., a Transofmer) by examining various aspects of the model, such as the weight matrices or the attention patterns.

LEMMA

Deep Equilibrium Based Neural Operators for Steady-State PDEs

no code implementations NeurIPS 2023 Tanya Marwah, Ashwini Pokle, J. Zico Kolter, Zachary C. Lipton, Jianfeng Lu, Andrej Risteski

Motivated by this observation, we propose FNO-DEQ, a deep equilibrium variant of the FNO architecture that directly solves for the solution of a steady-state PDE as the infinite-depth fixed point of an implicit operator layer using a black-box root solver and differentiates analytically through this fixed point resulting in $\mathcal{O}(1)$ training memory.

Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization

no code implementations7 Nov 2023 Elan Rosenfeld, Andrej Risteski

We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a particular heavy-tailed structure in natural data.

Stochastic Optimization

Fit Like You Sample: Sample-Efficient Generalized Score Matching from Fast Mixing Diffusions

no code implementations15 Jun 2023 Yilong Qin, Andrej Risteski

Moreover, we show that if the distribution being learned is a finite mixture of Gaussians in $d$ dimensions with a shared covariance, the sample complexity of annealed score matching is polynomial in the ambient dimension, the diameter of the means, and the smallest and largest eigenvalues of the covariance -- obviating the Poincar\'e constant-based lower bounds of the basic score matching loss shown in Koehler et al. 2022.

Understanding Augmentation-based Self-Supervised Representation Learning via RKHS Approximation and Regression

no code implementations1 Jun 2023 Runtian Zhai, Bingbin Liu, Andrej Risteski, Zico Kolter, Pradeep Ravikumar

Recent work has built the connection between self-supervised learning and the approximation of the top eigenspace of a graph Laplacian operator, suggesting that learning a linear probe atop such representation can be connected to RKHS regression.

Contrastive Learning Data Augmentation +7

How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding

1 code implementation7 Mar 2023 Yuchen Li, Yuanzhi Li, Andrej Risteski

While the successes of transformers across many domains are indisputable, accurate understanding of the learning mechanics is still largely lacking.

Neural Network Approximations of PDEs Beyond Linearity: A Representational Perspective

no code implementations21 Oct 2022 Tanya Marwah, Zachary C. Lipton, Jianfeng Lu, Andrej Risteski

We show that if composing a function with Barron norm $b$ with partial derivatives of $L$ produces a function of Barron norm at most $B_L b^p$, the solution to the PDE can be $\epsilon$-approximated in the $L^2$ sense by a function with Barron norm $O\left(\left(dB_L\right)^{\max\{p \log(1/ \epsilon), p^{\log(1/\epsilon)}\}}\right)$.

Statistical Efficiency of Score Matching: The View from Isoperimetry

no code implementations3 Oct 2022 Frederic Koehler, Alexander Heckett, Andrej Risteski

Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant.

Pitfalls of Gaussians as a noise distribution in NCE

no code implementations1 Oct 2022 Holden Lee, Chirag Pabbaraju, Anish Sevekari, Andrej Risteski

Noise Contrastive Estimation (NCE) is a popular approach for learning probability density functions parameterized up to a constant of proportionality.

Contrasting the landscape of contrastive and non-contrastive learning

1 code implementation29 Mar 2022 Ashwini Pokle, Jinjin Tian, Yuchen Li, Andrej Risteski

Some recent works however have shown promising results for non-contrastive learning, which does not require negative samples.

Contrastive Learning

Continual learning: a feature extraction formalization, an efficient algorithm, and fundamental obstructions

no code implementations27 Mar 2022 Binghui Peng, Andrej Risteski

When the features are linear, we design an efficient gradient-based algorithm $\mathsf{DPGD}$, that is guaranteed to perform well on the current environment, as well as avoid catastrophic forgetting.

Continual Learning

Masked prediction tasks: a parameter identifiability view

no code implementations18 Feb 2022 Bingbin Liu, Daniel Hsu, Pradeep Ravikumar, Andrej Risteski

This lens is undoubtedly very interesting, but suffers from the problem that there isn't a "canonical" set of downstream tasks to focus on -- in practice, this problem is usually resolved by competing on the benchmark dataset du jour.

Self-Supervised Learning

Sampling Approximately Low-Rank Ising Models: MCMC meets Variational Methods

no code implementations17 Feb 2022 Frederic Koehler, Holden Lee, Andrej Risteski

We consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest.

Variational Inference

Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization

2 code implementations14 Feb 2022 Elan Rosenfeld, Pradeep Ravikumar, Andrej Risteski

Towards this end, we introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift.

Domain Generalization Out-of-Distribution Generalization +1

Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias

1 code implementation ICLR 2022 Frederic Koehler, Viraj Mehta, Chenghui Zhou, Andrej Risteski

Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance which is correctly supported on the ground truth manifold.

Universal Approximation Using Well-Conditioned Normalizing Flows

no code implementations NeurIPS 2021 Holden Lee, Chirag Pabbaraju, Anish Prasad Sevekari, Andrej Risteski

As ill-conditioned Jacobians are an obstacle for likelihood-based training, the fundamental question remains: which distributions can be approximated using well-conditioned affine coupling flows?

Analyzing and Improving the Optimization Landscape of Noise-Contrastive Estimation

no code implementations ICLR 2022 Bingbin Liu, Elan Rosenfeld, Pradeep Ravikumar, Andrej Risteski

Noise-contrastive estimation (NCE) is a statistically consistent method for learning unnormalized probabilistic models.

The Effects of Invertibility on the Representational Complexity of Encoders in Variational Autoencoders

no code implementations ICML Workshop INNF 2021 Divyansh Pareek, Andrej Risteski

Training and using modern neural-network based latent-variable generative models (like Variational Autoencoders) often require simultaneously training a generative direction along with an inferential(encoding) direction, which approximates the posterior distribution over the latent variables.

Universal Approximation for Log-concave Distributions using Well-conditioned Normalizing Flows

no code implementations ICML Workshop INNF 2021 Holden Lee, Chirag Pabbaraju, Anish Sevekari, Andrej Risteski

As ill-conditioned Jacobians are an obstacle for likelihood-based training, the fundamental question remains: which distributions can be approximated using well-conditioned affine coupling flows?

Iterative Feature Matching: Toward Provable Domain Generalization with Logarithmic Environments

no code implementations18 Jun 2021 Yining Chen, Elan Rosenfeld, Mark Sellke, Tengyu Ma, Andrej Risteski

Domain generalization aims at performing well on unseen test environments with data from a limited number of training environments.

Domain Generalization

The Limitations of Limited Context for Constituency Parsing

no code implementations ACL 2021 Yuchen Li, Andrej Risteski

Concretely, we ground this question in the sandbox of probabilistic context-free-grammars (PCFGs), and identify a key aspect of the representational power of these approaches: the amount and directionality of context that the predictor has access to when forced to make parsing decision.

Constituency Parsing Language Modelling

Contrastive learning of strong-mixing continuous-time stochastic processes

no code implementations3 Mar 2021 Bingbin Liu, Pradeep Ravikumar, Andrej Risteski

Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data.

Contrastive Learning Time Series +1

Parametric Complexity Bounds for Approximating PDEs with Neural Networks

no code implementations NeurIPS 2021 Tanya Marwah, Zachary C. Lipton, Andrej Risteski

Recent experiments have shown that deep networks can approximate solutions to high-dimensional PDEs, seemingly escaping the curse of dimensionality.

An Online Learning Approach to Interpolation and Extrapolation in Domain Generalization

no code implementations25 Feb 2021 Elan Rosenfeld, Pradeep Ravikumar, Andrej Risteski

A popular assumption for out-of-distribution generalization is that the training data comprises sub-datasets, each drawn from a distinct distribution; the goal is then to "interpolate" these distributions and "extrapolate" beyond them -- this objective is broadly known as domain generalization.

Domain Generalization Out-of-Distribution Generalization

The Risks of Invariant Risk Minimization

no code implementations ICLR 2021 Elan Rosenfeld, Pradeep Ravikumar, Andrej Risteski

We furthermore present the very first results in the non-linear regime: we demonstrate that IRM can fail catastrophically unless the test data are sufficiently similar to the training distribution--this is precisely the issue that it was intended to solve.

Out-of-Distribution Generalization

Representational aspects of depth and conditioning in normalizing flows

no code implementations2 Oct 2020 Frederic Koehler, Viraj Mehta, Andrej Risteski

Normalizing flows are among the most popular paradigms in generative modeling, especially for images, primarily because we can efficiently evaluate the likelihood of a data point.

Efficient sampling from the Bingham distribution

no code implementations30 Sep 2020 Rong Ge, Holden Lee, Jianfeng Lu, Andrej Risteski

We give a algorithm for exact sampling from the Bingham distribution $p(x)\propto \exp(x^\top A x)$ on the sphere $\mathcal S^{d-1}$ with expected runtime of $\operatorname{poly}(d, \lambda_{\max}(A)-\lambda_{\min}(A))$.

On Learning Language-Invariant Representations for Universal Machine Translation

no code implementations ICML 2020 Han Zhao, Junjie Hu, Andrej Risteski

The goal of universal machine translation is to learn to translate between any pair of languages, given a corpus of paired translated documents for \emph{a small subset} of all pairs of languages.

Machine Translation Sentence +1

Fast Convergence for Langevin with Matrix Manifold Structure

no code implementations ICLR Workshop DeepDiffEq 2019 Ankur Moitra, Andrej Risteski

In this paper, we study one aspect of nonconvexity relevant for modern machine learning applications: existence of invariances (symmetries) in the function f, as a result of which the distribution p will have manifolds of points with equal probability.

Bayesian Inference

Fast Convergence for Langevin Diffusion with Manifold Structure

no code implementations13 Feb 2020 Ankur Moitra, Andrej Risteski

In this paper, we focus on an aspect of nonconvexity relevant for modern machine learning applications: existence of invariances (symmetries) in the function f, as a result of which the distribution p will have manifolds of points with equal probability.

Bayesian Inference

Benefits of Overparameterization in Single-Layer Latent Variable Generative Models

no code implementations25 Sep 2019 Rares-Darius Buhai, Andrej Risteski, Yoni Halpern, David Sontag

One of the most surprising and exciting discoveries in supervising learning was the benefit of overparameterization (i. e. training a very large model) to improving the optimization landscape of a problem, with minimal effect on statistical performance (i. e. generalization).

Variational Inference

Empirical Study of the Benefits of Overparameterization in Learning Latent Variable Models

1 code implementation ICML 2020 Rares-Darius Buhai, Yoni Halpern, Yoon Kim, Andrej Risteski, David Sontag

One of the most surprising and exciting discoveries in supervised learning was the benefit of overparameterization (i. e. training a very large model) to improving the optimization landscape of a problem, with minimal effect on statistical performance (i. e. generalization).

Variational Inference

Sum-of-squares meets square loss: Fast rates for agnostic tensor completion

no code implementations30 May 2019 Dylan J. Foster, Andrej Risteski

In agnostic tensor completion, we make no assumption on the rank of the unknown tensor, but attempt to predict unknown entries as well as the best rank-$r$ tensor.

Matrix Completion

The Comparative Power of ReLU Networks and Polynomial Kernels in the Presence of Sparse Latent Structure

no code implementations ICLR 2019 Frederic Koehler, Andrej Risteski

We give an almost-tight theoretical analysis of the performance of both neural networks and polynomials for this problem, as well as verify our theory with simulations.

Simulated Tempering Langevin Monte Carlo II: An Improved Proof using Soft Markov Chain Decomposition

no code implementations29 Nov 2018 Rong Ge, Holden Lee, Andrej Risteski

Previous approaches rely on decomposing the state space as a partition of sets, while our approach can be thought of as decomposing the stationary measure as a mixture of distributions (a "soft partition").

Mean-field approximation, convex hierarchies, and the optimality of correlation rounding: a unified perspective

no code implementations22 Aug 2018 Vishesh Jain, Frederic Koehler, Andrej Risteski

More precisely, we show that the mean-field approximation is within $O((n\|J\|_{F})^{2/3})$ of the free energy, where $\|J\|_F$ denotes the Frobenius norm of the interaction matrix of the Ising model.

Approximability of Discriminators Implies Diversity in GANs

no code implementations ICLR 2019 Yu Bai, Tengyu Ma, Andrej Risteski

Our preliminary experiments show that on synthetic datasets the test IPM is well correlated with KL divergence or the Wasserstein distance, indicating that the lack of diversity in GANs may be caused by the sub-optimality in optimization instead of statistical inefficiency.

Diversity

Representational Power of ReLU Networks and Polynomial Kernels: Beyond Worst-Case Analysis

no code implementations29 May 2018 Frederic Koehler, Andrej Risteski

We give almost-tight bounds on the performance of both neural networks and low degree polynomials for this problem.

Do GANs learn the distribution? Some Theory and Empirics

no code implementations ICLR 2018 Sanjeev Arora, Andrej Risteski, Yi Zhang

Using this evidence is presented that well-known GANs approaches do learn distributions of fairly low support.

Decoder

Theoretical limitations of Encoder-Decoder GAN architectures

no code implementations7 Nov 2017 Sanjeev Arora, Andrej Risteski, Yi Zhang

Encoder-decoder GANs architectures (e. g., BiGAN and ALI) seek to add an inference mechanism to the GANs setup, consisting of a small encoder deep net that maps data-points to their succinct encodings.

Decoder

Provable benefits of representation learning

no code implementations14 Jun 2017 Sanjeev Arora, Andrej Risteski

There is general consensus that learning representations is useful for a variety of reasons, e. g. efficient use of labeled data (semi-supervised learning), transfer learning and understanding hidden structure of data.

Clustering Representation Learning +1

Extending and Improving Wordnet via Unsupervised Word Embeddings

no code implementations29 Apr 2017 Mikhail Khodak, Andrej Risteski, Christiane Fellbaum, Sanjeev Arora

Our methods require very few linguistic resources, thus being applicable for Wordnet construction in low-resources languages, and may further be applied to sense clustering and other Wordnet improvements.

Clustering Word Embeddings

Automated WordNet Construction Using Word Embeddings

1 code implementation WS 2017 Mikhail Khodak, Andrej Risteski, Christiane Fellbaum, Sanjeev Arora

To evaluate our method we construct two 600-word testsets for word-to-synset matching in French and Russian using native speakers and evaluate the performance of our method along with several other recent approaches.

Information Retrieval Machine Translation +3

On the ability of neural nets to express distributions

no code implementations22 Feb 2017 Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, Sanjeev Arora

We take a first cut at explaining the expressivity of multilayer nets by giving a sufficient criterion for a function to be approximable by a neural network with $n$ hidden layers.

Provable learning of Noisy-or Networks

no code implementations28 Dec 2016 Sanjeev Arora, Rong Ge, Tengyu Ma, Andrej Risteski

Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables.

Tensor Decomposition Topic Models

Algorithms and matching lower bounds for approximately-convex optimization

no code implementations NeurIPS 2016 Andrej Risteski, Yuanzhi Li

In recent years, a rapidly increasing number of applications in practice requires solving non-convex objectives, like training neural networks, learning graphical models, maximum likelihood estimation etc.

Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates

no code implementations NeurIPS 2016 Yuanzhi Li, YIngyu Liang, Andrej Risteski

Non-negative matrix factorization is a popular tool for decomposing data into feature and weight matrices under non-negativity constraints.

Approximate maximum entropy principles via Goemans-Williamson with applications to provable variational methods

no code implementations NeurIPS 2016 Yuanzhi Li, Andrej Risteski

The well known maximum-entropy principle due to Jaynes, which states that given mean parameters, the maximum entropy distribution matching them is in an exponential family, has been very popular in machine learning due to its "Occam's razor" interpretation.

How to calculate partition functions using convex programming hierarchies: provable bounds for variational methods

no code implementations11 Jul 2016 Andrej Risteski

We make use of recent tools in combinatorial optimization: the Sherali-Adams and Lasserre convex programming hierarchies, in combination with variational methods to get algorithms for calculating partition functions in these families.

Combinatorial Optimization

Recovery guarantee of weighted low-rank approximation via alternating minimization

no code implementations6 Feb 2016 Yuanzhi Li, YIngyu Liang, Andrej Risteski

We show that the properties only need to hold in an average sense and can be achieved by the clipping step.

Matrix Completion

Linear Algebraic Structure of Word Senses, with Applications to Polysemy

1 code implementation TACL 2018 Sanjeev Arora, Yuanzhi Li, YIngyu Liang, Tengyu Ma, Andrej Risteski

A novel aspect of our technique is that each extracted word sense is accompanied by one of about 2000 "discourse atoms" that gives a succinct description of which other words co-occur with that word sense.

Information Retrieval Retrieval +1

On some provably correct cases of variational inference for topic models

no code implementations NeurIPS 2015 Pranjal Awasthi, Andrej Risteski

The assumptions on the topic priors are related to the well known Dirichlet prior, introduced to the area of topic modeling by (Blei et al., 2003).

Clustering Dictionary Learning +2

Label optimal regret bounds for online local learning

no code implementations7 Mar 2015 Pranjal Awasthi, Moses Charikar, Kevin A. Lai, Andrej Risteski

We resolve an open question from (Christiano, 2014b) posed in COLT'14 regarding the optimal dependency of the regret achievable for online local learning on the size of the label set.

Collaborative Filtering Open-Ended Question Answering

A Latent Variable Model Approach to PMI-based Word Embeddings

4 code implementations TACL 2016 Sanjeev Arora, Yuanzhi Li, YIngyu Liang, Tengyu Ma, Andrej Risteski

Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods.

Word Embeddings

Cannot find the paper you are looking for? You can Submit a new open access paper.