1 code implementation • 24 Mar 2025 • Yangjun Ruan, Neil Band, Chris J. Maddison, Tatsunori Hashimoto
We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data, with increasing gains from additional inference compute when performing the E-step.
no code implementations • 14 Feb 2025 • Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison
We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective.
no code implementations • 18 Nov 2024 • Ayoub El Hanchi, Chris J. Maddison, Murat A. Erdogdu
Given a collection of feature maps indexed by a set $\mathcal{T}$, we study the performance of empirical risk minimization (ERM) on regression problems with square loss over the union of the linear classes induced by these feature maps.
no code implementations • 9 Jul 2024 • Nikita Dhawan, Leonardo Cotta, Karen Ullrich, Rahul G. Krishnan, Chris J. Maddison
Our results suggest that unstructured text data is a rich source of causal effect information, and NATURAL is a first step towards an automated pipeline to tap this resource.
1 code implementation • 19 Jun 2024 • Honghua Dong, Qidong Su, Yubo Gao, Zhaoyu Li, Yangjun Ruan, Gennady Pekhimenko, Chris J. Maddison, Xujie Si
Large Language Models (LLMs) have become increasingly capable of handling diverse tasks with the aid of well-crafted prompts and integration of external tools, but as task complexity rises, the workflow involving LLMs can be complicated and thus challenging to implement and maintain.
no code implementations • 17 Jun 2024 • Ayoub El Hanchi, Chris J. Maddison, Murat A. Erdogdu
We study the problem of designing minimax procedures in linear regression under the quantile risk.
no code implementations • 11 Jun 2024 • Leonardo Cotta, Chris J. Maddison
Finally, we introduce a data augmentation strategy that guarantees stratified invariance at test time under suitable assumptions, together with a prompting strategy that encourages stratified invariance in LLMs.
1 code implementation • 3 Jun 2024 • Anvith Thudi, Chris J. Maddison
In our experiments, we found that MixMax matched or outperformed the standard group DRO baselines, and in particular, MixMax improved the performance of XGBoost over the only baseline, data balancing, for variations of the ACSIncome and CelebA annotations datasets.
1 code implementation • 17 May 2024 • Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto
However, we show that these variations are consistent with a simple, generalized scaling law where language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities.
no code implementations • 13 Feb 2024 • Daniel D. Johnson, Daniel Tarlow, David Duvenaud, Chris J. Maddison
Identifying how much a model ${\widehat{p}}_{\theta}(Y|X)$ knows about the stochastic real-world process $p(Y|X)$ it was trained on is important to ensure it avoids producing incorrect or "hallucinated" answers or taking unsafe actions.
1 code implementation • 25 Sep 2023 • Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, Tatsunori Hashimoto
Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks.
4 code implementations • 12 Jun 2023 • George E. Dahl, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh Medapati, Runa Eschenhagen, Priya Kasimbeg, Daniel Suo, Juhan Bae, Justin Gilmer, Abel L. Peirson, Bilal Khan, Rohan Anil, Mike Rabbat, Shankar Krishnan, Daniel Snider, Ehsan Amid, Kongtao Chen, Chris J. Maddison, Rakshith Vasudev, Michal Badura, Ankush Garg, Peter Mattson
In order to address these challenges, we introduce a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware, the AlgoPerf: Training Algorithms benchmark.
no code implementations • 4 Oct 2022 • Daniel D. Johnson, Ayoub El Hanchi, Chris J. Maddison
We give generalization bounds for downstream linear prediction using our Kernel PCA representation, and show empirically on a set of synthetic tasks that applying Kernel PCA to contrastive learning models can indeed approximately recover the Markov chain eigenfunctions, although the accuracy depends on the kernel parameterization as well as on the augmentation strength.
no code implementations • 27 Jun 2022 • Max B. Paulus, Giulia Zarpellon, Andreas Krause, Laurent Charlin, Chris J. Maddison
Cutting planes are essential for solving mixed-integer linear problems (MILPs), because they facilitate bound improvements on the optimal solution value.
2 code implementations • 4 Mar 2022 • Maxime Gasse, Quentin Cappart, Jonas Charfreitag, Laurent Charlin, Didier Chételat, Antonia Chmiela, Justin Dumouchelle, Ambros Gleixner, Aleksandr M. Kazachkov, Elias Khalil, Pawel Lichocki, Andrea Lodi, Miles Lubin, Chris J. Maddison, Christopher Morris, Dimitri J. Papageorgiou, Augustin Parjadis, Sebastian Pokutta, Antoine Prouvost, Lara Scavuzzo, Giulia Zarpellon, Linxin Yang, Sha Lai, Akang Wang, Xiaodong Luo, Xiang Zhou, Haohan Huang, Shengcheng Shao, Yuanming Zhu, Dong Zhang, Tao Quan, Zixuan Cao, Yang Xu, Zhewei Huang, Shuchang Zhou, Chen Binbin, He Minggui, Hao Hao, Zhang Zhiyu, An Zhiwu, Mao Kun
Combinatorial optimization is a well-established area in operations research and computer science.
1 code implementation • 17 Feb 2022 • Haonan Duan, Pashootan Vaezipoor, Max B. Paulus, Yangjun Ruan, Chris J. Maddison
While typical graph contrastive pre-training uses label-agnostic augmentations, our key insight is that many combinatorial problems have well-studied invariances, which allow for the design of label-preserving augmentations.
1 code implementation • 9 Feb 2022 • Valentin Villecroze, Harry J. Braviner, Panteha Naderian, Chris J. Maddison, Gabriel Loaiza-Ganem
Skills or low-level policies in reinforcement learning are temporally extended actions that can speed up learning and enable complex behaviours.
2 code implementations • ICLR 2022 • Yangjun Ruan, Yann Dubois, Chris J. Maddison
Machine learning systems often experience a distribution shift between training and testing.
Ranked #38 on
Image Classification
on ObjectNet
(using extra training data)
1 code implementation • NeurIPS 2021 • Guy Lorberbom, Daniel D. Johnson, Chris J. Maddison, Daniel Tarlow, Tamir Hazan
To perform counterfactual reasoning in Structural Causal Models (SCMs), one needs to know the causal mechanisms, which provide factorizations of conditional distributions into noise sources and deterministic functions mapping realizations of noise to samples.
no code implementations • NeurIPS Workshop ICBINB 2021 • Wouter Kool, Chris J. Maddison, andriy mnih
Training large-scale mixture of experts models efficiently on modern hardware requires assigning datapoints in a batch to different experts, each with a limited capacity.
1 code implementation • NeurIPS 2021 • Yann Dubois, Benjamin Bloem-Reddy, Karen Ullrich, Chris J. Maddison
Most data is automatically collected and only ever "seen" by algorithms.
Ranked #1 on
Image Compression
on Oxford-IIIT Pet Dataset
(using extra training data)
no code implementations • 28 May 2021 • Xuechen Li, Chris J. Maddison, Daniel Tarlow
Source code spends most of its time in a broken or incomplete state during software development.
1 code implementation • ICLR Workshop Neural_Compression 2021 • Yangjun Ruan, Karen Ullrich, Daniel Severo, James Townsend, Ashish Khisti, Arnaud Doucet, Alireza Makhzani, Chris J. Maddison
Naively applied, our schemes would require more initial bits than the standard bits-back coder, but we show how to drastically reduce this additional cost with couplings in the latent space.
1 code implementation • 8 Feb 2021 • Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, Chris J. Maddison
We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables.
5 code implementations • ICLR 2021 • Max B. Paulus, Chris J. Maddison, Andreas Krause
Gradient estimation in models with discrete latent variables is a challenging problem, because the simplest unbiased estimators tend to have high variance.
no code implementations • 7 Jul 2020 • Pashootan Vaezipoor, Gil Lederman, Yuhuai Wu, Chris J. Maddison, Roger Grosse, Sanjit A. Seshia, Fahiem Bacchus
In addition to step count improvements, Neuro# can also achieve orders of magnitude wall-clock speedups over the vanilla solver on larger instances in some problem families, despite the runtime overhead of querying the model.
1 code implementation • NeurIPS 2020 • Max B. Paulus, Dami Choi, Daniel Tarlow, Andreas Krause, Chris J. Maddison
The Gumbel-Max trick is the basis of many relaxed gradient estimators.
2 code implementations • 11 Oct 2019 • Dami Choi, Christopher J. Shallue, Zachary Nado, Jaehoon Lee, Chris J. Maddison, George E. Dahl
In particular, we find that the popular adaptive gradient methods never underperform momentum or gradient descent.
no code implementations • NeurIPS 2020 • Guy Lorberbom, Chris J. Maddison, Nicolas Heess, Tamir Hazan, Daniel Tarlow
A main benefit of DirPG algorithms is that they allow the insertion of domain knowledge in the form of upper bounds on return-to-go at training time, like is used in heuristic search, while still directly computing a policy gradient.
4 code implementations • NeurIPS 2019 • Emile Mathieu, Charline Le Lan, Chris J. Maddison, Ryota Tomioka, Yee Whye Teh
We therefore endow VAEs with a Poincar\'e ball model of hyperbolic geometry as a latent space and rigorously derive the necessary methods to work with two main Gaussian generalisations on that space.
3 code implementations • ICLR 2019 • George Tucker, Dieterich Lawson, Shixiang Gu, Chris J. Maddison
Burda et al. (2015) introduced a multi-sample variational bound, IWAE, that is at least as tight as the standard variational lower bound and becomes increasingly tight as the number of samples increases.
4 code implementations • 13 Sep 2018 • Chris J. Maddison, Daniel Paulin, Yee Whye Teh, Brendan O'Donoghue, Arnaud Doucet
Yet, crucially the kinetic gradient map can be designed to incorporate information about the convex conjugate in a fashion that allows for linear convergence on convex functions that may be non-smooth or non-strongly convex.
18 code implementations • ICML 2018 • Marta Garnelo, Dan Rosenbaum, Chris J. Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J. Rezende, S. M. Ali Eslami
Deep neural networks excel at function approximation, yet they are typically trained from scratch for each new function.
3 code implementations • ICML 2018 • Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank Wood, Yee Whye Teh
We provide theoretical and empirical evidence that using tighter evidence lower bounds (ELBOs) can be detrimental to the process of learning an inference network by reducing the signal-to-noise ratio of the gradient estimator.
3 code implementations • NeurIPS 2017 • Chris J. Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, andriy mnih, Arnaud Doucet, Yee Whye Teh
When used as a surrogate objective for maximum likelihood estimation in latent variable models, the evidence lower bound (ELBO) produces state-of-the-art results.
3 code implementations • NeurIPS 2017 • George Tucker, andriy mnih, Chris J. Maddison, Dieterich Lawson, Jascha Sohl-Dickstein
Learning in models with discrete latent variables is challenging due to high variance gradient estimators.
no code implementations • 16 Mar 2017 • Chris J. Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Arnaud Doucet, andriy mnih, Yee Whye Teh
The policy gradients of the expected return objective can react slowly to rare rewards.
5 code implementations • 2 Nov 2016 • Chris J. Maddison, andriy mnih, Yee Whye Teh
The essence of the trick is to refactor each stochastic node into a differentiable function of its parameters and a random variable with fixed distribution.
1 code implementation • 20 Dec 2014 • Chris J. Maddison, Aja Huang, Ilya Sutskever, David Silver
The game of Go is more challenging than other board games, due to the difficulty of constructing a position or move evaluation function.
no code implementations • NeurIPS 2014 • Chris J. Maddison, Daniel Tarlow, Tom Minka
The problem of drawing samples from a discrete distribution can be converted into a discrete optimization problem.
no code implementations • 2 Jan 2014 • Chris J. Maddison, Daniel Tarlow
We study the problem of building generative models of natural source code (NSC); that is, source code written and understood by humans.
no code implementations • NeurIPS 2013 • Roger B. Grosse, Chris J. Maddison, Ruslan R. Salakhutdinov
Many powerful Monte Carlo techniques for estimating partition functions, such as annealed importance sampling (AIS), are based on sampling from a sequence of intermediate distributions which interpolate between a tractable initial distribution and an intractable target distribution.