no code implementations • 7 Mar 2023 • Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, Tengyu Ma
We next study semantically-unrelated label ICL (SUL-ICL), in which labels are semantically unrelated to their inputs (e. g., foo/bar instead of negative/positive), thereby forcing language models to learn the input-label mappings shown in in-context exemplars in order to perform the task.
no code implementations • 13 Feb 2023 • James Urquhart Allingham, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, Balaji Lakshminarayanan
In particular, we ask "Given a large pool of prompts, can we automatically score the prompts and ensemble those that are most suitable for a particular downstream dataset, without needing access to labeled validation data?".
no code implementations • 10 Feb 2023 • Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby
The scaling of Transformers has driven breakthrough capabilities for language models.
Ranked #1 on
Zero-Shot Transfer Image Classification
on ObjectNet
no code implementations • 23 Nov 2022 • Neil Band, Tim G. J. Rudner, Qixuan Feng, Angelos Filos, Zachary Nado, Michael W. Dusenberry, Ghassen Jerfel, Dustin Tran, Yarin Gal
We use these tasks to benchmark well-established and state-of-the-art Bayesian deep learning methods on task-specific evaluation metrics.
1 code implementation • 15 Jul 2022 • Dustin Tran, Jeremiah Liu, Michael W. Dusenberry, Du Phan, Mark Collier, Jie Ren, Kehang Han, Zi Wang, Zelda Mariet, Huiyi Hu, Neil Band, Tim G. J. Rudner, Karan Singhal, Zachary Nado, Joost van Amersfoort, Andreas Kirsch, Rodolphe Jenatton, Nithum Thain, Honglin Yuan, Kelly Buchanan, Kevin Murphy, D. Sculley, Yarin Gal, Zoubin Ghahramani, Jasper Snoek, Balaji Lakshminarayanan
A recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have achieved extraordinary performance but also puzzling failures.
2 code implementations • 1 May 2022 • Jeremiah Zhe Liu, Shreyas Padhy, Jie Ren, Zi Lin, Yeming Wen, Ghassen Jerfel, Zack Nado, Jasper Snoek, Dustin Tran, Balaji Lakshminarayanan
The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles.
1 code implementation • 7 Oct 2021 • James Urquhart Allingham, Florian Wenzel, Zelda E Mariet, Basil Mustafa, Joan Puigcerver, Neil Houlsby, Ghassen Jerfel, Vincent Fortuin, Balaji Lakshminarayanan, Jasper Snoek, Dustin Tran, Carlos Riquelme Ruiz, Rodolphe Jenatton
Machine learning models based on the aggregated outputs of submodels, either at the activation or prediction levels, often exhibit strong performance compared to individual models.
no code implementations • 6 Oct 2021 • Vincent Fortuin, Mark Collier, Florian Wenzel, James Allingham, Jeremiah Liu, Dustin Tran, Balaji Lakshminarayanan, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou
Uncertainty estimation in deep learning has recently emerged as a crucial area of interest to advance reliability and robustness in safety-critical applications.
no code implementations • NeurIPS 2021 • Archit Karandikar, Nicholas Cain, Dustin Tran, Balaji Lakshminarayanan, Jonathon Shlens, Michael C. Mozer, Becca Roelofs
When incorporated into training, these soft calibration losses achieve state-of-the-art single-model ECE across multiple datasets with less than 1% decrease in accuracy.
1 code implementation • NeurIPS 2021 • Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, Mario Lucic
Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks.
3 code implementations • 7 Jun 2021 • Zachary Nado, Neil Band, Mark Collier, Josip Djolonga, Michael W. Dusenberry, Sebastian Farquhar, Qixuan Feng, Angelos Filos, Marton Havasi, Rodolphe Jenatton, Ghassen Jerfel, Jeremiah Liu, Zelda Mariet, Jeremy Nixon, Shreyas Padhy, Jie Ren, Tim G. J. Rudner, Faris Sbahi, Yeming Wen, Florian Wenzel, Kevin Murphy, D. Sculley, Balaji Lakshminarayanan, Jasper Snoek, Yarin Gal, Dustin Tran
In this paper we introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks.
1 code implementation • 14 Mar 2021 • Martin Mladenov, Chih-Wei Hsu, Vihan Jain, Eugene Ie, Christopher Colby, Nicolas Mayoraz, Hubert Pham, Dustin Tran, Ivan Vendrov, Craig Boutilier
The development of recommender systems that optimize multi-turn interaction with users, and model the interactions of different agents (e. g., users, content providers, vendors) in the recommender ecosystem have drawn increasing attention in recent years.
no code implementations • pproximateinference AABI Symposium 2021 • Zelda E Mariet, Rodolphe Jenatton, Florian Wenzel, Dustin Tran
We seek to bridge the performance gap between batch ensembles (ensembles of deep networks with shared parameters) and deep ensembles on tasks which require not only predictions, but also uncertainty estimates for these predictions.
no code implementations • NeurIPS Workshop ICBINB 2020 • Jeremy Nixon, Balaji Lakshminarayanan, Dustin Tran
Ensemble methods have consistently reached state of the art across predictive, uncertainty, and out-of-distribution robustness benchmarks.
no code implementations • ICLR 2021 • Yeming Wen, Ghassen Jerfel, Rafael Muller, Michael W. Dusenberry, Jasper Snoek, Balaji Lakshminarayanan, Dustin Tran
Ensemble methods which average over multiple neural network predictions are a simple approach to improve a model's calibration and robustness.
2 code implementations • ICLR 2021 • Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan, Andrew M. Dai, Dustin Tran
Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network.
3 code implementations • NeurIPS 2020 • Florian Wenzel, Jasper Snoek, Dustin Tran, Rodolphe Jenatton
Ensembles over neural network weights trained from different random initialization, known as deep ensembles, achieve state-of-the-art accuracy and calibration.
3 code implementations • NeurIPS 2020 • Jeremiah Zhe Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax-Weiss, Balaji Lakshminarayanan
Bayesian neural networks (BNN) and deep ensembles are principled approaches to estimate the predictive uncertainty of a deep learning model.
1 code implementation • ICML 2020 • Michael W. Dusenberry, Ghassen Jerfel, Yeming Wen, Yi-An Ma, Jasper Snoek, Katherine Heller, Balaji Lakshminarayanan, Dustin Tran
Bayesian neural networks (BNNs) demonstrate promising success in improving the robustness and uncertainty quantification of modern deep learning.
1 code implementation • EMNLP (spnlp) 2020 • Jason Lee, Dustin Tran, Orhan Firat, Kyunghyun Cho
In this paper, by comparing several density estimators on five machine translation tasks, we find that the correlation between rankings of models based on log-likelihood and BLEU varies significantly depending on the range of the model families being compared.
5 code implementations • ICLR 2020 • Yeming Wen, Dustin Tran, Jimmy Ba
We also apply BatchEnsemble to lifelong learning, where on Split-CIFAR-100, BatchEnsemble yields comparable performance to progressive neural networks while having a much lower computational and memory costs.
no code implementations • 25 Sep 2019 • Marton Havasi, Jasper Snoek, Dustin Tran, Jonathan Gordon, José Miguel Hernández-Lobato
Variational inference (VI) is a popular approach for approximate Bayesian inference that is particularly promising for highly parameterized models such as deep neural networks.
1 code implementation • 10 Jun 2019 • Michael W. Dusenberry, Dustin Tran, Edward Choi, Jonas Kemp, Jeremy Nixon, Ghassen Jerfel, Katherine Heller, Andrew M. Dai
We further show that RNNs with only Bayesian embeddings can be a more efficient way to capture model uncertainty compared to ensembles, and we analyze how model uncertainty is impacted across individual input features and patient subgroups.
2 code implementations • NeurIPS 2019 • Dustin Tran, Keyon Vafa, Kumar Krishna Agrawal, Laurent Dinh, Ben Poole
While normalizing flows have led to significant advances in modeling high-dimensional continuous distributions, their applicability to discrete distributions remains unknown.
Ranked #16 on
Language Modelling
on Text8
no code implementations • ICLR 2019 • Danijar Hafner, Dustin Tran, Timothy Lillicrap, Alex Irpan, James Davidson
NCPs are compatible with any model that can output uncertainty estimates, are easy to scale, and yield reliable uncertainty estimates throughout training.
2 code implementations • 2 Apr 2019 • Jeremy Nixon, Mike Dusenberry, Ghassen Jerfel, Timothy Nguyen, Jeremiah Liu, Linchuan Zhang, Dustin Tran
In this paper, we perform a comprehensive empirical study of choices in calibration measures including measuring all probabilities rather than just the maximum prediction, thresholding probability values, class conditionality, number of bins, bins that are adaptive to the datapoint density, and the norm used to compare accuracies to confidences.
1 code implementation • 9 Mar 2019 • Matthew Hoffman, Pavel Sountsov, Joshua V. Dillon, Ian Langmore, Dustin Tran, Srinivas Vasudevan
Hamiltonian Monte Carlo is a powerful algorithm for sampling from difficult-to-normalize posterior distributions.
1 code implementation • NeurIPS 2019 • Dustin Tran, Michael W. Dusenberry, Mark van der Wilk, Danijar Hafner
We describe Bayesian Layers, a module designed for fast experimentation with neural network uncertainty.
2 code implementations • NeurIPS 2018 • Matthew D. Hoffman, Matthew J. Johnson, Dustin Tran
Deriving conditional and marginal distributions using conjugacy relationships can be time consuming and error prone.
1 code implementation • NeurIPS 2018 • Dustin Tran, Matthew Hoffman, Dave Moore, Christopher Suter, Srinivas Vasudevan, Alexey Radul, Matthew Johnson, Rif A. Saurous
For both a state-of-the-art VAE on 64x64 ImageNet and Image Transformer on 256x256 CelebA-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips.
1 code implementation • NeurIPS 2018 • Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, Blake Hechtman
We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model.
Ranked #10 on
Language Modelling
on One Billion Word
2 code implementations • ICLR 2019 • Danijar Hafner, Dustin Tran, Timothy Lillicrap, Alex Irpan, James Davidson
NCPs are compatible with any model that can output uncertainty estimates, are easy to scale, and yield reliable uncertainty estimates throughout training.
3 code implementations • ICLR 2018 • Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse
Stochastic neural net weights are used in a variety of contexts, including regularization, Bayesian neural nets, exploration in reinforcement learning, and evolution strategies.
no code implementations • 15 Feb 2018 • Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, Dustin Tran
Image generation has been successfully cast as an autoregressive sequence generation or transformation problem.
Ranked #6 on
Image Generation
on ImageNet 32x32
(bpd metric)
no code implementations • ICLR 2018 • Dustin Tran, Yura Burda, Ilya Sutskever
We examine how learning from unaligned data can improve both the data efficiency of supervised tasks as well as enable alignments without any supervision.
no code implementations • NeurIPS 2017 • Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, David Blei
In this paper we propose CHIVI, a black-box variational inference algorithm that minimizes $D_{\chi}(p || q)$, the $\chi$-divergence from $p$ to $q$.
9 code implementations • 28 Nov 2017 • Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, Rif A. Saurous
The TensorFlow Distributions library implements a vision of probability theory adapted to the modern deep-learning paradigm of end-to-end differentiable computation.
no code implementations • ICLR 2018 • Dustin Tran, David M. Blei
For the first, we describe implicit causal models, a class of causal models that leverages neural architectures with an implicit density.
no code implementations • NeurIPS 2017 • Dustin Tran, Rajesh Ranganath, David M. Blei
Implicit probabilistic models are a flexible class of models defined by a simulation process for data.
no code implementations • 13 Jan 2017 • Dustin Tran, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, David M. Blei
By treating inference as a first class citizen, on a par with modeling, we show that probabilistic programming can be as flexible and computationally efficient as traditional deep learning.
no code implementations • 1 Nov 2016 • Adji B. Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, David M. Blei
In this paper we propose CHIVI, a black-box variational inference algorithm that minimizes $D_{\chi}(p || q)$, the $\chi$-divergence from $p$ to $q$.
no code implementations • 31 Oct 2016 • Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, David M. Blei
Probabilistic modeling is a powerful approach for analyzing empirical information.
no code implementations • NeurIPS 2016 • Rajesh Ranganath, Jaan Altosaar, Dustin Tran, David M. Blei
Though this divergence has been widely used, the resultant posterior approximation can suffer from undesirable statistical properties.
no code implementations • 29 Mar 2016 • Dustin Tran, Minjae Kim, Finale Doshi-Velez
Method of moment estimators exhibit appealing statistical properties, such as asymptotic unbiasedness, for nonconvex problems.
4 code implementations • 2 Mar 2016 • Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, David M. Blei
Probabilistic modeling is iterative.
no code implementations • 20 Nov 2015 • Dustin Tran, Rajesh Ranganath, David M. Blei
Variational inference is a powerful tool for approximate inference, and it has been recently applied for representation learning with deep generative models.
1 code implementation • 7 Nov 2015 • Rajesh Ranganath, Dustin Tran, David M. Blei
We study HVMs on a variety of deep discrete latent variable models.
1 code implementation • 22 Sep 2015 • Dustin Tran, Panos Toulis, Edoardo M. Airoldi
When the update is based on a noisy gradient, the stochastic approximation is known as standard stochastic gradient descent, which has been fundamental in modern applications with large data sets.
no code implementations • NeurIPS 2015 • Dustin Tran, David M. Blei, Edoardo M. Airoldi
We develop a general variational inference method that preserves dependency among the latent variables.
no code implementations • 10 May 2015 • Panos Toulis, Dustin Tran, Edoardo M. Airoldi
For statistical efficiency, AI-SGD employs averaging of the iterates, which achieves the optimal Cram\'{e}r-Rao bound under strong convexity, i. e., it is an optimal unbiased estimator of the true parameter value.
2 code implementations • 16 Dec 2014 • Aki Vehtari, Andrew Gelman, Tuomas Sivula, Pasi Jylänki, Dustin Tran, Swupnil Sahai, Paul Blomstedt, John P. Cunningham, David Schiminovich, Christian Robert
A common divide-and-conquer approach for Bayesian computation with big data is to partition the data, perform local inference for each piece separately, and combine the results to obtain a global posterior approximation.
no code implementations • 27 Nov 2014 • Dustin Tran
We develop a robust convex algorithm to select the regularization parameter in model selection.