1 code implementation • 30 Nov 2023 • Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, Sergey Levine
Developing such algorithms requires tasks that can gauge progress on algorithm design, provide accessible and reproducible evaluations for multi-turn interactions, and cover a range of task properties and challenges in improving reinforcement learning algorithms.
no code implementations • 9 Nov 2023 • Joey Hong, Sergey Levine, Anca Dragan
LLMs trained with supervised fine-tuning or "single-step" RL, as with standard RLHF, might struggle which tasks that require such goal-directed behavior, since they are not trained to optimize for overall conversational outcomes after multiple turns of interaction.
no code implementations • 31 Oct 2023 • Joey Hong, Anca Dragan, Sergey Levine
Theoretically, we show that standard offline RL algorithms conditioned on observation histories suffer from poor sample complexity, in accordance with the above intuition.
no code implementations • 26 Jul 2023 • Kensen Shi, Joey Hong, Manzil Zaheer, Pengcheng Yin, Charles Sutton
When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks.
no code implementations • 9 Dec 2022 • Joey Hong, Branislav Kveton, Sumeet Katariya, Manzil Zaheer, Mohammad Ghavamzadeh
We prove per-task bounds on the suboptimality of the learned policies, which show a clear improvement over not using the hierarchical model.
no code implementations • 9 Dec 2022 • Joey Hong, Kush Bhatia, Anca Dragan
This begs the question: how accurate do these models need to be in order for the reward inference to be accurate?
no code implementations • 8 Dec 2022 • Joey Hong, Aviral Kumar, Sergey Levine
This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence.
no code implementations • 12 Apr 2022 • Aviral Kumar, Joey Hong, Anikait Singh, Sergey Levine
To answer this question, we characterize the properties of environments that allow offline RL methods to perform better than BC methods, even when only provided with expert data.
no code implementations • 7 Apr 2022 • Kensen Shi, Joey Hong, Manzil Zaheer, Pengcheng Yin, Charles Sutton
We first characterize several different axes along which program synthesis methods would be desired to generalize, e. g., length generalization, or the ability to combine known subroutines in new ways that do not occur in the training data.
no code implementations • 3 Feb 2022 • Joey Hong, Branislav Kveton, Sumeet Katariya, Manzil Zaheer, Mohammad Ghavamzadeh
We use this exact posterior to analyze the Bayes regret of HierTS in Gaussian bandits.
no code implementations • 12 Nov 2021 • Joey Hong, Branislav Kveton, Manzil Zaheer, Mohammad Ghavamzadeh
We provide a unified view of all these problems, as learning to act in a hierarchical Bayesian bandit.
no code implementations • ICLR 2022 • Aviral Kumar, Joey Hong, Anikait Singh, Sergey Levine
In this paper, our goal is to characterize environments and dataset compositions where offline RL leads to better performance than BC.
no code implementations • 10 Jun 2021 • Joey Hong, Branislav Kveton, Manzil Zaheer, Mohammad Ghavamzadeh, Craig Boutilier
We study Thompson sampling (TS) in online decision making, where the uncertain environment is sampled from a mixture distribution.
no code implementations • 1 Dec 2020 • Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed, Mohammad Ghavamzadeh, Craig Boutilier
The key idea is to frame this problem as a latent bandit, where the prototypical models of user behavior are learned offline and the latent state of the user is inferred online from its interactions with the models.
no code implementations • 1 Dec 2020 • Joey Hong, David Dohan, Rishabh Singh, Charles Sutton, Manzil Zaheer
The latent codes are learned using a self-supervised learning principle, in which first a discrete autoencoder is trained on the output sequences, and then the resulting latent codes are used as intermediate targets for the end-to-end sequence prediction task.
no code implementations • NeurIPS 2020 • Joey Hong, Branislav Kveton, Manzil Zaheer, Yin-Lam Chow, Amr Ahmed, Craig Boutilier
A latent bandit problem is one in which the learning agent knows the arm reward distributions conditioned on an unknown discrete latent state.
no code implementations • 15 Jun 2020 • Joey Hong, Branislav Kveton, Manzil Zaheer, Yin-Lam Chow, Amr Ahmed
This approach is practical and analyzable, and we provide guarantees on both the quality of off-policy optimization and the regret during online deployment.
no code implementations • CVPR 2019 • Joey Hong, Benjamin Sapp, James Philbin
We focus on the problem of predicting future states of entities in complex, real-world driving scenarios.
no code implementations • 4 Oct 2016 • Joey Hong, Chris Mattmann, Paul Ramirez
The evolution of the internet has created an abundance of unstructured data on the web, a significant part of which is textual.