1 code implementation • 25 Sep 2023 • Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, Tatsunori Hashimoto
Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks.
1 code implementation • 12 Apr 2023 • Silviu Pitis, Michael R. Zhang, Andrew Wang, Jimmy Ba
Methods such as chain-of-thought prompting and self-consistency have pushed the frontier of language model reasoning performance with no additional training.
2 code implementations • 3 Nov 2022 • Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba
By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers.
1 code implementation • 20 Oct 2022 • Silviu Pitis, Elliot Creager, Ajay Mandlekar, Animesh Garg
To this end, we show that (1) known local structure in the environment transitions is sufficient for an exponential reduction in the sample complexity of training a dynamics model, and (2) a locally factored dynamics model provably generalizes out-of-distribution to unseen states and actions.
2 code implementations • ICML 2020 • Silviu Pitis, Harris Chan, Stephen Zhao, Bradly Stadie, Jimmy Ba
What goals should a multi-goal reinforcement learning agent pursue during training in long-horizon tasks?
1 code implementation • NeurIPS 2020 • Silviu Pitis, Elliot Creager, Animesh Garg
Many dynamic processes, including common scenarios in robotic control and reinforcement learning (RL), involve a set of interacting subprocesses.
2 code implementations • ICLR 2020 • Silviu Pitis, Harris Chan, Kiarash Jamali, Jimmy Ba
When defining distances, the triangle inequality has proven to be a useful constraint, both theoretically--to prove convergence and optimality guarantees--and empirically--as an inductive bias.
1 code implementation • 27 Jan 2020 • Silviu Pitis, Michael R. Zhang
Instead, we assume that votes are independent but not necessarily identically distributed and that our ensembling algorithm has access to certain auxiliary information related to the underlying model governing the noise in each vote.
no code implementations • 9 Sep 2019 • Kristopher De Asis, Alan Chan, Silviu Pitis, Richard S. Sutton, Daniel Graves
We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a $\textit{fixed}$ number of future time steps.
no code implementations • 8 Feb 2019 • Silviu Pitis
Reinforcement learning (RL) agents have traditionally been tasked with maximizing the value function of a Markov decision process (MDP), either in continuous settings, with fixed discount factor $\gamma < 1$, or in episodic settings, with $\gamma = 1$.
no code implementations • 8 Feb 2019 • Silviu Pitis
This paper motivates and develops source traces for temporal difference (TD) learning in the tabular setting.