no code implementations • 29 May 2024 • Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, Bilal Piot
The canonical element of such datasets is for instance an LLM's response to a user's prompt followed by a user's feedback such as a thumbs-up/down.
no code implementations • 24 May 2024 • Guy Tennenholtz, Yinlam Chow, Chih-Wei Hsu, Lior Shani, Ethan Liang, Craig Boutilier
Our embedding-aligned guided language (EAGLE) agent is trained to iteratively steer the LLM's generation towards optimal regions of the latent embedding space, w. r. t.
1 code implementation • 23 May 2024 • Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Rémi Munos
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks.
no code implementations • 6 Oct 2023 • Guy Tennenholtz, Yinlam Chow, Chih-Wei Hsu, Jihwan Jeong, Lior Shani, Azamat Tulepbergenov, Deepak Ramachandran, Martin Mladenov, Craig Boutilier
Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format.
no code implementations • 31 May 2023 • Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert Dadashi, Matthieu Geist, Sertan Girgin, Léonard Hussenot, Orgad Keller, Nikola Momchev, Sabela Ramos, Piotr Stanczyk, Nino Vieillard, Olivier Bachem, Gal Elidan, Avinatan Hassidim, Olivier Pietquin, Idan Szpektor
Despite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input.
Abstractive Text Summarization
Natural Language Inference
+3
no code implementations • 4 Feb 2023 • Guy Tennenholtz, Nadav Merlis, Lior Shani, Martin Mladenov, Craig Boutilier
We introduce Dynamic Contextual Markov Decision Processes (DCMDPs), a novel reinforcement learning framework for history-dependent environments that generalizes the contextual MDP framework to handle non-Markov environments, where contexts change over time.
1 code implementation • 30 May 2022 • Guy Tennenholtz, Nadav Merlis, Lior Shani, Shie Mannor, Uri Shalit, Gal Chechik, Assaf Hallak, Gal Dalal
We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds.
no code implementations • 13 Feb 2021 • Lior Shani, Tom Zahavy, Shie Mannor
Finally, we implement a deep variant of our algorithm which shares some similarities to GAIL \cite{ho2016generative}, but where the discriminator is replaced with the costs learned by the OAL problem.
1 code implementation • ICLR 2022 • Manan Tomar, Lior Shani, Yonathan Efroni, Mohammad Ghavamzadeh
Overall, MDPO is derived from the MD principles, offers a unified approach to viewing a number of popular RL algorithms, and performs better than or on-par with TRPO, PPO, and SAC in a number of continuous control tasks.
no code implementations • ICML 2020 • Yonathan Efroni, Lior Shani, Aviv Rosenberg, Shie Mannor
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
no code implementations • 6 Sep 2019 • Lior Shani, Yonathan Efroni, Shie Mannor
Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is iteratively solved.
no code implementations • 17 Dec 2018 • Mark Kozdoba, Edward Moroshko, Lior Shani, Takuya Takagi, Takashi Katoh, Shie Mannor, Koby Crammer
In the context of Multi Instance Learning, we analyze the Single Instance (SI) learning objective.
1 code implementation • 13 Dec 2018 • Lior Shani, Yonathan Efroni, Shie Mannor
We continue and analyze properties of exploration-conscious optimal policies and characterize two general approaches to solve such criteria.