1 code implementation • 22 Apr 2024 • Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu
This paper addresses the issue of text quality within the preference dataset by focusing on direct preference optimization (DPO), an increasingly adopted reward-model-free RLHF method.
1 code implementation • 1 Apr 2024 • Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe
In this research, we propose Regularized Best-of-N (RBoN), a variant of BoN that aims to mitigate reward hacking by incorporating a proximity term in response selection, similar to preference learning techniques.
1 code implementation • 22 Feb 2024 • Riku Togashi, Kenshi Abe, Yuta Saito
Typical recommendation and ranking methods aim to optimize the satisfaction of users, but they are often oblivious to their impact on the items (e. g., products, jobs, news, video) and their providers.
no code implementations • 6 Feb 2024 • Tsunehiko Tanaka, Kenshi Abe, Kaito Ariu, Tetsuro Morimura, Edgar Simo-Serra
Decision Transformer (DT) optimizes a policy that generates actions conditioned on the target return through supervised learning and is equipped with a mechanism to control the agent using the target return.
no code implementations • 15 Nov 2023 • Hakuei Yamada, Junpei Komiyama, Kenshi Abe, Atsushi Iwasaki
This work addresses learning online fair division under uncertainty, where a central planner sequentially allocates items without precise knowledge of agents' values or utilities.
1 code implementation • 9 Nov 2023 • Yuu Jinnai, Tetsuro Morimura, Ukyo Honda, Kaito Ariu, Kenshi Abe
MBR decoding selects a hypothesis from a pool of hypotheses that has the least expected risk under a probability model according to a given utility function.
no code implementations • 13 Jul 2023 • Sho Shimoyama, Tetsuro Morimura, Kenshi Abe, Toda Takamichi, Yuta Tomomatsu, Masakazu Sugiyama, Asahi Hentona, Yuuki Azuma, Hirotaka Ninomiya
One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL).
1 code implementation • 26 May 2023 • Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Atsushi Iwasaki
This paper proposes a payoff perturbation technique for the Mirror Descent (MD) algorithm in games where the gradient of the payoff functions is monotone in the strategy profile space, potentially containing additive noise.
no code implementations • 2 May 2023 • Hiroaki Shiino, Kaito Ariu, Kenshi Abe, Togashi Riku
In this paper, we propose a safe OLTR algorithm that efficiently exchanges one of the items in the current ranking with an item outside the ranking (i. e., an unranked item) to perform exploration.
no code implementations • 9 Sep 2022 • Riku Togashi, Kenshi Abe
However, the intrinsic nature of fairness destroys the separability of optimisation subproblems for users and items, which is an essential property of conventional scalable algorithms, such as implicit alternating least squares (iALS).
1 code implementation • 21 Aug 2022 • Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Kentaro Toyoshima, Atsushi Iwasaki
This paper proposes Mutation-Driven Multiplicative Weights Update (M2WU) for learning an equilibrium in two-player zero-sum normal-form games and proves that it exhibits the last-iterate convergence property in both full and noisy feedback settings.
1 code implementation • 18 Jun 2022 • Kenshi Abe, Mitsuki Sakamoto, Atsushi Iwasaki
In this study, we consider a variant of the Follow the Regularized Leader (FTRL) dynamics in two-player zero-sum games.
no code implementations • 2 Jun 2022 • Tetsuro Morimura, Kazuhiro Ota, Kenshi Abe, Peinan Zhang
In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL setups.
1 code implementation • 14 Feb 2022 • Kenshi Abe, Junpei Komiyama, Atsushi Iwasaki
Constructing a good search tree representation significantly boosts the performance of the proposed method.
no code implementations • 23 Oct 2020 • Masahiro Kato, Kenshi Abe, Kaito Ariu, Shota Yasui
Based on the properties of the evaluation policy, we categorize OPE situations.
1 code implementation • 22 Oct 2020 • Kaito Ariu, Kenshi Abe, Alexandre Proutière
In this paper, we revisit the regret minimization problem in sparse stochastic contextual linear bandits, where feature vectors may be of large dimension $d$, but where the reward function depends on a few, say $s_0\ll d$, of these features only.
no code implementations • 3 Oct 2020 • Masahiro Kato, Kei Nakagawa, Kenshi Abe, Tetsuro Morimura
To achieve this purpose, we train an agent to maximize the expected quadratic utility function, a common objective of risk management in finance and economics.
no code implementations • 4 Jul 2020 • Kenshi Abe, Yusuke Kaneko
The proposed estimators project exploitability that is often used as a metric for determining how close a policy profile (i. e., a tuple of policies) is to a Nash equilibrium in two-player zero-sum games.
no code implementations • 18 Nov 2019 • Masahiro Nomura, Kenshi Abe
The aim of black-box optimization is to optimize an objective function within the constraints of a given evaluation budget.