1 code implementation • 1 Apr 2024 • Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding.
no code implementations • 6 Feb 2024 • Tsunehiko Tanaka, Kenshi Abe, Kaito Ariu, Tetsuro Morimura, Edgar Simo-Serra
Traditional approaches in offline reinforcement learning aim to learn the optimal policy that maximizes the cumulative reward, also known as return.
1 code implementation • 5 Jan 2024 • Yuu Jinnai, Kaito Ariu
Minimum Bayes-Risk (MBR) decoding is shown to be a powerful alternative to beam search decoding for a wide range of text generation tasks.
no code implementations • 9 Nov 2023 • Yuu Jinnai, Tetsuro Morimura, Ukyo Honda, Kaito Ariu, Kenshi Abe
MBR decoding selects a hypothesis from a pool of hypotheses that has the least expected risk under a probability model according to a given utility function.
no code implementations • 23 Aug 2023 • Po-An Wang, Kaito Ariu, Alexandre Proutiere
We prove that there is no algorithm that (i) performs as well as the algorithm sampling each arm equally (this algorithm is referred to as the {\it uniform sampling} algorithm) on all instances, and that (ii) strictly outperforms this algorithm on at least one instance.
no code implementations • 18 Jun 2023 • Kaito Ariu, Alexandre Proutiere, Se-Young Yun
To this end, we revisit instance-specific lower bounds on the expected number of misclassified items satisfied by any clustering algorithm.
no code implementations • 26 May 2023 • Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Atsushi Iwasaki
This paper addresses the problem of learning Nash equilibria in {\it monotone games} where the gradient of the payoff functions is monotone in the strategy profile space, potentially containing additive noise.
no code implementations • 2 May 2023 • Hiroaki Shiino, Kaito Ariu, Kenshi Abe, Togashi Riku
In this paper, we propose a safe OLTR algorithm that efficiently exchanges one of the items in the current ranking with an item outside the ranking (i. e., an unranked item) to perform exploration.
1 code implementation • 21 Aug 2022 • Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Kentaro Toyoshima, Atsushi Iwasaki
This paper proposes Mutation-Driven Multiplicative Weights Update (M2WU) for learning an equilibrium in two-player zero-sum normal-form games and proves that it exhibits the last-iterate convergence property in both full and noisy feedback settings.
no code implementations • 12 Jan 2022 • Masahiro Kato, Kaito Ariu, Masaaki Imaizumi, Masahiro Nomura, Chao Qin
We show that a strategy following the Neyman allocation rule (Neyman, 1934) is asymptotically optimal when the gap between the expected rewards is small.
1 code implementation • 18 Nov 2021 • Junpei Komiyama, Kaito Ariu, Masahiro Kato, Chao Qin
We consider best arm identification in the multi-armed bandit problem.
no code implementations • 16 Sep 2021 • Kaito Ariu, Masahiro Kato, Junpei Komiyama, Kenichiro McAlinn, Chao Qin
We consider the "policy choice" problem -- otherwise known as best arm identification in the bandit literature -- proposed by Kasy and Sautmann (2021) for adaptive experimental design.
1 code implementation • 26 Jun 2021 • Masahiro Kato, Kaito Ariu
We demonstrate that contextual information can be used to improve the efficiency of the identification of the best marginalized mean reward compared with the results of Garivier & Kaufmann (2016).
no code implementations • NeurIPS 2020 • Kaito Ariu, Narae Ryu, Se-Young Yun, Alexandre Proutière
Interestingly, our analysis reveals the relative weights of the different components of regret: the component due to the constraint of not presenting the same item twice to the same user, that due to learning the chances users like items, and finally that arising when learning the underlying structure.
no code implementations • 23 Oct 2020 • Masahiro Kato, Kenshi Abe, Kaito Ariu, Shota Yasui
Based on the properties of the evaluation policy, we categorize OPE situations.
1 code implementation • 22 Oct 2020 • Kaito Ariu, Kenshi Abe, Alexandre Proutière
In this paper, we revisit the regret minimization problem in sparse stochastic contextual linear bandits, where feature vectors may be of large dimension $d$, but where the reward function depends on a few, say $s_0\ll d$, of these features only.
no code implementations • 14 Oct 2019 • Kaito Ariu, Jungseul Ok, Alexandre Proutiere, Se-Young Yun
The objective is to devise an algorithm with a minimal cluster recovery error rate.