Search Results for author: Kaito Ariu

Found 19 papers, 9 papers with code

Matroid Semi-Bandits in Sublinear Time

no code implementations28 May 2024 Ruo-Chun Tzeng, Naoto Ohsaka, Kaito Ariu

We study the matroid semi-bandits problem, where at each round the learner plays a subset of $K$ arms from a feasible set, and the goal is to maximize the expected cumulative linear rewards.

Filtered Direct Preference Optimization

1 code implementation22 Apr 2024 Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu

This paper addresses the issue of text quality within the preference dataset by focusing on direct preference optimization (DPO), an increasingly adopted reward-model-free RLHF method.

Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment

1 code implementation1 Apr 2024 Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe

In this research, we propose Regularized Best-of-N (RBoN), a variant of BoN that aims to mitigate reward hacking by incorporating a proximity term in response selection, similar to preference learning techniques.

Language Modelling

Return-Aligned Decision Transformer

no code implementations6 Feb 2024 Tsunehiko Tanaka, Kenshi Abe, Kaito Ariu, Tetsuro Morimura, Edgar Simo-Serra

Decision Transformer (DT) optimizes a policy that generates actions conditioned on the target return through supervised learning and is equipped with a mechanism to control the agent using the target return.

Action Generation

Hyperparameter-Free Approach for Faster Minimum Bayes Risk Decoding

1 code implementation5 Jan 2024 Yuu Jinnai, Kaito Ariu

Minimum Bayes-Risk (MBR) decoding is shown to be a powerful alternative to beam search decoding for a wide range of text generation tasks.

Image Captioning Machine Translation +3

Model-Based Minimum Bayes Risk Decoding for Text Generation

2 code implementations9 Nov 2023 Yuu Jinnai, Tetsuro Morimura, Ukyo Honda, Kaito Ariu, Kenshi Abe

MBR decoding selects a hypothesis from a pool of hypotheses that has the least expected risk under a probability model according to a given utility function.

Decoder Text Generation

On Universally Optimal Algorithms for A/B Testing

no code implementations23 Aug 2023 Po-An Wang, Kaito Ariu, Alexandre Proutiere

For the problem with two arms, also known as the A/B testing problem, we prove that there is no algorithm that (i) performs as well as the algorithm sampling each arm equally (referred to as the {\it uniform sampling} algorithm) in all instances, and that (ii) strictly outperforms uniform sampling on at least one instance.

Multi-Armed Bandits

Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model

no code implementations18 Jun 2023 Kaito Ariu, Alexandre Proutiere, Se-Young Yun

To this end, we revisit instance-specific lower bounds on the expected number of misclassified items satisfied by any clustering algorithm.

Clustering Stochastic Block Model

Adaptively Perturbed Mirror Descent for Learning in Games

1 code implementation26 May 2023 Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Atsushi Iwasaki

This paper proposes a payoff perturbation technique for the Mirror Descent (MD) algorithm in games where the gradient of the payoff functions is monotone in the strategy profile space, potentially containing additive noise.

Exploration of Unranked Items in Safe Online Learning to Re-Rank

no code implementations2 May 2023 Hiroaki Shiino, Kaito Ariu, Kenshi Abe, Togashi Riku

In this paper, we propose a safe OLTR algorithm that efficiently exchanges one of the items in the current ranking with an item outside the ranking (i. e., an unranked item) to perform exploration.

Learning-To-Rank Safe Exploration

Last-Iterate Convergence with Full and Noisy Feedback in Two-Player Zero-Sum Games

1 code implementation21 Aug 2022 Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Kentaro Toyoshima, Atsushi Iwasaki

This paper proposes Mutation-Driven Multiplicative Weights Update (M2WU) for learning an equilibrium in two-player zero-sum normal-form games and proves that it exhibits the last-iterate convergence property in both full and noisy feedback settings.

Multi-agent Reinforcement Learning

Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap

no code implementations12 Jan 2022 Masahiro Kato, Kaito Ariu, Masaaki Imaizumi, Masahiro Nomura, Chao Qin

We show that a strategy following the Neyman allocation rule (Neyman, 1934) is asymptotically optimal when the gap between the expected rewards is small.

Causal Inference

Rate-optimal Bayesian Simple Regret in Best Arm Identification

1 code implementation18 Nov 2021 Junpei Komiyama, Kaito Ariu, Masahiro Kato, Chao Qin

We consider best arm identification in the multi-armed bandit problem.

Policy Choice and Best Arm Identification: Asymptotic Analysis of Exploration Sampling

no code implementations16 Sep 2021 Kaito Ariu, Masahiro Kato, Junpei Komiyama, Kenichiro McAlinn, Chao Qin

We consider the "policy choice" problem -- otherwise known as best arm identification in the bandit literature -- proposed by Kasy and Sautmann (2021) for adaptive experimental design.

Decision Making Experimental Design

The Role of Contextual Information in Best Arm Identification

1 code implementation26 Jun 2021 Masahiro Kato, Kaito Ariu

We demonstrate that contextual information can be used to improve the efficiency of the identification of the best marginalized mean reward compared with the results of Garivier & Kaufmann (2016).

Regret in Online Recommendation Systems

no code implementations NeurIPS 2020 Kaito Ariu, Narae Ryu, Se-Young Yun, Alexandre Proutière

Interestingly, our analysis reveals the relative weights of the different components of regret: the component due to the constraint of not presenting the same item twice to the same user, that due to learning the chances users like items, and finally that arising when learning the underlying structure.

Recommendation Systems

Thresholded Lasso Bandit

1 code implementation22 Oct 2020 Kaito Ariu, Kenshi Abe, Alexandre Proutière

In this paper, we revisit the regret minimization problem in sparse stochastic contextual linear bandits, where feature vectors may be of large dimension $d$, but where the reward function depends on a few, say $s_0\ll d$, of these features only.

Cannot find the paper you are looking for? You can Submit a new open access paper.