We aim to guarantee each worker the largest possible share from the utility in her best possible stable matching.

In the strategic multi-armed bandit setting, when arms possess perfect information about the player's behavior, they can establish an equilibrium where: 1. they retain almost all of their value, 2. they leave the player with a substantial (linear) regret.

For this model, our algorithm obtains a regret $\tilde{\mathcal{O}}(T^{d+2\beta/d+3\beta})$, where $d$ is the dimension of the context space.

The non-clairvoyant scheduling problem has gained new interest within learning-augmented algorithms, where the decision-maker is equipped with predictions without any quality guarantees.

In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead.

The combination of lightly supervised pre-training and online fine-tuning has played a key role in recent AI developments.

We study how to learn $\epsilon$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback.

We consider the problem of online allocation subject to a long-term fairness penalty.

We consider the dataset valuation problem, that is, the problem of quantifying the incremental gain, to some relevant pre-defined utility of a machine learning task, of aggregating an individual dataset to others.

This motivates the harder, asynchronous multiplayer bandits problem, which was first tackled with an explore-then-commit (ETC) algorithm (see Dakdouk, 2022), with a regret upper-bound in $\mathcal{O}(T^{\frac{2}{3}})$.

Imperfect information games (IIG) are games in which each player only partially observes the current game state.

Due mostly to its application to cognitive radio networks, multiplayer bandits gained a lot of interest in the last decade.

In this paper we discuss an application of Stochastic Approximation to statistical estimation of high-dimensional sparse parameters.

We study single-machine scheduling of jobs, each belonging to a job type that determines its duration distribution.

The workhorse of machine learning is stochastic gradient descent.

In the case of asymmetric values where optimal solutions need not exist but Nash equilibria do, our algorithm samples from an $\varepsilon$-Nash equilibrium with similar complexity but where implicit constants depend on various parameters of the game such as battlefield values.

Contextual bandit algorithms are widely used in domains where it is desirable to provide a personalized service by leveraging contextual information, that may contain sensitive information that needs to be protected.

We consider the problem of online linear regression in the stochastic setting.

In the fixed budget thresholding bandit problem, an algorithm sequentially allocates a budgeted number of samples to different distributions.

Finding an optimal matching in a weighted graph is a standard combinatorial problem.

Motivated by sequential budgeted allocation problems, we investigate online matching problems where connections between vertices are not i. i. d., but they have fixed degree distributions -- the so-called configuration model.

We introduce a new procedure to neuralize unsupervised Hidden Markov Models in the continuous case.

Current solutions either solve a behaviour cloning problem (which does not leverage the exploratory data) or a reinforced imitation learning problem (using a fixed cost function that discriminates available exploratory trajectories from expert ones).

In the centralized case, the number of accumulated packets remains bounded (i. e., the system is \textit{stable}) as long as the ratio between service rates and arrival rates is larger than $1$.

The gloabal objective of inverse Reinforcement Learning (IRL) is to estimate the unknown cost function of some MDP base on observed trajectories generated by (approximate) optimal policies.

A critical aspect of bandit methods is that they require to observe the contexts --i. e., individual or group-level data-- and rewards in order to solve the sequential problem.

If she accepts the proposal, she is busy for the duration of the task and obtains a reward that depends on the task duration.

On the other hand, this heuristic performs reasonably well in practice and it even has sublinear, and even near-optimal, regret bounds in some very specific linear contextual and Bayesian bandit models.

We consider the quickest change detection problem where both the parameters of pre- and post- change distributions are unknown, which prevent the use of classical simple hypothesis testing.

Continuously learning and leveraging the knowledge accumulated from prior tasks in order to improve future performance is a long standing machine learning problem.

We consider the stochastic block model where connection between vertices is perturbed by some latent (and unobserved) random geometric graph.

Motivated by this, we study privacy in the context of finite-horizon Markov Decision Processes (MDPs) by requiring information to be obfuscated on the user side.

In the simple uni-dimensional and static setting, beliefs about the quality are known to converge to its true value.

In CMAB, the question of the existence of an efficient policy with an optimal asymptotic regret (up to a factor poly-logarithmic with the action size) is still open for many families of distributions, including mutually independent outcomes, and more generally the multivariate sub-Gaussian family.

We introduce a new stochastic multi-armed bandit setting where arms are grouped inside ``ordered'' categories.

We provide the first algorithm robust to selfish players (a. k. a.

Expanding Non Markovian Reward Decision Processes (NMRDP) into Markov Decision Processes (MDP) enables the use of state of the art Reinforcement Learning (RL) techniques to identify optimal policies.

Studies on massive open online courses (MOOCs) users discuss the existence of typical profiles and their impact on the learning process of the students.

By trying to minimize the $\ell^2$-loss $\mathbb{E} [\lVert\hat{\beta}-\beta^{\star}\rVert^2]$ the decision maker is actually minimizing the trace of the covariance matrix of the problem, which corresponds then to online A-optimal design.

We introduce a novel theoretical framework for Return On Investment (ROI) maximization in repeated decision-making.

Strategic information is valuable either by remaining private (for instance if it is sensitive) or, on the other hand, by being used publicly to increase some utility.

This can be recast as a specific stochastic optimization problem where the objective is to maximize the cumulative reward, or equivalently to minimize the regret.

We improve the efficiency of algorithms for stochastic \emph{combinatorial semi-bandits}.

We study a multiplayer stochastic multi-armed bandit problem in which players cannot communicate, and if two or more players pull the same arm, a collision occurs and the involved players receive zero reward.

We consider the stochastic contextual bandit problem with additional regularization.

State of the art online learning procedures focus either on selecting the best alternative ("best arm identification") or on minimizing the cost (the "regret").

Motivated by cognitive radio networks, we consider the stochastic multiplayer multi-armed bandit problem, where several players pull arms simultaneously and collisions occur if one of them is pulled by several players at the same stage.

We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency $\epsilon$, an extra observation is gathered by the agent for free.

When $K=2$ in the distribution-dependent case, the hardness of our setting reduces to that of a stochastic $2$-armed bandit: we prove that an upper bound of order $(\log T)/\Delta$ (up to $\log\log$ factors) on the regret can be achieved with no information on the demand curve.

We consider the problem where an agent wants to find a hidden object that is randomly located in some vertex of a directed acyclic graph (DAG) according to a fixed but possibly unknown distribution.

We assume that the probability of conversion associated with each action is unknown while the distribution of the conversion delay is known, distinguishing between the (idealized) case where the conversion events may be observed whatever their delay and the more realistic setting in which late conversions are censored.

In the classical multi-armed bandit problem, d arms are available to the decision maker who pulls them sequentially in order to maximize his cumulative reward.

We provide a comparative study of several widely used off-policy estimators (Empirical Average, Basic Importance Sampling and Normalized Importance Sampling), detailing the different regimes where they are individually suboptimal.

We consider the problem of bandit optimization, inspired by stochastic optimization and online learning problems with bandit feedback.

We introduce a way to quantify the dependency structure of the problem and design an algorithm that adapts to it.

The third is necessary: if it is not satisfied, the opponent can weakly exclude the target set.

The minimization of convex functions which are only available through partial and noisy information is a key methodological problem in many disciplines.

We demonstrate that, in the classical non-stochastic regret minimization problem with $d$ decisions, gains and losses to be respectively maximized or minimized are fundamentally different.

To our knowledge, this is the first complete set of strategies for bidders participating in auctions of this type.

We show that it is impossible, in general, to approach the best target set in hindsight and propose achievable though ambitious alternative goals.

In this paper, we analyze a generic algorithm scheme for sequential global optimization using Gaussian processes.

In this paper we provide primal conditions on a convex set to be approachable with partial monitoring.

We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate.

