previous theoretical work on contextual multi-armed bandits do not satisfy our technical goals described below. Goals: In the stochastic multi-armed bandit problem, each arm is associated with an unknown payoff distribution that is fixed throughout the episode. Without use of context, worst-case...
Multi-armed bandits with episode context can arise naturally, for example in computer Go where context is used to bias move decisions made by a multi-armed bandit algorithm. The UCB1 algorithm for multi-armed bandits achieves worst-case regret bounded by O (Knlog(n)_(1/2). We seek to ...
(2011). Automatic discovery of ranking formulas for playing with multi-armed bandits. In S. Sanner & M. Hutter (Eds.), LNCS: Vol. 7188. Recent advances in reinforcement learning—9th European workshop, EWRL 2011 (pp. 5–17). Berlin: Springer. Chapter Google Scholar Mannor, S., & ...
dynamic grouping; multi-armed bandits; exploration and exploitation; reinforcement learning; recommendation 1. Introduction Reinforcement learning is a canonical formalism for studying how an agent learns to take optimal actions by repeated interactions with a stochastic environment [1]. Meanwhile, ...