On Upper-Confidence Bound Policies for Non-stationary Bandit Problems

ConferenceName:

Algorithmic Learning Theory

url:

22nd International Conference on Algorithmic Learning Theory

Edition Number:

Date:

October, 2011

Place:

Espoo, Finland

PageStart:

174

PageEnd:

188

Authors:

Aurélien Garivier

Eric Moulines

Abstract:

Many problems, such as cognitive radio, parameter control of a scanning tunnelling microscope or internet advertisement, can be modelled as non-stationary bandit problems where the distributions of rewards changes abruptly at unknown time instants. In this paper, we analyze two algorithms designed for solving this issue: discounted UCB (D-UCB) and sliding-window UCB (SW-UCB). We establish an upper-bound for the expected regret by upper-bounding the expectation of the number of times suboptimal arms are played. The proof relies on an interesting Hoeffding type inequality for self normalized deviations with a random number of summands. We establish a lower-bound for the regret in presence of abrupt changes in the arms reward distributions. We show that the discounted UCB and the sliding-window UCB both match the lower-bound up to a logarithmic factor. Numerical simulations show that D-UCB and SW-UCB perform significantly better than existing soft-max methods like EXP3.S.

Direct link:

Proceedings published as Lecture Notes in Computer Science Volume 6925

Arxiv Number:

0805.3415

Hal Number:

00281392

Main menu

On Upper-Confidence Bound Policies for Non-stationary Bandit Problems

Keywords:

Search form

Main menu

You are here

On Upper-Confidence Bound Policies for Non-stationary Bandit Problems

Keywords: