Optimism in Reinforcement Learning and Kullback-Leibler Divergence

ConferenceName:

Allerton Conference on Communication, Control, and Computing

url:

48th Annual Allerton Conference on Communication, Control, and Computing

Edition Number:

Date:

September, 2010

Place:

Urbana-Champaign, USA

PageStart:

115

PageEnd:

122

Authors:

Sarah Filippi

Olivier Cappé

Aurélien Garivier

Abstract:

We consider model-based reinforcement learning in finite Markov Decision Processes (MDPs), focussing on so-called optimistic strategies. In MDPs, optimism can be implemented by carrying out extended value iterations under a constraint of consistency with the estimated model transition probabilities. The UCRL2 algorithm by Auer, Jaksch and Ortner (2009), which follows this strategy, has recently been shown to guarantee near-optimal regret bounds. In this paper, we strongly argue in favor of using the Kullback-Leibler (KL) divergence for this purpose. By studying the linear maximization problem under KL constraints, we provide an efficient algorithm, termed KL-UCRL, for solving KL-optimistic extended value iteration. Using recent deviation bounds on the KL divergence, we prove that KL-UCRL provides the same guarantees as UCRL2 in terms of regret. However, numerical experiments on classical benchmarks show a significantly improved behavior, particularly when the MDP has reduced connectivity. To support this observation, we provide elements of comparison between the two algorithms based on geometric considerations.

Direct link:

available on IEEE Xplore

Arxiv Number:

1004.5229

Hal Number:

00476116

Main menu

Optimism in Reinforcement Learning and Kullback-Leibler Divergence

Keywords:

Search form

Main menu

You are here

Optimism in Reinforcement Learning and Kullback-Leibler Divergence

Keywords: