Planning in MDPs: beyond expectations

Journées MAS - août 2024 - Poitiers

Aurélien Garivier
Ecole Normale Supérieure de Lyon

Beyond Average Return in Markov Decision Processes

Alexandre Marthe ENSL

Claire Vernade Univ. Tübingen

Markov Decision Processes

Reinforcement Learning

  • TD-Gammon. [Tesauro ’92-’95]: backgammon world champion
  • KnightCap [Baxter et al. ’98]: chess (2500 ELO)
  • Computer poker [Alberta, ’08…]
  • Computer go [Mogo ’06], [AlphaGo ’15, Alphazero ’18]
  • Atari, Starcraft, etc. [Deepmind ’10 sqq]
  • Robotics: jugglers, acrobots, … [Schaal et Atkeson ’94 sqq]
  • Navigation: robot guide in Smithonian Museum [Thrun et al. ’99]
  • Lift command [Crites et Barto ’96]
  • Maintenance [Mahadevan et al. ’97]
  • Internet Packet Routing [Boyan et Littman ’93]
  • Task Scheduling [Zhang et Dietterich ’95]
  • Social Networks [Acemoglu et Ozdaglar ’10]
  • Yield Management, pricing [Gosavi ’10]
  • Load forecasting [Meynn ’10]
  • Recommendation systems [Asfar et al. ’21]
  • Protein structure prediction [Jumper et al. ’21]
  • Reinforcement learning with human feedback for LLM [Ouyang et al. ’23]

Example: the Cliff Environment

(Moher Cliff in Ireland)

Example: Retail Store Management

  • during week \(t\), the (random) demand is \(U_t\) units.
  • on Monday morning you may command \(A_t\) additional units: they are delivered immediately before the shop opens.
  • maintenance cost: \(h(s)\) for s units in stock left from the previous week
  • command cost: \(C(a)\) for a units
  • sales profit: \(f(q)\) for q units sold
  • constraints:
    • warehouse has maximal capacity of \(M\) units
    • cannot sell units that are not in stock

Semi-controlled Dynamical System

\(\mathcal{D} = (\mathcal{S}, \mathcal{A}, \mathcal{U}, \psi, r(\cdot, \cdot, \cdot))\)

  • \(\mathcal{S}\) = state space (discrete in this talk)

  • \(\mathcal{A}\) = action space (discrete in this talk)

  • \(\mathcal{U}\) = externality space (uncontrolled)

  • \(\psi(s, a, u)\) = transition function:
       choosing action in \(a\) state \(s\), if the externality is \(u\) then the next state is \(\psi(s, a, u)\)

  • \(r(\cdot, \cdot, \cdot)\) = reward function
      \(r(s, a, s')\) = reward associated to a transition from state \(s\) to state \(s'\) choosing action \(a\)

Session

  • intial state \(S_0=s_0\in\mathcal{S}\)
  • horizon H = total number of steps
  • policy \(\pi = (\pi_t)_{0\leq t<H}\)    where \(\pi_t: \mathcal{S}\to \mathcal{A}\)
  • externalities \(U_1, \dots, U_H\) are \(\mathcal{U}\)-valued random variables
  • Given a policy \(\pi\), for all \(t\in\{0, \dots, H-1\}\) the state evolves as \(S_{t+1} = \psi(S_t, \pi_t(S_t), U_t)\)
  • cumulated reward: \(\displaystyle{W_H = \sum_{t=0}^{H-1} r\big(S_t, \pi_t(S_t), S_{t+1}\big)}\)

Model: Markov Decision Process

\(\mathcal{M}= (\mathcal{S}, \mathcal{A}, P(\cdot|\cdot, \cdot), r(\cdot, \cdot, \cdot))\)

  • \(\mathcal{S}\) = state space

  • \(\mathcal{A}\) = action space

  • \(P(\cdot|\cdot, \cdot)\) = transition kernel:
       \(P(s'|s,a)\) = probability, if you choose action \(a\), to jump from state \(s\) to state \(s'\)

  • \(r(\cdot, \cdot, \cdot)\) = reward function
       \(r(s,a, s')\) = reward associated to a transition from state \(s\) to state \(s'\) choosing action \(a\)

Session

  • intial state \(S_0=s_0\in\mathcal{S}\)
  • horizon H = total number of steps
  • policy \(\pi = (\pi_t)_{0\leq t<H}\)
       At time \(t\), the chosen action is \(\pi_t(s)\)
  • given policy \(\pi\), the trajectory \((S_t)_{0\leq t\leq H}\) is a (non-homogeneous) Markov chain with kernels \(P^{\pi_t}(s'|s) = P(s'|s, \pi_t(s))\)
  • (random) cumulated reward: \(\displaystyle{W_H = \sum_{t=0}^{H-1} r\big(S_t, \pi_t(S_t), S_{t+1}\big)}\)
  • We write \(\mathbb{P}^\pi\) (resp. \(\mathbb{E}^\pi\)) for the law of this process (resp. its expectation)
  • stationary policy: \(\pi_t=\pi_0\) for all \(t\)

Markov Decision Process

Example: the Cliff Environment

  • random move with probability \(\epsilon\geq 0\)
  • reward: \(1\) per step spent in the goal state

Example: Retail Store Management

  • state = number of units in stock before the week starts \[\mathcal{S}= \{0, \dots, M\}\]

  • action = number of units commanded \[\mathcal{A}= \{0,\dots,M\}\]

  • dynamic: \(\displaystyle{S_{t+1} = \max\big(0, \min(M, S_t + A_t) - U_t \big)}\)

  • reward: \(\displaystyle{r(s, a, s') = - C(a) - h(s) + f\big(\min(M, s+a) - s'\big)}\)

  • possible policy: always buy \(\mathbb{E}[U_t]\) bikes (?)

Goals

  • Planning: knowing the model \(\mathcal{M}= (\mathcal{S}, \mathcal{A}, P, r)\), find a policy \(\pi\) maximizing the cumulative reward \(W_H\) in expectation: \[\max_{\pi} \mathbb{E}^\pi\left[\sum_{t=0}^{H-1} r\big(S_t, \pi_t(S_t), S_{t+1}\big)\right] \]
  • Learning: Knowing only the structure of the model \(\mathcal{S}, \mathcal{A}\) but not the dynamic \(P\) and the reward function \(r\), and having access to a simluator, play several sessions so as to find the best policy
    • regret minimization or best policy identification
    • trajectory or transition simulator
    • etc…

Classical Planning

Policy Evaluation

The value function \(\big(V^{\pi}_t\big)_{0\leq t\leq H}\) of a policy \(\pi\) is defined by \[V^{\pi}_t(s) = \mathbb{E}^\pi\left[\left.\sum_{k=t}^{H-1}r\big(S_k, \pi_t(S_k), S_{k+1}\big) \right| S_t = s\right]\] and can be computed recursively: for all \(s\in\mathcal{S}\),

  • \(V^\pi_{H}(s) = 0\)

  • \(\displaystyle{V^{\pi}_{t}(s) = \sum_{s'\in\mathcal{A}} P(s'|s, \pi_t(s))\big(r(s, a, s') + V^{\pi}_{t+1}(s')\big)}\)
    \(\tiny{\hbox{since }V^{\pi}_{t}(s) = \mathbb{E}^{\pi}\Big[ \mathbb{E}^{\pi}\big[ r(s, \pi_t(s), S_{t+1}) \big| S_{t+1}\big] + V^{\pi}_{t+1}(S_{t+1}) \Big| S_t = s\Big]}\)
    NB: \(V^{\pi}_{0}(s_0) = \mathbb{E}^{\pi}[W_H]\)

Operator formulation

The Bellman operator \(\mathcal{T}^\pi: \mathbb{R}^{\mathcal{S}} \to \mathbb{R}^{\mathcal{S}}\) is defined by \[\mathcal{T}^{\pi}v (s) = \sum_{s'\in\mathcal{S}} P\big(s'|s,\pi(s)\big) \big(r(s, \pi(s),s') + v(s')\big)\]

Backward induction on the expected cumulated reward:

Policy Evaluation Algorithm

  • \(V \gets 0^{\mathcal{S}}\)
  • for \(t=H-1\) down to \(0\):
    • \(V \gets \mathcal{T}^\pi V\)
  • return \(V(s_0)\)

Planning: Dynamic Programming

  • since \(S_H = S_{H-1} + r(S_{H-1}, \pi_{H-1}(S_{H-1}), S_H)\), \[V_{H-1}(s) = \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) {\color{orange}{\big(}}r(s, a, s') {\color{orange}{+0\big)}}\] by choosing \(\displaystyle{\pi_{H-1}(s) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) r(s, a, s') }\)

  • Thus, one can obtain on average on the last two steps \[V_{H-2}(s) = \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a, s') + V_H(s') \big) \] by choosing \(\displaystyle{\pi_{H-2}(s) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a, s') + V_{H-1}(s') \big) }\)

Backward induction on the table

Example: Retail Store

Example: the Cliff Environment

The iterates may converge to a fixed point, but this is another story…

Bellman iterations:

  • Greedy action: \(\displaystyle{\mathcal{G}^*(s, v) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a,s') + v(s')\big)}\)
  • Bellman optimal operator \(\mathcal{T}^*: \mathbb{R}^{\mathcal{S}} \to \mathbb{R}^{\mathcal{S}}\) defined by \[\begin{align*}\mathcal{T}^*v (s) &= {\color{orange}{\max_{a\in\mathcal{A}}}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a,s') + v(s')\big)\\ & \scriptsize= \sum_{s'\in\mathcal{S}}P\big(s'|s,\mathcal{G}^*(s, v)\big) \big(r(s, \mathcal{G}^*(s, v),s') + v(s')\big) \end{align*}\]

Planning by Value Iteration:

Value Iteration Algorithm

  • \(V \gets 0^{\mathcal{S}}\)
  • for \(t=H-1\) down to \(0\):
    • for each \(s\in\mathcal{S}\), \(\pi_t(s) \gets \mathcal{G}^*(s, V)\)
    • \(V \gets \mathcal{T}^* V\)
  • return \(\pi\) and \(V(s_0)\)

The iterates may converge to a fixed point, but this is another story…

Distributional RL

Reward distribution

  • The expectation of \(W_H\) may not be a good decision criterion

  • Risk-aware RL: optimize other functionals of \(W_H\)

    • quantiles
    • conditional value-at-risk
    • etc…

Backward Induction is pretty general

  • probability of reaching a state
  • time spent in a state
  • higher moments of the total reward
  • etc…

Example: entropic utility

  • Def: If \(X\) is a \([a,b]\)-valued random variable, for all \(\beta\in\mathbb{R}\) \[U_\beta(X) = \frac{1}{\beta}\log \mathbb{E}\left[e^{\beta X}\right]\]

  • Properties:

    • if \(\beta\leq \beta'\) then \(a\leq U_\beta(X) \leq U_{\beta'}(X) \leq b\)
    • \(\beta \mapsto U_\beta(X)\) is continuous with \(U_0(X) = \mathbb{E}[X]\)
    • Example: if \(X\sim\mathcal{N}(\mu, \sigma^2)\), then \(U_\beta(X) = \mu + \frac{\sigma^2}{2} \beta\)
    • More generally, \(U_\beta(X) = \mathbb{E}[X] + \frac{\mathbb{V}(X)}{2} \beta + o(\beta)\)

\(\beta\)-entropies as risk-aware utilities

Computing entropic utilities of \(W_H\)

the \(\beta\)-value function \(\big(V^{\pi}_{\beta,t})_{0\leq t\leq H}\) of policy \(\pi\) \[V^{\pi}_{\beta,t}(s) = \frac{1}{\beta}\log \mathbb{E}^\pi\left[\left.\exp\left(\beta \sum_{k=t}^{H-1}r\big(S_k, \pi_t(S_k), S_{k+1}\big)\right) \right| S_t = s\right]\] can also be computed recursively: for all \(s\in\mathcal{S}\),

  • \(V_{\beta,H}(s) = 0\)

  • \(\displaystyle{V^{\pi}_{\beta,t}(s) = \frac{1}{\beta}\log \sum_{s'\in\mathcal{S}} P(s'|s, \pi_t(s)) \exp\Big(\beta\big(r(s, a, s') + V^{\pi}_{t+1}(s')\big)\Big)}\)
    \(\tiny{\hbox{since }\beta \exp\left(V^{\pi}_{t}(s)\right) = \mathbb{E}^\pi\Big[ \mathbb{E}^\pi\big[ \exp\big(\beta r(s, \pi_t(s), S_{t+1}) \big) \big| S_{t+1}\big] \exp\big(\beta V^{\pi}_{t+1}(S_{t+1})\big) \Big| S_t = s\Big]}\)

\(\beta\)-entropic Value Iteration

[R. A. Howard and J. E. Matheson. Risk-Sensitive Markov Decision Processes. 1972]

  • \(\beta\)-entropic greedy action: \(\displaystyle{\mathcal{G}_\beta^*(s, v) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s, \pi_t(s))\; {\color{orange}{\exp\Big(\beta}}\big(r(s, a, s') + v(s')\big){\color{orange}{\Big)}}}\)

  • \(\beta\)-entropic Bellman optimal operator: \(\displaystyle{\mathcal{T}_\beta^*v (s) = {\color{orange}{\frac{1}{\beta} \log}} \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \; {\color{orange}{\exp\Big(\beta}}\big(r(s, a, s') + v(s')\big){\color{orange}{\Big)}}}\)

\(\beta\)-entropic Value Iteration Algorithm

  • \(V \gets 0^{\mathcal{S}}\)
  • for \(t=H-1\) down to \(0\):
    • for each \(s\in\mathcal{S}\), \(\pi_t(s) \gets \mathcal{G}_\beta^*(s, V)\)
    • \(V \gets \mathcal{T}_\beta^* V\)
  • return \(\pi\) and \(V(s_0)\)

What else?

  • Standard VI uses the linearity of expectation \[ V_t^*(s) = \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s, a) \Big( r(s, a, s') + V_{t+1}^*(s') \Big)\]

  • entropic VI uses conditional indep. and \(e^{a+b} = e^a\,e^b\)

\[ V_{\beta,t}^*(s) = \frac{1}{\beta} \log \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s, a) \exp\Big( r(s, a, s') + V_{\beta,t+1}^*(s') \Big)\]

  • the proof does not work for other utilities (ex: quantiles)

Propagating the distribution

Src: [Bellemare, Dabney and Rowland. Distributional Reinforcement Learning. 2023.]

Propagating the distribution

Distributional value function \(\big(\mathcal{V}^{\pi}_t\big)_{0\leq t\leq H}\) of a policy \(\pi\): \[\mathcal{V}^{\pi}_t(s) = \mathbb{P}^\pi\left(\left.\sum_{k=t}^{H-1}r\big(S_t, \pi_t(S_t), S_{t+1}\big) \right| S_t = s\right) \in \mathcal{M}_1(\mathbb{R})\]

It can be computed recursively: for all \(s\in\mathcal{S}\),

  • \(\mathcal{V}_{H}(s) = \delta_0\)

  • \(\displaystyle{\mathcal{V}^{\pi}_{t}(s) = \sum_{s'\in\mathcal{S}} P\big(s'|s, \pi_t(s)\big)\big(\delta_{r(s, a, s')} * \mathcal{V}^{\pi}_{t+1}(s')\big)}\)
    where \(*\) is the convolution on \(\mathcal{M}_1(\mathbb{R})\)
    \(\mathcal{V}^{\pi}_{t}(s)\) is hence a mixture of Dirac measures

Example: Retail Store

Reference on Distributional RL

Operator formulation

Distributional Bellman operator \(\mathcal{T}^\pi: \mathcal{M}_1(\mathbb{R})^{\mathcal{S}} \to \mathcal{M}_1(\mathbb{R})^{\mathcal{S}}\) \[\mathcal{T}^{\pi}\mathcal{V}(s) = \sum_{s'\in\mathcal{S}} P\big(s'|s,\pi(s)\big)s \big( \delta_{r(s, \pi(s),s')} * \mathcal{V}(s')\big)\]

The law of cumulated reward \(W_H\) under policy \(\pi\) can be computed recursively:

Distributional Value Iteration Algorithm

  • \(\mathcal{V}\gets \delta_0 {}^{\mathcal{S}}\)
  • for \(t=H-1\) down to \(0\):
    • \(\mathcal{V}\gets \mathcal{T}^\pi \mathcal{V}\)
  • return \(\mathcal{V}(s_0)\)

Idea: optimize any functional

  • The greedy action for the functional \(\Psi\) is defined by \[\mathcal{G}^*(s, \mathcal{V}) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \Psi\left(\sum_{s'\in\mathcal{S}} P(s'|s,a) \big(\delta_{r(s, a,s')} * \mathcal{V}(s')\big) \right)\]
  • The distributional Bellman optimal operator \(\mathcal{T}^*: \mathbb{R}^{\mathcal{S}} \to \mathbb{R}^{\mathcal{S}}\) for the functional \(\Psi\) is defined by \[\mathcal{T}^*\mathcal{V}(s) = \sum_{s'\in\mathcal{S}} P\big(s'|s,\mathcal{G}^*(s, \nu)\big) \big(\delta_{r(s, \mathcal{G}^*(s, \nu),s')} * \mathcal{V}(s')\big)\]

Distrib-VI

Distributional Value Iteration Algorithm for functional \(\Psi\)

  • \(\mathcal{V}\gets \delta_0 {}^{\mathcal{S}}\)
  • for \(t=H-1\) down to \(0\):
    • for each \(s\in\mathcal{S}\), \(\pi_t(s) \gets \mathcal{G}^*(s, \mathcal{V})\)
    • \(\mathcal{V}\gets \mathcal{T}^* \mathcal{V}\)
  • return \(\pi\) and \(V(s_0)\)

Question

What does it optimize??

Understanding what functionals distrib-VI can optimize

  • expectation and \(\beta\)-entropies are optimizable

  • of course, all monotone transforms of optimizable functionals are optimizable

  • some utilities like quantiles are not optimized this way

  • we assume that the decision criterion \(\Psi\) is such that if \(\nu\) is stochastically dominated by \(\nu'\), then \(\Psi(\nu)\leq \Psi(\nu')\).

Ingredient 1 : reduction to utilities

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}}}%%
flowchart LR
  A(00) -- "[λ] 0" --> B(11)
  A(00) -- "[λ] 0" --> B(11)
  A -- "[1-λ] 0" --> C(22)
  A -- "[1-λ] 0" --> C(22)

  B -- X --> D(22)
  B -- Y --> E(33)
  C -- Z --> F(44)
  C -- Z --> F(44)
  
  linkStyle 0,2,4,6 stroke:red,stroke-width:4px
  linkStyle 1,3,5,7 stroke:blue,stroke-width:4px
  classDef allnodes 0 fill:#131c37,stroke:#131c37,stroke-width:4px,font-size:0pt
  class A allnodes;
  class B allnodes;
  class C allnodes;
  class D allnodes;
  class E allnodes;
  class F allnodes;

action 0
action 1
\(X\sim \nu_1\), \(Y\sim \nu_2\), \(Z\sim \nu'\)

  • For all \(\lambda\in\mathbb{R}\) and all \(\nu_1, \nu_2, \nu'\mathcal{M}_1(\mathbb{R})\), \[ \Psi(\nu_1) \leq \Psi(\nu_2) \implies \Psi\big(\lambda \nu_1 + (1-\lambda)\nu'\big) \leq \Psi\big(\lambda \nu_2 + (1-\lambda)\nu' \big)\]

  • Von Neumann - Morgenstern theorem: there exists \(f:\mathbb{R}\to\mathbb{R}\) such that \[\Psi(\nu_1)\leq \Psi(\nu_2) \iff \int f\,d\nu_1 \leq \int f\,d\nu_2\]

\(\implies\) wlog we focus on utilities \(\displaystyle{\Psi(\nu) = \int f\,d\nu}\)

Proof of the VNM utility theorem

  • by induction, if \(\forall i, \Psi(\nu_i)=\Psi(\nu'_i)\) then \(\Psi\big(\sum p_i \nu_i\big) = \Psi\big(\sum p_i \nu'_i\big)\)

  • assume that \(\operatorname{Supp}(\nu) \subset \big[\underline{x}, \overline{x}\big]\).

  • for every \(x\in\big[\underline{x}, \overline{x}\big]\), by the stochastic domination property there exists \(f(x)\in[0,1]\) such that \[\Psi(\delta_x) = \big(1-f(x)\big)\Psi(\delta_{\underline{x}}) + f(x)\Psi(\delta_{\overline{x}})\]

  • define \(u(\nu) = \int f\,d\nu\)

  • then \(\displaystyle{\Psi\big(\sum_i p_i\delta_{x_i}\big) = \Psi\Big(\sum_i p_i\Big(\big(1-f(x_i)\big)\delta_{\underline{x}} + f(x_i)\delta_{\overline{x}}\Big) \Big) = \Psi\Big(\big(1-u\big(\sum_i p_i\delta_{x_i}\big)\big)\delta_{\underline{x}} + u\big(\sum_i p_i\delta_{x_i}\big)\delta_{\overline{x}}\Big)}\)

  • hence \[\begin{align*} \Psi(\nu)\leq \Psi(\nu') &\iff \Psi\Big(\big(1-u(\nu)\big)\delta_{\underline{x}} + u(\nu)\delta_{\overline{x}}\Big)\leq \Psi\Big(\big(1-u(\nu')\big)\delta_{\underline{x}} + u(\nu')\delta_{\overline{x}}\Big)\\ &\iff \big(1-u(\nu)\big)\delta_{\underline{x}} + u(\nu)\delta_{\overline{x}} \hbox{ is stochastically dominated by }\big(1-u(\nu')\big)\delta_{\underline{x}} + u(\nu')\delta_{\overline{x}}\\ & \iff u(\nu)\leq u(\nu') \end{align*}\]

Ingredient 2: invariance by translation

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}}}%%
flowchart LR
  A(00) -- x --> B(11)
  A(00) -- x --> B(11)
  B -- X --> C(22)
  B -- X --> D(22)
  B -- Y --> E(33)
  B -- Y --> F(44)
  linkStyle 0,2,3 stroke:red,stroke-width:4px
  linkStyle 1,4,5 stroke:blue,stroke-width:4px
  classDef allnodes 0 fill:#131c37,stroke:#131c37,stroke-width:4px,font-size:0pt
  class A allnodes;
  class B allnodes;
  class C allnodes;
  class D allnodes;
  class E allnodes;
  class F allnodes;

action 0
action 1

For all \(x\in\mathbb{R}\) and all random variables \(X,Y\) \[ \Psi(X) \leq \Psi(Y) \implies \Psi(x+X) \leq \Psi(x+Y)\]

Hence, it is necessary that \[ \Psi(X) = \Psi(Y) \implies \Psi(x+X) = \Psi(x+Y)\]

Mixing the ingredients

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}}}%%
flowchart LR
  A(00) -- x --> B(11)
  A(00) -- x --> B(11)
  B -- "η(h)" --> C(22)
  B -- "[1/2] h" --> D(33)
  B -- "[1/2] -h" --> E(44)
  linkStyle 0 stroke:magenta,stroke-width:4px,font-size:28pt
  linkStyle 0,2 stroke:red,stroke-width:4px
  linkStyle 1,3,4 stroke:blue,stroke-width:4px
  classDef allnodes 0 fill:#131c37,stroke:#131c37,stroke-width:4px,font-size:0pt
  class A allnodes;
  class B allnodes;
  class C allnodes;
  class D allnodes;
  class E allnodes;

action 0
action 1

  • \(\Psi(\nu) = \int fd\nu\) with \(f\in\mathcal{C}^2(\mathbb{R})\) and \(f'(0)\neq 0\)

  • let for \(h\) small enough \(\eta(h) = f^{-1}\left(\frac{f(h)+f(-h)}{2}\right) = \beta h^2 + o(h^2)\quad~\) \(\beta=\frac{f''(0)}{2}\)

  • for all \(x\in\mathbb{R}\), \(f\big(\eta(h)\big) = \frac{1}{2}f(-h) + \frac{1}{2}f(h)\) implies \[\begin{align*} f(x + \eta(h)) &= \frac{1}{2} f(x+h) + \frac{1}{2} f(x-h)\\ f(x) + f'(x)\; \beta h^2 + o(h^2) &= f(x) + f''(x)\; h^2 + o(h^2) \end{align*}\]

  • hence \(f'' = \beta f'\)

    • if \(\beta=0\), \(f(x) = \alpha x + c\) \(\quad \implies\) expectation
    • otherwise, \(f(x) = \alpha e^{\beta x}+c\) \(\quad \implies\) \(\beta\)-entropy

What can DistribRL optimize?

Theorem

The only “smooth” functionals of the cumulated reward that can be optimized are the entropic utilities

…but we have seen that direct DP works for them :-|

Concluding remarks

  • Limit cases

    • \(\beta \to\infty\): maximize the highest cumulated reward reachable with positive probability

    • \(\beta \to -\infty\): maximize the lowest cumulated reward reachable with positive probability

  • Distribution representation: many interesting problems

  • The possibilities offered by Distributional RL are not unlimited

  • Some gains in stability have been experimentally observed

  • \(\beta\)-entropies provide a good surrogates for risk-sensitive RL

  • Q-learning for \(\beta\)-entropies works as well, see [V. S. Borkar. Q-learning for risk-sensitive control. 2002.]

  • Working on identifying the \(\beta\)-optimal policies for all \(\beta\) at the same time

  • Possible interest in inverse RL