Planning in MDPs: beyond expectations

Journées MAS - août 2024 - Poitiers

Aurélien Garivier
Ecole Normale Supérieure de Lyon

Beyond Average Return in Markov Decision Processes

Neurips 2023

Alexandre Marthe ENSL

Claire Vernade Univ. Tübingen

Markov Decision Processes

Reinforcement Learning

TD-Gammon. [Tesauro ’92-’95]: backgammon world champion
KnightCap [Baxter et al. ’98]: chess (2500 ELO)
Computer poker [Alberta, ’08…]
Computer go [Mogo ’06], [AlphaGo ’15, Alphazero ’18]
Atari, Starcraft, etc. [Deepmind ’10 sqq]

Robotics: jugglers, acrobots, … [Schaal et Atkeson ’94 sqq]
Navigation: robot guide in Smithonian Museum [Thrun et al. ’99]
Lift command [Crites et Barto ’96]
Maintenance [Mahadevan et al. ’97]

Internet Packet Routing [Boyan et Littman ’93]
Task Scheduling [Zhang et Dietterich ’95]
Social Networks [Acemoglu et Ozdaglar ’10]
Yield Management, pricing [Gosavi ’10]
Load forecasting [Meynn ’10]
Recommendation systems [Asfar et al. ’21]
Protein structure prediction [Jumper et al. ’21]
Reinforcement learning with human feedback for LLM [Ouyang et al. ’23]

Example: the Cliff Environment

(Moher Cliff in Ireland)

Example: Retail Store Management

during week \(t\), the (random) demand is \(U_t\) units.
on Monday morning you may command \(A_t\) additional units: they are delivered immediately before the shop opens.

maintenance cost: \(h(s)\) for s units in stock left from the previous week
command cost: \(C(a)\) for a units
sales profit: \(f(q)\) for q units sold
constraints:
- warehouse has maximal capacity of \(M\) units
- cannot sell units that are not in stock

Semi-controlled Dynamical System

\(\mathcal{D} = (\mathcal{S}, \mathcal{A}, \mathcal{U}, \psi, r(\cdot, \cdot, \cdot))\)

\(\mathcal{S}\) = state space (discrete in this talk)
\(\mathcal{A}\) = action space (discrete in this talk)
\(\mathcal{U}\) = externality space (uncontrolled)
\(\psi(s, a, u)\) = transition function:
choosing action in \(a\) state \(s\), if the externality is \(u\) then the next state is \(\psi(s, a, u)\)
\(r(\cdot, \cdot, \cdot)\) = reward function
\(r(s, a, s')\) = reward associated to a transition from state \(s\) to state \(s'\) choosing action \(a\)

Session

intial state \(S_0=s_0\in\mathcal{S}\)
horizon H = total number of steps
policy \(\pi = (\pi_t)_{0\leq t<H}\) where \(\pi_t: \mathcal{S}\to \mathcal{A}\)
externalities \(U_1, \dots, U_H\) are \(\mathcal{U}\)-valued random variables
Given a policy \(\pi\), for all \(t\in\{0, \dots, H-1\}\) the state evolves as \(S_{t+1} = \psi(S_t, \pi_t(S_t), U_t)\)
cumulated reward: \(\displaystyle{W_H = \sum_{t=0}^{H-1} r\big(S_t, \pi_t(S_t), S_{t+1}\big)}\)

Model: Markov Decision Process

\(\mathcal{M}= (\mathcal{S}, \mathcal{A}, P(\cdot|\cdot, \cdot), r(\cdot, \cdot, \cdot))\)

\(\mathcal{S}\) = state space
\(\mathcal{A}\) = action space
\(P(\cdot|\cdot, \cdot)\) = transition kernel:
\(P(s'|s,a)\) = probability, if you choose action \(a\), to jump from state \(s\) to state \(s'\)
\(r(\cdot, \cdot, \cdot)\) = reward function
\(r(s,a, s')\) = reward associated to a transition from state \(s\) to state \(s'\) choosing action \(a\)

Session

intial state \(S_0=s_0\in\mathcal{S}\)
horizon H = total number of steps
policy \(\pi = (\pi_t)_{0\leq t<H}\)
At time \(t\), the chosen action is \(\pi_t(s)\)
given policy \(\pi\), the trajectory \((S_t)_{0\leq t\leq H}\) is a (non-homogeneous) Markov chain with kernels \(P^{\pi_t}(s'|s) = P(s'|s, \pi_t(s))\)
(random) cumulated reward: \(\displaystyle{W_H = \sum_{t=0}^{H-1} r\big(S_t, \pi_t(S_t), S_{t+1}\big)}\)
We write \(\mathbb{P}^\pi\) (resp. \(\mathbb{E}^\pi\)) for the law of this process (resp. its expectation)
stationary policy: \(\pi_t=\pi_0\) for all \(t\)

Markov Decision Process

Example: the Cliff Environment

random move with probability \(\epsilon\geq 0\)
reward: \(1\) per step spent in the goal state

Example: Retail Store Management

state = number of units in stock before the week starts \[\mathcal{S}= \{0, \dots, M\}\]
action = number of units commanded \[\mathcal{A}= \{0,\dots,M\}\]
dynamic: \(\displaystyle{S_{t+1} = \max\big(0, \min(M, S_t + A_t) - U_t \big)}\)
reward: \(\displaystyle{r(s, a, s') = - C(a) - h(s) + f\big(\min(M, s+a) - s'\big)}\)
possible policy: always buy \(\mathbb{E}[U_t]\) bikes (?)

Goals

Planning: knowing the model \(\mathcal{M}= (\mathcal{S}, \mathcal{A}, P, r)\), find a policy \(\pi\) maximizing the cumulative reward \(W_H\) in expectation: \[\max_{\pi} \mathbb{E}^\pi\left[\sum_{t=0}^{H-1} r\big(S_t, \pi_t(S_t), S_{t+1}\big)\right] \]

Learning: Knowing only the structure of the model \(\mathcal{S}, \mathcal{A}\) but not the dynamic \(P\) and the reward function \(r\), and having access to a simluator, play several sessions so as to find the best policy
- regret minimization or best policy identification
- trajectory or transition simulator
- etc…

Classical Planning

Policy Evaluation

The value function \(\big(V^{\pi}_t\big)_{0\leq t\leq H}\) of a policy \(\pi\) is defined by \[V^{\pi}_t(s) = \mathbb{E}^\pi\left[\left.\sum_{k=t}^{H-1}r\big(S_k, \pi_t(S_k), S_{k+1}\big) \right| S_t = s\right]\] and can be computed recursively: for all \(s\in\mathcal{S}\),

\(V^\pi_{H}(s) = 0\)
\(\displaystyle{V^{\pi}_{t}(s) = \sum_{s'\in\mathcal{A}} P(s'|s, \pi_t(s))\big(r(s, a, s') + V^{\pi}_{t+1}(s')\big)}\)
\(\tiny{\hbox{since }V^{\pi}_{t}(s) = \mathbb{E}^{\pi}\Big[ \mathbb{E}^{\pi}\big[ r(s, \pi_t(s), S_{t+1}) \big| S_{t+1}\big] + V^{\pi}_{t+1}(S_{t+1}) \Big| S_t = s\Big]}\)
NB: \(V^{\pi}_{0}(s_0) = \mathbb{E}^{\pi}[W_H]\)

Operator formulation

The Bellman operator \(\mathcal{T}^\pi: \mathbb{R}^{\mathcal{S}} \to \mathbb{R}^{\mathcal{S}}\) is defined by \[\mathcal{T}^{\pi}v (s) = \sum_{s'\in\mathcal{S}} P\big(s'|s,\pi(s)\big) \big(r(s, \pi(s),s') + v(s')\big)\]

Backward induction on the expected cumulated reward:

Policy Evaluation Algorithm

\(V \gets 0^{\mathcal{S}}\)
for \(t=H-1\) down to \(0\):
- \(V \gets \mathcal{T}^\pi V\)
return \(V(s_0)\)

Planning: Dynamic Programming

since \(S_H = S_{H-1} + r(S_{H-1}, \pi_{H-1}(S_{H-1}), S_H)\), \[V_{H-1}(s) = \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) {\color{orange}{\big(}}r(s, a, s') {\color{orange}{+0\big)}}\] by choosing \(\displaystyle{\pi_{H-1}(s) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) r(s, a, s') }\)
Thus, one can obtain on average on the last two steps \[V_{H-2}(s) = \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a, s') + V_H(s') \big) \] by choosing \(\displaystyle{\pi_{H-2}(s) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a, s') + V_{H-1}(s') \big) }\)

Backward induction on the table

Example: Retail Store

Example: the Cliff Environment

The iterates may converge to a fixed point, but this is another story…

Bellman iterations:

Greedy action: \(\displaystyle{\mathcal{G}^*(s, v) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a,s') + v(s')\big)}\)
Bellman optimal operator \(\mathcal{T}^*: \mathbb{R}^{\mathcal{S}} \to \mathbb{R}^{\mathcal{S}}\) defined by \[\begin{align*}\mathcal{T}^*v (s) &= {\color{orange}{\max_{a\in\mathcal{A}}}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a,s') + v(s')\big)\\ & \scriptsize= \sum_{s'\in\mathcal{S}}P\big(s'|s,\mathcal{G}^*(s, v)\big) \big(r(s, \mathcal{G}^*(s, v),s') + v(s')\big) \end{align*}\]

Planning by Value Iteration:

Value Iteration Algorithm

\(V \gets 0^{\mathcal{S}}\)
for \(t=H-1\) down to \(0\):
- for each \(s\in\mathcal{S}\), \(\pi_t(s) \gets \mathcal{G}^*(s, V)\)
- \(V \gets \mathcal{T}^* V\)
return \(\pi\) and \(V(s_0)\)

The iterates may converge to a fixed point, but this is another story…

Distributional RL

Reward distribution

The expectation of \(W_H\) may not be a good decision criterion
Risk-aware RL: optimize other functionals of \(W_H\)
- quantiles
- conditional value-at-risk
- etc…

Backward Induction is pretty general

probability of reaching a state
time spent in a state
higher moments of the total reward
etc…

Example: entropic utility

Def: If \(X\) is a \([a,b]\)-valued random variable, for all \(\beta\in\mathbb{R}\) \[U_\beta(X) = \frac{1}{\beta}\log \mathbb{E}\left[e^{\beta X}\right]\]
Properties:
- if \(\beta\leq \beta'\) then \(a\leq U_\beta(X) \leq U_{\beta'}(X) \leq b\)
- \(\beta \mapsto U_\beta(X)\) is continuous with \(U_0(X) = \mathbb{E}[X]\)
- Example: if \(X\sim\mathcal{N}(\mu, \sigma^2)\), then \(U_\beta(X) = \mu + \frac{\sigma^2}{2} \beta\)
- More generally, \(U_\beta(X) = \mathbb{E}[X] + \frac{\mathbb{V}(X)}{2} \beta + o(\beta)\)

\(\beta\)-entropies as risk-aware utilities

Computing entropic utilities of \(W_H\)

the \(\beta\)-value function \(\big(V^{\pi}_{\beta,t})_{0\leq t\leq H}\) of policy \(\pi\) \[V^{\pi}_{\beta,t}(s) = \frac{1}{\beta}\log \mathbb{E}^\pi\left[\left.\exp\left(\beta \sum_{k=t}^{H-1}r\big(S_k, \pi_t(S_k), S_{k+1}\big)\right) \right| S_t = s\right]\] can also be computed recursively: for all \(s\in\mathcal{S}\),

\(V_{\beta,H}(s) = 0\)
\(\displaystyle{V^{\pi}_{\beta,t}(s) = \frac{1}{\beta}\log \sum_{s'\in\mathcal{S}} P(s'|s, \pi_t(s)) \exp\Big(\beta\big(r(s, a, s') + V^{\pi}_{t+1}(s')\big)\Big)}\)
\(\tiny{\hbox{since }\beta \exp\left(V^{\pi}_{t}(s)\right) = \mathbb{E}^\pi\Big[ \mathbb{E}^\pi\big[ \exp\big(\beta r(s, \pi_t(s), S_{t+1}) \big) \big| S_{t+1}\big] \exp\big(\beta V^{\pi}_{t+1}(S_{t+1})\big) \Big| S_t = s\Big]}\)

\(\beta\)-entropic Value Iteration

[R. A. Howard and J. E. Matheson. Risk-Sensitive Markov Decision Processes. 1972]

\(\beta\)-entropic greedy action: \(\displaystyle{\mathcal{G}_\beta^*(s, v) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s, \pi_t(s))\; {\color{orange}{\exp\Big(\beta}}\big(r(s, a, s') + v(s')\big){\color{orange}{\Big)}}}\)
\(\beta\)-entropic Bellman optimal operator: \(\displaystyle{\mathcal{T}_\beta^*v (s) = {\color{orange}{\frac{1}{\beta} \log}} \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \; {\color{orange}{\exp\Big(\beta}}\big(r(s, a, s') + v(s')\big){\color{orange}{\Big)}}}\)

\(\beta\)-entropic Value Iteration Algorithm

\(V \gets 0^{\mathcal{S}}\)
for \(t=H-1\) down to \(0\):
- for each \(s\in\mathcal{S}\), \(\pi_t(s) \gets \mathcal{G}_\beta^*(s, V)\)
- \(V \gets \mathcal{T}_\beta^* V\)
return \(\pi\) and \(V(s_0)\)

What else?

Standard VI uses the linearity of expectation \[ V_t^*(s) = \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s, a) \Big( r(s, a, s') + V_{t+1}^*(s') \Big)\]
entropic VI uses conditional indep. and \(e^{a+b} = e^a\,e^b\)

\[ V_{\beta,t}^*(s) = \frac{1}{\beta} \log \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s, a) \exp\Big( r(s, a, s') + V_{\beta,t+1}^*(s') \Big)\]

the proof does not work for other utilities (ex: quantiles)

Propagating the distribution

Src: [Bellemare, Dabney and Rowland. Distributional Reinforcement Learning. 2023.]

Propagating the distribution

Distributional value function \(\big(\mathcal{V}^{\pi}_t\big)_{0\leq t\leq H}\) of a policy \(\pi\): \[\mathcal{V}^{\pi}_t(s) = \mathbb{P}^\pi\left(\left.\sum_{k=t}^{H-1}r\big(S_t, \pi_t(S_t), S_{t+1}\big) \right| S_t = s\right) \in \mathcal{M}_1(\mathbb{R})\]

It can be computed recursively: for all \(s\in\mathcal{S}\),

\(\mathcal{V}_{H}(s) = \delta_0\)
\(\displaystyle{\mathcal{V}^{\pi}_{t}(s) = \sum_{s'\in\mathcal{S}} P\big(s'|s, \pi_t(s)\big)\big(\delta_{r(s, a, s')} * \mathcal{V}^{\pi}_{t+1}(s')\big)}\)
where \(*\) is the convolution on \(\mathcal{M}_1(\mathbb{R})\)
\(\mathcal{V}^{\pi}_{t}(s)\) is hence a mixture of Dirac measures

Example: Retail Store

Reference on Distributional RL

Book by Bellemare, Dabney and Rowland

Operator formulation

Distributional Bellman operator \(\mathcal{T}^\pi: \mathcal{M}_1(\mathbb{R})^{\mathcal{S}} \to \mathcal{M}_1(\mathbb{R})^{\mathcal{S}}\) \[\mathcal{T}^{\pi}\mathcal{V}(s) = \sum_{s'\in\mathcal{S}} P\big(s'|s,\pi(s)\big)s \big( \delta_{r(s, \pi(s),s')} * \mathcal{V}(s')\big)\]

The law of cumulated reward \(W_H\) under policy \(\pi\) can be computed recursively:

Distributional Value Iteration Algorithm

\(\mathcal{V}\gets \delta_0 {}^{\mathcal{S}}\)
for \(t=H-1\) down to \(0\):
- \(\mathcal{V}\gets \mathcal{T}^\pi \mathcal{V}\)
return \(\mathcal{V}(s_0)\)

Idea: optimize any functional

The greedy action for the functional \(\Psi\) is defined by \[\mathcal{G}^*(s, \mathcal{V}) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \Psi\left(\sum_{s'\in\mathcal{S}} P(s'|s,a) \big(\delta_{r(s, a,s')} * \mathcal{V}(s')\big) \right)\]
The distributional Bellman optimal operator \(\mathcal{T}^*: \mathbb{R}^{\mathcal{S}} \to \mathbb{R}^{\mathcal{S}}\) for the functional \(\Psi\) is defined by \[\mathcal{T}^*\mathcal{V}(s) = \sum_{s'\in\mathcal{S}} P\big(s'|s,\mathcal{G}^*(s, \nu)\big) \big(\delta_{r(s, \mathcal{G}^*(s, \nu),s')} * \mathcal{V}(s')\big)\]

Distrib-VI

Distributional Value Iteration Algorithm for functional \(\Psi\)

\(\mathcal{V}\gets \delta_0 {}^{\mathcal{S}}\)
for \(t=H-1\) down to \(0\):
- for each \(s\in\mathcal{S}\), \(\pi_t(s) \gets \mathcal{G}^*(s, \mathcal{V})\)
- \(\mathcal{V}\gets \mathcal{T}^* \mathcal{V}\)
return \(\pi\) and \(V(s_0)\)

Question

What does it optimize??

Understanding what functionals distrib-VI can optimize

expectation and \(\beta\)-entropies are optimizable
of course, all monotone transforms of optimizable functionals are optimizable
some utilities like quantiles are not optimized this way
we assume that the decision criterion \(\Psi\) is such that if \(\nu\) is stochastically dominated by \(\nu'\), then \(\Psi(\nu)\leq \Psi(\nu')\).

Ingredient 1 : reduction to utilities

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}}}%%
flowchart LR
  A(00) -- "[λ] 0" --> B(11)
  A(00) -- "[λ] 0" --> B(11)
  A -- "[1-λ] 0" --> C(22)
  A -- "[1-λ] 0" --> C(22)

  B -- X --> D(22)
  B -- Y --> E(33)
  C -- Z --> F(44)
  C -- Z --> F(44)
  
  linkStyle 0,2,4,6 stroke:red,stroke-width:4px
  linkStyle 1,3,5,7 stroke:blue,stroke-width:4px
  classDef allnodes 0 fill:#131c37,stroke:#131c37,stroke-width:4px,font-size:0pt
  class A allnodes;
  class B allnodes;
  class C allnodes;
  class D allnodes;
  class E allnodes;
  class F allnodes;

action 0
action 1
\(X\sim \nu_1\), \(Y\sim \nu_2\), \(Z\sim \nu'\)

For all \(\lambda\in\mathbb{R}\) and all \(\nu_1, \nu_2, \nu'\mathcal{M}_1(\mathbb{R})\), \[ \Psi(\nu_1) \leq \Psi(\nu_2) \implies \Psi\big(\lambda \nu_1 + (1-\lambda)\nu'\big) \leq \Psi\big(\lambda \nu_2 + (1-\lambda)\nu' \big)\]
Von Neumann - Morgenstern theorem: there exists \(f:\mathbb{R}\to\mathbb{R}\) such that \[\Psi(\nu_1)\leq \Psi(\nu_2) \iff \int f\,d\nu_1 \leq \int f\,d\nu_2\]

\(\implies\) wlog we focus on utilities \(\displaystyle{\Psi(\nu) = \int f\,d\nu}\)

Proof of the VNM utility theorem

by induction, if \(\forall i, \Psi(\nu_i)=\Psi(\nu'_i)\) then \(\Psi\big(\sum p_i \nu_i\big) = \Psi\big(\sum p_i \nu'_i\big)\)
assume that \(\operatorname{Supp}(\nu) \subset \big[\underline{x}, \overline{x}\big]\).
for every \(x\in\big[\underline{x}, \overline{x}\big]\), by the stochastic domination property there exists \(f(x)\in[0,1]\) such that \[\Psi(\delta_x) = \big(1-f(x)\big)\Psi(\delta_{\underline{x}}) + f(x)\Psi(\delta_{\overline{x}})\]
define \(u(\nu) = \int f\,d\nu\)
then \(\displaystyle{\Psi\big(\sum_i p_i\delta_{x_i}\big) = \Psi\Big(\sum_i p_i\Big(\big(1-f(x_i)\big)\delta_{\underline{x}} + f(x_i)\delta_{\overline{x}}\Big) \Big) = \Psi\Big(\big(1-u\big(\sum_i p_i\delta_{x_i}\big)\big)\delta_{\underline{x}} + u\big(\sum_i p_i\delta_{x_i}\big)\delta_{\overline{x}}\Big)}\)
hence \[\begin{align*} \Psi(\nu)\leq \Psi(\nu') &\iff \Psi\Big(\big(1-u(\nu)\big)\delta_{\underline{x}} + u(\nu)\delta_{\overline{x}}\Big)\leq \Psi\Big(\big(1-u(\nu')\big)\delta_{\underline{x}} + u(\nu')\delta_{\overline{x}}\Big)\\ &\iff \big(1-u(\nu)\big)\delta_{\underline{x}} + u(\nu)\delta_{\overline{x}} \hbox{ is stochastically dominated by }\big(1-u(\nu')\big)\delta_{\underline{x}} + u(\nu')\delta_{\overline{x}}\\ & \iff u(\nu)\leq u(\nu') \end{align*}\]

Ingredient 2: invariance by translation

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}}}%%
flowchart LR
  A(00) -- x --> B(11)
  A(00) -- x --> B(11)
  B -- X --> C(22)
  B -- X --> D(22)
  B -- Y --> E(33)
  B -- Y --> F(44)
  linkStyle 0,2,3 stroke:red,stroke-width:4px
  linkStyle 1,4,5 stroke:blue,stroke-width:4px
  classDef allnodes 0 fill:#131c37,stroke:#131c37,stroke-width:4px,font-size:0pt
  class A allnodes;
  class B allnodes;
  class C allnodes;
  class D allnodes;
  class E allnodes;
  class F allnodes;

action 0
action 1

For all \(x\in\mathbb{R}\) and all random variables \(X,Y\) \[ \Psi(X) \leq \Psi(Y) \implies \Psi(x+X) \leq \Psi(x+Y)\]

Hence, it is necessary that \[ \Psi(X) = \Psi(Y) \implies \Psi(x+X) = \Psi(x+Y)\]

Mixing the ingredients

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '28px'}}}%%
flowchart LR
  A(00) -- x --> B(11)
  A(00) -- x --> B(11)
  B -- "η(h)" --> C(22)
  B -- "[1/2] h" --> D(33)
  B -- "[1/2] -h" --> E(44)
  linkStyle 0 stroke:magenta,stroke-width:4px,font-size:28pt
  linkStyle 0,2 stroke:red,stroke-width:4px
  linkStyle 1,3,4 stroke:blue,stroke-width:4px
  classDef allnodes 0 fill:#131c37,stroke:#131c37,stroke-width:4px,font-size:0pt
  class A allnodes;
  class B allnodes;
  class C allnodes;
  class D allnodes;
  class E allnodes;

action 0
action 1

\(\Psi(\nu) = \int fd\nu\) with \(f\in\mathcal{C}^2(\mathbb{R})\) and \(f'(0)\neq 0\)
let for \(h\) small enough \(\eta(h) = f^{-1}\left(\frac{f(h)+f(-h)}{2}\right) = \beta h^2 + o(h^2)\quad~\) \(\beta=\frac{f''(0)}{2}\)
for all \(x\in\mathbb{R}\), \(f\big(\eta(h)\big) = \frac{1}{2}f(-h) + \frac{1}{2}f(h)\) implies \[\begin{align*} f(x + \eta(h)) &= \frac{1}{2} f(x+h) + \frac{1}{2} f(x-h)\\ f(x) + f'(x)\; \beta h^2 + o(h^2) &= f(x) + f''(x)\; h^2 + o(h^2) \end{align*}\]
hence \(f'' = \beta f'\)
- if \(\beta=0\), \(f(x) = \alpha x + c\) \(\quad \implies\) expectation
- otherwise, \(f(x) = \alpha e^{\beta x}+c\) \(\quad \implies\) \(\beta\)-entropy

What can DistribRL optimize?

Theorem

The only “smooth” functionals of the cumulated reward that can be optimized are the entropic utilities

…but we have seen that direct DP works for them :-|

Concluding remarks

Limit cases
- \(\beta \to\infty\): maximize the highest cumulated reward reachable with positive probability
- \(\beta \to -\infty\): maximize the lowest cumulated reward reachable with positive probability
Distribution representation: many interesting problems
The possibilities offered by Distributional RL are not unlimited
Some gains in stability have been experimentally observed
\(\beta\)-entropies provide a good surrogates for risk-sensitive RL
Q-learning for \(\beta\)-entropies works as well, see [V. S. Borkar. Q-learning for risk-sensitive control. 2002.]
Working on identifying the \(\beta\)-optimal policies for all \(\beta\) at the same time
Possible interest in inverse RL