Journées MAS - août 2024 - Poitiers
\(\mathcal{D} = (\mathcal{S}, \mathcal{A}, \mathcal{U}, \psi, r(\cdot, \cdot, \cdot))\)
\(\mathcal{S}\) = state space (discrete in this talk)
\(\mathcal{A}\) = action space (discrete in this talk)
\(\mathcal{U}\) = externality space (uncontrolled)
\(\psi(s, a, u)\) = transition function:
choosing action in \(a\) state \(s\), if the externality is \(u\) then the next state is \(\psi(s, a, u)\)
\(r(\cdot, \cdot, \cdot)\) = reward function
\(r(s, a, s')\) = reward associated to a transition from state \(s\) to state \(s'\) choosing action \(a\)
\(\mathcal{M}= (\mathcal{S}, \mathcal{A}, P(\cdot|\cdot, \cdot), r(\cdot, \cdot, \cdot))\)
\(\mathcal{S}\) = state space
\(\mathcal{A}\) = action space
\(P(\cdot|\cdot, \cdot)\) = transition kernel:
\(P(s'|s,a)\) = probability, if you choose action \(a\), to jump from state \(s\) to state \(s'\)
\(r(\cdot, \cdot, \cdot)\) = reward function
\(r(s,a, s')\) = reward associated to a transition from state \(s\) to state \(s'\) choosing action \(a\)
state = number of units in stock before the week starts \[\mathcal{S}= \{0, \dots, M\}\]
action = number of units commanded \[\mathcal{A}= \{0,\dots,M\}\]
dynamic: \(\displaystyle{S_{t+1} = \max\big(0, \min(M, S_t + A_t) - U_t \big)}\)
reward: \(\displaystyle{r(s, a, s') = - C(a) - h(s) + f\big(\min(M, s+a) - s'\big)}\)
possible policy: always buy \(\mathbb{E}[U_t]\) bikes (?)
The value function \(\big(V^{\pi}_t\big)_{0\leq t\leq H}\) of a policy \(\pi\) is defined by \[V^{\pi}_t(s) = \mathbb{E}^\pi\left[\left.\sum_{k=t}^{H-1}r\big(S_k, \pi_t(S_k), S_{k+1}\big) \right| S_t = s\right]\] and can be computed recursively: for all \(s\in\mathcal{S}\),
\(V^\pi_{H}(s) = 0\)
\(\displaystyle{V^{\pi}_{t}(s) = \sum_{s'\in\mathcal{A}} P(s'|s, \pi_t(s))\big(r(s, a, s') + V^{\pi}_{t+1}(s')\big)}\)
\(\tiny{\hbox{since }V^{\pi}_{t}(s) = \mathbb{E}^{\pi}\Big[ \mathbb{E}^{\pi}\big[ r(s, \pi_t(s), S_{t+1}) \big| S_{t+1}\big] + V^{\pi}_{t+1}(S_{t+1}) \Big| S_t = s\Big]}\)
NB: \(V^{\pi}_{0}(s_0) = \mathbb{E}^{\pi}[W_H]\)
The Bellman operator \(\mathcal{T}^\pi: \mathbb{R}^{\mathcal{S}} \to \mathbb{R}^{\mathcal{S}}\) is defined by \[\mathcal{T}^{\pi}v (s) = \sum_{s'\in\mathcal{S}} P\big(s'|s,\pi(s)\big) \big(r(s, \pi(s),s') + v(s')\big)\]
Backward induction on the expected cumulated reward:
Policy Evaluation Algorithm
since \(S_H = S_{H-1} + r(S_{H-1}, \pi_{H-1}(S_{H-1}), S_H)\), \[V_{H-1}(s) = \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) {\color{orange}{\big(}}r(s, a, s') {\color{orange}{+0\big)}}\] by choosing \(\displaystyle{\pi_{H-1}(s) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) r(s, a, s') }\)
Thus, one can obtain on average on the last two steps \[V_{H-2}(s) = \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a, s') + V_H(s') \big) \] by choosing \(\displaystyle{\pi_{H-2}(s) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a, s') + V_{H-1}(s') \big) }\)
The iterates may converge to a fixed point, but this is another story…
Value Iteration Algorithm
The iterates may converge to a fixed point, but this is another story…
The expectation of \(W_H\) may not be a good decision criterion
Risk-aware RL: optimize other functionals of \(W_H\)
Def: If \(X\) is a \([a,b]\)-valued random variable, for all \(\beta\in\mathbb{R}\) \[U_\beta(X) = \frac{1}{\beta}\log \mathbb{E}\left[e^{\beta X}\right]\]
Properties:
the \(\beta\)-value function \(\big(V^{\pi}_{\beta,t})_{0\leq t\leq H}\) of policy \(\pi\) \[V^{\pi}_{\beta,t}(s) = \frac{1}{\beta}\log \mathbb{E}^\pi\left[\left.\exp\left(\beta \sum_{k=t}^{H-1}r\big(S_k, \pi_t(S_k), S_{k+1}\big)\right) \right| S_t = s\right]\] can also be computed recursively: for all \(s\in\mathcal{S}\),
\(V_{\beta,H}(s) = 0\)
\(\displaystyle{V^{\pi}_{\beta,t}(s) = \frac{1}{\beta}\log \sum_{s'\in\mathcal{S}} P(s'|s, \pi_t(s)) \exp\Big(\beta\big(r(s, a, s') + V^{\pi}_{t+1}(s')\big)\Big)}\)
\(\tiny{\hbox{since }\beta \exp\left(V^{\pi}_{t}(s)\right) = \mathbb{E}^\pi\Big[ \mathbb{E}^\pi\big[ \exp\big(\beta r(s, \pi_t(s), S_{t+1}) \big) \big| S_{t+1}\big] \exp\big(\beta V^{\pi}_{t+1}(S_{t+1})\big) \Big| S_t = s\Big]}\)
[R. A. Howard and J. E. Matheson. Risk-Sensitive Markov Decision Processes. 1972]
\(\beta\)-entropic greedy action: \(\displaystyle{\mathcal{G}_\beta^*(s, v) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s, \pi_t(s))\; {\color{orange}{\exp\Big(\beta}}\big(r(s, a, s') + v(s')\big){\color{orange}{\Big)}}}\)
\(\beta\)-entropic Bellman optimal operator: \(\displaystyle{\mathcal{T}_\beta^*v (s) = {\color{orange}{\frac{1}{\beta} \log}} \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \; {\color{orange}{\exp\Big(\beta}}\big(r(s, a, s') + v(s')\big){\color{orange}{\Big)}}}\)
\(\beta\)-entropic Value Iteration Algorithm
Standard VI uses the linearity of expectation \[ V_t^*(s) = \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s, a) \Big( r(s, a, s') + V_{t+1}^*(s') \Big)\]
entropic VI uses conditional indep. and \(e^{a+b} = e^a\,e^b\)
\[ V_{\beta,t}^*(s) = \frac{1}{\beta} \log \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s, a) \exp\Big( r(s, a, s') + V_{\beta,t+1}^*(s') \Big)\]
Src: [Bellemare, Dabney and Rowland. Distributional Reinforcement Learning. 2023.]
Distributional value function \(\big(\mathcal{V}^{\pi}_t\big)_{0\leq t\leq H}\) of a policy \(\pi\): \[\mathcal{V}^{\pi}_t(s) = \mathbb{P}^\pi\left(\left.\sum_{k=t}^{H-1}r\big(S_t, \pi_t(S_t), S_{t+1}\big) \right| S_t = s\right) \in \mathcal{M}_1(\mathbb{R})\]
It can be computed recursively: for all \(s\in\mathcal{S}\),
\(\mathcal{V}_{H}(s) = \delta_0\)
\(\displaystyle{\mathcal{V}^{\pi}_{t}(s) = \sum_{s'\in\mathcal{S}} P\big(s'|s, \pi_t(s)\big)\big(\delta_{r(s, a, s')} * \mathcal{V}^{\pi}_{t+1}(s')\big)}\)
where \(*\) is the convolution on \(\mathcal{M}_1(\mathbb{R})\)
\(\mathcal{V}^{\pi}_{t}(s)\) is hence a mixture of Dirac measures
Distributional Bellman operator \(\mathcal{T}^\pi: \mathcal{M}_1(\mathbb{R})^{\mathcal{S}} \to \mathcal{M}_1(\mathbb{R})^{\mathcal{S}}\) \[\mathcal{T}^{\pi}\mathcal{V}(s) = \sum_{s'\in\mathcal{S}} P\big(s'|s,\pi(s)\big)s \big( \delta_{r(s, \pi(s),s')} * \mathcal{V}(s')\big)\]
The law of cumulated reward \(W_H\) under policy \(\pi\) can be computed recursively:
Distributional Value Iteration Algorithm
Distributional Value Iteration Algorithm for functional \(\Psi\)
Question
What does it optimize??
expectation and \(\beta\)-entropies are optimizable
of course, all monotone transforms of optimizable functionals are optimizable
some utilities like quantiles are not optimized this way
we assume that the decision criterion \(\Psi\) is such that if \(\nu\) is stochastically dominated by \(\nu'\), then \(\Psi(\nu)\leq \Psi(\nu')\).
action 0
action 1
\(X\sim \nu_1\), \(Y\sim \nu_2\), \(Z\sim \nu'\)
For all \(\lambda\in\mathbb{R}\) and all \(\nu_1, \nu_2, \nu'\mathcal{M}_1(\mathbb{R})\), \[ \Psi(\nu_1) \leq \Psi(\nu_2) \implies \Psi\big(\lambda \nu_1 + (1-\lambda)\nu'\big) \leq \Psi\big(\lambda \nu_2 + (1-\lambda)\nu' \big)\]
Von Neumann - Morgenstern theorem: there exists \(f:\mathbb{R}\to\mathbb{R}\) such that \[\Psi(\nu_1)\leq \Psi(\nu_2) \iff \int f\,d\nu_1 \leq \int f\,d\nu_2\]
\(\implies\) wlog we focus on utilities \(\displaystyle{\Psi(\nu) = \int f\,d\nu}\)
by induction, if \(\forall i, \Psi(\nu_i)=\Psi(\nu'_i)\) then \(\Psi\big(\sum p_i \nu_i\big) = \Psi\big(\sum p_i \nu'_i\big)\)
assume that \(\operatorname{Supp}(\nu) \subset \big[\underline{x}, \overline{x}\big]\).
for every \(x\in\big[\underline{x}, \overline{x}\big]\), by the stochastic domination property there exists \(f(x)\in[0,1]\) such that \[\Psi(\delta_x) = \big(1-f(x)\big)\Psi(\delta_{\underline{x}}) + f(x)\Psi(\delta_{\overline{x}})\]
define \(u(\nu) = \int f\,d\nu\)
then \(\displaystyle{\Psi\big(\sum_i p_i\delta_{x_i}\big) = \Psi\Big(\sum_i p_i\Big(\big(1-f(x_i)\big)\delta_{\underline{x}} + f(x_i)\delta_{\overline{x}}\Big) \Big) = \Psi\Big(\big(1-u\big(\sum_i p_i\delta_{x_i}\big)\big)\delta_{\underline{x}} + u\big(\sum_i p_i\delta_{x_i}\big)\delta_{\overline{x}}\Big)}\)
hence \[\begin{align*} \Psi(\nu)\leq \Psi(\nu') &\iff \Psi\Big(\big(1-u(\nu)\big)\delta_{\underline{x}} + u(\nu)\delta_{\overline{x}}\Big)\leq \Psi\Big(\big(1-u(\nu')\big)\delta_{\underline{x}} + u(\nu')\delta_{\overline{x}}\Big)\\ &\iff \big(1-u(\nu)\big)\delta_{\underline{x}} + u(\nu)\delta_{\overline{x}} \hbox{ is stochastically dominated by }\big(1-u(\nu')\big)\delta_{\underline{x}} + u(\nu')\delta_{\overline{x}}\\ & \iff u(\nu)\leq u(\nu') \end{align*}\]
action 0
action 1
For all \(x\in\mathbb{R}\) and all random variables \(X,Y\) \[ \Psi(X) \leq \Psi(Y) \implies \Psi(x+X) \leq \Psi(x+Y)\]
Hence, it is necessary that \[ \Psi(X) = \Psi(Y) \implies \Psi(x+X) = \Psi(x+Y)\]
action 0
action 1
\(\Psi(\nu) = \int fd\nu\) with \(f\in\mathcal{C}^2(\mathbb{R})\) and \(f'(0)\neq 0\)
let for \(h\) small enough \(\eta(h) = f^{-1}\left(\frac{f(h)+f(-h)}{2}\right) = \beta h^2 + o(h^2)\quad~\) \(\beta=\frac{f''(0)}{2}\)
for all \(x\in\mathbb{R}\), \(f\big(\eta(h)\big) = \frac{1}{2}f(-h) + \frac{1}{2}f(h)\) implies \[\begin{align*} f(x + \eta(h)) &= \frac{1}{2} f(x+h) + \frac{1}{2} f(x-h)\\ f(x) + f'(x)\; \beta h^2 + o(h^2) &= f(x) + f''(x)\; h^2 + o(h^2) \end{align*}\]
hence \(f'' = \beta f'\)
Theorem
The only “smooth” functionals of the cumulated reward that can be optimized are the entropic utilities
…but we have seen that direct DP works for them :-|
Limit cases
\(\beta \to\infty\): maximize the highest cumulated reward reachable with positive probability
\(\beta \to -\infty\): maximize the lowest cumulated reward reachable with positive probability
Distribution representation: many interesting problems
The possibilities offered by Distributional RL are not unlimited
Some gains in stability have been experimentally observed
\(\beta\)-entropies provide a good surrogates for risk-sensitive RL
Q-learning for \(\beta\)-entropies works as well, see [V. S. Borkar. Q-learning for risk-sensitive control. 2002.]
Working on identifying the \(\beta\)-optimal policies for all \(\beta\) at the same time
Possible interest in inverse RL
Aurélien Garivier