Planning in MDPs: beyond expectations
Journées MAS - août 2024 - Poitiers
Beyond Average Return in Markov Decision Processes
Markov Decision Processes
Reinforcement Learning
- TD-Gammon. [Tesauro ’92-’95]: backgammon world champion
- KnightCap [Baxter et al. ’98]: chess (2500 ELO)
- Computer poker [Alberta, ’08…]
- Computer go [Mogo ’06], [AlphaGo ’15, Alphazero ’18]
- Atari, Starcraft, etc. [Deepmind ’10 sqq]
- Robotics: jugglers, acrobots, … [Schaal et Atkeson ’94 sqq]
- Navigation: robot guide in Smithonian Museum [Thrun et al. ’99]
- Lift command [Crites et Barto ’96]
- Maintenance [Mahadevan et al. ’97]
- Internet Packet Routing [Boyan et Littman ’93]
- Task Scheduling [Zhang et Dietterich ’95]
- Social Networks [Acemoglu et Ozdaglar ’10]
- Yield Management, pricing [Gosavi ’10]
- Load forecasting [Meynn ’10]
- Recommendation systems [Asfar et al. ’21]
- Protein structure prediction [Jumper et al. ’21]
- Reinforcement learning with human feedback for LLM [Ouyang et al. ’23]
- …
Example: the Cliff Environment
(Moher Cliff in Ireland)
Example: Retail Store Management
- during week \(t\), the (random) demand is \(U_t\) units.
- on Monday morning you may command \(A_t\) additional units: they are delivered immediately before the shop opens.
- maintenance cost: \(h(s)\) for s units in stock left from the previous week
- command cost: \(C(a)\) for a units
- sales profit: \(f(q)\) for q units sold
- constraints:
- warehouse has maximal capacity of \(M\) units
- cannot sell units that are not in stock
Semi-controlled Dynamical System
\(\mathcal{D} = (\mathcal{S}, \mathcal{A}, \mathcal{U}, \psi, r(\cdot, \cdot, \cdot))\)
\(\mathcal{S}\) = state space (discrete in this talk)
\(\mathcal{A}\) = action space (discrete in this talk)
\(\mathcal{U}\) = externality space (uncontrolled)
\(\psi(s, a, u)\) = transition function:
choosing action in \(a\) state \(s\), if the externality is \(u\) then the next state is \(\psi(s, a, u)\)\(r(\cdot, \cdot, \cdot)\) = reward function
\(r(s, a, s')\) = reward associated to a transition from state \(s\) to state \(s'\) choosing action \(a\)
Session
- intial state \(S_0=s_0\in\mathcal{S}\)
- horizon H = total number of steps
- policy \(\pi = (\pi_t)_{0\leq t<H}\) where \(\pi_t: \mathcal{S}\to \mathcal{A}\)
- externalities \(U_1, \dots, U_H\) are \(\mathcal{U}\)-valued random variables
- Given a policy \(\pi\), for all \(t\in\{0, \dots, H-1\}\) the state evolves as \(S_{t+1} = \psi(S_t, \pi_t(S_t), U_t)\)
- cumulated reward: \(\displaystyle{W_H = \sum_{t=0}^{H-1} r\big(S_t, \pi_t(S_t), S_{t+1}\big)}\)
Model: Markov Decision Process
\(\mathcal{M}= (\mathcal{S}, \mathcal{A}, P(\cdot|\cdot, \cdot), r(\cdot, \cdot, \cdot))\)
\(\mathcal{S}\) = state space
\(\mathcal{A}\) = action space
\(P(\cdot|\cdot, \cdot)\) = transition kernel:
\(P(s'|s,a)\) = probability, if you choose action \(a\), to jump from state \(s\) to state \(s'\)\(r(\cdot, \cdot, \cdot)\) = reward function
\(r(s,a, s')\) = reward associated to a transition from state \(s\) to state \(s'\) choosing action \(a\)
Session
- intial state \(S_0=s_0\in\mathcal{S}\)
- horizon H = total number of steps
- policy \(\pi = (\pi_t)_{0\leq t<H}\)
At time \(t\), the chosen action is \(\pi_t(s)\) - given policy \(\pi\), the trajectory \((S_t)_{0\leq t\leq H}\) is a (non-homogeneous) Markov chain with kernels \(P^{\pi_t}(s'|s) = P(s'|s, \pi_t(s))\)
- (random) cumulated reward: \(\displaystyle{W_H = \sum_{t=0}^{H-1} r\big(S_t, \pi_t(S_t), S_{t+1}\big)}\)
- We write \(\mathbb{P}^\pi\) (resp. \(\mathbb{E}^\pi\)) for the law of this process (resp. its expectation)
- stationary policy: \(\pi_t=\pi_0\) for all \(t\)
Markov Decision Process
Example: the Cliff Environment
- random move with probability \(\epsilon\geq 0\)
- reward: \(1\) per step spent in the goal state
Example: Retail Store Management
state = number of units in stock before the week starts \[\mathcal{S}= \{0, \dots, M\}\]
action = number of units commanded \[\mathcal{A}= \{0,\dots,M\}\]
dynamic: \(\displaystyle{S_{t+1} = \max\big(0, \min(M, S_t + A_t) - U_t \big)}\)
reward: \(\displaystyle{r(s, a, s') = - C(a) - h(s) + f\big(\min(M, s+a) - s'\big)}\)
possible policy: always buy \(\mathbb{E}[U_t]\) bikes (?)
Goals
- Planning: knowing the model \(\mathcal{M}= (\mathcal{S}, \mathcal{A}, P, r)\), find a policy \(\pi\) maximizing the cumulative reward \(W_H\) in expectation: \[\max_{\pi} \mathbb{E}^\pi\left[\sum_{t=0}^{H-1} r\big(S_t, \pi_t(S_t), S_{t+1}\big)\right] \]
- Learning: Knowing only the structure of the model \(\mathcal{S}, \mathcal{A}\) but not the dynamic \(P\) and the reward function \(r\), and having access to a simluator, play several sessions so as to find the best policy
- regret minimization or best policy identification
- trajectory or transition simulator
- etc…
Classical Planning
Policy Evaluation
The value function \(\big(V^{\pi}_t\big)_{0\leq t\leq H}\) of a policy \(\pi\) is defined by \[V^{\pi}_t(s) = \mathbb{E}^\pi\left[\left.\sum_{k=t}^{H-1}r\big(S_k, \pi_t(S_k), S_{k+1}\big) \right| S_t = s\right]\] and can be computed recursively: for all \(s\in\mathcal{S}\),
\(V^\pi_{H}(s) = 0\)
\(\displaystyle{V^{\pi}_{t}(s) = \sum_{s'\in\mathcal{A}} P(s'|s, \pi_t(s))\big(r(s, a, s') + V^{\pi}_{t+1}(s')\big)}\)
\(\tiny{\hbox{since }V^{\pi}_{t}(s) = \mathbb{E}^{\pi}\Big[ \mathbb{E}^{\pi}\big[ r(s, \pi_t(s), S_{t+1}) \big| S_{t+1}\big] + V^{\pi}_{t+1}(S_{t+1}) \Big| S_t = s\Big]}\)
NB: \(V^{\pi}_{0}(s_0) = \mathbb{E}^{\pi}[W_H]\)
Operator formulation
The Bellman operator \(\mathcal{T}^\pi: \mathbb{R}^{\mathcal{S}} \to \mathbb{R}^{\mathcal{S}}\) is defined by \[\mathcal{T}^{\pi}v (s) = \sum_{s'\in\mathcal{S}} P\big(s'|s,\pi(s)\big) \big(r(s, \pi(s),s') + v(s')\big)\]
Backward induction on the expected cumulated reward:
- \(V \gets 0^{\mathcal{S}}\)
- for \(t=H-1\) down to \(0\):
- \(V \gets \mathcal{T}^\pi V\)
- return \(V(s_0)\)
Planning: Dynamic Programming
since \(S_H = S_{H-1} + r(S_{H-1}, \pi_{H-1}(S_{H-1}), S_H)\), \[V_{H-1}(s) = \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) {\color{orange}{\big(}}r(s, a, s') {\color{orange}{+0\big)}}\] by choosing \(\displaystyle{\pi_{H-1}(s) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) r(s, a, s') }\)
Thus, one can obtain on average on the last two steps \[V_{H-2}(s) = \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a, s') + V_H(s') \big) \] by choosing \(\displaystyle{\pi_{H-2}(s) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a, s') + V_{H-1}(s') \big) }\)
Backward induction on the table
Example: Retail Store
Example: the Cliff Environment
The iterates may converge to a fixed point, but this is another story…
Bellman iterations:
- Greedy action: \(\displaystyle{\mathcal{G}^*(s, v) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a,s') + v(s')\big)}\)
- Bellman optimal operator \(\mathcal{T}^*: \mathbb{R}^{\mathcal{S}} \to \mathbb{R}^{\mathcal{S}}\) defined by \[\begin{align*}\mathcal{T}^*v (s) &= {\color{orange}{\max_{a\in\mathcal{A}}}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \big(r(s, a,s') + v(s')\big)\\ & \scriptsize= \sum_{s'\in\mathcal{S}}P\big(s'|s,\mathcal{G}^*(s, v)\big) \big(r(s, \mathcal{G}^*(s, v),s') + v(s')\big) \end{align*}\]
Planning by Value Iteration:
- \(V \gets 0^{\mathcal{S}}\)
- for \(t=H-1\) down to \(0\):
- for each \(s\in\mathcal{S}\), \(\pi_t(s) \gets \mathcal{G}^*(s, V)\)
- \(V \gets \mathcal{T}^* V\)
- return \(\pi\) and \(V(s_0)\)
The iterates may converge to a fixed point, but this is another story…
Distributional RL
Reward distribution
The expectation of \(W_H\) may not be a good decision criterion
Risk-aware RL: optimize other functionals of \(W_H\)
- quantiles
- conditional value-at-risk
- etc…
Backward Induction is pretty general
- probability of reaching a state
- time spent in a state
- higher moments of the total reward
- etc…
Example: entropic utility
Def: If \(X\) is a \([a,b]\)-valued random variable, for all \(\beta\in\mathbb{R}\) \[U_\beta(X) = \frac{1}{\beta}\log \mathbb{E}\left[e^{\beta X}\right]\]
Properties:
- if \(\beta\leq \beta'\) then \(a\leq U_\beta(X) \leq U_{\beta'}(X) \leq b\)
- \(\beta \mapsto U_\beta(X)\) is continuous with \(U_0(X) = \mathbb{E}[X]\)
- Example: if \(X\sim\mathcal{N}(\mu, \sigma^2)\), then \(U_\beta(X) = \mu + \frac{\sigma^2}{2} \beta\)
- More generally, \(U_\beta(X) = \mathbb{E}[X] + \frac{\mathbb{V}(X)}{2} \beta + o(\beta)\)
\(\beta\)-entropies as risk-aware utilities
Computing entropic utilities of \(W_H\)
the \(\beta\)-value function \(\big(V^{\pi}_{\beta,t})_{0\leq t\leq H}\) of policy \(\pi\) \[V^{\pi}_{\beta,t}(s) = \frac{1}{\beta}\log \mathbb{E}^\pi\left[\left.\exp\left(\beta \sum_{k=t}^{H-1}r\big(S_k, \pi_t(S_k), S_{k+1}\big)\right) \right| S_t = s\right]\] can also be computed recursively: for all \(s\in\mathcal{S}\),
\(V_{\beta,H}(s) = 0\)
\(\displaystyle{V^{\pi}_{\beta,t}(s) = \frac{1}{\beta}\log \sum_{s'\in\mathcal{S}} P(s'|s, \pi_t(s)) \exp\Big(\beta\big(r(s, a, s') + V^{\pi}_{t+1}(s')\big)\Big)}\)
\(\tiny{\hbox{since }\beta \exp\left(V^{\pi}_{t}(s)\right) = \mathbb{E}^\pi\Big[ \mathbb{E}^\pi\big[ \exp\big(\beta r(s, \pi_t(s), S_{t+1}) \big) \big| S_{t+1}\big] \exp\big(\beta V^{\pi}_{t+1}(S_{t+1})\big) \Big| S_t = s\Big]}\)
\(\beta\)-entropic Value Iteration
[R. A. Howard and J. E. Matheson. Risk-Sensitive Markov Decision Processes. 1972]
\(\beta\)-entropic greedy action: \(\displaystyle{\mathcal{G}_\beta^*(s, v) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s, \pi_t(s))\; {\color{orange}{\exp\Big(\beta}}\big(r(s, a, s') + v(s')\big){\color{orange}{\Big)}}}\)
\(\beta\)-entropic Bellman optimal operator: \(\displaystyle{\mathcal{T}_\beta^*v (s) = {\color{orange}{\frac{1}{\beta} \log}} \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s,a) \; {\color{orange}{\exp\Big(\beta}}\big(r(s, a, s') + v(s')\big){\color{orange}{\Big)}}}\)
- \(V \gets 0^{\mathcal{S}}\)
- for \(t=H-1\) down to \(0\):
- for each \(s\in\mathcal{S}\), \(\pi_t(s) \gets \mathcal{G}_\beta^*(s, V)\)
- \(V \gets \mathcal{T}_\beta^* V\)
- return \(\pi\) and \(V(s_0)\)
What else?
Standard VI uses the linearity of expectation \[ V_t^*(s) = \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s, a) \Big( r(s, a, s') + V_{t+1}^*(s') \Big)\]
entropic VI uses conditional indep. and \(e^{a+b} = e^a\,e^b\)
\[ V_{\beta,t}^*(s) = \frac{1}{\beta} \log \max_{a\in\mathcal{A}} \sum_{s'\in\mathcal{S}} P(s'|s, a) \exp\Big( r(s, a, s') + V_{\beta,t+1}^*(s') \Big)\]
- the proof does not work for other utilities (ex: quantiles)
Propagating the distribution
Src: [Bellemare, Dabney and Rowland. Distributional Reinforcement Learning. 2023.]
Propagating the distribution
Distributional value function \(\big(\mathcal{V}^{\pi}_t\big)_{0\leq t\leq H}\) of a policy \(\pi\): \[\mathcal{V}^{\pi}_t(s) = \mathbb{P}^\pi\left(\left.\sum_{k=t}^{H-1}r\big(S_t, \pi_t(S_t), S_{t+1}\big) \right| S_t = s\right) \in \mathcal{M}_1(\mathbb{R})\]
It can be computed recursively: for all \(s\in\mathcal{S}\),
\(\mathcal{V}_{H}(s) = \delta_0\)
\(\displaystyle{\mathcal{V}^{\pi}_{t}(s) = \sum_{s'\in\mathcal{S}} P\big(s'|s, \pi_t(s)\big)\big(\delta_{r(s, a, s')} * \mathcal{V}^{\pi}_{t+1}(s')\big)}\)
where \(*\) is the convolution on \(\mathcal{M}_1(\mathbb{R})\)
\(\mathcal{V}^{\pi}_{t}(s)\) is hence a mixture of Dirac measures
Example: Retail Store
Reference on Distributional RL
Operator formulation
Distributional Bellman operator \(\mathcal{T}^\pi: \mathcal{M}_1(\mathbb{R})^{\mathcal{S}} \to \mathcal{M}_1(\mathbb{R})^{\mathcal{S}}\) \[\mathcal{T}^{\pi}\mathcal{V}(s) = \sum_{s'\in\mathcal{S}} P\big(s'|s,\pi(s)\big)s \big( \delta_{r(s, \pi(s),s')} * \mathcal{V}(s')\big)\]
The law of cumulated reward \(W_H\) under policy \(\pi\) can be computed recursively:
- \(\mathcal{V}\gets \delta_0 {}^{\mathcal{S}}\)
- for \(t=H-1\) down to \(0\):
- \(\mathcal{V}\gets \mathcal{T}^\pi \mathcal{V}\)
- return \(\mathcal{V}(s_0)\)
Idea: optimize any functional
- The greedy action for the functional \(\Psi\) is defined by \[\mathcal{G}^*(s, \mathcal{V}) = \mathop{\mathrm{argmax}}_{a\in\mathcal{A}} \Psi\left(\sum_{s'\in\mathcal{S}} P(s'|s,a) \big(\delta_{r(s, a,s')} * \mathcal{V}(s')\big) \right)\]
- The distributional Bellman optimal operator \(\mathcal{T}^*: \mathbb{R}^{\mathcal{S}} \to \mathbb{R}^{\mathcal{S}}\) for the functional \(\Psi\) is defined by \[\mathcal{T}^*\mathcal{V}(s) = \sum_{s'\in\mathcal{S}} P\big(s'|s,\mathcal{G}^*(s, \nu)\big) \big(\delta_{r(s, \mathcal{G}^*(s, \nu),s')} * \mathcal{V}(s')\big)\]
Distrib-VI
- \(\mathcal{V}\gets \delta_0 {}^{\mathcal{S}}\)
- for \(t=H-1\) down to \(0\):
- for each \(s\in\mathcal{S}\), \(\pi_t(s) \gets \mathcal{G}^*(s, \mathcal{V})\)
- \(\mathcal{V}\gets \mathcal{T}^* \mathcal{V}\)
- return \(\pi\) and \(V(s_0)\)
What does it optimize??
Understanding what functionals distrib-VI can optimize
expectation and \(\beta\)-entropies are optimizable
of course, all monotone transforms of optimizable functionals are optimizable
some utilities like quantiles are not optimized this way
we assume that the decision criterion \(\Psi\) is such that if \(\nu\) is stochastically dominated by \(\nu'\), then \(\Psi(\nu)\leq \Psi(\nu')\).
Ingredient 1 : reduction to utilities
action 0
action 1
\(X\sim \nu_1\), \(Y\sim \nu_2\), \(Z\sim \nu'\)
For all \(\lambda\in\mathbb{R}\) and all \(\nu_1, \nu_2, \nu'\mathcal{M}_1(\mathbb{R})\), \[ \Psi(\nu_1) \leq \Psi(\nu_2) \implies \Psi\big(\lambda \nu_1 + (1-\lambda)\nu'\big) \leq \Psi\big(\lambda \nu_2 + (1-\lambda)\nu' \big)\]
Von Neumann - Morgenstern theorem: there exists \(f:\mathbb{R}\to\mathbb{R}\) such that \[\Psi(\nu_1)\leq \Psi(\nu_2) \iff \int f\,d\nu_1 \leq \int f\,d\nu_2\]
\(\implies\) wlog we focus on utilities \(\displaystyle{\Psi(\nu) = \int f\,d\nu}\)
Proof of the VNM utility theorem
by induction, if \(\forall i, \Psi(\nu_i)=\Psi(\nu'_i)\) then \(\Psi\big(\sum p_i \nu_i\big) = \Psi\big(\sum p_i \nu'_i\big)\)
assume that \(\operatorname{Supp}(\nu) \subset \big[\underline{x}, \overline{x}\big]\).
for every \(x\in\big[\underline{x}, \overline{x}\big]\), by the stochastic domination property there exists \(f(x)\in[0,1]\) such that \[\Psi(\delta_x) = \big(1-f(x)\big)\Psi(\delta_{\underline{x}}) + f(x)\Psi(\delta_{\overline{x}})\]
define \(u(\nu) = \int f\,d\nu\)
then \(\displaystyle{\Psi\big(\sum_i p_i\delta_{x_i}\big) = \Psi\Big(\sum_i p_i\Big(\big(1-f(x_i)\big)\delta_{\underline{x}} + f(x_i)\delta_{\overline{x}}\Big) \Big) = \Psi\Big(\big(1-u\big(\sum_i p_i\delta_{x_i}\big)\big)\delta_{\underline{x}} + u\big(\sum_i p_i\delta_{x_i}\big)\delta_{\overline{x}}\Big)}\)
hence \[\begin{align*} \Psi(\nu)\leq \Psi(\nu') &\iff \Psi\Big(\big(1-u(\nu)\big)\delta_{\underline{x}} + u(\nu)\delta_{\overline{x}}\Big)\leq \Psi\Big(\big(1-u(\nu')\big)\delta_{\underline{x}} + u(\nu')\delta_{\overline{x}}\Big)\\ &\iff \big(1-u(\nu)\big)\delta_{\underline{x}} + u(\nu)\delta_{\overline{x}} \hbox{ is stochastically dominated by }\big(1-u(\nu')\big)\delta_{\underline{x}} + u(\nu')\delta_{\overline{x}}\\ & \iff u(\nu)\leq u(\nu') \end{align*}\]
Ingredient 2: invariance by translation
action 0
action 1
For all \(x\in\mathbb{R}\) and all random variables \(X,Y\) \[ \Psi(X) \leq \Psi(Y) \implies \Psi(x+X) \leq \Psi(x+Y)\]
Hence, it is necessary that \[ \Psi(X) = \Psi(Y) \implies \Psi(x+X) = \Psi(x+Y)\]
Mixing the ingredients
action 0
action 1
\(\Psi(\nu) = \int fd\nu\) with \(f\in\mathcal{C}^2(\mathbb{R})\) and \(f'(0)\neq 0\)
let for \(h\) small enough \(\eta(h) = f^{-1}\left(\frac{f(h)+f(-h)}{2}\right) = \beta h^2 + o(h^2)\quad~\) \(\beta=\frac{f''(0)}{2}\)
for all \(x\in\mathbb{R}\), \(f\big(\eta(h)\big) = \frac{1}{2}f(-h) + \frac{1}{2}f(h)\) implies \[\begin{align*} f(x + \eta(h)) &= \frac{1}{2} f(x+h) + \frac{1}{2} f(x-h)\\ f(x) + f'(x)\; \beta h^2 + o(h^2) &= f(x) + f''(x)\; h^2 + o(h^2) \end{align*}\]
hence \(f'' = \beta f'\)
- if \(\beta=0\), \(f(x) = \alpha x + c\) \(\quad \implies\) expectation
- otherwise, \(f(x) = \alpha e^{\beta x}+c\) \(\quad \implies\) \(\beta\)-entropy
What can DistribRL optimize?
The only “smooth” functionals of the cumulated reward that can be optimized are the entropic utilities
…but we have seen that direct DP works for them :-|
Concluding remarks
Limit cases
\(\beta \to\infty\): maximize the highest cumulated reward reachable with positive probability
\(\beta \to -\infty\): maximize the lowest cumulated reward reachable with positive probability
Distribution representation: many interesting problems
The possibilities offered by Distributional RL are not unlimited
Some gains in stability have been experimentally observed
\(\beta\)-entropies provide a good surrogates for risk-sensitive RL
Q-learning for \(\beta\)-entropies works as well, see [V. S. Borkar. Q-learning for risk-sensitive control. 2002.]
Working on identifying the \(\beta\)-optimal policies for all \(\beta\) at the same time
Possible interest in inverse RL