Markov Decision Process
This post is mainly based on
Markov Decision Processes
Environment
- State space $\mathcal{S}$
- Action space $\mathcal{A}$
- Reward function $R: \mathcal{S} \times \mathcal{A} \rightarrow \Delta([0, 1])$
- Transition operator / dynamics $P: \mathcal{S} \times \mathcal{A} \rightarrow \Delta(S)$
- Initial state distribution $\mu \in \Delta(S)$
Agent
- Trajectory $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, …)$
- History / set of all partial trajectories $\mathcal{H}$
- Non-Markov policy $\pi: \mathcal{h} \rightarrow \Delta(\mathcal{A})$
Markov policy
Finite horizon length $H$ (policy is time dependent):
\[\pi(s, h): \mathcal{S} \times [H] \rightarrow \Delta(\mathcal{A})\]Infinite horizon:
\[\pi(s): \mathcal{S} \rightarrow \Delta(\mathcal{A})\]Total reward
Finite horizon:
\[J(\pi) := \mathbb{E}[ \sum_{h=0}^{H-1} r_h | s_0 \sim \mu, a_{0: H-1} \sim \pi ]\]Infinite horizon:
\[J(\pi) := \mathbb{E}[ \sum_{h=1}^\infty \gamma^h r_h | \pi ]\]Value function
Finite horizon
\[V_h^\pi(s) := \mathbb{E}[ \sum_{h' = h}^{H-1} r_{h'} | s_h = s, a_{h:H-1} \sim \pi ]\] \[Q_h^\pi(s,a) := \mathbb{E}[ \sum_{h' = h}^{H-1} r_{h'} | s_h = s, a_h = a, a_{h+1:H-1} \sim \pi ]\]Infinite horizon:
\[V^\pi(s) := \mathbb{E}[ \sum_{h = 0}^\infty \gamma^h r_h | s_0 = s, \pi ]\] \[Q^\pi(s,a) := \mathbb{E}[ \sum_{h = 0}^\infty \gamma^h r_h | s_0 = s, a_0 = a, \pi ]\]Bellman Equations
Finite horizon
\[V^{\pi}_{h}(s) = \mathbb{E}[r(s, a) + V^\pi_{h+1}(s') | a \sim \pi(s,h), s' \sim P(s, a)]\] \[Q^{\pi}_{h}(s,a) = \mathbb{E}[r(s, a) + Q^\pi_{h+1}(s', a') | s' \sim P(s, a), a' \sim \pi(s', h+1)]\]Infinite horizon
\[V^{\pi}(s) = \mathbb{E}[r(s,a) + \gamma V^{\pi}(s') | a \sim \pi, s' \sim P(s,a)]\] \[Q^{\pi}(s,a) = \mathbb{E}[r(s,a) + \gamma Q^{\pi}(s', a') | s' \sim P(s,a), a' \sim \pi(s')]\]Bellman Optimality Equations
Pointwise optimal: $ V^* $ and $ Q^* $ hold for every state-time pair $s_t$ (finite horizon) or state $s$ (infinite horizon)
Finite horizon
\[V_H^* = 0, \; \; V_h^*(s) = \underset{a}{\operatorname{max}} r(s,a) + \mathbb{E}_{s' \sim P(s,a)}[V_{h+1}^*(s')]\] \[Q_h^*(s,a) = r(s,a) + \mathbb{E}_{s' \sim P(s,a)}[ \underset{a'}{\operatorname{max}} Q_{h+1}^*(s', a') ]\]Infinite horizon
\[V^*(s) = \underset{a}{\operatorname{max}} r(s,a) + \gamma \mathbb{E}_{s' \sim P(s,a)}[V^*(s')]\] \[Q^*(s,a) = r(s,a) + \gamma \mathbb{E}_{s' \sim P(s,a)}[V^*(s')]\]Bellman operator $\mathcal{T}$
$\mathcal{T}$ is an operator on a function. $\mathcal{T}$ maps one function to another function.
\[\mathcal{T} f : (s,a) \mapsto r(s,a) + \gamma \mathbb{E}_{s' \sim P(s,a)} [\underset{a}{\operatorname{max}} f(s', a')]\]For example, given a function $Q^*(s,a)$:
\[\mathcal{T} Q^* = r(s,a) + \gamma \mathbb{E}_{s' \sim P(s,a)} [\underset{a}{\operatorname{max}} Q^*(s', a')]\]Bellman optimiality equations can be written as $Q^* = \mathcal{T} Q^*$
Contraction Lemma
For any two Q-function $f, f’$, we have,
\[|| \mathcal{T} f - \mathcal{T} f' ||_{\infty} \leq \gamma || f - f' ||_{\infty}\]Policy Error Lemma
For any Q-function $f$, and policy $\pi_f$ where $\pi_f(s) = \underset{a}{\operatorname{argmax}} f(s,a)$, we have,
\[J(\pi^*) \leq J(\pi_f) + \frac{2}{1-\gamma} || f - Q^* ||_{\infty}\]The sup-norm is defined as:
\[\|f\|_\infty = \underset{s \in \mathcal{S}}{\operatorname{sup}} |f(s)| :\]