This post is mainly based on

Markov Decision Processes

Environment

  • State space $\mathcal{S}$
  • Action space $\mathcal{A}$
  • Reward function $R: \mathcal{S} \times \mathcal{A} \rightarrow \Delta([0, 1])$
  • Transition operator / dynamics $P: \mathcal{S} \times \mathcal{A} \rightarrow \Delta(S)$
  • Initial state distribution $\mu \in \Delta(S)$

Agent

  • Trajectory $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, …)$
  • History / set of all partial trajectories $\mathcal{H}$
  • Non-Markov policy $\pi: \mathcal{h} \rightarrow \Delta(\mathcal{A})$

Markov policy

Finite horizon length $H$ (policy is time dependent):

\[\pi(s, h): \mathcal{S} \times [H] \rightarrow \Delta(\mathcal{A})\]

Infinite horizon:

\[\pi(s): \mathcal{S} \rightarrow \Delta(\mathcal{A})\]

Total reward

Finite horizon:

\[J(\pi) := \mathbb{E}[ \sum_{h=0}^{H-1} r_h | s_0 \sim \mu, a_{0: H-1} \sim \pi ]\]

Infinite horizon:

\[J(\pi) := \mathbb{E}[ \sum_{h=1}^\infty \gamma^h r_h | \pi ]\]

Value function

Finite horizon

\[V_h^\pi(s) := \mathbb{E}[ \sum_{h' = h}^{H-1} r_{h'} | s_h = s, a_{h:H-1} \sim \pi ]\] \[Q_h^\pi(s,a) := \mathbb{E}[ \sum_{h' = h}^{H-1} r_{h'} | s_h = s, a_h = a, a_{h+1:H-1} \sim \pi ]\]

Infinite horizon:

\[V^\pi(s) := \mathbb{E}[ \sum_{h = 0}^\infty \gamma^h r_h | s_0 = s, \pi ]\] \[Q^\pi(s,a) := \mathbb{E}[ \sum_{h = 0}^\infty \gamma^h r_h | s_0 = s, a_0 = a, \pi ]\]

Bellman Equations

Finite horizon

\[V^{\pi}_{h}(s) = \mathbb{E}[r(s, a) + V^\pi_{h+1}(s') | a \sim \pi(s,h), s' \sim P(s, a)]\] \[Q^{\pi}_{h}(s,a) = \mathbb{E}[r(s, a) + Q^\pi_{h+1}(s', a') | s' \sim P(s, a), a' \sim \pi(s', h+1)]\]

Infinite horizon

\[V^{\pi}(s) = \mathbb{E}[r(s,a) + \gamma V^{\pi}(s') | a \sim \pi, s' \sim P(s,a)]\] \[Q^{\pi}(s,a) = \mathbb{E}[r(s,a) + \gamma Q^{\pi}(s', a') | s' \sim P(s,a), a' \sim \pi(s')]\]

Bellman Optimality Equations

Pointwise optimal: $ V^* $ and $ Q^* $ hold for every state-time pair $s_t$ (finite horizon) or state $s$ (infinite horizon)

Finite horizon

\[V_H^* = 0, \; \; V_h^*(s) = \underset{a}{\operatorname{max}} r(s,a) + \mathbb{E}_{s' \sim P(s,a)}[V_{h+1}^*(s')]\] \[Q_h^*(s,a) = r(s,a) + \mathbb{E}_{s' \sim P(s,a)}[ \underset{a'}{\operatorname{max}} Q_{h+1}^*(s', a') ]\]

Infinite horizon

\[V^*(s) = \underset{a}{\operatorname{max}} r(s,a) + \gamma \mathbb{E}_{s' \sim P(s,a)}[V^*(s')]\] \[Q^*(s,a) = r(s,a) + \gamma \mathbb{E}_{s' \sim P(s,a)}[V^*(s')]\]

Bellman operator $\mathcal{T}$

$\mathcal{T}$ is an operator on a function. $\mathcal{T}$ maps one function to another function.

\[\mathcal{T} f : (s,a) \mapsto r(s,a) + \gamma \mathbb{E}_{s' \sim P(s,a)} [\underset{a}{\operatorname{max}} f(s', a')]\]

For example, given a function $Q^*(s,a)$:

\[\mathcal{T} Q^* = r(s,a) + \gamma \mathbb{E}_{s' \sim P(s,a)} [\underset{a}{\operatorname{max}} Q^*(s', a')]\]

Bellman optimiality equations can be written as $Q^* = \mathcal{T} Q^*$

Contraction Lemma

For any two Q-function $f, f’$, we have,

\[|| \mathcal{T} f - \mathcal{T} f' ||_{\infty} \leq \gamma || f - f' ||_{\infty}\]

Policy Error Lemma

For any Q-function $f$, and policy $\pi_f$ where $\pi_f(s) = \underset{a}{\operatorname{argmax}} f(s,a)$, we have,

\[J(\pi^*) \leq J(\pi_f) + \frac{2}{1-\gamma} || f - Q^* ||_{\infty}\]

The sup-norm is defined as:

\[\|f\|_\infty = \underset{s \in \mathcal{S}}{\operatorname{sup}} |f(s)| :\]