Contextual Bandit

This post is mainly based on

In multi-armed bandit, UCB tries to learn each arm’s reward distributions: $r \sim N(\mu_i, \sigma_i^2)$.

In linear bandit, LinUCB learns generalize each arm’s reward distribution into a regression: $r \sim N( \langle\phi(a_i), \theta^* \rangle, \sigma^2 ) $.

However, linear bandit still assumes a single reward function that is fixed over time, which may not be a good model when we have a system that is interacting with many different users or in many different scenarios.

In contextual bandit, the reward function changes depending on the context $x$. The learner can observe the context $x$ and the action $a$ under which the reward $r$ is generated. For example, the path (a sequence of webpages) which the user traverse to reach current webpage contains information of user identity. Contextual bandit can use this context $x$ to determine which ad $a$ should be displayed to achieve the highest click through rate $r$.

Contextual bandit make another step toward the reinforcement learning setting: the optimal action $a^*$ is now a function of context/environment/state $x$. We will see later that contextual bandit also incorporate the policy and the value function concept. The difference is that the contextual bandit is a single step MDP: current action $a$ will not affect next context/state $x$. This makes contextual bandit easier to study since no bellman backup is involved. The value function is also easier to learn since samples $x$ are generated iid (no distribution shift problem in MDP).

Settings

Consider a context space $\mathcal{X}$, action space $\mathcal{A}$, a distribution $\mathcal{D}$ supported on $\mathcal{X}$, and a reward function $R: \mathcal{X} \times \mathcal{A} \rightarrow \Delta([0, 1])$ ($\Delta(0,1)$ is the class of distribution functions bounded by $[0,1]$). Let us define $f^* := (x, a) \mapsto \mathbb{E}_{r \sim R(x,a)}[r]$ to be the mean reward function. In round $t$:

Nature samples a context $x_t \sim \mathcal{D}$
Learner observe context $x_t$ and choose action $a_t \in \mathcal{A}$
Learner observe reward $r_t \sim R(x_t, a_t )$

The optimization goal is minimize the reward, measured as $\sum r_t$ or $ \sum f^*(x_t, a_t)$.

Similar to policy iteration and value iteration in reinforcement learning, there are two different learning setting for contextual bandit:

Policy Class

Given a policy class $\Pi: \mathcal{X} \rightarrow \mathcal{A}$, the learner directly learns the mapping to optimal action. The optimization goal is to find the best policy:

\[\pi := \underset{\pi \in \Pi}{\operatorname{argmax}} \mathbb{E}_{x \sim \mathcal{D}}[ f^*(x, \pi(x)) ]\]

The regret is measured as:

\[\operatorname{Regret}(T, \Pi) = \sum_t f^*(x_t, \pi^*(x_t)) - \sum_t f^*(x_t, a_t)\]

Note that we make no assumptions on $\Pi$, $\mathcal{D}$ or $R$. If our choice of $\Pi$ is bad, the optimal policy may not even get much reward (analogous to agnostic learning).

Value Function Class

Given a value function class $\mathcal{F}: \mathcal{X} \times \mathcal{A} \rightarrow [0,1]$, the learner tries to learn a mapping in $\mathcal{F}$ that is closest to $f^*$.

Under the realizability assumption, $f^* \in \mathcal{F}$. Any value function $f$ induces a greedy policy

\[\pi_f : x \mapsto \operatorname{argmax}_a f(x,a)\]

Therefore, we can derive a policy class $\Pi_{\mathcal{F}}$ from $\mathcal{F}$ and measure regret using $\Pi_{\mathcal{F}}$ as in Policy Class above.

Linear contextual bandits

If we parameter both context $x$ and action $a$ into feature $\phi(x,a) \in \mathbb{R}^d$, assuming realizability, we can still use the LinUCB to learn to an optimal $f^*$ from some Value Function Class. Following similar argument in LinUCB, we can show a $O(d\sqrt{T})$ regret.

Oracle efficiency and reductions

As discussed in the Linear Bandit post, LinUCB suffers from the two variables optimization problem in finding the optimal action $a_t$. The oracle-efficient algorithm takes another approach: sampling $a_t$. Dr. Akshay Krishnamurthy commented that oracle-efficient algorithm is also the most successful approach for designing practically useful algorithms.

If we use Exp3 and treat each policy as an arm, a variant of Exp3 could achieve $O( AT \log| \Pi |)$ regret. However, this requires maintaining and updating a weight for each policy, so the running time is $O(|\Pi|T )$, which prevent large policy class $\Pi$ being used.

Oracle-efficient - Policy Class approach [TBD]

Oracle-efficient - Value Function Class approach

The idea is to solve the regression problems over the function class $\mathcal{F}$. Assume we have access to an online regression solver $\operatorname{SqAlg}$, which has regret bound $\log(T)$. The $\operatorname{SquareCB}$ algorithm has the following procedure:

For each action $a$ get predictions $\hat{y}_{t,a} := \hat{y}_t (x_t , a)$ from $\operatorname{SqAlg}$
Let $b_t := \operatorname{argmax}_a \hat{y}_{t,a}$ be the best predicted action
Compute sampling probability:

For $a \not= b_t$, $p_t(a) := \frac{1}{A + \gamma(\hat{y}_{t,b_t} - \hat{y}_{t,a})}$

For $a = b_t$, $p_t(b_t) := 1 - \sum_{a \not= b_t} p_t(a)$

Sample $a_t \sim p_t$ , observe $r_t \sim R(x_t , a_t)$
Pass $(x_t , a_t), r_t$ to online regression solver $\operatorname{SqAlg}$

Interpretation

The action selection is based on a technique called “inverse gap weighting”
$ p_t(a) $ start from equal probability $\frac{1}{A}$ ($A$ is number of arms), and reduce the sampling probability according $a$’s predicted return gap
The predicted return gap is the difference between best reward $\hat{y}_{t,b_t}$ to predicted reward $\hat{y}_{t,a}$
If $a$ has a large gap it should be played infrequently, since we expect it to be quite suboptimal

Theorem $\operatorname{SquareCB}$ regret bound

Setting $\gamma = \sqrt{AT /(\operatorname{Reg}_{\operatorname{Sq}}(T) + \log(2/\delta))}$, with probability $\geq q-\delta$, SquareCB has regret bound:

\[\operatorname{Regret}(T) \leq 4\sqrt{AT \operatorname{Reg}_{\operatorname{Sq}}(T)} + 8 \sqrt{AT \log(2/\delta)}\]

We will skip the proof since it is quite involved. Interested reader please refer to original paper or Dr. Krishnamurthy’s note.