This post is mainly based on

Online Learning

Consider

  • An input space $\mathcal{X}$
  • An output space $\mathcal{Y}$
  • A loss function $\ell: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}$
  • A function class $\mathcal{F}: \mathcal{X} \rightarrow \mathcal{Y}$

We play a $T$ round game following the procedure: At each round $t$

  • Adversary chooses $(x_t , y_t)$ based on all information in the past and reveals $x_t$ to the learner
  • Learner makes a prediction $\hat{y}_t$
  • Adversary reveals $y_t$ and learner incurs loss $\ell(\hat{y}_t , y_t)$

Adversary could have information about the learner, including the learning algorithm of the learner. The optimization goal is to minimize the cumulative loss $\sum \ell(\hat{y}_t , y_t)$.

We often refer to the function as expert. And refer to the optimization problem of minimizing the cumulative loss as choosing the best expert.

Mistake Bound Setting

Previously, we discussed mistake bound setting in this post, including perfect expert, halving algorithm (HA), noisy expert and weighted majority algorithm (WMA). Takeaways are

  • In mistake bound setting, the loss function is binary $\ell(\hat{y}_t , y_t) = \mathbb{1}\{ \hat{y}_t \not= y_t \}$ and the optimization goal is minimize the number of mistakes through $T$ rounds of the game
  • WMA make prediction by majority vote of $f_i$
  • Under each mistake, WMA update weight of expert $i$ by learning rate $\beta$: $w_{t+1}(i) \leftarrow \beta w_t(i)$

General Setting

Compared to the mistake bound setting, the general setting relaxed the assumption from binary loss to bounded loss: for each expert, the loss function is bounded by $[0,1]$. Therefore, the optimization goal has to change to minimize Regret:

\[\operatorname{Regret}_T := \sum_t \ell(\hat{y}_t, y_t) - \operatorname{min}_{f \in \mathcal{F}} \sum_t \ell(f(x_t), y_t)\]

Ideally, we want a learning algorithm that induced a sublinear regret w.r.t $T$. Similar to WMA, we want to maintain a weight on each expert to track how well they performs. The difference is weight update rule is changed due to exponential of loss: $w_{t+1}(i) \leftarrow e^{-\eta \ell_t(i)} w_t(i)$.

Exponential Weights

Note that in this setting, we can observe the loss of all experts, even for those we did not choose. This is important since in Multi-armed bandit setting, we can observe the return of the arm we choose. To simplify the analysis, we will make a few modifications to the setting:

  • Number the experts by $1, …, N$
  • Instead of computing loss by $\ell(\hat{y}_t , y_t)$ for each expert, now the adversary simple chooses a loss vector $\ell_t \in [0,1]^N$
  • The learner choose a distribution $P_t \in \Delta(N)$ over $N$ experts, where $\Delta$ the class of distribution over $N$ event and $P_t$ induced a probability mass function $p_t$
  • The learner sample an expert from $p_t$ and output its prediction
  • The learner incur loss $ \langle p_t, \ell_t \rangle $ (can be viewed as an expectation of loss under sampling)
  • The best expert incur loss $\operatorname{min} \ell_t$

The regret is measured as:

\[\operatorname{Regret}_T := \sum_t \langle p_t, \ell_t \rangle - \operatorname{min}_{i} \sum_t \ell_t(i)\]

The Exponential Weights algorithm are also refer to as “Hedge Algorithm.” The procedure is as follow:

  • Initialize
    • Set weights $w_1 = (1, …, 1)$
    • Set learning rate $\eta$
  • At round t
    • Set $p_t(i) = w_t(i) / Z_t$, where $Z_t$ is the normalization constant: $Z_t = \sum w_t(i)$
    • Incur loss $\langle p_t, \ell_t \rangle$
    • Update $w_{t+1}(i) \leftarrow e^{-\eta \ell_t(i)} w_t(i)$

Theorem Exponential Weights Regret Bound

Assume that $\ell_t \in [0, 1]^N$ for all $t$. Then for any $\eta \in (0, 1]$ we have

\[\operatorname{Regret}_T \leq \frac{\eta}{2} \sum_t \langle p_t, \ell_t^2 \rangle + \frac{\log(N)}{\eta}\]

With $\eta = \sqrt{2\log(N)/T}$, we obtain $\operatorname{Regret}_T \leq \sqrt{2T \log(N)}$

proof

First observe that $w$ is a function of $\ell$,

\[\begin{align} w_{T+1}(i) &= \exp(-\eta \ell_T(i)) w_T(i) \\ &= \exp(-\eta \ell_T(i)) ... \exp(-\eta \ell_1(i))w_1(i) \\ &= \exp(\sum_t -\eta \ell_t(i)) \cdot 1 \end{align}\]

The normalization constant $Z_t$ can be expressed by $\ell$,

\[\begin{align} Z_{T+1} &= \sum_j w_{T+1}(j) \\ Z_{T+1} &= \sum_j \exp(-\eta \sum_t \ell_t(j)) \\ \log(Z_{T+1}) &= \log[ \sum_j \exp(-\eta \sum_t \ell_t(j)) ]\\ &\geq \log[ \exp(-\eta \sum_t \ell_t(i)) ]\\ &= -\eta \sum_t \ell_t(i) \end{align}\]

Consider the ratio between $Z_t$ and $Z_{t+1}$,

\[\begin{align} \log \frac{Z_{t+1}}{Z_t} &= \log\left( \frac{ \sum_i w_t(i) \exp(-\eta\ell_t(i)) }{Z_t} \right) \\ &= \log\left( \sum_i p_t(i) \exp(-\eta\ell_t(i)) \right) \\ &\leq \log\left( 1 - \eta \langle p_t, \ell_t \rangle + \frac{\eta^2}{2} \langle p_t, \ell_t^2 \rangle \right) \\ &\leq - \eta \langle p_t, \ell_t \rangle + \frac{\eta^2}{2} \langle p_t, \ell_t^2 \rangle \\ \end{align}\]

The third line is obtained by $e^{-x} \leq 1-x+x^2/2$. This is obtained by Taylor’s expansion: $\ell(i) \leq 1$ so lower power terms dominate higher power terms, and the third term is negative. Express $p_t(i) \cdot (1-l_t(i) + l_t(i)^2/2)$ as a vector product and we get the third line. The last line is obtained due to $\log(1+x) \leq x$ for $x > -1$.

Combine previous result and use a telescoping sum argument,

\[\begin{align} -\eta \sum_t \ell_t(i) &\leq \log(Z_{T+1}) \\ -\eta \sum_t \ell_t(i) &\leq \log(\frac{Z_{T+1}}{Z_T} ... \frac{Z_2}{Z_1} Z_1) \\ -\eta \sum_t \ell_t(i) &\leq \sum_t \log(\frac{Z_{t+1}}{Z_t}) + \log(Z_1) \\ -\eta \sum_t \ell_t(i) &\leq \sum_t[ - \eta \langle p_t, \ell_t \rangle + \frac{\eta^2}{2} \langle p_t, \ell_t^2 \rangle ] + \log(N)\\ -\sum_t \ell_t(i) &\leq \sum_t[ - \langle p_t, \ell_t \rangle + \frac{\eta}{2} \langle p_t, \ell_t^2 \rangle ] + \frac{\log(N)}{\eta}\\ \sum \langle p_t, \ell_t \rangle -\sum_t \ell_t(i) &\leq \frac{\eta}{2} \sum \langle p_t, \ell_t^2 \rangle + \frac{\log(N)}{\eta} \\ \operatorname{Regret}_T &\leq \frac{\eta}{2} \sum \langle p_t, \ell_t^2 \rangle + \frac{\log(N)}{\eta} \end{align}\]

Observe that RHS of the inequality is of form $ax + b/x$. To obtain its minima, set $\frac{\partial }{\partial x}(ax + b/x) = 0$, we have $x=\sqrt{b/a}$. Since $\ell \in [0,1]$, $\sum \langle p_t, \ell_t^2 \rangle \leq N$. Solving for $\eta$, we have:

\[\eta = \sqrt{2\log(N)/T}\]

Substitute $\eta$ into the $\operatorname{Regret}_T$, we have,

\[\operatorname{Regret}_T \leq \sqrt{2T \log(N)}\]

The main result is that regret grows in a sublinear scale of $N$.