This post is mainly based on

Decision making with experts

Consider the setting where we are trying to predict the binary outcome of a game. The problem is we do not have any knowledge over the game, and thus cannot make informed guess. However, we can observe the predictions of $N$ “experts” (who could be arbitrarily correlated, and can have any distribution over the correct outcome) and use their predictions to construct our result. The goal is to make as few mistakes as possible.

Online learning

We will play this prediction game in an online setting. This is different from “regular” machine learning setting where we have access to a training set. In online setting, we need to construct the training set by interacting with the environment and collecting experiences. We must make a decision from the very beginning. As more trails arrive, and we may discover later, after seeing more trails, that our previous decisions were suboptimal.

Adversary

We may face different Adversaries in this game:

  • Omniscient Adversary
    • Adversary has full knowledge on Learner and Experts
    • Adversary is goal is to maximize the Learner’s loss
    • It can pick outcome after Learner makes a prediction, as long as the outcome is consistent with the history
  • Oblivious Adversary
    • Adversary does not have full knowledge on Learner
    • Adversary has to pick a sequence of trails and outcomes in advance
    • Allow us to work with Expectations and realized better bound

At first glance this may seem an impossible task, since it is not possible to know the best expert until we finish all trails. However, we will see by carefully maintain a set of weightings over the experts, we can still achieve promising result.

Having Algorithm (HA)

Problem Setup

  • In each trail, the expert predict 0 or 1
  • Learner predicts either 0 or 1 based on expert prediction
  • Learner observes the outcome and suffers loss if his prediction was incorrect
  • The best expert never makes a mistake

Algorithm

  • Learner maintains a pool of “active experts”
  • On each trail, Learner predict with majority vote in the active expert pool
  • Once a mistake is observed, all expert in the majority side is removed from the active expert pool

The Learner makes at most $\log_2 N$ mistakes

Analysis

  • On each mistake, at least half of the experts in the active expert pool are removed
  • At most $\log_2 N$ removal is possible

Discussion

  • Unrealistic assumption: “best expert never makes a mistake”
  • Not noise tolerant: if noise exist, best expert could make mistake and being removed from the pool
  • Resilient to Omniscient Adversary
  • Randomized HA
    • Improve the expected number of mistakes to $\ln N$ on Oblivious Adversary (proof can be found here)
    • However, Randomized HA is not resilient to Omniscient Adversary

Weighted Majority Algorithm (WMA)

Problem Setup

  • In each trail, the expert predict 0 or 1
  • Learner predicts either 0 or 1 based on expert prediction
  • Learner observes the outcome and suffers loss if his prediction was incorrect
  • The best expert makes at most $N$ mistake

Algorithm

  • Learner maintains weights on all experts
    • Initialize $w_i = 1$, for expert $i$’s weight
  • On each trail, Learner predict with weighted majority vote
    • Each expert predict $z_i \in \{0,1\}$
    • $q_0 = \sum_{i: z_i = 0} w_i$
    • $q_1 = \sum_{i: z_i = 1} w_i$
    • Predict 0 if $q_0 > q_1$, else predict 1
  • Once a mistake is observed, all expert in the majority side are downweighted
    • For all expert who predict incorrectly, $w_i \leftarrow \beta w_i$

Analysis For any sequence of trails, given $1…N$ experts, suppose the best expert make at most $m$ mistakes. Then WMA makes at most $M$ mistakes.

\[M = \frac{\log N + m \log \frac{1}{\beta}}{\log \frac{2}{1 + \beta}}\]

Let $W = q_0 + q_1$, initially $W = N$. Due to decision is made by weighted majority vote, after each mistake, at least $\frac{W}{2}$ of weights are multiplied by $\beta$. Therefore, the new weight sum is at most $\frac{W}{2} + \beta \frac{W}{2} = \frac{1+\beta}{2}W$. After $M$ mistakes, the total weight $W \leq \left(\frac{1+\beta}{2}\right)^M W$

Best expert make at most $m$ mistakes, so $w_{\operatorname{best expert}} \geq \beta^m$. Therefore $W \geq \beta^m$

Combine the above two equation, we have $\beta^m \leq W \leq \left(\frac{1+\beta}{2}\right)^M N$, solve this equation yields $M \leq \left(\frac{1+\beta}{2}\right)^M N$

Discussion

  • Resilient to Omniscient Adversary
  • Randomized WMA
    • Improve the expected number of mistakes to $\frac{m\log \frac{1}{\beta} - \log N}{1-\beta}$ on Oblivious Adversary (proof can be found here)
    • However, Randomized WMA is not resilient to Omniscient Adversary
  • HA can by viewed as a special case of WMA where $\beta = 0$

There are many variants of this game. (1) The prediction may not be binary: for example, we trying to select the stock with highest return (assume stationary distribution). This make the $N$ expert problem into a Bandit Problem. (2) In bandit setting, when we pull one arm, we may not see returns on other arms (e.g., slot machine). (3) In the machine learning setting, the “experts” becomes weak learner and we can use boosting to re-weight distributions and construct a strong learner from weak ones.

Some works classify all the above mentioned algorithm in Multiplicative Weight Algorithm, as they all iteratively update weights on a set of experts / arms / distributions to construct an optimal learner. Intuitively, Multiplicative Weight Algorithms works by assigning higher weights to higher payoff experts. It is a simple idea which has been repeatedly discovered in diverse fields, including Machine Learning, Optimization, and Game Theory.

Readings