Boosting

This post is mainly based on

Dr. Rocco Servedio’s Lecture Note

Boosting convert a set of weak learners to a strong learner. Under PAC learning framework, strength of a learner is quantified by two parameters: accuracy $\epsilon$ and confidence $\delta$.

Boosting confidence

Suppose we have a weak learner $A$ which can output hypothesis $h$ with accuracy $1-\epsilon$ and confidence $1-\delta_0$. We want to improve the confidence to some higher value $1-\delta$. To improve confidence:

Step 1: Draw independent samples to get $k$ hypothesis with accuracy $\epsilon$ from the weak learner $A$
Step 2: Output the hypothesis with highest accuracy on testing set $S$.

Intuitively, although the probability of getting one “good” hypothesis $h_1$ from $A$ is low, it is unlikely that all $k$ hypothesis $h_1, …, h_k$ are “bad”. Some concentration inequality tools is adequate to solve this.

Sketch of proof

Using weak learner $A$, each output hypothesis fails to achieve accuracy $\epsilon$ with probability $\delta_0$. Therefore, all $k$ hypothesis fails to achieve accuracy $\epsilon$ with probability $\delta_0^k$. Let $\delta_0^k \leq \delta/2$, we have $k \geq \frac{\log 2/\delta}{ - \log \delta_0}$.

Run weak learner $k$ times to get hypothesis $h_1, …, h_k$. Now we need to determine how many samples to draw for the testing set $S$. Let $|S| = m$. Denote the empirical error of $h_i$ on testing test $S$ as $e(h_i)$ and true error as $e^*(h_i)$.

Note that $e(h_i)$ is just an estimation. We can bound the error of this estimation, say $\gamma = | e^*(h_i) - e(h_i) |$, by Chernoff bounds. Since we are working with $k$ hypothesis, we need to bound each of them with confidence $1-\delta /2k$. By union bound, all estimation is bounded by $\gamma$ with probability $1-\delta /2$.

By Chernoff bounds, $P(\operatorname{number of error} > (\epsilon+\gamma)m) \leq \exp (-2m\gamma^2)$.

Solve for $\exp (-2m\gamma^2) = 1-\delta/2k$, we have $m = \frac{C \log(2k/\delta)}{\gamma^2}$.

Both Step 1 and Step 2 fails with probability $\delta/2$. Therefore, total failure probability $\leq \delta$

Takeaway

Number of samples required for PAC learning is always $O(\log 1/\delta)$. Therefore, we often ignore $\delta$ in analysis for simplicity.

Boosting accuracy

Boosting for accuracy is less straight forward. The idea of Boosting is first proposed by Robert Schapire in his 1990 paper The Strength of Weak Learnability. We will call this “3-Stage Boosting”.

Boosting is based on the assumption that a weak learner $A_0$ is always able to extract some information from the sample (which is NOT true and we will discuss this later). We can run $A$ on difference distribution $\mathcal{D_i}$ to construct hypothesis $h_i$ that is accurate one some part of the population $X$. We can then combine these $h_i$ using some rules. The idea is by assigning $h_i$ to part of $X$ they are good at, we can construct stronger learner $A_1$ with higher accuracy. Given the $A_i$, we can iteratively construct $A_{i+1}$ that achieve even higher accuracy.

3-Stage Boosting

Consider a weak learner $A$ with advantage $\gamma$, i.e., $P_{x \sim \mathcal{D}}[h(x) \not= c(x)] \leq 0.5 - \gamma$. An example oracle $\operatorname{EX}(c, \mathcal{D})$ that can simulate any distribution $\mathcal{D}$ on $X$. This is possible given only one oracle, since we can filter its sample to simulate another distribution (will show later).

We will use an example to illustrate how 3-Stage Boosting works. Consider $\gamma = 0.1$, and $|X| = 100\%$

Training

Step 1
- Run $A$ on $\operatorname{EX}(c, \mathcal{D})$ and output $h_1$
- $P_{x \sim \mathcal{D}}[h_1(x) \not= c(x)] = 0.4$
- Define the set of examples $x$ where $h_1(x) = c(x)$ as $P_1$, $|P_1| = 60\%$
- Define the set of examples $x$ where $h_1(x) \not= c(x)$ as $Q_1$, $|Q_1| = 40\%$
Step 2
- Let $\mathcal{D}_2$ be the distribution
  - Probability of drawing from $|P_1| = 0.5$ (downscale by 5/6)
  - Probability of drawing from $|Q_1| = 0.5$ (upscale by 5/4)
- To construct $\operatorname{EX}(c, \mathcal{D}_2)$
  - Flip a fair coin
  - If “H”, draw $x$ from $\operatorname{EX}(c, \mathcal{D})$ until $h_1(x) = c(x)$
  - If “T”, draw $x$ from $\operatorname{EX}(c, \mathcal{D})$ until $h_1(x) \not= c(x)$
- Run $A$ on this new oracle and output $h_2$
  - $P_{x \sim \mathcal{D}_2}[h_2(x) \not= c(x)] = 0.4$
  - $P_{x \sim \mathcal{D}_2}[h_1(x) \not= c(x)] = 0.5$
Step 3
- Let $\mathcal{D}_3$ be the distribution
  - Define the set of examples $x$ where $h_1(x) \not= h_2(x)$ as $R$
  - Probability of drawing from $|R| = 1$
- To construct $\operatorname{EX}(c, \mathcal{D}_3)$
  - Draw $x$ from $\operatorname{EX}(c, \mathcal{D})$ until $h_1(x) \not= h_2(x)$
- Run $A$ on this new oracle and output $h_3$

Inference

Given a sample $x$
If $h_1(x) = h_2(x)$, output the agreed value
Otherwise, output $h_3(x)$

Claim: $P_{x \sim \mathcal{D}}[h_2(x) \not= c(x)] = 0.352 < 0.4$

Analysis

Partition $X$ into 4 regions based on $h_1, h_2$

$R_1$: $h_1$ correct, $h_2$ correct
$R_2$: $h_1$ correct, $h_2$ incorrect
$R_3$: $h_1$ incorrect, $h_2$ incorrect
$R_4$: $h_1$ incorrect, $h_2$ correct

Let $\mathcal{D}_2(R_2) = \rho$

$\mathcal{D}_2(R_3) = 0.4 - \rho$
- Reason: $\mathcal{D}_2(R_2 \cup R_3) = 0.4$ since $h_2$ is correct with probability 0.6 on $\mathcal{D}_2$
$\mathcal{D}_2(R_3) = 0.1 + \rho$
- Reason: $\mathcal{D}_2(R_3 \cup R_4) = 0.5$ since $h_1$ is correct with probability 0.5 on $\mathcal{D}_2$
$\mathcal{D}(R_2) = 6/5\mathcal{D}_2(R_2)$
- Reason: $R_1 + R_2$ is downscale by 5/6 from $\mathcal{D}$ to $\mathcal{D}_2$
$\mathcal{D}(R) = 4/5\mathcal{D}_2(R)$ for $R_3, R_4$
- Reason: $R_1 + R_2$ is upscale by 5/4 from $\mathcal{D}$ to $\mathcal{D}_2$

Error of boosted hypothesis $h$ under $\mathcal{D} = \mathcal{D}(R_3) + 0.4(\mathcal{D}(R_2) + \mathcal{D}(R_4)) = 0.352$

100% of Error on region $R_3$
40% of Error on region $R_2 \cup R_4$

Discussion

Boosting has limitations. Obviously, if the boosting is as “magical” as described above, machine learning is a basically a solved problem - all you need is iteratively constructing strong learners from weak ones. The problem of Boosting is its assumption that a weak learner $A_0$ is always able to extract some information from the sample. In other words, whatever data you feed to a weak learner, it can always gain some advantage $\gamma$. This is NOT true in practice as there are “hard” examples that a learner cannot achieve good accuracy. The advantage $\gamma$ will erode as the constructed learner becomes stronger. (A pessimistic view is that Boosting is good at identify hard examples, rather improving accuracy)