Empirical Risk Minimization

This post is mainly based on

Supervised Learning

Consider

An input space $\mathcal{X}$
An output space $\mathcal{Y}$
A loss function $\ell: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}$.

Consider a distribution $P$ over $(X,Y) \in \mathcal{X} \times \mathcal{Y}$, which generate the samples $\{ (X_i, Y_i) \}$. Defined the risk as the expected loss between predicted and ground truth label over $P$:

\[R(f) := \mathbb{E}_{(X,Y) \sim P} \; \ell (f(X), Y)\]

Given some function class $\mathcal{F}: \mathcal{X} \rightarrow \mathcal{Y}$, the supervised learning task is to find a mapping $f \in \mathcal{F}$ to minimize the risk. If we have access to $P$, we can find the optimal mapping $f^*$:

\[f^* = \operatorname{argmin}_{f \in \mathcal{F}} \mathbb{E}_{(X,Y) \sim P} \; \ell (f(X), Y)\]

The challenge is we do not have access to $P$ (otherwise we will have access to infinitely many samples), so directly learning $f^*$ is impossible. Consider we learned a function $f$, then $f$ a good mapping if:

\[|R(f^*) - R(f)| \leq \varepsilon\]

PAC Learning

Under the PAC learning framework, the model class $\mathcal{F}$ is PAC-learnable, if:

For all tolerance levels $\varepsilon$ and all confidence levels $\delta$,
There exists some model selection algorithm that selects $f$ from $n$ samples, and has the property: with probability at least $1-\delta$, $|R(f^*) - R(\hat{f}_{ERM})| \leq \varepsilon $

If the sample size $n$ is polynomial in $1/\varepsilon$ and $1/\delta$, then $\mathcal{F}$ is efficiently PAC-learnable.

Empirical Risk Minimizer (ERM)

Previously we mentioned that we do not have access to $P$. In practice, we are often given a dataset $S = \{ (X_i, Y_i) \}_{i=1}^n$ sampled from $P$. Given $S$, we can find a function $f$ that minimize the empirical risk $\hat{R}_n(f)$:

\[\hat{f}_{ERM} = \operatorname{argmin}_{f \in \mathcal{F}} \hat{R}_n(f) = \operatorname{argmin}_{f \in \mathcal{F}} \frac{1}n \sum_{i=1}^n \ell(f(X_i), Y_i)\]

The key point in statistical learning is: how well does $\hat{f}_{ERM}$ generalize to the ground truth distribution $P$?

Uniform Convergence Theorem, Finite $| \mathcal{F} |$

Suppose $\forall y, y’ \in \mathcal{Y}$, we have $\ell(y, y’) \in [0,B]$ and $| \mathcal{F} | < \infty$. Given $n$ samples, for all $\varepsilon \in (0,1)$, with probability at least $1-\delta$,

\[\forall f \in \mathcal{F}: | \hat{R}_n(f) - R(f) | \leq \sqrt{\frac{B^2 \log(2| \mathcal{F} |/\delta)}{2n}}\]

Consequently,

\[R(\hat{f}_{ERM}) \leq R(f^*) + \sqrt{\frac{2B^2 \log(2| \mathcal{F} |/\delta)}{n}}\]

proof

Fix a single function $f \in \mathcal{F}$. By definition of ERM, $\hat{R}_n(f) = \frac{1}{n} \sum \ell(f(X_i), Y_i)$. Due to $(X_i, Y_i)\sim P$, $\mathbb{E}[\ell(f(X_i), Y_i)] = R(f)$. Since loss is bounded by $[0,B]$, by Hoeffding’s inequality, we have:

\[P( | \hat{R}_n(f) - R(f) | \geq \varepsilon ) \leq 2\exp(\frac{-2n\varepsilon^2}{B^2})\]

Note that now the above equation describes the failure risk for one $f$. We can obtain the failure risk for all $f$ by union bound:

\[P( \exists f \in \mathcal{F}: | \hat{R}_n(f) - R(f) | \geq \varepsilon ) \leq \sum_{f \in \mathcal{F}} P( | \hat{R}_n(f) - R(f) | \geq \varepsilon ) = 2|\mathcal{F}|\exp(\frac{-2n\varepsilon^2}{B^2})\]

Solving for $\delta = 2 |\mathcal{F} |\exp(\frac{-2n\varepsilon^2}{B^2})$, we have the first equation

Let $\Delta = \sqrt{\frac{B^2 \log(2 | \mathcal{F} |/\delta)}{2n}}$, the second equation is obtained by:

\[\begin{align} R(\hat{f}_{ERM}) &\leq \hat{R}_n(\hat{f}_{ERM}) + \Delta \\ &\leq \hat{R}_n(f^*) + \Delta \\ &\leq R(f^*) + 2\Delta \\ \end{align}\]

This first and last lines are obtained by applying the first equation. The second line is valid due to $\hat{f}_{ERM}$ is the minimizer on $\hat{R}_n$ of the dataset $S$.

There is another of the Uniform Convergence Theorem for VC-dimension of $ \mathcal{F} $, interested reader can refer to this slide.

Interpretation

The first equation and the second equation describes different things
Let $\Delta = \sqrt{\frac{B^2 \log(2 | \mathcal{F} |/\delta)}{2n}}$
The first equation: for any $f$, our estimation of risk (empirical risk) deviate from the true risk by $\Delta$
The second equation: the true risk of our empirical $\hat{f}_{ERM}$ deviate from the best $f^*$ by $2\Delta$

Define

Approximation error as $R(f^*)$
- Approximation error is the loss incurred with the optimal function $f^*$
- Approximation error is bias, what we pay for choosing a simple function class
Estimation error as $R(\hat{f}_{ERM}) - R(f^*)$
- Estimation error is the loss incurred due to we cannot find the best function $f^*$ in function class $\mathcal{F}$
- Estimation error is variance, what we pay for choosing a complex function class

Bias-variance tradeoff

If we choose a simple function class
- Approximation error $R(f^*)$ is high, Estimation error $R(\hat{f}_{ERM}) - R(f^*)$ is low
If we choose a complex function class
- Approximation error $R(f^*)$ is low, Estimation error $R(\hat{f}_{ERM}) - R(f^*)$ is high
Estimation error $R(\hat{f}_{ERM}) - R(f^*) = \sqrt{\frac{2B^2 \log(2 | \mathcal{F} |/\delta)}{n}}$, scales with size of the function class $| \mathcal{F} |$

Efficient PAC

ERM is an efficiently PAC-learning algorithm for finite $| \mathcal{F} |$. Given the second equation of Uniform Convergence Theorem, we can solve for the sample size $n$:

\[n = 2B^2\frac{1}{\varepsilon^2} \log \frac{2|\mathcal{F}|}{\delta}\]

$n$ polynomial in $1/\varepsilon$ and $1/\delta$, therefore, ERM is an efficient PAC algorithm.

Discussion

Uniform Convergence Theorem offers a surprising result: if training set is generated by random sampling, we are guaranteed to learn a model that generalize well using ERM. The sample complexity and convergence of the model are independent from the complexity of the input space $\mathcal{X}$. Rather, they depend on complexity of the function class $\mathcal{F}$.