This post is mainly based on

Previously we analyzed the Weighted Majority Algorithm, using the Mistake Bound Model (MB). MB is a simple concept, but it is not always useful in practice. Criticisms over the MB model include:

  • Requires exact solution
    • MB requires algorithm to return the exact target concept
    • In practice, an approximately correct answer is adequate
  • No “confidence estimation”
    • MB does not tell you when to stop
    • In practice, a confidence estimation is preferred
  • Sampling procedure
    • MB requires output predictions at the beginning of the training
    • In practice, a batch of data is offered for training
  • Too adversarial
    • MB assumes Omniscient Adversary (adversary has full knowledge on learner and can adjust to learner’s strategy)
    • In practice, many machine learning algorithms only assume sampling from a distribution

Probably Approximately Correct (PAC) learning was proposed in 1984 by Leslie Valiant in the paper A Theory of the Learnable. PAC learning provide a framework for analysis of machine learning models.

“Learning”

The main contribution PAC learning framework is it formalize the definition of “learning”. Consider a binary classification problem where data $x \sim X$ are labeled by some functions $ f^* $. The set of all functions that maps domain $X$ to label $f: x \rightarrow \{0,1\}$ is called concept class $\mathcal{C}$. We call this ground truth $ f^* $ target concept $c$.

Ideally, we want to find the “Correct” model $c$ from $\mathcal{C}$, which has 0 error on $X$. However, $c$ could be computational hard to find. Therefore, we relax the goal from finding the “Correct” model $c$ to finding the “Approximately Correct” model $h$, which has a low error on $X$. Since the model is learned from sample $ \{ x_1, x_2, … \} $ where $x_i \sim X$ by some distribution $\mathcal{D}$, it is possible that the sample $\{x_i\}$ is not representative of the population $X$. We cannot prevent this from happening, but we can limit this to a low probability, i.e., the model is “Probably Approximately Correct”.

For example, consider all threshold function $f = x > a$ where $a \in \mathbb{R}$. The set of all threshold functions forms concept class $\mathcal{C}$. Each function in $\mathcal{C}$ is called a concept. We are interested in finding the approximate solution to the target concept (e.g., $f = x > 1$) which labels our data.

Concept class $\mathcal{C}$ need to be differentiated from Hypothesis class $\mathcal{H}$. Concept class $\mathcal{C}$ is the set of functions which contains the $ f^* $ which labels samples. Hypothesis class $\mathcal{H}$ is the class of functions use by the learning algorithm. Hypothesis class may not contains the target concept $ f^* $.

Settings

PAC learning assumptions

  • Sampling procedure
    • Each sample $x$ is drawn independently from some fixed distribution $\mathcal{D}$ over domain $X$
    • Not adversarial
  • “Batch model”
    • Learner get a “training set” of labeled data $ (x, c(x)) $
    • An example oracle $\operatorname{EX}(c, \mathcal{D})$
    • Each time you press the button, the example oracle yield $m$ examples
  • Performance measure
    • Output hypothesis should have low generalization error

The generalization error is defined as,

\[\operatorname{error}_{\mathcal{D}}(h,c) = P_{x \sim \mathcal{D}}[h(x) \not= c(x)]\]

PAC learning goal

  • Cannot expect algorithm to achieve 0 error
    • $\mathcal{H}$ may not contains $c$
  • Cannot expect guarantee (100% confidence) of achieve low error
    • The probability of output a high generalization error $h$ is low

Definition

An algorithm $A$ PAC learns $\mathcal{C}$ using hypothesis class $\mathcal{H}$ if:

  • $\forall c \in \mathcal{C}$
  • $\forall$ distribution $\mathcal{D}$ over $X$
  • $\forall \epsilon > 0, \forall \delta > 0$

$A$ is given $ \epsilon, \delta $ and access to example oracle $\operatorname{EX}(c, \mathcal{D})$ and $A$ output a hypothesis $h \in \mathcal{H}$ s.t., with probability $\geq 1 - \delta$, $\operatorname{error}_{\mathcal{D}}(h,c) \leq \epsilon$

Sample complexity $m$ = number of examples draw from oracle. $m$ depends on $\epsilon, \delta$.

PAC learning algorithm is efficient if

  • Sample complexity $\leq \operatorname{poly}(1/\epsilon, 1/\delta)$
  • Run time $\leq \operatorname{poly}(1/\epsilon, 1/\delta)$

Discussion

As previously mentioned, the main contribution of PAC learning framework is it offers a mathematically rigorous definition of what is machine learning. Equipped with this definition, we can ask questions about machine learning model. For example, (1) how many samples is required ($m$) to learn a model up to some accuracy with some confidence? (2) if we have a consistent hypothesis finder (CHF) which can output a $h$ that achieve 0 prediction error on $m$ samples, can we use it as a PAC learner?

Sample complexity of a PAC learning algorithm depends on many factors. One of them is the “complexity” of the hypothesis class $\mathcal{H}$. For finite $|\mathcal{H}|$, we can bound the sample complexity by $\log |\mathcal{H}|$. For infinite $|\mathcal{H}|$, we need a new tool called VC-dimension.

Other interesting topics include

  • Lower bound on PAC learning sample complexity
    • Given a concept class $\mathcal{C}$ with its VC-dimension = $d$
    • Any algorithm requires $\Omega(d/\epsilon)$ samples to PAC learn $\mathcal{C}$
  • Upper bound on PAC learning sample complexity
    • Sauer’s Lemma
    • Given a concept class $\mathcal{C}$ with its VC-dimension = $d$
    • Given $m$ samples, growth function is polynomial on $m$ when $m > d$
    • Sample complexity for CHF is polynomial on $d$

Readings