“A Gentle Start to learning theory”

A Formal Model – The Statistical Learning Framework

  1. The learner’s input:
    1. Domain set: An arbitrary set, \(X\) . This is the set of objects that we may wish to label
    2. Label set: e.g. for binary case: \(\{0,1\}\)
    3. Training data: \(S=\left(\left(x_{1}, y_{1}\right) \ldots\left(x_{m}, y_{m}\right)\right)\) is a finite sequence of pairs in \(\mathcal{X} \times \mathcal{Y}\)
  2. The learner’s output: The learner is requested to output a prediction rule, \(h : \mathcal{X} \rightarrow \mathcal{Y}\). This function is also called a predictor, a hypothesis, or a classifier
  3. How data is generated: each pair in the training data \(S\) is generated by first sampling a point \(x_{i}\) according to \(\mathcal{D}\) and then labeling it by \(\mathcal{f}\).
  4. Measures of success: We define the error of a prediction rule, \(h : \mathcal{X} \rightarrow \mathcal{Y}\), to be: \(L_{\mathcal{D}, f}(h) \stackrel{\mathrm{def}}{=} \underset{x \sim \mathcal{D}}{\mathbb{P}}[h(x) \neq f(x)] \stackrel{\mathrm{def}}{=} \mathcal{D}(\{x : h(x) \neq f(x)\})\). That is, the error of such \(\mathcal{h}\) is the probability of randomly choosing an example \(x\) for which \(h(x) \neq f(x)\).

Empirical Risk Minimization

a learning algorithm receives as input a training set \(S\), sampled from an unknown distribution \(\mathcal{D}\) and labeled by some target function f, and should output a predictor \(h_{S} : \mathcal{X} \rightarrow \mathcal{Y}\) (the subscript \(S\) emphasizes the fact that the output predictor depends on \(S\)). The goal of the algorithm is to find \(h_{S}\) that minimizes the error with respect to the unknown \(\mathcal{D}\) and \(\mathcal{f}\).

A useful notion of error that can be calculated by the learner is the training error – the error the classifier incurs over the training sample:

[L_{S}(h) \stackrel{\mathrm{def}}{=} \frac{\left \left{i \in[m] : h\left(x_{i}\right) \neq y_{i}\right}\right }{m}]
  1. Overfitting problem: Example: Consider a sample as depicted in the following:

Assume that the probability distribution \(\mathcal{D}\) is such that instances are distributed uniformly within the gray square and the labeling function, \(\mathcal{f}\), determines the label to be 1 if the instance is within the inner blue square, and 0 otherwise. The area of the gray square in the picture is 2 and the area of the blue square is 1. Consider the following predictor:

[h_{S}(x)=\left{\begin{array}{ll}{y_{i}} & {\text { if } \exists i \in[m] \text { s.t. } x_{i}=x} \ {0} & {\text { otherwise }}\end{array}\right.]

i.e. if the data is from the training set, it will be assigned its real label. otherwise, it will be assigned 0. while the true error rate on the entire set can be more than 0.

Empirical Risk Minimization with Inductive Bias

ERM rule might lead to overfitting. Rather than giving up on the ERM paradigm, we will look for ways to rectify it. We will search for conditions under which there is a guarantee that ERM does not overfit.

A common solution is to apply the ERM learning rule over a restricted search space. Formally, the learner should choose in advance (before seeing the data) a set of predictors.

This set is called a hypothesis class and is denoted by \(H\). Each \(h \in \mathcal{H}\) is a function mapping from \(X\) to \(Y\). Formally,

[\operatorname{ERM}{\mathcal{H}}(S) \in \underset{h \in \mathcal{H}}{\operatorname{argmin}} L{S}(h)]

\(\in\) instead of = is because there might be mutiple learners achieved the minimum.

By restricting the learner to choosing a predictor from \(H\), we bias it toward a particular set of predictors. Such restrictions are often called an inductive bias. Since the choice of such a restriction is determined before the learner sees the training data, it should ideally be based on some prior knowledge about the problem to be learned.

A fundamental question in learning theory is, over which hypothesis classes \(ERM_{H}\) learning will not result in overfitting.

  1. Finite Hypothesis Classes In this section, we show that if H is a finite class then \(ERM_{H}\) will not overfit, provided it is based on a sufficiently large training sample (this size requirement will depend on the size of \(H\)).

we make the following simplifying assumption:

(The Realizability Assumption) There exists \(h^{\star} \in \mathcal{H} s.t. L_{(\mathcal{D}, f)}\left(h^{\star}\right)=0\). Note that this assumption implies that with probability 1 over random samples, \(S\), where the instances of \(S\) are sampled according to \(\mathcal{D}\) and are labeled by \(f\), we have \(L_{S}\left(h^{\star}\right)=0\).

Clearly, any guarantee on the error with respect to the underlying distribution, \(\mathcal{D}\), for an algorithm that has access only to a sample \(S\) should depend on the relationship between \(\mathcal{D}\) and \(S\). The common assumption in statistical machine learning is that the training sample \(S\) is generated by sampling points from the distribution \(\mathcal{D}\) independently of each other. Formally,

The i.i.d. assumption: The examples in the training set are independently and identically distributed (i.i.d.) according to the distribution \(\mathcal{D}\). That is, every \(x_{i}\) in S is freshly sampled according to \(\mathcal{D}\) and then labeled according to the labeling function, \(f\). Intuitively, the training set \(S\) is a window through which the learner gets partial information about the distribution \(\mathcal{D}\) over the world and the labeling function, \(f\). The larger the sample gets, the more likely it is to reflect more accurately the distribution and labeling used to generate it. there is always some probability that the sampled training data happens to be very nonrepresentative of the underlying \(\mathcal{D}\). We will therefore want to get a sense of the probability to sample a training set for which \(L_{(\mathcal{D}, f)}\left(h_{S}\right)\) is not too large. Usually, we denote the probability of getting a nonrepresentative sample by \(\delta\), and call \((1-\delta)\) the confidence parameter of our prediction.

On top of that, since we cannot guarantee perfect label prediction, we intro- duce another parameter for the quality of prediction, the accuracy parameter,commonly denoted by \(\epsilon\). We interpret the event \(L_{(\mathcal{D}, f)}\left(h_{S}\right)>\epsilon\) as a failure of the learner, while if \(L_{(\mathcal{D}, f)}\left(h_{S}\right) \leq \epsilon\) we view the output of the algorithm as an approx- imately correct predictor.

Formally, let \(\left.S\right|_{x}=\left(x_{1}, \ldots, x_{m}\right)\) be the instances of the training set. We would like to upper bound.

[\mathcal{D}^{m}\left(\left{\left.S\right {x} : L{(\mathcal{D}, f)}\left(h_{S}\right)>\epsilon\right}\right)]

Let \(\mathcal{H}_{B}\) be the set of “bad” hypotheses, that is,

[\mathcal{H}{B}=\left{h \in \mathcal{H} : L{(\mathcal{D}, f)}(h)>\epsilon\right}]

In addition, let

[M=\left{\left.S\right {x} : \exists h \in \mathcal{H}{B}, L_{S}(h)=0\right}]

be the set of misleading samples.

Now, recall that we would like to bound the probability of the event \(L_{(\mathcal{D}, f)}\left(h_{S}\right)>\epsilon\). But, since the realizability assumption implies that \(L_{S}\left(h_{S}\right)=0\), it follows that the event \(L_{(\mathcal{D}, f)}\left(h_{S}\right)>\epsilon\) can only happen if for some \(h \in \mathcal{H}_{B}\) we have \(L_{S}(h)=0\) In other words, this event will only happen if our sample is in the set of misleading samples, \(M\). Formally, we have shown that

[\left{\left.S\right {x} : L{(\mathcal{D}, f)}\left(h_{S}\right)>\epsilon\right} \subseteq M]

Note that we can rewrite \(M\) as

[M=\bigcup_{h \in \mathcal{H}_{B}}\left{\left.S\right {x} : L{S}(h)=0\right}]

Hence, \(\mathcal{D}^{m}\left(\left\{\left.S\right|_{x} : L_{(\mathcal{D}, f)}\left(h_{S}\right)>\epsilon\right\}\right) \leq \mathcal{D}^{m}(M)=\mathcal{D}^{m}\left(\cup_{h \in \mathcal{H}_{B}}\left\{\left.S\right|_{x} : L_{S}(h)=0\right\}\right)\)

Next, we upper bound the right-hand side of the preceding equation using the union bound – a basic property of probabilities

(Union Bound) For any two sets \(A\),\(B\) and a distribution \(\mathcal{D}\) we have \(\mathcal{D}(A \cup B) \leq \mathcal{D}(A)+\mathcal{D}(B)\)

Applying the union bound to the right-hand side of previous equation yeilds \(\mathcal{D}^{m}\left(\left\{\left.S\right|_{x} : L_{(\mathcal{D}, f)}\left(h_{S}\right)>\epsilon\right\}\right) \leq \sum_{h \in \mathcal{H}_{B}} \mathcal{D}^{m}\left(\left\{\left.S\right|_{x} : L_{S}(h)=0\right\}\right)\)

The event \(L_{S}(h)=0\) is equivalent to the event \(\forall i, h\left(x_{i}\right)=f\left(x_{i}\right)\). Since the examples in the training set are sampled i.i.d. we get that

[\begin{aligned} \mathcal{D}^{m}\left(\left{\left.S\right {x} : L{S}(h)=0\right}\right) &=\mathcal{D}^{m}\left(\left{\left.S\right {x} : \forall i, h\left(x{i}\right)=f\left(x_{i}\right)\right}\right) \ &=\prod_{i=1}^{m} \mathcal{D}\left(\left{x_{i} : h\left(x_{i}\right)=f\left(x_{i}\right)\right}\right) \end{aligned}]

For each individual sampling of an element of the training set we have \(\mathcal{D}\left(\left\{x_{i} : h\left(x_{i}\right)=y_{i}\right\}\right)=1-L_{(\mathcal{D}, f)}(h) \leq 1-\epsilon\)

combining above equation, for every \(h \in \mathcal{H}_{B}\) we have \(\mathcal{D}^{m}\left(\left\{\left.S\right|_{x} : L_{S}(h)=0\right\}\right) \leq(1-\epsilon)^{m} \leq e^{-\epsilon m}\)

we conclude that

[\mathcal{D}^{m}\left(\left{\left.S\right {x} : L{(\mathcal{D}, f)}\left(h_{S}\right)>\epsilon\right}\right) \leq\left \mathcal{H}_{B}\right e^{-\epsilon m} \leq \mathcal{H} e^{-\epsilon m}]

Put it back together

Let \(H\) be a finite hypothesis class. Let \(\delta \in(0,1)\) and \(\epsilon>0\) and let \(m\) be an integer that satisfies \(m \geq \frac{\log (|\mathcal{H}| / \delta)}{\epsilon}\)

Then, for any labeling function, \(f\), and for any distribution, \(\mathcal{D}\), for which the realizability assumption holds with probability of at least \(1-\delta\) over the choice of an i.i.d. sample S of size m, we have that for every \(ERM\) hypothesis, \(h_{S}\), it holds that \(L_{(\mathcal{D}, f)}\left(h_{S}\right) \leq \epsilon\)

this tells us that for a sufficiently large \(m\), the \(ERM_{H}\) rule over a finite hypothesis class will be probably (with confidence \(\delta \in(0,1)\)) approximately (up to an error of \(\delta\)) correct.

YONG HUANG

YONG HUANG

Hi, I'm Yong Huang. I've recently graduated from Cornell Tech and obtained my master's degree, I shall start my Ph.D. in Computer Science this fall at UC Irvine. Thank you for visiting my site.