## $\gamma$-Weak-Learnability

1. A learning algorithm, A, is a $\gamma$-weak-learner for a class $H$ if there exists a function $m_{\mathcal{H}} :(0,1) \rightarrow \mathbb{N}$ such that for every $\delta \in(0,1)$, for every distribution $D$ over $X$ , and for every labeling function $f : \mathcal{X} \rightarrow\{ \pm 1\}$, if the realizable assumption holds with respect to $H$,$D$,$f$, then when running the learning algorithm on $m \geq m_{\mathcal{H}}(\delta) \text { i.i.d. }$ examples generated by $D$ and labeled by $f$, the algorithm returns a hypothesis $h$ such that, with probability of at least $1-\delta, L_{(\mathcal{D}, f)}(h) \leq 1 / 2-\gamma$
2. A hypothesis class $H$ is $\gamma$-weak-learnable if there exists a $\gamma$-weak-learner for that class.

This definition is almost identical to the definition of PAC learning, which here we will call strong learning, with one crucial difference: Strong learnability implies the ability to find an arbitrarily good classifier (with error rate at most $\epsilon$ for an arbitrarily small $\epsilon>0$). In weak learnability, however, we only need to output a hypothesis whose error rate is at most $1 / 2-\gamma$, namely, whose error rate is slightly better than what a random labeling would give us. The hope is that it may be easier to come up with efficient weak learners than with efficient (full) PAC learners.

AdaBoost (short for Adaptive Boosting) is an algorithm that has access to a weak learner and finds a hypothesis with a low empirical risk. The AdaBoost algorithm receives as input a training set of examples $S=\left(\mathbf{x}_{1}, y_{1}\right), \ldots,\left(\mathbf{x}_{m}, y_{m}\right)$ where for each $i, y_{i}=f\left(\mathbf{x}_{i}\right)$ for some labeling function $f$. The boosting process proceeds in a sequence of consecutive rounds. At round $t$, the booster first defines a distribution over the examples in $S$, denoted $\mathbf{D}^{(t)}$. That is, $\mathbf{D}^{(t)} \in \mathbb{R}_{+}^{m}$ and $\sum_{i=1}^{m} D_{i}^{(t)}=1$. Then, the booster passes the distribution $\mathbf{D}^{(t)}$ and the sample $S$ to the weak learner. The weak learner is assumed to return a “weak” hypothesis, $h_{t}$, whose error,

[\epsilon_{t} \stackrel{\mathrm{def}}{=} L_{\mathbf{D}^{(t)}}\left(h_{t}\right) \stackrel{\mathrm{def}}{=} \sum_{i=1}^{m} D_{i}^{(t)} \mathbb{1}{\left[h{t}\left(\mathbf{x}{i}\right) \neq y{i}\right]}]

is at most $\frac{1}{2}-\gamma$

The pseudocode of AdaBoost is presented in the following.

Here is a very detailed example: LINK

In next section, the theorem shows that the training error of the output hypothesis decreases exponentially fast with the number of boosting rounds.

## upper bound of adaboost error

Theorem: Let $S$ be a training set and assume that at each iteration of AdaBoost, the weak learner returns a hypothesis for which $\epsilon_{t} \leq 1 / 2-\gamma$. Then, the training error of the output hypothesis of AdaBoost is at most

[L_{S}\left(h_{s}\right)=\frac{1}{m} \sum_{i=1}^{m} \mathbb{l}{\left[h{s}\left(\mathbf{x}{i}\right) \neq y{i}\right]} \leq \exp \left(-2 \gamma^{2} T\right)]

Proof:

For each $t$, denote $f_{t}=\sum_{p \leq t} w_{p} h_{p}$. Therefore, the output of AdaBoost is $f_{T}$ . In addition, denote

[Z_{t}=\frac{1}{m} \sum_{i=1}^{m} e^{-y_{i} f_{t}\left(x_{i}\right)}]

Note that for any hypothesis we have that $\mathbb{1}_{[h(x) \neq y]} \leq e^{-y h(x)}$. Therefore,$L_{S}\left(f_{T}\right) \leq Z_{T}$ .

To upper bound $Z_{T}$ we rewrite it as

[Z_{T}=\frac{Z_{T}}{Z_{0}}=\frac{Z_{T}}{Z_{T-1}} \cdot \frac{Z_{T-1}}{Z_{T-2}} \cdots \frac{Z_{2}}{Z_{1}} \cdot \frac{Z_{1}}{Z_{0}}]

we first note that using a simple inductive argument, for all $t$ and $i$,

[D_{i}^{(t+1)}=\frac{e^{-y_{i} f_{t}\left(x_{i}\right)}}{\sum_{j=1}^{m} e^{-y_{j} f_{t}\left(x_{j}\right)}}]

Hence,

[\begin{aligned} \frac{Z_{t+1}}{Z_{t}} &=\frac{\sum_{i=1}^{m} e^{-y_{i} f_{t+1}\left(x_{i}\right)}}{\sum_{j=1}^{m} e^{-y_{j} f_{t}\left(x_{j}\right)}} \ &=\frac{\sum_{i=1}^{m} e^{-y_{i} f_{t}\left(x_{i}\right)} e^{-y_{i} w_{t+1} h_{t+1}\left(x_{i}\right)}}{\sum_{j=1}^{m} e^{-y_{j} f_{t}\left(x_{j}\right)}} \ &=\sum_{i=1}^{m} D_{i}^{(t+1)} e^{-y_{i} w_{t+1} h_{t+1}\left(x_{i}\right)} \ &=e^{-w_{t+1}} \sum_{i : y_{i} h_{t+1}\left(x_{i}\right)=1} D_{i}^{(t+1)}+e^{w_{t+1}} \sum_{i : y_{i} h_{t+1}\left(x_{i}\right)=-1 } D_{i}^{(t+1)} \ &=e^{-w_{t+1}}\left(1-\epsilon_{t+1}\right)+e^{w_{t+1}} \epsilon_{t+1} \ &=\frac{1}{\sqrt{1 / \epsilon_{t+1}-1}}\left(1-\epsilon_{t+1}\right)+\sqrt{1 / \epsilon_{t+1}-1} \epsilon_{t+1} \ &=\sqrt{\frac{\epsilon_{t+1}}{1-\epsilon_{t+1}}}\left(1-\epsilon_{t+1}\right)+\sqrt{\frac{1-\epsilon_{t+1}}{\epsilon_{t+1}}} \epsilon_{t+1} \ &=2 \sqrt{\epsilon_{t+1}\left(1-\epsilon_{t+1}\right)} \end{aligned}]

By our assumption, $\epsilon_{t+1} \leq \frac{1}{2}-\gamma$. Since the function $g(a)=a(1-a)$ is monotonically increasing in $[0,1 / 2]$, we obtain that

[2 \sqrt{\epsilon_{t+1}\left(1-\epsilon_{t+1}\right)} \leq 2 \sqrt{\left(\frac{1}{2}-\gamma\right)\left(\frac{1}{2}+\gamma\right)}=\sqrt{1-4 \gamma^{2}}]

Finally, using the inequality $1-a \leq e^{-a}$ we have that $\sqrt{1-4 \gamma^{2}} \leq e^{-4 \gamma^{2} / 2}= e^{-2 \gamma^{2}}$

Hence we have:

[\frac{Z_{t+1}}{Z_{t}} \leq e^{-2 \gamma^{2}}]

we used the fact that $Z_{0}=1$ because $f_{0} \equiv 0$. Therefore, it suffices to show that

[Z_{T} \leq e^{-2 \gamma^{2} T}]

## Linear Combinations of Base Hypotheses

As mentioned previously, a popular approach for constructing a weak learner is to apply the $ERM$ rule with respect to a base hypothesis class. Given a base hypothesis class $B$ (e.g., decision stumps), the output of AdaBoost will be a member of the following class:

[L(B, T)=\left{x \mapsto \operatorname{sign}\left(\sum_{t=1}^{T} w_{t} h_{t}(x)\right) : \mathbf{w} \in \mathbb{R}^{T}, \forall t, \quad h_{t} \in B\right}]

That is, each $h \in L(B, T)$ is parameterized by $T$ base hypotheses from $B$ and by a vector $w \in \mathbb{R}^{T}$

In the next section we analyze the estimation error of $L(B, T)$ by bounding the VC-dimension of $L(B, T)$ in terms of the VC-dimension of $B$ and $T$. We will show that, up to logarithmic factors, the VC-dimension of $L(B, T)$ is bounded by $T$ times the VC-dimension of $B$. It follows that the estimation error of AdaBoost grows linearly with $T$ . On the other hand, the empirical risk of AdaBoost decreases with $T$. In fact, as we demonstrate later, $T$ can be used to decrease the approximation error of $L(B, T)$. Therefore, the parameter $T$ of AdaBoost enables us to control the bias-complexity tradeoff.

## The VC-Dimension of $L(B, T)$

The following lemma tells us that the VC-dimension of $L(B, T)$ is upper bounded by $\tilde{O}(\operatorname{VCdim}(B) T)$ (the $\tilde{O}$ notation ignores constants and logarithmic factors).

lemma Let $B$ be a base class and let $L(B, T)$ be as defined in Equation in the previous section. Assume that both $T$ and $VCdim(B)$ are at least 3. Then,

[\operatorname{VCdim}(L(B, T)) \leq T(\operatorname{VCdim}(B)+1)(3 \log (T(\operatorname{VCdim}(B)+1))+2)]

Denote $d = VCdim(B)$. Let $C=\left\{x_{1}, \ldots, x_{m}\right\}$ be a set that is shattered by $L(B, T)$. Each labeling of $C$ by $h_{1}, \ldots, h_{T} \in B$ is obtained by first choosing $h_{1}, \ldots, h_{T} \in B$ and then applying a halfspace hypothesis over the vector $\left(h_{1}(x), \ldots, h_{T}(x)\right)$. By Sauer’s lemma, there are at most $(e m / d)^{d}$ different dichotomies (i.e., labelings) induced by $B$ over $C$. Therefore, we need to choose $T$ hypotheses, out of at most $(e m / d)^{d}$ different hypotheses. There are at most $(e m / d)^{dT}$ ways to do it. Next, for each such choice, we apply a linear predictor, which yields at most $(e m / T)^{T}$ dichotomies. Therefore, the overall number of dichotomies we can construct is upper bounded by

[(e m / d)^{d T}(e m / T)^{T} \leq m^{(d+1) T}]

where we used the assumption that both $d$ and $T$ are at least 3. Since we assume that $C$ is shattered, we must have that the preceding is at least $2^{m}$, which yields

[2^{m} \leq m^{(d+1) T}]

Therefore,

[m \leq \log (m) \frac{(d+1) T}{\log (2)}]

lemma: Let $a>0$. Then: $x \geq 2 a \log (a) \Rightarrow x \geq a \log (x)$. It follows that a necessary condition for the inequality $x<a \log (x)$ to hold is that $x<2 a \log (a)$

using this lemma, we can conclude that

[m \leq 2 \frac{(d+1) T}{\log (2)} \log \frac{(d+1) T}{\log (2)} \leq(d+1) T(3 \log ((d+1) T)+2)]

## YONG HUANG

Hi, I'm Yong Huang. I've recently graduated from Cornell Tech and obtained my master's degree, I shall start my Ph.D. in Computer Science this fall at UC Irvine. Thank you for visiting my site.