1 Linear Support Vector Machines

In this tutorial, we will focus on the original SVM for binary classification problem. Let the training data be \(\{( x_{1},y_{1}) ,\ldots,( x_{n},y_{n})\}\), where \(x_{i}\in \mathbb{R}^{p}\) is the explanatory variable for the \(i\)-th object and \(y_{i} \in\{1,-1\}\) is the class label.

1.1 Hard-margin for linearly separable problem

Define a hyperplane by \(H=\{ x:f( x) =b+x^{\prime}w=0\}\). The signed distance of a point \(x_{0}\) to the hyperplane \(H\) is \(\frac{f( x_{0}) }{\Vert w\Vert }\) (visit http://mathworld.wolfram.com/Point-PlaneDistance.html). Two classes are separable if there is a \(f(x)\) with \(y_{i}f( x_{i}) >0\) for all \(i\). Such \(f\), if exists, is not unique. Our objective is to find the maximal margin hyperplane such that it creates the biggest margin between the training points for class \(1\) and \(-1\). It corresponds to the following optimization problem:

\[\underset{w,b}{\min}\Vert w\Vert\] subject to \(y_{i}( x_{i}^{^{\prime}}w+b) \geq1,i=1,2,\ldots,n.\) Here, the minimization of \(\Vert w \Vert\) is equivalent to the maximization of the smallest distance between the two classes (i.e., the distance between support vectors).

(From https://wizardforcel.gitbooks.io/dm-algo-top10/content/img/20140502163842531.jpg)

1.2 Soft-margin for non-separable problem

When classes are not separable, we define slack variable \(\xi=( \xi_{1},\xi_{2},\ldots,\xi_{n}), \xi_{i}\geq0\) to allow misclassification:

\[\underset{w,b}{\min}\frac{1}{2}\Vert w\Vert ^{2}+C\sum_{i=1}^{n}\xi_{i}\] subject to \(\xi_{i}\geq0,y_{i}(x_{i}^{^{\prime}}w+b) \geq1-\xi_{i},i=1,2,\ldots,n.\)

where the parameter \(C\) is used to balance the two loss terms.

1.3 Soultion via Lagrange multiplier

The Lagrange multiplier could be easily derived as:

\[L_{P}=\frac{1}{2}\Vert w\Vert ^{2}+C\sum_{i=1}^{n}\xi_{i}-\sum_{i=1}^{n}\alpha_{i}[ y_{i}( x_{i}^{^{\prime}}w+b) - ( 1-\xi_{i}) ] - \sum_{i=1}^{n}\mu_{i}\xi_{i},\]

which we minimize w.r.t. \(w\), \(b\) and \(\xi\).

  1. \(\frac{\partial L_{P}}{\partial w}=0\Longrightarrow w=\sum_{i=1}^{n}\alpha_{i}y_{i}x_{i},\)

  2. \(\frac{\partial L_{P}}{\partial b}=0\Longrightarrow \sum_{i=1}^{n}\alpha_{i}y_{i}=0,\)

  3. \(\frac{\partial L_{P}}{\partial\xi_{i}}=0\Longrightarrow\alpha_{i}=C-\mu_{i},i=1,2,\ldots,n.\)

The dual problem is

\[\max L_{P}\] subject to \(w=\sum_{i=1}^{n}\alpha_{i}y_{i}x_{i},\sum_{i=1}^{n}\alpha_{i}y_{i}=0,\alpha_{i}=C-\mu_{i},\alpha_{i}\geq0,\mu_{i}\geq0,\forall i\).

By substituting (1), (2) and (3) into \(L_{P}\) to obtain

\[L_{D}=\sum_{i=1}^{n}\alpha_{i}-\sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}^{\prime}x_{j}.\]

Notice that the constraints have inequity forms, so we need to fowllow the Karush-Kuhn-Tucker conditions, which would produce unique solution to the following optimization:

\[\max L_{P}\] subject to \(0\leq\alpha_{i}\leq C,\sum_{i=1}^{n}\alpha_{i}y_{i}=0,\alpha_{i}[ y_{i}( x_{i}^{^{\prime}}w+b) - ( 1-\xi_{i})] =0,\mu_{i}\xi_{i}=0,\) and \(y_{i}( x_{i}^{^{\prime}}w+b) - ( 1-\xi_{i}) \geq0\) for \(i=1,2,\ldots,n\).

Note that

  1. Since \(w=\sum_{i=1}^{n}\alpha_{i}y_{i}x_{i}\), \(w\) depends only on cases where \(\alpha_{i}>0\).

  2. Since \(\alpha_{i}[ y_{i}( x_{i}^{^{\prime}}w+b) - ( 1-\xi_{i}) ] =0\), a case with \(\alpha_{i}>0\) must satisfy \(y_{i}( x_{i}^{^{\prime}}w+b) \leq1\). We call these observations the support vectors.

  3. Among the support vectors, some will lie on the edge of margin with \(\xi_{i}=0\). Since \(\mu_{i}\xi_{i}=0\) and \(\alpha_{i}=C-\mu_{i}\), it corresponds to \(0<\alpha_{i}<C\). The reminder \(( \xi_{i}>0)\) (thus \(\mu_{i}=0\)) have \(\alpha_{i}=C\).

  4. The margin points \(( \alpha_{i}>0,\xi_{i}=0)\) can be used to solve for \(b\) because \(y_{i}( x_{i}^{^{\prime}}w+b) =1\).

  5. Given a solution \(\widehat{w}_{0}\) and \(\widehat{w}\), the decision function is \(\widehat{G}( x) =sign[ x^{\prime}\widehat{w}+\widehat{w}_{0}].\)

2 Kernel Tricks & Extensions

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces \(h(x) =( h_{1}( x) ,\ldots,h_{m}( x))\). Since \(m\) can be a very large number, this extension is called the support vector machine.