Sparsity of Support vector machines over an RKHS

In summary: Therefore, if we have a solution $f$ that is a linear combination of the kernel $k(x_{i},\cdot)$ with coefficients $\alpha_{i}$ that are all equal to zero, we know that this solution will only depend on the "support vectors," or the points $x_{i}$ where the coefficients $\alpha_{i}$ do not vanish. This is a useful result, as it can then be applied to other optimization problems such as the SVM optimization problem.I hope this helps clarify the problem for you. Best of luck with your studies!
  • #1
eipiplusone
10
0
Im trying to solve the following problem from the book 'Learning with kernels', and would really appreciate a little help.

Background information

- Let $\{(x_{1},y_{1}),...,(x_{N},y_{N})\}$ be a dataset, L a Loss function and $H(k)$ a reproducing kernel Hilbert space with kernel $k$. The representer theorem states that the minimizer $f \in H(k)$ of the following optimization problem
\begin{equation}
\operatorname*{argmin}_{f \in H(k)} \sum_{i=1}^{N}L(y_{i},f(x_{i})) + \Vert f \Vert_{H(k)}
\end{equation}
can be represented as
\begin{equation}
f(\cdot) = \sum_{i=1}^{N}\alpha_{i}k(x_{i},\cdot)
\end{equation}Problem statement

- Show that it is a sufficient requirement for the coefficients $\alpha_{i}$ of the kernel expansion to vanish, if for the corresponding loss
functions $L(y_{i},f(x_{i}))$ both the lhs and the rhs derivative with respect to $f(x_{i})$ vanish. Hint: use the proof strategy of the Representer theorem (see https://en.wikipedia.org/wiki/Representer_theorem).
My sparse thoughts

- I have been pondering on this for half a day, but don't seem to get anywhere. I don't exactly know how to make sense of the derivative of L. Since we want to minimize over $H(k)$ (a set of functions), I don't see how it is relevant to take the derivative (and thus minimize) w.r.t. $f(x_{i})$ (which is a real number). Furthermore, this derivative-approach seemingly has nothing to do with the proof strategy from the Representer theorem; in this proof, $H(k)$ is decomposed into the span of $k(x_{i})$ and its orthogonal complement. It is then shown that the component of $f \in H(k)$ that lies in the ortogonal complement vanish when evaluated in the $x_{i}$'s, and thus does not contribute to the Loss-function term of the minimization problem.

Ps. I am hoping to be able to use this result to show that the solutions to the SVM optimization problem only depends on the 'support vectors'.

Any help would be greatly appreciated.
 

Attachments

  • upload_2017-1-18_11-19-3.png
    upload_2017-1-18_11-19-3.png
    5.1 KB · Views: 454
  • upload_2017-1-18_11-19-20.png
    upload_2017-1-18_11-19-20.png
    5.1 KB · Views: 436
Physics news on Phys.org
  • #2

Thank you for reaching out for help with this problem from the book "Learning with Kernels." I am happy to assist you in understanding the proof and how it relates to the Representer theorem.

To begin, let's break down the problem statement. We are given a dataset $\{(x_{1},y_{1}),...,(x_{N},y_{N})\}$ and we want to minimize the following optimization problem:

\begin{equation}
\operatorname*{argmin}_{f \in H(k)} \sum_{i=1}^{N}L(y_{i},f(x_{i})) + \Vert f \Vert_{H(k)}
\end{equation}

The Representer theorem tells us that the minimizer $f \in H(k)$ can be represented as a linear combination of the kernel $k(x_{i},\cdot)$ with coefficients $\alpha_{i}$:

\begin{equation}
f(\cdot) = \sum_{i=1}^{N}\alpha_{i}k(x_{i},\cdot)
\end{equation}

Now, the problem statement asks us to show that if the coefficients $\alpha_{i}$ vanish, then the derivatives of the corresponding loss functions $L(y_{i},f(x_{i}))$ also vanish. In other words, if we have a solution $f$ that is a linear combination of the kernel $k(x_{i},\cdot)$ with coefficients $\alpha_{i}$ that are all equal to zero, then the derivatives of the loss functions at the points $x_{i}$ also equal zero.

To understand why this is a sufficient requirement, we need to go back to the proof strategy of the Representer theorem. The key idea is that we can decompose $H(k)$ into two orthogonal subspaces: the span of $k(x_{i},\cdot)$ and its orthogonal complement. This means that any function $f \in H(k)$ can be written as a sum of two components: one that lies in the span of $k(x_{i},\cdot)$ and one that lies in the orthogonal complement.

Now, when we evaluate the function $f$ at the points $x_{i}$, we can see that the component that lies in the orthogonal complement will vanish, since it is orthogonal to the span of $k(x_{i},\cdot)$. This means that when we
 

FAQ: Sparsity of Support vector machines over an RKHS

1. What is sparsity in the context of support vector machines (SVMs)?

Sparsity refers to the property of having a small number of non-zero coefficients in the SVM model. In other words, only a few data points are used as support vectors to define the decision boundary, while the rest of the data points have no impact on the model.

2. How does sparsity affect the performance of SVMs?

Sparsity can improve the performance of SVMs in terms of computational efficiency and generalization. By using only a small number of support vectors, the model becomes less complex and less prone to overfitting, which can lead to better generalization on unseen data. Additionally, sparsity allows for faster training and prediction times.

3. What is the relationship between sparsity and the Reproducing Kernel Hilbert Space (RKHS)?

SVMs use the concept of RKHS to find the optimal decision boundary. In this space, the support vectors are the data points that lie on the boundary or are closest to it. Therefore, sparsity in SVMs means that only a small number of points in the RKHS are used to define the decision boundary.

4. How is sparsity achieved in SVMs?

Sparsity is achieved through the use of the kernel trick, which maps the data into a higher-dimensional space where it is easier to find a linear decision boundary. In this higher-dimensional space, only a few data points will be relevant for defining the boundary, leading to a sparse model.

5. Are there any drawbacks to sparsity in SVMs?

One potential drawback of sparsity in SVMs is the sensitivity to outliers. Since only a few data points are used as support vectors, outliers can have a significant impact on the decision boundary and potentially lead to poor performance. Additionally, the selection of the kernel function can also affect the sparsity of an SVM model.

Similar threads

Back
Top