Support Vector Machines

The meaning of nonsense

Introduction

In this tutorial, we will cover support vector machines. We will first start with linear splits using SVM and then proceed to nonlinear splits including radial. Overall we will,

train support vector machines,
apply SVM to linearly non-separable cases,
apply SVM to very nonlinear cases
present their differences with other linear models.

Getting Started

To follow up, you will need

tidyverse
e1071
parsnip

Now, call tidyverse and e1071:

library('tidyverse')
library('e1071')
library('rpart')
library('parsnip')
library('parttree')

theme_set(theme_minimal())

Support Vector Machines

SVM is a very flexible approach to classification. Here, unlike logistic regressions, we are not assuming a latent probability and decide based on our assumption. Instead, we have a more intuitive or visual approach, split the region wherever seems the best. The regions may be split linearly or nonlinearly (e.g. polynomial or radial) depending on the choice. In case of radial, the centre of the field can be classified as belonging to one class and the rest to another.

There are direct advantages of SVM over Decision Trees and Logistic Regressions. Unlike logistic regressions, we are not restricted to a linear split, which would fail in cases like the one below. We can use our imagination and choose other geometric objects, which seem more appropriate.

We present the problem in the below example. You can see that logistic regression fails but decision tree works. This is the advantage of the logical-intuitive approach. However, it works by splitting the region using little rectangles and it may overfit. What we need is a more visually compatible approach, using a geometrical object, like drawing a circle and saying this is the target region. This is pretty much how SVM works.

library(tree)
library(gridExtra)

set.seed(157)
x            <- matrix(rnorm (200*2), ncol =2)
x[  1:100, ] <- x[  1:100, ] + 2
x[101:150, ] <- x[101:150, ] - 2
y            <- c(rep(1,150),rep(2,50))
dat          <- data.frame(x=x,y=as.factor(y))
colnames(dat) <- c("x1","x2","y")
ggplot(dat, aes(x1,x2,colour=y)) +
  geom_point()

logreg  <- glm(y~x1+x2, data=dat, family = "binomial")
preds.logreg <- predict(logreg, type = "response") > 0.5
dectree <- rpart(y~x1+x2, data=dat)
preds.dectree <- predict(dectree, type = "class")

g1 <- ggplot(dat, aes(x1,x2,colour=y)) +
  geom_point() + 
  ggtitle("True Classes") + 
  theme(legend.position = "none")
g2 <- ggplot(dat, aes(x1,x2,colour=preds.logreg)) +
  geom_point() + 
  ggtitle("Logreg Preds") + 
  theme(legend.position = "none")
g3 <- ggplot(dat, aes(x1,x2,colour=preds.dectree)) +
  geom_point() + 
  ggtitle("Decision Tree Preds") + 
  theme(legend.position = "none")

grid.arrange(g1,g2,g3,ncol=3)

> Logistic regression predicted all belonging to one class whereas Decision Tree did a better job. Still, there are some misclassified examoples in the middle of the plot that seem to be easy examples but because of vertical and horizontal splits of decision trees we don’t have enough flexibility.

If you wonder how the above decision tree works, we can visualize the decision boundaries as below:

library("parttree")
## Plot the data and model partitions
dat %>%
  ggplot(aes(x=x1, y=x2)) +
  geom_point(aes(col=y)) +
  geom_parttree(data = dectree, aes(fill=y), alpha = 0.1)

> As can be seen above, the decision trees are very precise, but it doesn’t reflect the nature of the question and not an intuitive solution.

Let’s see what happens on an easy dataset:

set.seed (156)
dat   <- data.frame(x=matrix(rnorm (200*2), ncol =2),
                    y = c(rep (-1,100) , rep (1 ,100)))
centres <- list(c(0,0), c(5,5))
dat <- data.frame(x=matrix(rnorm (100*2), ncol =2) + c(0,0), y=-1)
dat <- rbind(dat, data.frame(x=matrix(rnorm (100*2), ncol =2) + c(5,5), y=1))
colnames(dat) <- c("x1","x2","y")
dat$y <- as.factor(dat$y)

ggplot(dat, aes(x1,x2,colour=y)) +
  geom_point() #+

  # geom_abline(aes(intercept = 5, slope = -1.0)) + 
  # geom_abline(aes(intercept = 5+.25, slope = -1.1), alpha=.6) +
  # geom_abline(aes(intercept = 5+.5, slope  = -1.2), alpha=.6) +
  # geom_abline(aes(intercept = 5-.25, slope = -0.9), alpha=.6) +
  # geom_abline(aes(intercept = 5-.5, slope = -0.8), alpha=.6)

fit.tree <- tree(y~x1+x2, data=dat)
fit.tree

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 200 277.3 -1 ( 0.5 0.5 )  
##   2) x1 < 2.40365 100   0.0 -1 ( 1.0 0.0 ) *
##   3) x1 > 2.40365 100   0.0 1 ( 0.0 1.0 ) *

fit.logreg <- glm(y~x1+x2, data=dat, family="binomial")
fit.logreg

## 
## Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = dat)
## 
## Coefficients:
## (Intercept)           x1           x2  
##      -54.06        10.37        10.99  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  197 Residual
## Null Deviance:       277.3 
## Residual Deviance: 5.122e-09     AIC: 6

coeffs <- coef(fit.logreg)

ggplot(dat, aes(x1,x2,colour=y)) +
  geom_point() + 
  geom_vline(aes(xintercept=2.40365), linetype="dashed") + 
  geom_abline(aes(intercept=-coeffs[1]/coeffs[2], slope=-coeffs[3]/coeffs[2]), colour="steelblue") + 
  ggtitle("Decision Tree vs. Logistic Regression")

This is an easy example and all three classifiers would work perfectly. Let’s compare the styles:

Decision tree split the region at x1=2.4 and terminated.
Logistic regression estimated the coefficients that aims define a line that maximizes the probability of falling into the true class using the very same function with different angles. To visualize the line, we will extract it as follows:
- If the probability is greater than 0.5, it falls to one class and vice versa.
- There the decision boundary is \(P = 0.5 = 1/(1+e^Z)\) or \(0 = Z = \beta_0 + \beta_1 X_1 + \beta_2 X_2\).
- Thus, in the region where x1 and x2 are axes, we can the estimated line corresponds to \(x_2 = -\beta_0/\beta_2 - \beta_0/\beta_1 x_1\) which is the above blue line.

This built in probability may be unnecessary and hurt efficiency. Instead, we can use a more intuitive approach. SVM seeks to find a line that minimizes the number of misclassified points by penalizing ones that fall into wrong sides. In case when two regions classes are distinct, the method is equivalent to maximal margin approach (see lectures). When the two classes are inseparable due to noise, then it allows, the points fall into wrong sides are penalized with a constant, if it increases, such points that fall further would have more impact, which may be inappropriate for the situation. Therefore, we may need to find the optimal penalty. Nevertheless, here the split is clean and easy.

Now, let’s see what we can do with SVMs.

svmfit <- svm(y~., data=dat, kernel="linear")
plot(svmfit, dat)

Linear Splits with SVM

Let’s generate some data. But this time the points won’t be linearly separable:

set.seed(156)
dat   <- data.frame(x=matrix(rnorm (60*2), ncol =2),
                    y = c(rep (-1,30) , rep (1 ,30)))
dat[dat$y==1 ,1:2] <- dat[dat$y==1,1:2] + 2
colnames(dat) <- c("x1","x2","y")
dat$y <- as.factor(dat$y)
ggplot(dat, aes(x1,x2,colour=as.factor(y))) +
  geom_point()

logreg.fit <- glm(y~x1+x2, data = dat,family = "binomial")
coeffs <- coef(logreg.fit)
preds  <- predict(logreg.fit, type="response") > 0.5
ggplot(dat, aes(x1,x2,colour=y)) +
  geom_point() + 
  geom_abline(aes(intercept=-coeffs[1]/coeffs[3], slope=-coeffs[2]/coeffs[3]), colour="steelblue") +
  ggtitle("Logistic Regression fit")

table(predict=preds,actual=(dat$y==1))

##        actual
## predict FALSE TRUE
##   FALSE    27    3
##   TRUE      3   27

The resulting classes seems fine with an even split.

tree.fit <- rpart(y~x1+x2, data = dat)
preds <- predict(tree.fit, type="class")
table(predict=preds,actual=(dat$y==1))

##        actual
## predict FALSE TRUE
##      -1    30    6
##      1      0   24

dat %>%
  ggplot(aes(x=x1, y=x2)) +
  geom_point(aes(col=y)) +
  geom_parttree(data = tree.fit, aes(fill=y), alpha = 0.1)

The behaviour of decision tree is too inflexible for the given case. The bounds are insensitive and unintuitive. The method needs more data to sparpen the decisions and performs poor on small datasets.

Let’s see what SVM does:

svmfit <-  svm(y ~ x1+x2, data=dat , kernel ="linear", cost =0.05,scale =FALSE)
preds <- predict(svmfit,dat, type="class")

table(predict=preds,dat$y==1)

##        
## predict FALSE TRUE
##      -1    28    4
##      1      2   26

plot(svmfit,dat)

preds <- predict(svmfit,dat[,1:2], type="class")
ggplot(dat, aes(x1,x2,colour=as.factor(preds))) +
  geom_point()

We have a more unbiased split compared to decision tree but pretty much the same as logistic regression. The false positive and false negatives are even in both sides and false classifications are less. SVM owes this by accepting that there will be errors and compansating them.

Finetuning SVM

The previous example was based a certain cost parameter, 0.1. We don’t know this is the best cost parameter. What are we going to do? … Yes, it is.

set.seed(156)
tune.out <- tune(svm ,y~.,data=dat ,kernel ="linear",scale =FALSE,
              ranges =list(cost=c(0.001 , 0.01, 0.1, 1,5,10,100) ))

# summary(tune.out)
bestmod <- tune.out$best.model
preds   <- predict(bestmod,dat[,1:2], type="class")
plot(bestmod,dat)

table(predict=preds,actual = (dat$y==1))

##        actual
## predict FALSE TRUE
##      -1    27    2
##      1      3   28

We compared different cost parameters using 10-fold validation. The performance is displayed above. The one that generated the minimum error is 0.1.

The best model reduced the misclassified examples.

Radial Kernels

Now, let’s go back to our first example. What are we going to do about it? We saw that decision trees work well in such a dataset. Let’s look closer:

set.seed(157)
x           <- matrix(rnorm (200*2), ncol =2)
x[  1:100, ] <- x[  1:100, ] +2
x[101:150, ] <- x[101:150, ] -2
y            <- c(rep(1,150),rep(2,50))

dat          <- data.frame(x=x,y=as.factor(y))
colnames(dat) <- c("x1","x2","y")
ggplot(dat, aes(x1,x2,colour=y)) +
  geom_point()

tree.fit <- rpart(y~x1+x2, data = dat)
preds <- predict(tree.fit, type="class")
table(predict=preds,actual=(dat$y==1))

##        actual
## predict FALSE TRUE
##       1     6  141
##       2    44    9

dat %>%
  ggplot(aes(x=x1, y=x2)) +
  geom_point(aes(col=y)) +
  geom_parttree(data = tree.fit, aes(fill=y), alpha = 0.1)

The way decision tree solves the issue is very unintuitive and it is easy to imagine how it would overfit. We need a more intuitive approach.

We are looking for one line that can split the region more effectively. A polynomial split sounds good but we need something more radical, something that can draw a circle around the points:

svmfit <- svm(y~.,data =dat, kernel ="radial", gamma =1,cost =1)
plot(svmfit , dat)

preds.svm <- predict(svmfit)
table(predict=preds.svm,actual=(dat$y==2))

##        actual
## predict FALSE TRUE
##       1   145   11
##       2     5   39

g4 <- ggplot(dat, aes(x1,x2,colour=as.factor(preds.svm))) +
  geom_point() + 
  ggtitle("SVM, cost=1, gamma=1") + 
  theme(legend.position = "none")

You can see that SVMs have amazing capacities. The above example shows the decision boundaries. Within that kernel, all fall into the same class.

Now, we have a second parameter, gamma. Let’s see what cross validation can offer.

#### Tune for the best parameter selection ####
set.seed(156)
tune.out <- tune(svm, y~., data=dat, kernel ="radial",
              ranges =list(cost=c(0.01, 0.05, .1 ,1 ,10 ,100 ,1000),
                           gamma=c(0.5,1,2,3,4)))

preds.svmCV <- predict(tune.out$best.model,dat)
g5 <- ggplot(dat, aes(x1,x2,colour=preds.svmCV)) +
  geom_point() + 
  ggtitle("SVM, CV'd, cost=1, gamma=0.5") + 
  theme(legend.position = "none")

summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##    10     1
## 
## - best performance: 0.085 
## 
## - Detailed performance results:
##     cost gamma error dispersion
## 1  1e-02   0.5 0.250 0.11055416
## 2  5e-02   0.5 0.250 0.11055416
## 3  1e-01   0.5 0.150 0.13944334
## 4  1e+00   0.5 0.095 0.04972145
## 5  1e+01   0.5 0.090 0.05163978
## 6  1e+02   0.5 0.090 0.06582806
## 7  1e+03   0.5 0.090 0.06582806
## 8  1e-02   1.0 0.250 0.11055416
## 9  5e-02   1.0 0.245 0.11654756
## 10 1e-01   1.0 0.100 0.07817360
## 11 1e+00   1.0 0.095 0.05502525
## 12 1e+01   1.0 0.085 0.06258328
## 13 1e+02   1.0 0.095 0.06851602
## 14 1e+03   1.0 0.100 0.06666667
## 15 1e-02   2.0 0.250 0.11055416
## 16 5e-02   2.0 0.250 0.11547005
## 17 1e-01   2.0 0.125 0.10341395
## 18 1e+00   2.0 0.100 0.06236096
## 19 1e+01   2.0 0.105 0.06433420
## 20 1e+02   2.0 0.100 0.06666667
## 21 1e+03   2.0 0.125 0.07168604
## 22 1e-02   3.0 0.250 0.11055416
## 23 5e-02   3.0 0.250 0.11055416
## 24 1e-01   3.0 0.145 0.10394977
## 25 1e+00   3.0 0.110 0.06146363
## 26 1e+01   3.0 0.095 0.04972145
## 27 1e+02   3.0 0.105 0.07245688
## 28 1e+03   3.0 0.145 0.05986095
## 29 1e-02   4.0 0.250 0.11055416
## 30 5e-02   4.0 0.250 0.11055416
## 31 1e-01   4.0 0.145 0.10394977
## 32 1e+00   4.0 0.105 0.05986095
## 33 1e+01   4.0 0.090 0.06146363
## 34 1e+02   4.0 0.110 0.06582806
## 35 1e+03   4.0 0.150 0.06236096

plot(tune.out$best.model , dat)

Let’s compare the models performance:

grid.arrange(g1,g2,g3,ncol=3)

grid.arrange(g1,g4,g5,ncol=3)