Introduction

In this tutorial, we will first present the difference between Batch Stochastic Gradient Descent algorithms. Then we will cover overfitting and validation. Overall we will,

discuss how GDA works,
compare BGD and SGD,
revisit the problem of overfitting,
apply different validation techniques to overcome the problem.

Getting Started

To follow up, you will need

tidyverse
caret
gridExtra
ISLR
plotly
house dataset.

Now, call tidyverse and other packages:

library('tidyverse')
library('caret')
library('gridExtra')
library('plotly')
library('ISLR')

theme_set(theme_minimal())

Best Linear Fit

When fitting regressions, we try to find parameters that fit the data the best. For sake of simplicity, let’s assume a simple linear regression model

\[\hat Y^i = \theta_0 + \theta_1 X^i\] whereas the actual values have an additional error term:

\[Y^i = \theta_0 + \theta_1 X^i + \epsilon_i\] Notice here that \(Y_i\) is the actual observation and \(\hat Y_i\) is the estimated one. We also refer to the estimated one as \(\hat Y_i = h_\theta (x_i)\).

Seems like all miss the point. First cannot capture the slope (because it is 0), second miss both the slope and intercept, the last two have the right slope but wrong intercept.

The distances from the line to each point are too much. The sum of distance may cancel out (some are negative, some are positive) but sum of squarred distances will be very high.

What is so special about the line below?

preds <- lm(log(mpg)~log(horsepower), data=Auto)$fitted.values
ggplot(Auto, aes(x=log(horsepower), y=log(mpg))) + 
  geom_point() + 
  geom_line(aes(y=preds), colour = 'steelblue') + 
  ylim(1,5) + 
  ggtitle('theta0 = 6.9606 , theta1 = -0.8418')

The line pass through the points, so that the distance between each point to the line is as small as possible.

How to determine the best line mathematically?

We can define the best line mathematically by the line that minimizes the sum of squarred loss. For example, in the first 4 plots, the distance from each point to the line are very high and if we sum the squarred errors (SSE) it will be too much. For the one above, the sum of squarred error would be minimal.

\[MSE = J(\theta) = \frac2 M \sum_{j=1}^M(h_\theta(x^i) - Y^i)^2 \\ = \frac 2 M (h_\theta(X) - Y^i)^T (h_\theta(X) - Y)\]

So, if we can define the problem as finding the optimal \(\theta\) that minimizes the loss. When defined so, we can use Gradient Descent Algorithm that finds the optimal parameter.

To be able to use the gradient descent, we need to write the gradient of the above function:

\[\nabla J(\theta) = -\frac 1 M \sum_{i=1}^M(Y^i - h_\theta(x^i) )x_j^i \ \forall j\\ \nabla J(\theta) = \frac 1 M \sum_{i=1}^M(h_\theta(x^i) - Y^i)x_j^i \\ =-\frac 1 MX^T(Y - h_\theta(X))\]

You can find some implementations that drop \(\frac 1 M\) term. This is because we can account it by adjust the learning rate, \(alpha\).

Gradient Descent Algorithm

Here in this section, we will estimate the optimal parameters that best fits the data using gradient descent algorithm. Let’s start by defining loss function and its gradient:

MSE.loss <- function(x, y, theta){
  y_pred <- x %*% theta 
  2*mean( (y_pred - y)^2 ) # sum of squares
}

MSE.grad <- function(x, y, theta){
  y_pred <- x %*% theta
  res <- t(x) %*% (y_pred - y)/length(y)
  c(res)
}

For presentation purposes, we will generate a dataset that expose a linear relationship, i.e. \(Y_i = 1-2X_i+\epsilon_i\)

set.seed(156)

gen_data <- function(intercept,slope,x){
  intercept + slope*x + rnorm(length(x), sd=0.7)
}

x <- seq(-3,3,,100)
y <- gen_data(1,-2, x)

ggplot() + 
  geom_point(aes(x,y))

Now, let’s guess the optimal \(\theta\) vector and calculate their MSE loss. E.g. we can try \(\theta = (0,1)\), \(\theta = (1,1.5)\) and \(\theta = (1,-2)\):

x <- cbind(1,x)
y <- cbind(y)
l1 <- MSE.loss(x,y, c(0,1))
l2 <- MSE.loss(x,y, c(1,1.5))
l3 <- MSE.loss(x,y, c(1,-2))
cbind(l1,l2,l3)

##            l1       l2        l3
## [1,] 59.05332 77.26366 0.8994777

The first model misses the intercept and slope and the second has a positive slope. The third generates the minimum loss.

Now, let’s plot the loss function that we will try to minimize using GDA:

library('plotly')
library('ggplot2')
theta0 <- seq(-3,5,,100)
theta1 <- seq(-6,2,,100)

f <- function(theta) MSE.loss(x,y,theta) 
dat <- expand.grid(x=theta0,y=theta1)
z <- matrix(apply(dat,1,f), 100)

plot_ly(x=~theta1,y=~theta0,z=~z) %>% 
  add_surface()

The function has one minimum point. We can locate it by eye but algorithmically finding it is not as straightforward. So we use Gradient Descent.

The contour plot is a bird-eye of the above 3D function and will be handy when we visualize the optimization:

Gradient Descent algorithm aims to find the parameters, here \(\theta_0\) and \(\theta_1\), that minimize the function. To this end, we start with an arbitrary \(\theta\) vector and calculate the gradient.

The gradient descent algorithm is as follows:

Determine step size (learning rate), \(\alpha\), and pick an initial point, \(\theta^0 = (\theta_0^0,\theta_1^0)\). Also calculate the gradient of the function, \(\nabla J(\theta_0,\theta_1)\).
Calculate the gradient at the point, \(\nabla J(\theta_i)\).
Multiply the gradient with the step size, \(\alpha\cdot\nabla J(\theta^i)\).
Move in that direction as much distance as you calculated. The point where you will arrive at will be your new point \(\theta^{i+1} = \theta^i - \alpha\cdot\nabla J(\theta^i)\).
Repeat steps 2-4 until the loss doesn’t change much, i.e. \(|J(\theta_{i+1}) - J(\theta_{i})| < \epsilon\)

Step By step Gradient Descent

Let’s pick a starting point (initial parameters), \(\theta^0 = (4,1.5)\) and a step size \(\alpha=0.1\)

Now, we calculate gradient at the point \(\theta\). Assume you did it correctly and it resulted in the below vector

MSE.grad(x,y,theta.i0)

## [1]  3.02984 10.81065

Now, the algorithm tells us to update the parameters using the below rule. We will repeat it iteratively so that the output of the former will be the input of the latter:

\[\theta^1 = \theta^0 - \alpha \nabla J(\theta^0)\] \[\theta^2 = \theta^1 - \alpha \nabla J(\theta^1)\]

Therefore, the new (updated) theta will be

theta.i1 <- theta.i0 - MSE.grad(x,y,theta.i0) * 0.1 
theta.i2 <- theta.i1 - MSE.grad(x,y,theta.i1) * 0.1 

theta.hist <- rbind(theta.i0,theta.i1,theta.i2)
colnames(theta.hist) <- c('th1', 'th2')
theta.hist

##               th1        th2
## theta.i0 4.000000  1.5000000
## theta.i1 3.697016  0.4189355
## theta.i2 3.424330 -0.3312578

The visual of the parameter updates is:

ggplot(dat, aes(x=x,y=y,z=apply(dat,1,f))) + 
  geom_contour_filled(alpha=.7) + 
  geom_point(aes(x=theta.i0[1], y=theta.i0[2]), colour='red') +
  geom_point(aes(x=theta.i0[1], y=theta.i0[2]), colour='red') +
  geom_point(aes(x=theta.i1[1], y=theta.i1[2]), colour='red') +
  geom_segment(aes(x=theta.i0[1],y=theta.i0[2],xend=theta.i1[1], yend=theta.i1[2])) + 
  geom_point(aes(x=theta.i2[1], y=theta.i2[2]), colour='red') +
  geom_segment(aes(x=theta.i1[1],y=theta.i1[2],xend=theta.i2[1], yend=theta.i2[2])) + 
  labs(x='theta0', y='theta1', title = 'J(theta)')

If we repeat the above 20 times (epochs):

Therefore the optimal \(\theta\) vector calculated after 20 iterations is:

theta

## [1]  1.379446 -2.028778

where the correct answer is \((1,-2)\).

The above is a visual for the optimization process. Now, let’s see how the each parameter list (intercept and slope) perform on the data. To save space we will plot the parameter resuults of the 5th, 10th, 15th and 20th steps:

library('gridExtra')
set.seed(156)

thetas <- BGD(x,y,c(4,1.5), alpha = 0.1, max_iter = 20)

## [1] "Finished in 20 epochs"

thetas <- do.call(rbind,thetas[[1]])

plot_line <- function(theta){
  ggplot(data.frame(x=x[,2],y=y)) + 
    geom_point(aes(x,y)) + 
    geom_abline(aes(intercept=theta[2],slope=theta[3]), colour='steelblue') + 
    labs(x='x',y='y', title=paste0('Epoch: ', theta[1]))
}

thetas <- cbind(1:nrow(thetas), thetas)
gs <- apply(thetas[seq(5,20,5),], 1,plot_line)
grid.arrange(grobs=gs, ncol=2)

BGD vs SGD

As we discussed before, SGD is a modification to BGD to reduce the computational cost. However later it was shown to be more effective in finding the optimal parameters. Below we will compare the two algorithms’ behaviour in 20 epochs. Since the SGD updates the parameters at each data point and our data has 100 observations, we will not plot all the history of parameter updates for presentation purposes. Instead, we will plot 5 parameters per each epoch:

set.seed(156)
x <- seq(-3,3,,100)
y <- gen_data(1,-2, x)
x <- cbind(1,x)

res.BGD <- BGD(x,y,c(4,1.5), alpha = 0.2, max_iter = 20)

## [1] "Finished in 20 epochs"

res.SGD <- SGD(x,y,c(4,1.5), alpha = 0.2, max_iter = 20)

## [1] "maximum epoch 20 was reached"

thetas.BGD <- do.call(rbind,res.BGD[[1]]) %>% data.frame() %>% rename(theta0 = V1, theta1=x)
thetas.SGD <- do.call(rbind,res.SGD[[1]]) %>% data.frame() %>% rename(theta0 = V1, theta1=x)

g1 <- ggplot() + 
  geom_contour_filled(data=dat, aes(x=x,y=y,z=apply(dat,1,f)), alpha=.7) + 
  geom_point(data=thetas.BGD, aes(x=theta0,y=theta1), colour='firebrick', shape=4, size=1) + 
  labs(x='theta0', y='theta1', title = 'BGD') + 
  coord_fixed() + 
  theme(legend.position = 'none')

g2 <- ggplot() + 
  geom_contour_filled(data=dat, aes(x=x,y=y,z=apply(dat,1,f)), alpha=.7) + 
  geom_point(data=thetas.SGD, aes(x=theta0,y=theta1), colour='firebrick', shape=4, size=1) + 
  labs(x='theta0', y='theta1', title = 'SGD') + 
  coord_fixed() + 
  theme(legend.position = 'none')
grid.arrange(g1,g2,ncol=2)

As we can see in the plots, BGD approaches the optimal point smoothly whereas SGD behaves randomly. Still we can see the trend of SGD parameter updates and it approaches to the optimal.

Overfitting

In previous weeks we discussed the problem of overfitting. Let’s repeat it here to recall. Below we will fit several models changing w.r.t. their complexity. Namely, we will fit linear, quadratic, cubic, quartic models and ones with higher degrees. As model complexity increases, the fit becomes stronger in capturing data movements. However, models with higher degree can learn the data too much which causes poor performances on test set.

To present this problem, let’s split the data into train and test with a 80/20 split:

house <- read.csv('house_data.csv') %>% 
  subset(zipcode %in% c('98055', '98108', '98119', '98070', '98039', '98112', '98040', '98116', '98105'))

set.seed(15)
train_inds <- sample(1:nrow(house), size = floor(0.8*nrow(house)))
train <- house[train_inds,]
test  <- house[-train_inds,]

Now, let’s define a function that fits model, predicts and fits line to the scatter plot. The below function also takes testData as an input. By default, it will take train as input and train and fit to the train set. If we specify, it will train on train and test on another dataset:

plot_poly <- function(p,testData=train){
  fit   <- lm(price ~ poly(sqft_living,p), data = train)
  preds <- predict(fit,newdata=testData)
  rmse  <- RMSE(preds,testData$price)
   ggplot(testData, aes(y=price, x=sqft_living)) + 
     geom_point(alpha=.3) + 
     geom_line(aes(y=preds), colour = 'blue') + 
     ggtitle(paste0('RMSE: ',round(rmse)))
}

Now, we apply degrees from 1 to 4 and plot:

gs <- lapply(1:4, plot_poly)
grid.arrange(grobs=gs, ncols=2)

It seems increasing the degree can help to closely follow the output values.

What about this:

plot_poly(13,testData = train)

It looks we are overfitting. Such a behaviour is specific to the dataset and we don’t expect it to happen always.

For example, let’s use this model to and test it on another set:

plot_poly(13,testData = test)

If we increase the complexity of the model, it will learn the data ridiculously well, so that the model will expect the same behavoiur in a new dataset.

Model Complexity vs Bias

There is a tradeoff between model complexity and bias. The best model is not the one that learns the training data best, but the one that performs the best on another set. Here we will measure the performance using RMSE:

calc_rmse <- function(p){
  fit   <- lm(price ~ poly(sqft_living,p), data = train)
  preds <- predict(fit,newdata=test)
  rmse.train <- RMSE(fit$fitted.values, train$price)
  rmse.test  <- RMSE(preds, test$price)
  return(c(TrainLoss=rmse.train,TestLoss=rmse.test))
}

losses <- t(sapply(1:13,calc_rmse))
losses <- data.frame(p=1:13, losses) %>% gather(key = 'Type', 'Loss', -p)
ggplot(losses, aes(x=p, y=Loss, colour=Type)) + 
  geom_line() + 
  geom_point()

As the model complexity increases, the train loss reduces gradually and can potentially be zero. On another set, however, the loss reduces but starts increasing after a level. The minimum is the sweetspot and here it is polynomial regression with degree=5.

Validation

A way to deal with this problem is validating the performance of the models on a dataset separated from the training set. The common practice is rougly splitting your data into train/validation/test with a 70/10/20 split or 60/20/20 split depending on the dataset size.

Since we have already split the dataset into train (80) and test (20), for validation set we will split the train with 60/20 ratio.

set.seed(15)

val_inds <- sample(1:nrow(train), size = floor(2/8 *nrow(train)))
val      <- train[val_inds,]
train    <- train[-val_inds,]

c(nrow(train),nrow(val),nrow(test))

## [1] 1149  383  384

calc_rmse <- function(p, valData=train){
  fit   <- lm(price ~ poly(sqft_living,p), data = train)
  preds.train <- predict(fit,newdata=train)
  preds.val   <- predict(fit,newdata=valData)
  rmse.train  <- RMSE(preds.train, train$price)
  rmse.val    <- RMSE(preds.val, valData$price)
  return(c(TrainLoss=rmse.train,ValLoss=rmse.val))
}

losses <- t(sapply(1:13, function(p) calc_rmse(p, val)))
losses <- data.frame(p=1:13, losses) %>% gather(key = 'Type', 'Loss', -p)
ggplot(losses, aes(x=p, y=Loss, colour=Type)) + 
  geom_line() + 
  geom_point()

The best model seems to be polynomial regression with \(p=9\).

Let’s see how it performs on the test set:

plot_poly(9,test)

As we can see, the behaviour is slightly better compared to p=13. But the unnecessary movement on the right tail didn’t vanish.

Parameter Tuning with Cross-Validation

Leave-One-Out

The above example shows that validation split is not the ultimate solution. Validation set in the previous example had the same outlier at the right tail, so model with \(p=9\) performed the best. But that outlier may not always be there.

A better approach is to repeat this split the data into \(k\) pieces. Each time we will leave one peace out for validation and train on the rest. The procedure will result in 10 models each produces different loss. We will pick the one produces the minimum loss. The below displays this approach:

knitr::include_graphics("LOOCV.gif")

Source: Wikipedia

CV in Caret

To apply different cross validation techniques, we will use caret package. It has builtin functions for validating on a train set. But we still need to split the data after reading to train and test:

house <- read.csv('house_data.csv') %>% 
  subset(zipcode %in% c('98055', '98108', '98119', '98070', '98039', '98112', '98040', '98116', '98105'))

set.seed(15)
train_inds <- sample(1:nrow(house), size = floor(0.8*nrow(house)))
train <- house[train_inds,]
test  <- house[-train_inds,]

The syntax is the same among different methods:

LOOCV

ctrl <- trainControl(method = "LOOCV", number = 10)

We will use train function which takes lm as the method and the form = price ~ poly(sqft_living,10) as the formula:

cv_fit <- train(
  form = price ~ poly(sqft_living,10), 
  data = train, 
  method = "lm",
  trControl = ctrl
)

summary(cv_fit$finalModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1508281  -185928    -1532   146390  1556338 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 794798       8211  96.796  < 2e-16 ***
## `poly(sqft_living, 10)1`  19681431     321389  61.239  < 2e-16 ***
## `poly(sqft_living, 10)2`   4386368     321389  13.648  < 2e-16 ***
## `poly(sqft_living, 10)3`  -1056697     321389  -3.288  0.00103 ** 
## `poly(sqft_living, 10)4`    981317     321389   3.053  0.00230 ** 
## `poly(sqft_living, 10)5`    400651     321389   1.247  0.21273    
## `poly(sqft_living, 10)6`   -636126     321389  -1.979  0.04796 *  
## `poly(sqft_living, 10)7`    563077     321389   1.752  0.07997 .  
## `poly(sqft_living, 10)8`   1319828     321389   4.107 4.23e-05 ***
## `poly(sqft_living, 10)9`   1610866     321389   5.012 6.01e-07 ***
## `poly(sqft_living, 10)10`   860661     321389   2.678  0.00749 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 321400 on 1521 degrees of freedom
## Multiple R-squared:  0.7252, Adjusted R-squared:  0.7234 
## F-statistic: 401.4 on 10 and 1521 DF,  p-value: < 2.2e-16

The best model in 10 trials is the one above.

k-fold CV

ctrl <- trainControl(method = "cv", number = 10)

cv_fit <- train(
  form = price ~ poly(sqft_living,10), 
  data = train, 
  method = "lm",
  trControl = ctrl
)

summary(cv_fit$finalModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1508281  -185928    -1532   146390  1556338 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 794798       8211  96.796  < 2e-16 ***
## `poly(sqft_living, 10)1`  19681431     321389  61.239  < 2e-16 ***
## `poly(sqft_living, 10)2`   4386368     321389  13.648  < 2e-16 ***
## `poly(sqft_living, 10)3`  -1056697     321389  -3.288  0.00103 ** 
## `poly(sqft_living, 10)4`    981317     321389   3.053  0.00230 ** 
## `poly(sqft_living, 10)5`    400651     321389   1.247  0.21273    
## `poly(sqft_living, 10)6`   -636126     321389  -1.979  0.04796 *  
## `poly(sqft_living, 10)7`    563077     321389   1.752  0.07997 .  
## `poly(sqft_living, 10)8`   1319828     321389   4.107 4.23e-05 ***
## `poly(sqft_living, 10)9`   1610866     321389   5.012 6.01e-07 ***
## `poly(sqft_living, 10)10`   860661     321389   2.678  0.00749 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 321400 on 1521 degrees of freedom
## Multiple R-squared:  0.7252, Adjusted R-squared:  0.7234 
## F-statistic: 401.4 on 10 and 1521 DF,  p-value: < 2.2e-16

Repeated CV

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)