Introduction

In this tutorial, we will cover how batch and stochastic gradient descent behave and how to fit nonlinear kind regressions, namely weighted local regressions and logistic regression. We will use movies dataset for both tasks, which is available at data.world. Overall, we will

summarize the data and plot basic visuals,
compare linear and weighted linear regressions
introduce to logistic regressions and classification task,
learn about cross entropy loss.

Getting Started

To follow up, you will need

tidyverse
caret
gridExtra
ggExtra
MLmetrics
ISLR
movies dataset.

Now, call tidyverse library and read the data. Since the data is big, not to challenge your computers’ performance, we will filter out some neighbourhoods from our analysis:

library('tidyverse')

theme_set(theme_minimal())

Getting Familiar with the Data

# head(house)
movies <- read.csv("https://query.data.world/s/7ej7gdpyitpf6d3wr3ge557ioltdwx", header=TRUE, stringsAsFactors=FALSE);
rmarkdown::paged_table(movies)

dim(movies)

## [1] 399  20

str(movies)

## 'data.frame':    399 obs. of  20 variables:
##  $ audience_freshness: int  92 89 93 86 71 66 76 73 52 92 ...
##  $ poster_url        : chr  "http://resizing.flixster.com/gxRJwetP1eNIrPR6xlWBfD_VaFc=/180x267/dkpu1ddg7pbsk.cloudfront.net/movie/11/17/72/11177246_ori.jpg" "http://resizing.flixster.com/gDtbA1iPxTYEjBZeSsIYktCtUiw=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/18/91/11189194_ori.jpg" "http://resizing.flixster.com/YrF_OeTQx3bXNsMLIy_NocjA4wI=/180x267/dkpu1ddg7pbsk.cloudfront.net/movie/11/17/80/11178082_ori.jpg" "http://resizing.flixster.com/l9yjA-72sZMYECeOjx-r14mgXtU=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/19/08/11190860_ori.jpg" ...
##  $ rt_audience_score : num  4.3 4.2 4.4 4.2 3.8 3.7 3.9 3.8 3.3 4.3 ...
##  $ rt_freshness      : int  89 90 91 72 49 53 61 65 18 91 ...
##  $ rt_score          : num  7.5 7.9 7.7 7 5.7 5.9 6.3 6.3 3.9 7.6 ...
##  $ X2015_inflation   : chr  "-0.26%" "-0.26%" "-0.26%" "-0.26%" ...
##  $ adjusted          : chr  "$712,903,691.09 " "$706,988,165.89 " "$772,158,880.00 " "$671,220,455.10 " ...
##  $ genres            : chr  "Sci-Fi\nAdventure\nAction" "Sci-Fi\nDrama\nAction" "Sci-Fi\nAdventure\nAction" "Sci-Fi\nAdventure" ...
##  $ Genre_1           : chr  "Sci-Fi" "Sci-Fi" "Sci-Fi" "Sci-Fi" ...
##  $ Genre_2           : chr  "Adventure" "Drama" "Adventure" "Adventure" ...
##  $ Genre_3           : chr  "Action" "Action" "Action" "" ...
##  $ imdb_rating       : num  7.8 7.7 8.1 8.7 7.1 6.9 7.5 6.8 5.8 8.1 ...
##  $ length            : int  136 130 121 169 97 142 144 123 165 131 ...
##  $ rank_in_year      : int  7 9 3 10 4 8 2 5 1 6 ...
##  $ rating            : chr  "PG-13" "PG-13" "PG-13" "PG-13" ...
##  $ release_date      : chr  "4-Apr-14" "11-Jul-14" "1-Aug-14" "7-Nov-14" ...
##  $ studio            : chr  "Marvel Studios" "20th Century Fox" "Marvel Studios" "Paramount Pictures / Warner Bros." ...
##  $ title             : chr  "Captain America: The Winter Soldier" "Dawn of the Planet of the Apes" "Guardians of the Galaxy" "Interstellar" ...
##  $ worldwide_gross   : chr  "$714,766,572.00 " "$708,835,589.00 " "$774,176,600.00 " "$672,974,414.00 " ...
##  $ year              : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...

Lovely datast, isn’t it. It contains 20 features about 399 movies. rt scores the rotten tomato scores and adjusted is how much each movie grossed after inflation is adjisted. One problem with the data is some variables are in character form even though they are numeric.

We can convert how much each movie grossed into numeric variables by removing $ signs and dots. We will avoid worldwide gross and inflation since adjusted has these information already. In the below line, '[^0-9]' is a regex code that selects any character but numbers and gsub replaces them with nothing (''). Even though the resulting values are numbers, it will remain as chararacter so we had to convert it using as.numeric. Also we will use lubridate package to process dates. dmy is a generic function that can process dates in form of day-month-year.

library('lubridate')
movies$adjusted     <- as.numeric(gsub('[^0-9]','',movies$adjusted))
movies$release_date <- dmy(movies$release_date)

movies <- na.omit(movies)
dim(movies)

## [1] 398  20

unique(movies$Genre_1)

##  [1] "Sci-Fi"    "Family"    "Fantasy"   "Thriller"  "Comedy"    "Adventure"
##  [7] "Mystery"   "Animation" "Crime"     "Romance"   "Drama"     "War"      
## [13] "History"   "Music"     "Action"    "Western"   "Horror"    "Sport"    
## [19] "Musical"

mean(grepl( 'Action|Adventure', movies$genres))

## [1] 0.5904523

Nolan <- c('Interstellar', 'Inception', 'Tenet', 'Prestige', 'The Dark Knight', 'The Dark Knight Rises', 'Batman Begins', 'Dunkirk', 'Memento', 'Insomnia') 
subset(movies, title %in% Nolan)

##    audience_freshness
## 4                  86
## 27                 90
## 45                 91
## 69                 94
## 91                 94
##                                                                                                                        poster_url
## 4  http://resizing.flixster.com/l9yjA-72sZMYECeOjx-r14mgXtU=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/19/08/11190860_ori.jpg
## 27 http://resizing.flixster.com/_7NLUQGoKyQuOq7uhhARpzeku04=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/16/78/11167835_ori.jpg
## 45 http://resizing.flixster.com/THwVJ6BnMSZz1ygWPdIEok_wVhA=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/16/67/11166725_ori.jpg
## 69 http://resizing.flixster.com/csINXh6bBY_YMax7pD2fFGtGDjQ=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/16/51/11165160_ori.jpg
## 91 http://resizing.flixster.com/fg1Y49hq1wUyVH-VdqeAksXVM4A=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/16/52/11165262_ori.jpg
##    rt_audience_score rt_freshness rt_score X2015_inflation     adjusted
## 4                4.2           72      7.0          -0.26%  67122045510
## 27               4.3           87      8.0           2.84% 111525856781
## 45               4.2           86      8.0           8.28%  89391514690
## 69               4.4           94      8.6           9.67% 109339230294
## 91               3.9           85      7.7          20.90%  45060888905
##                     genres   Genre_1   Genre_2 Genre_3 imdb_rating length
## 4        Sci-Fi\nAdventure    Sci-Fi Adventure                 8.7    169
## 27        Thriller\nAction  Thriller    Action                 8.5    165
## 45 Sci-Fi\nMystery\nAction    Sci-Fi   Mystery  Action         8.8    148
## 69    Drama\nCrime\nAction     Drama     Crime  Action         9.0    152
## 91       Adventure\nAction Adventure    Action                 8.3    140
##    rank_in_year rating release_date                            studio
## 4            10  PG-13   2014-11-07 Paramount Pictures / Warner Bros.
## 27            3  PG-13   2012-07-20          Warner Bros. / Legendary
## 45            4  PG-13   2010-07-16          Warner Bros. / Legendary
## 69            1  PG-13   2008-07-18          Warner Bros. / Legendary
## 91            9  PG-13   2005-06-15           Warner Bros / Legendary
##                    title    worldwide_gross year
## 4           Interstellar   $672,974,414.00  2014
## 27 The Dark Knight Rises $1,084,439,099.00  2012
## 45             Inception   $825,531,030.00  2010
## 69       The Dark Knight   $997,000,000.00  2008
## 91         Batman Begins   $372,710,015.00  2005

library('gridExtra')
g1 <- ggplot(movies, aes(x=adjusted)) + 
  geom_histogram(colour='white') 
g2 <- ggplot(movies, aes(x=rt_freshness)) + 
  geom_histogram(colour='white') 
g3 <- ggplot(movies, aes(x=rt_score)) +
  geom_histogram(colour='white') 
g4 <- ggplot(movies, aes(x=length)) + 
  geom_histogram(colour='white') 

grid.arrange(g1,g2,g3,g4, ncol=2)

Adjusted is right skewed and rt_score seems symmetric if we ignore the outlier. Length is also rightskewed with a bunch of Christopher Nolan movies on the right tail.

Pairs Plot

numeric.cols <- summarize_all(movies, is.numeric) %>% unlist()
pairs(movies[,numeric.cols], col = adjustcolor('firebrick', .3), pch=16)

Some pairs expose relations. It is clear that rotten tomato scores are positively related and IMDB scores are also implying positive relation with rotten tomato scores. Year and adjusted also have positive correlation which seems to tell that money buys happiness. I hope this tells you something.

library('ggExtra')
g <- ggplot(movies, aes(y=rt_score, x=rt_freshness)) + 
  geom_point(alpha=.3, position = position_jitter()) 
ggMarginal(g, type='histogram')

g1 <- ggplot(movies, aes(y=rt_freshness, x=audience_freshness)) + 
  geom_point(alpha=.3) 
g2 <- ggplot(movies, aes(y=audience_freshness, x=imdb_rating)) + 
  geom_point(alpha=.3) 
g3 <- ggplot(movies, aes(y=rt_audience_score, x=year)) + 
  geom_point(alpha=.3, position = position_jitter()) 
g4 <- ggplot(movies, aes(y=adjusted, x=release_date)) + 
  geom_point(alpha=.3) 
grid.arrange(g1,g2,g3,g4, ncol=2)

Some notes about Train Test Split

Last week we covered train test split and discussed why it is necessary. It was not clear, however, that why we need it. When we train our model on train, this will not help the model to perform better on test, but is certainly a better comparison.

We use train / test split to compare different models not to finetune. This is how the ML contests are conducted. Participants are usually given train sets and asked to build the best model they can. The performances are later evaluated on a separate test.

For example, if someone says she invented a robust linear regression model that doesn’t get fooled by the outliers and therefore it performs better, you must test if the model that she came up with is the better one. Let’s first test it on without split:

library('MASS')
lm.fit  <- lm(audience_freshness ~ rt_audience_score, data=movies)
rlm.fit <- rlm(audience_freshness ~ rt_audience_score, data = movies)

preds.lm  <- predict(lm.fit)
preds.rlm <- predict(rlm.fit)

ggplot(movies, aes(y=audience_freshness, x=rt_audience_score)) + 
  geom_point(alpha=.8, position = position_jitter()) +
  geom_line(aes(y=preds.lm), colour = 'steelblue') + 
  geom_line(aes(y=preds.rlm), colour = 'firebrick')

Linear regression is more affected by the outliers on the left top, whereas weighted linear model adjusts the importance of the observations and is less affected by the outliers.

library('caret')
RMSE(preds.lm,movies$audience_freshness)

## [1] 10.04971

RMSE(preds.rlm,movies$audience_freshness)

## [1] 10.07345

Robust linear model underperformed because it under-estimated the outliers.

Train / Test Split

This behaviour of RLM makes it more powerful compared to LM but since we trained and tested on the same data, it underperformed. Now let’s try a train / test split:

set.seed(156)
train_inds <- sample(1:nrow(movies), floor(nrow(movies)*0.8))
train <- movies[ train_inds, ]
test  <- movies[-train_inds, ]

cat('train: ', nrow(train), ', test: ', nrow(test))

## train:  318 , test:  80

lm.fit  <- lm(audience_freshness ~ rt_audience_score, data=train)
rlm.fit <- rlm(audience_freshness ~ rt_audience_score, data = train)

preds.lm  <- predict(lm.fit, test)
preds.rlm <- predict(rlm.fit, test)

ggplot(test, aes(y=audience_freshness, x=rt_audience_score)) + 
  geom_point(alpha=.8, position = position_jitter()) +
  geom_line(aes(y=preds.lm), colour = 'steelblue', alpha=.7) +
  geom_line(aes(y=preds.rlm), colour = 'firebrick', alpha=.7)

This time since we don’t have the outliers in the test set, linear model overestimated the movies with lower IMDB rating. Weighting the datapoints to determine the importance of the nodes helped RLM not to be fooled by data points only seen in the training set.

RMSE(preds.lm, test$audience_freshness)

## [1] 9.509842

RMSE(preds.rlm, test$audience_freshness)

## [1] 9.477721

Therefore RLM performed better on the test set.

Locally Weighted Regression

Linear regressions are not always favourable, they are too picky about what they eat; you cannot feed in inputs and outputs that don’t have linear relationship. Consider the below plot:

ggplot(movies, aes(y=audience_freshness, x=imdb_rating)) + 
  geom_point(alpha=.8, position = position_jitter())

Obviously the relation is not linear.

Locally Weighted Regression is a nonlinear regression that is a very nice alternative to Linear Regression. It has strong advantages:

it can keep trace of your datapoints very well
unlike polynomial regression, it doesn’t require a hyper-parameter to control the output shape
all data driven

However, there are notable disadvantages:

It is not interpretable
Requires much data; if the data has sparse regions, it either doesn’t calculate or overfit.

In R, it is defined as loess:

lm.fit  <- lm(audience_freshness ~ imdb_rating, data=movies)
wlm.fit <- loess(audience_freshness ~ imdb_rating, data = movies)

preds.lm  <- predict(lm.fit)
preds.wlm <- predict(wlm.fit)

ggplot(movies, aes(y=audience_freshness, x=imdb_rating)) + 
  geom_point(alpha=.6, position = position_jitter()) +
  geom_line(aes(y=preds.lm), colour = 'steelblue', alpha=.8) +
  geom_line(aes(y=preds.wlm), colour = 'firebrick', alpha=.8)

LWR can trace the patterns in the data very well. However, the points on the left also show that it has a potential to overfit. But when the sample size is very big and the data is not granular, LWR can perform very well.

RMSE(preds.lm, movies$audience_freshness)

## [1] 8.72954

RMSE(preds.wlm, movies$audience_freshness)

## [1] 7.680297

Span Parameter

The span parameter controls the window of points to be considered. More clearly, it calculates the weight of the data based on gaussian distribution (by default), if a point is too far, its weight is close to 0. Since it is gaussian, if we change the span parameter, we can increase/reduce the width of the kernel:

library('gridExtra')
plot_loess <- function(span){
  wlm.fit <- loess(audience_freshness ~ imdb_rating, data = movies, span = span )
  preds.wlm <- predict(wlm.fit)
  ggplot(movies, aes(y=audience_freshness, x=imdb_rating)) + 
    geom_point(alpha=.4, position = position_jitter()) +
    geom_line(aes(y=preds.wlm), colour = 'firebrick', alpha=1, size=1)
}

gs <- lapply(c(0.05,0.1,0.2,0.3), plot_loess)
grid.arrange(grobs=gs, ncol=2)

Train / Test Split

Let’s see how it performs on a test set:

set.seed(12)
train_inds <- sample(1:nrow(movies), floor(nrow(movies)*0.8))
train <- movies[ train_inds, ]
test  <- movies[-train_inds, ]

cat('train: ', nrow(train), ', test: ', nrow(test))

## train:  318 , test:  80

lm.fit  <- lm(audience_freshness ~ imdb_rating, data=train)
wlm.fit <- loess(audience_freshness ~ imdb_rating, data = train)

preds.lm  <- predict(lm.fit, test)
preds.wlm <- predict(wlm.fit, test)

ggplot(test, aes(y=audience_freshness, x=imdb_rating)) + 
  geom_point(alpha=.8, position = position_jitter()) +
  geom_line(aes(y=preds.lm), colour = 'steelblue', alpha=.7) +
  geom_line(aes(y=preds.wlm), colour = 'firebrick', alpha=.7)

It captures the s shape very well. Also the left tail is a better fit than LM. But it is not hard to see that it can go worse.

RMSE(preds.lm, test$audience_freshness)

## [1] 11.12495

RMSE(preds.wlm, test$audience_freshness)

## [1] 9.560652

Let’s add some more variables. Since LWR tries to follow the points, if the variable is unnecessary it can cause worse performances too:

lm.fit  <- lm(audience_freshness ~ imdb_rating + length + year + adjusted, data=train)
wlm.fit <- loess(audience_freshness ~ imdb_rating + length + year + adjusted, data = train)

preds.lm  <- predict(lm.fit, test)
preds.wlm <- predict(wlm.fit, test)

ggplot(test, aes(y=audience_freshness, x=imdb_rating)) + 
  geom_point(alpha=.8, position = position_jitter()) +
  geom_line(aes(y=preds.lm), colour = 'steelblue', alpha=.7) +
  geom_line(aes(y=preds.wlm), colour = 'firebrick', alpha=.7)

RMSE(preds.lm, test$audience_freshness)

## [1] 11.07527

RMSE(preds.wlm, test$audience_freshness)

## [1] 11.36595

While linear regression reduced its RMSE loss on test set, LWR performed worse compared to what it achieved before.

Classification with Logistic Regression

In supervised learning, there are two major tasks, one is regression and the other is classification. The former is to perdict the output value and the error is calculated over how far the estimation is from the output. In classification, we aim to predict the output’s class, e.g. negative review vs. positive review, rock vs. rap, defaulted loan vs. non-defaulted.

library('ISLR')
ggplot(Default, aes(y=default, x=balance, colour = default)) + 
  geom_point(alpha=.8)

Logistic regression is essentaially a linear model, except the output is squeezed to 0 and 1 with a logistic function. This particular sigmoid function resonates very well with probability.

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

\[P(Y \in c) = \sigma (Y_i)\]

where $c$ is a particular class, e.g. loan belongs to defaulted ones.

In R, logistic regression model can be fit using glm for generalized linear model. We will specify the family=binomial() which implies the outputs are binary.

library('ISLR')
logreg.fit <- glm(default ~ balance,data = Default, family = binomial())

temp  <- mutate(Default, default = ifelse(default == 'Yes', 1 , 0))
probs <- predict(logreg.fit,type = 'response')

ggplot(temp, aes(y=default, x=balance, colour = probs>=0.5)) + 
  geom_point(alpha=.8) + 
  geom_line(aes(y=probs), colour='black')

The line doesn’t fit perfectly, but this is not we are looking for anyways. The line is strictly increasing but after a level its speed of increase decreases. That point is where the probability is 0.5. If the resulting probabilities are less than 0.5, we will classify it as “not belonging to c” and if it is greater than 0.5 we will classify as "belongs to the class.

summary(logreg.fit)

## 
## Call:
## glm(formula = default ~ balance, family = binomial(), data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2697  -0.1465  -0.0589  -0.0221   3.7589  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
## balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 8

Balance seems to be significant in explaining whether the loan will default.

library('ISLR')
logreg.fit <- glm(default ~ balance + income +  student,data = Default, family = binomial())

temp  <- mutate(Default, default = ifelse(default == 'Yes', 1 , 0))
probs <- predict(logreg.fit,type = 'response')

ggplot(temp, aes(y=default, x=balance, colour = probs>=0.5)) + 
  geom_point(alpha=.8) + 
  geom_line(aes(y=probs), colour='black')

Evaluating Performance

MSE or RMSE are not suitable for classification. They cannot tell much. The most intuitive measure is Accuracy, that is the proportion of correct predictions among all:

movies$isAction <- ifelse(grepl('Action|Adventure', movies$genres), 1 , 0)

probs  <- predict(logreg.fit, type = 'response')
preds  <- ifelse(probs >= 0.5, "Yes", "No")
target <- Default$default
acc <- mean(preds == target)
acc

## [1] 0.9732

It looks our predictions are great. The accuracy is 97.25%.

Confusion matrix can tell more about the classification performance. We will use MLmetrics package for it:

library('MLmetrics')
ConfusionMatrix(preds, target)

##       y_pred
## y_true   No  Yes
##    No  9627   40
##    Yes  228  105

Accuracy is a good measure, but is not smooth enough for us to optimize. In ML, it is very common to use Cross Entropy loss, which is defined as:

\[CEloss = -\frac 1 N\sum_{i=1}^N t_i \log y_i + (1-t_i) \log (1-y_i) \]

where $t$ is 1 if the data belongs to the class and 0 otherwise. We will see this formula later in more often. We can calculate CE loss as below:

library('MLmetrics')
probs  <- predict(logreg.fit, type = 'response')
preds  <- ifelse(probs >= 0.5, 1, 0)
target <- ifelse(Default$default == 'Yes',1 , 0)

CEloss <- LogLoss(preds, target)
cat('Accuracy: ', acc, "\nCE Loss:  ", CEloss)

## Accuracy:  0.9732 
## CE Loss:   0.9256424

Week 4: Nonlinear Regressions

Regressions vs. the World

Week 4: Nonlinear Regressions

Introduction

Getting Started

Getting Familiar with the Data

Pairs Plot

Some notes about Train Test Split

Train / Test Split

Locally Weighted Regression

Span Parameter

Train / Test Split

Classification with Logistic Regression

Evaluating Performance