Week 4: Nonlinear Regressions
Regressions vs. the World
Introduction
In this tutorial, we will cover how batch and stochastic gradient descent behave and how to fit nonlinear kind regressions, namely weighted local regressions and logistic regression. We will use movies dataset for both tasks, which is available at data.world. Overall, we will
- summarize the data and plot basic visuals,
- compare linear and weighted linear regressions
- introduce to logistic regressions and classification task,
- learn about cross entropy loss.
Getting Started
To follow up, you will need
tidyverse
caret
gridExtra
ggExtra
MLmetrics
ISLR
- movies dataset.
Now, call tidyverse
library and read the data. Since the data is big, not to challenge your computers’ performance, we will filter out some neighbourhoods from our analysis:
Getting Familiar with the Data
# head(house)
movies <- read.csv("https://query.data.world/s/7ej7gdpyitpf6d3wr3ge557ioltdwx", header=TRUE, stringsAsFactors=FALSE);
rmarkdown::paged_table(movies)
## [1] 399 20
## 'data.frame': 399 obs. of 20 variables:
## $ audience_freshness: int 92 89 93 86 71 66 76 73 52 92 ...
## $ poster_url : chr "http://resizing.flixster.com/gxRJwetP1eNIrPR6xlWBfD_VaFc=/180x267/dkpu1ddg7pbsk.cloudfront.net/movie/11/17/72/11177246_ori.jpg" "http://resizing.flixster.com/gDtbA1iPxTYEjBZeSsIYktCtUiw=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/18/91/11189194_ori.jpg" "http://resizing.flixster.com/YrF_OeTQx3bXNsMLIy_NocjA4wI=/180x267/dkpu1ddg7pbsk.cloudfront.net/movie/11/17/80/11178082_ori.jpg" "http://resizing.flixster.com/l9yjA-72sZMYECeOjx-r14mgXtU=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/19/08/11190860_ori.jpg" ...
## $ rt_audience_score : num 4.3 4.2 4.4 4.2 3.8 3.7 3.9 3.8 3.3 4.3 ...
## $ rt_freshness : int 89 90 91 72 49 53 61 65 18 91 ...
## $ rt_score : num 7.5 7.9 7.7 7 5.7 5.9 6.3 6.3 3.9 7.6 ...
## $ X2015_inflation : chr "-0.26%" "-0.26%" "-0.26%" "-0.26%" ...
## $ adjusted : chr "$712,903,691.09 " "$706,988,165.89 " "$772,158,880.00 " "$671,220,455.10 " ...
## $ genres : chr "Sci-Fi\nAdventure\nAction" "Sci-Fi\nDrama\nAction" "Sci-Fi\nAdventure\nAction" "Sci-Fi\nAdventure" ...
## $ Genre_1 : chr "Sci-Fi" "Sci-Fi" "Sci-Fi" "Sci-Fi" ...
## $ Genre_2 : chr "Adventure" "Drama" "Adventure" "Adventure" ...
## $ Genre_3 : chr "Action" "Action" "Action" "" ...
## $ imdb_rating : num 7.8 7.7 8.1 8.7 7.1 6.9 7.5 6.8 5.8 8.1 ...
## $ length : int 136 130 121 169 97 142 144 123 165 131 ...
## $ rank_in_year : int 7 9 3 10 4 8 2 5 1 6 ...
## $ rating : chr "PG-13" "PG-13" "PG-13" "PG-13" ...
## $ release_date : chr "4-Apr-14" "11-Jul-14" "1-Aug-14" "7-Nov-14" ...
## $ studio : chr "Marvel Studios" "20th Century Fox" "Marvel Studios" "Paramount Pictures / Warner Bros." ...
## $ title : chr "Captain America: The Winter Soldier" "Dawn of the Planet of the Apes" "Guardians of the Galaxy" "Interstellar" ...
## $ worldwide_gross : chr "$714,766,572.00 " "$708,835,589.00 " "$774,176,600.00 " "$672,974,414.00 " ...
## $ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
Lovely datast, isn’t it. It contains 20 features about 399 movies. rt scores the rotten tomato scores and adjusted is how much each movie grossed after inflation is adjisted. One problem with the data is some variables are in character form even though they are numeric.
We can convert how much each movie grossed into numeric variables by removing $
signs and dots. We will avoid worldwide gross and inflation since adjusted
has these information already. In the below line, '[^0-9]'
is a regex code that selects any character but numbers and gsub
replaces them with nothing (''
). Even though the resulting values are numbers, it will remain as chararacter so we had to convert it using as.numeric
. Also we will use lubridate
package to process dates. dmy
is a generic function that can process dates in form of day-month-year.
library('lubridate')
movies$adjusted <- as.numeric(gsub('[^0-9]','',movies$adjusted))
movies$release_date <- dmy(movies$release_date)
## [1] 398 20
## [1] "Sci-Fi" "Family" "Fantasy" "Thriller" "Comedy" "Adventure"
## [7] "Mystery" "Animation" "Crime" "Romance" "Drama" "War"
## [13] "History" "Music" "Action" "Western" "Horror" "Sport"
## [19] "Musical"
## [1] 0.5904523
Nolan <- c('Interstellar', 'Inception', 'Tenet', 'Prestige', 'The Dark Knight', 'The Dark Knight Rises', 'Batman Begins', 'Dunkirk', 'Memento', 'Insomnia')
subset(movies, title %in% Nolan)
## audience_freshness
## 4 86
## 27 90
## 45 91
## 69 94
## 91 94
## poster_url
## 4 http://resizing.flixster.com/l9yjA-72sZMYECeOjx-r14mgXtU=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/19/08/11190860_ori.jpg
## 27 http://resizing.flixster.com/_7NLUQGoKyQuOq7uhhARpzeku04=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/16/78/11167835_ori.jpg
## 45 http://resizing.flixster.com/THwVJ6BnMSZz1ygWPdIEok_wVhA=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/16/67/11166725_ori.jpg
## 69 http://resizing.flixster.com/csINXh6bBY_YMax7pD2fFGtGDjQ=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/16/51/11165160_ori.jpg
## 91 http://resizing.flixster.com/fg1Y49hq1wUyVH-VdqeAksXVM4A=/180x270/dkpu1ddg7pbsk.cloudfront.net/movie/11/16/52/11165262_ori.jpg
## rt_audience_score rt_freshness rt_score X2015_inflation adjusted
## 4 4.2 72 7.0 -0.26% 67122045510
## 27 4.3 87 8.0 2.84% 111525856781
## 45 4.2 86 8.0 8.28% 89391514690
## 69 4.4 94 8.6 9.67% 109339230294
## 91 3.9 85 7.7 20.90% 45060888905
## genres Genre_1 Genre_2 Genre_3 imdb_rating length
## 4 Sci-Fi\nAdventure Sci-Fi Adventure 8.7 169
## 27 Thriller\nAction Thriller Action 8.5 165
## 45 Sci-Fi\nMystery\nAction Sci-Fi Mystery Action 8.8 148
## 69 Drama\nCrime\nAction Drama Crime Action 9.0 152
## 91 Adventure\nAction Adventure Action 8.3 140
## rank_in_year rating release_date studio
## 4 10 PG-13 2014-11-07 Paramount Pictures / Warner Bros.
## 27 3 PG-13 2012-07-20 Warner Bros. / Legendary
## 45 4 PG-13 2010-07-16 Warner Bros. / Legendary
## 69 1 PG-13 2008-07-18 Warner Bros. / Legendary
## 91 9 PG-13 2005-06-15 Warner Bros / Legendary
## title worldwide_gross year
## 4 Interstellar $672,974,414.00 2014
## 27 The Dark Knight Rises $1,084,439,099.00 2012
## 45 Inception $825,531,030.00 2010
## 69 The Dark Knight $997,000,000.00 2008
## 91 Batman Begins $372,710,015.00 2005
library('gridExtra')
g1 <- ggplot(movies, aes(x=adjusted)) +
geom_histogram(colour='white')
g2 <- ggplot(movies, aes(x=rt_freshness)) +
geom_histogram(colour='white')
g3 <- ggplot(movies, aes(x=rt_score)) +
geom_histogram(colour='white')
g4 <- ggplot(movies, aes(x=length)) +
geom_histogram(colour='white')
grid.arrange(g1,g2,g3,g4, ncol=2)
Adjusted is right skewed and rt_score seems symmetric if we ignore the outlier. Length is also rightskewed with a bunch of Christopher Nolan movies on the right tail.
Pairs Plot
numeric.cols <- summarize_all(movies, is.numeric) %>% unlist()
pairs(movies[,numeric.cols], col = adjustcolor('firebrick', .3), pch=16)
Some pairs expose relations. It is clear that rotten tomato scores are positively related and IMDB scores are also implying positive relation with rotten tomato scores. Year and adjusted also have positive correlation which seems to tell that money buys happiness. I hope this tells you something.
library('ggExtra')
g <- ggplot(movies, aes(y=rt_score, x=rt_freshness)) +
geom_point(alpha=.3, position = position_jitter())
ggMarginal(g, type='histogram')
g1 <- ggplot(movies, aes(y=rt_freshness, x=audience_freshness)) +
geom_point(alpha=.3)
g2 <- ggplot(movies, aes(y=audience_freshness, x=imdb_rating)) +
geom_point(alpha=.3)
g3 <- ggplot(movies, aes(y=rt_audience_score, x=year)) +
geom_point(alpha=.3, position = position_jitter())
g4 <- ggplot(movies, aes(y=adjusted, x=release_date)) +
geom_point(alpha=.3)
grid.arrange(g1,g2,g3,g4, ncol=2)
Some notes about Train Test Split
Last week we covered train test split and discussed why it is necessary. It was not clear, however, that why we need it. When we train our model on train, this will not help the model to perform better on test, but is certainly a better comparison.
We use train / test split to compare different models not to finetune. This is how the ML contests are conducted. Participants are usually given train sets and asked to build the best model they can. The performances are later evaluated on a separate test.
For example, if someone says she invented a robust linear regression model that doesn’t get fooled by the outliers and therefore it performs better, you must test if the model that she came up with is the better one. Let’s first test it on without split:
library('MASS')
lm.fit <- lm(audience_freshness ~ rt_audience_score, data=movies)
rlm.fit <- rlm(audience_freshness ~ rt_audience_score, data = movies)
preds.lm <- predict(lm.fit)
preds.rlm <- predict(rlm.fit)
ggplot(movies, aes(y=audience_freshness, x=rt_audience_score)) +
geom_point(alpha=.8, position = position_jitter()) +
geom_line(aes(y=preds.lm), colour = 'steelblue') +
geom_line(aes(y=preds.rlm), colour = 'firebrick')
Linear regression is more affected by the outliers on the left top, whereas weighted linear model adjusts the importance of the observations and is less affected by the outliers.
## [1] 10.04971
## [1] 10.07345
Robust linear model underperformed because it under-estimated the outliers.
Train / Test Split
This behaviour of RLM makes it more powerful compared to LM but since we trained and tested on the same data, it underperformed. Now let’s try a train / test split:
set.seed(156)
train_inds <- sample(1:nrow(movies), floor(nrow(movies)*0.8))
train <- movies[ train_inds, ]
test <- movies[-train_inds, ]
cat('train: ', nrow(train), ', test: ', nrow(test))
## train: 318 , test: 80
lm.fit <- lm(audience_freshness ~ rt_audience_score, data=train)
rlm.fit <- rlm(audience_freshness ~ rt_audience_score, data = train)
preds.lm <- predict(lm.fit, test)
preds.rlm <- predict(rlm.fit, test)
ggplot(test, aes(y=audience_freshness, x=rt_audience_score)) +
geom_point(alpha=.8, position = position_jitter()) +
geom_line(aes(y=preds.lm), colour = 'steelblue', alpha=.7) +
geom_line(aes(y=preds.rlm), colour = 'firebrick', alpha=.7)
This time since we don’t have the outliers in the test set, linear model overestimated the movies with lower IMDB rating. Weighting the datapoints to determine the importance of the nodes helped RLM not to be fooled by data points only seen in the training set.
## [1] 9.509842
## [1] 9.477721
Therefore RLM performed better on the test set.
Locally Weighted Regression
Linear regressions are not always favourable, they are too picky about what they eat; you cannot feed in inputs and outputs that don’t have linear relationship. Consider the below plot:
ggplot(movies, aes(y=audience_freshness, x=imdb_rating)) +
geom_point(alpha=.8, position = position_jitter())
Obviously the relation is not linear.
Locally Weighted Regression is a nonlinear regression that is a very nice alternative to Linear Regression. It has strong advantages:
- it can keep trace of your datapoints very well
- unlike polynomial regression, it doesn’t require a hyper-parameter to control the output shape
- all data driven
However, there are notable disadvantages:
- It is not interpretable
- Requires much data; if the data has sparse regions, it either doesn’t calculate or overfit.
In R, it is defined as loess
:
lm.fit <- lm(audience_freshness ~ imdb_rating, data=movies)
wlm.fit <- loess(audience_freshness ~ imdb_rating, data = movies)
preds.lm <- predict(lm.fit)
preds.wlm <- predict(wlm.fit)
ggplot(movies, aes(y=audience_freshness, x=imdb_rating)) +
geom_point(alpha=.6, position = position_jitter()) +
geom_line(aes(y=preds.lm), colour = 'steelblue', alpha=.8) +
geom_line(aes(y=preds.wlm), colour = 'firebrick', alpha=.8)
LWR can trace the patterns in the data very well. However, the points on the left also show that it has a potential to overfit. But when the sample size is very big and the data is not granular, LWR can perform very well.
## [1] 8.72954
## [1] 7.680297
Span Parameter
The span parameter controls the window of points to be considered. More clearly, it calculates the weight of the data based on gaussian distribution (by default), if a point is too far, its weight is close to 0. Since it is gaussian, if we change the span
parameter, we can increase/reduce the width of the kernel:
library('gridExtra')
plot_loess <- function(span){
wlm.fit <- loess(audience_freshness ~ imdb_rating, data = movies, span = span )
preds.wlm <- predict(wlm.fit)
ggplot(movies, aes(y=audience_freshness, x=imdb_rating)) +
geom_point(alpha=.4, position = position_jitter()) +
geom_line(aes(y=preds.wlm), colour = 'firebrick', alpha=1, size=1)
}
gs <- lapply(c(0.05,0.1,0.2,0.3), plot_loess)
grid.arrange(grobs=gs, ncol=2)
Train / Test Split
Let’s see how it performs on a test set:
set.seed(12)
train_inds <- sample(1:nrow(movies), floor(nrow(movies)*0.8))
train <- movies[ train_inds, ]
test <- movies[-train_inds, ]
cat('train: ', nrow(train), ', test: ', nrow(test))
## train: 318 , test: 80
lm.fit <- lm(audience_freshness ~ imdb_rating, data=train)
wlm.fit <- loess(audience_freshness ~ imdb_rating, data = train)
preds.lm <- predict(lm.fit, test)
preds.wlm <- predict(wlm.fit, test)
ggplot(test, aes(y=audience_freshness, x=imdb_rating)) +
geom_point(alpha=.8, position = position_jitter()) +
geom_line(aes(y=preds.lm), colour = 'steelblue', alpha=.7) +
geom_line(aes(y=preds.wlm), colour = 'firebrick', alpha=.7)
It captures the s shape very well. Also the left tail is a better fit than LM. But it is not hard to see that it can go worse.
## [1] 11.12495
## [1] 9.560652
Let’s add some more variables. Since LWR tries to follow the points, if the variable is unnecessary it can cause worse performances too:
lm.fit <- lm(audience_freshness ~ imdb_rating + length + year + adjusted, data=train)
wlm.fit <- loess(audience_freshness ~ imdb_rating + length + year + adjusted, data = train)
preds.lm <- predict(lm.fit, test)
preds.wlm <- predict(wlm.fit, test)
ggplot(test, aes(y=audience_freshness, x=imdb_rating)) +
geom_point(alpha=.8, position = position_jitter()) +
geom_line(aes(y=preds.lm), colour = 'steelblue', alpha=.7) +
geom_line(aes(y=preds.wlm), colour = 'firebrick', alpha=.7)
## [1] 11.07527
## [1] 11.36595
While linear regression reduced its RMSE loss on test set, LWR performed worse compared to what it achieved before.
Classification with Logistic Regression
In supervised learning, there are two major tasks, one is regression and the other is classification. The former is to perdict the output value and the error is calculated over how far the estimation is from the output. In classification, we aim to predict the output’s class, e.g. negative review vs. positive review, rock vs. rap, defaulted loan vs. non-defaulted.
Logistic regression is essentaially a linear model, except the output is squeezed to 0 and 1 with a logistic function. This particular sigmoid function resonates very well with probability.
\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
\[P(Y \in c) = \sigma (Y_i)\]
where \(c\) is a particular class, e.g. loan belongs to defaulted ones.
In R, logistic regression model can be fit using glm
for generalized linear model. We will specify the family=binomial()
which implies the outputs are binary.
library('ISLR')
logreg.fit <- glm(default ~ balance,data = Default, family = binomial())
temp <- mutate(Default, default = ifelse(default == 'Yes', 1 , 0))
probs <- predict(logreg.fit,type = 'response')
ggplot(temp, aes(y=default, x=balance, colour = probs>=0.5)) +
geom_point(alpha=.8) +
geom_line(aes(y=probs), colour='black')
The line doesn’t fit perfectly, but this is not we are looking for anyways. The line is strictly increasing but after a level its speed of increase decreases. That point is where the probability is 0.5. If the resulting probabilities are less than 0.5, we will classify it as “not belonging to c” and if it is greater than 0.5 we will classify as "belongs to the class.
##
## Call:
## glm(formula = default ~ balance, family = binomial(), data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2697 -0.1465 -0.0589 -0.0221 3.7589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
## balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1596.5 on 9998 degrees of freedom
## AIC: 1600.5
##
## Number of Fisher Scoring iterations: 8
Balance seems to be significant in explaining whether the loan will default.
library('ISLR')
logreg.fit <- glm(default ~ balance + income + student,data = Default, family = binomial())
temp <- mutate(Default, default = ifelse(default == 'Yes', 1 , 0))
probs <- predict(logreg.fit,type = 'response')
ggplot(temp, aes(y=default, x=balance, colour = probs>=0.5)) +
geom_point(alpha=.8) +
geom_line(aes(y=probs), colour='black')
Evaluating Performance
MSE or RMSE are not suitable for classification. They cannot tell much. The most intuitive measure is Accuracy, that is the proportion of correct predictions among all:
probs <- predict(logreg.fit, type = 'response')
preds <- ifelse(probs >= 0.5, "Yes", "No")
target <- Default$default
acc <- mean(preds == target)
acc
## [1] 0.9732
It looks our predictions are great. The accuracy is 97.25%.
Confusion matrix can tell more about the classification performance. We will use MLmetrics
package for it:
## y_pred
## y_true No Yes
## No 9627 40
## Yes 228 105
Accuracy is a good measure, but is not smooth enough for us to optimize. In ML, it is very common to use Cross Entropy loss, which is defined as:
\[CEloss = -\frac 1 N\sum_{i=1}^N t_i \log y_i + (1-t_i) \log (1-y_i) \]
where \(t\) is 1 if the data belongs to the class and 0 otherwise. We will see this formula later in more often. We can calculate CE loss as below:
library('MLmetrics')
probs <- predict(logreg.fit, type = 'response')
preds <- ifelse(probs >= 0.5, 1, 0)
target <- ifelse(Default$default == 'Yes',1 , 0)
CEloss <- LogLoss(preds, target)
cat('Accuracy: ', acc, "\nCE Loss: ", CEloss)
## Accuracy: 0.9732
## CE Loss: 0.9256424