Week 3: Regressions
and where to find them
Introduction
In this tutorial, you will learn how to show basic relationship between variables. To this end, we will work with house prices dataset, try to predict the house prices using other variables in case you want to buy a house there. Overall, we will
- summarize the data and plot basic visuals,
- carry out single and multiple linear regressions,
- use train / test split to deal with overfitting.
Getting Started
To follow up, you will need
tidyverse
caret
gridExtra
ggExtra
leaflet
(optional)- house_data.csv
Now, call tidyverse
library and read the data. Since the data is big, not to challenge your computers’ performance, we will filter out some neighbourhoods from our analysis:
library('tidyverse')
house <- read.csv('house_data.csv')
house <- subset(house,
zipcode %in% c('98055', '98108', '98119', '98070', '98039', '98112', '98040', '98116', '98105'))
Also, some people such as the writer of this tutorial don’t like the default grey background theme of ggplots. Instead of adding theme_minimal
at the end of all plotting code chunks, we can set the default as
Getting Familiar with the Data
## [1] 1916 21
## 'data.frame': 1916 obs. of 21 variables:
## $ id : num 2.52e+09 1.79e+09 3.30e+09 9.27e+09 8.22e+08 ...
## $ date : Factor w/ 372 levels "20140502T000000",..: 117 56 213 180 305 297 11 345 197 130 ...
## $ price : num 2000000 937000 667000 685000 1350000 920000 885000 480000 940000 905000 ...
## $ bedrooms : int 3 3 3 3 3 5 4 3 3 4 ...
## $ bathrooms : num 2.75 1.75 1 1 2.5 2.25 2.5 1 1.5 2.5 ...
## $ sqft_living : int 3050 2450 1400 1570 2753 2730 2830 1040 2140 3300 ...
## $ sqft_lot : int 44867 2691 1581 2280 65005 6000 5000 5060 3600 10250 ...
## $ floors : num 1 2 1.5 2 1 1.5 2 1 2 1 ...
## $ waterfront : int 0 0 0 0 1 0 0 0 0 0 ...
## $ view : int 4 0 0 0 2 0 0 0 0 0 ...
## $ condition : int 3 3 5 3 5 3 3 3 3 3 ...
## $ grade : int 9 8 8 7 9 8 9 7 9 7 ...
## $ sqft_above : int 2330 1750 1400 1570 2165 2130 2830 1040 1900 2390 ...
## $ sqft_basement: int 720 700 0 0 588 600 0 0 240 910 ...
## $ yr_built : int 1968 1915 1909 1922 1953 1927 1995 1941 1925 1946 ...
## $ yr_renovated : int 0 0 0 0 0 0 0 0 0 1991 ...
## $ zipcode : int 98040 98119 98112 98119 98070 98105 98105 98116 98119 98040 ...
## $ lat : num 47.5 47.6 47.6 47.6 47.4 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 4110 1760 1860 1580 2680 2730 1950 890 2020 1950 ...
## $ sqft_lot15 : int 20336 3573 3861 2640 72513 6000 5000 5060 4800 6045 ...
The data we have contains 20 columns of information about 1916 houses.
library('gridExtra')
g1 <- ggplot(house, aes(x=price)) +
geom_histogram(colour='white')
g2 <- ggplot(house, aes(x=sqft_living)) +
geom_histogram(colour='white')
g3 <- ggplot(house, aes(x=bedrooms)) +
geom_histogram(colour='white')
g4 <- ggplot(house, aes(x=bathrooms)) +
geom_histogram(colour='white')
grid.arrange(g1,g2,g3,g4, ncol=2)
Price, bedroom and sqrft_living are rightskewed. Also, the histograms are all on the left and there is much space on the right. This shows that there are some outlying observations that we might consider removing.
If we remove the houses that are more than $4m, the price distribution looks like:
We can see the details better. Most houses are accumulated between 0 and $1m.
library('ggExtra')
g <- ggplot(house, aes(y=price, x=sqft_living)) +
geom_point(alpha=.3)
ggMarginal(g, type='histogram')
An important note about Categorical Variables
Categorical variables seems easy to pick. For example, city name can be entered using strings, e.g. “Waterloo”, “Toronto” etc. But what if not? In our data zipcode
is such a variable:
## [1] "integer"
In US, unlike Canada or England, the zipcodes are written in numbers. If you are not aware of this fact your analysis may be screwed. For example, if you forget to convert your plot will be a huge junk:
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
Zipcode is treated as a number, so the plot is a huge nonsense.
Now let’s do it after converting to factors using as.factor
:
house$zipcode <- as.factor(house$zipcode)
ggplot(house, aes(x=price, y=zipcode, colour=zipcode)) +
geom_boxplot()
So, before your analysis, it is best to convert them into factors. We will see how useful it is when fitting regressions.
Distribution in different regions
Basic Stats
house %>%
select(price:sqft_living, zipcode) %>%
group_by(zipcode) %>%
summarise(Price_median = median(price),
LivingSp_median = median(sqft_living),
Bedrooms_median = median(bedrooms),
Bathrooms_median = median(bathrooms))
## # A tibble: 9 x 5
## zipcode Price_median LivingSp_median Bedrooms_median Bathrooms_median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 98039 1892500 3560 4 3
## 2 98040 993750 3020 4 2.5
## 3 98055 294950 1705 3 2
## 4 98070 463750 1875 3 1.75
## 5 98105 675000 1910 3 2
## 6 98108 342500 1645 3 1.75
## 7 98112 915000 2330 3 2.25
## 8 98116 562750 1735 3 2
## 9 98119 744975 1800 3 2.25
Something Fancy?
Well, there are many things fancy in this course that I will overlook. But not today:
Linear Regression
In R, linear regression is like having knife in the kitchen. It is a central topic in Statistics, and R is a statistical programming language. The basic syntax is lm(Y ~ X1 + X2 + X3, data=data_name)
for
\[Y \sim \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3\]
For example, to estimate below model;
\[HP_i = \beta_0 + \beta_1 LivSpc_i+\epsilon_i\]
we run
##
## Call:
## lm(formula = price ~ sqft_living, data = house)
##
## Coefficients:
## (Intercept) sqft_living
## -211827.5 451.3
The resulting fit
is a class includes more than it just printed. If you wonder you can run str(fit)
. We can extract basic results using summary
as:
##
## Call:
## lm(formula = price ~ sqft_living, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1327013 -198067 7035 165947 2632993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.118e+05 1.776e+04 -11.93 <2e-16 ***
## sqft_living 4.513e+02 7.241e+00 62.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 339100 on 1914 degrees of freedom
## Multiple R-squared: 0.67, Adjusted R-squared: 0.6698
## F-statistic: 3885 on 1 and 1914 DF, p-value: < 2.2e-16
The results seem significant. The p-value estimates for coefficient of sqft_living i.e. the hypothesis, \(H_0:\beta_1=0\) can be rejected confidently. Also, the F-statistics shows that the model is significant (note that when we have 1 variable F-statistics is equivalent to t-statistics calculated for the variable). Lastly, \(R^2\) estimate is 0.67. Seems good, but need some improvement.
The above information can be too much spacious. Instead, we will focus on only Coefficients
:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -211827.4523 17760.331516 -11.92700 1.101625e-31
## sqft_living 451.3483 7.241161 62.33093 0.000000e+00
or sometimes you simply need:
## (Intercept) sqft_living
## -211827.4523 451.3483
The model predicts the target values as \[\hat Y = \widehat {HP} = -2.118 \cdot 10^5 + 4.513 \cdot\ 10^2 \ LivSpc\]
In ML, people are less interested in the old skool statistican traditions, such as \(R^2\). More important one is, how close are your predictions, \(\hat y\), to the target, \(y\). In other words, what is the (Root) Mean Squarred Error:
\[MSE = \frac 1 {n-p}\sum_{i=1}^n error^2\]
where p is the number of independent variables you used in your regression. Here we have one variable sqft_living
and a constant, thus it is 2:
preds <- predict(fit1)
target <- house$price
N <- nrow(house)
MSE <- 1/(N-2) * sum((preds - target)^2)
RMSE <- sqrt(MSE)
cat('MSE: ', MSE, ", RMSE: ", RMSE)
## MSE: 114999214257 , RMSE: 339115.3
Or, you can use the built in function:
## [1] 339115.3
There are other statistics that may be of your interest.
## 2.5 % 97.5 %
## (Intercept) -246659.0888 -176995.8158
## sqft_living 437.1469 465.5497
The coefficients can be estimated as a number, but this is driven from one sample, for another sample it can be more or less. If we exclude the extreme cases sqft_living can be at least 437.1 and at most 465.5.
Plot
Fitting Line
Using Smoothers
There is a method called smoothers
; loess`` is one example if them,
gam` is another. It fits a line that somehow remove all the noise and summarizes the relation:
ggplot(house, aes(x=sqft_living, y=price)) +
geom_point(alpha=.3, size=2) +
geom_smooth(colour="blue")
If we change the method to lm
and remove the error band using se=F
Multiple Linear Regression
We already got an idea of how linear model works. Although mathematically it is challenging, multiple regression in R is straightforward. Assume the following model:
\[HP_i \sim \beta_0 + \beta_1 LivSpc_i + \beta_2 Bedr_i + \beta_3 Bathr_i +\epsilon_i\]
We can estimate it as
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -68638.4700 26640.80212 -2.5764416 1.005672e-02
## sqft_living 494.3907 12.10527 40.8409477 1.044581e-262
## bedrooms -69246.0167 9090.89758 -7.6170715 4.054587e-14
## bathrooms -1481.7560 13694.63084 -0.1081998 9.138486e-01
Number of bedrooms and living space seems to be significant factors in explaining house price. But number of bathrooms doesn’t seem to tell much.
## [1] 339115.3
## [1] 334135.7
By adding the new variables, we improved the RMSE score.
Interacting Variables
Sometimes assuming the variables are independent from each other is enough; the year it was built in affecs the value; the old is better. If it is waterfront the prices are higher. What if having a new waterfront house have a positive impact. Treating the variables independent will hinder this:
fit3 <- lm(price ~ sqft_living + bedrooms + yr_built + waterfront, data = house)
summary(fit3)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3816588.1195 4.293585e+05 8.889047 1.392548e-18
## sqft_living 498.7035 8.947064e+00 55.739339 0.000000e+00
## bedrooms -67841.4362 8.780888e+03 -7.726034 1.778571e-14
## yr_built -1999.1940 2.194745e+02 -9.109002 2.037944e-19
## waterfront 432027.3467 5.142860e+04 8.400527 8.547379e-17
fit4 <- lm(price ~ sqft_living + bedrooms + yr_built + waterfront + yr_built:waterfront, data = house)
summary(fit4)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.988551e+06 4.292896e+05 9.291049 4.025750e-20
## sqft_living 4.955824e+02 8.936217e+00 55.457737 0.000000e+00
## bedrooms -6.701360e+04 8.743275e+03 -7.664588 2.835379e-14
## yr_built -2.085013e+03 2.193946e+02 -9.503487 5.847845e-21
## waterfront -1.582732e+07 3.789181e+06 -4.176978 3.087142e-05
## yr_built:waterfront 8.320961e+03 1.938992e+03 4.291386 1.864443e-05
## [1] 321015
## [1] 319562.2
library('plotly')
preds <- predict(fit4)
g1 <- ggplot(house, aes(x=sqft_living, y=price)) +
geom_point(alpha=.3, size=2) +
geom_line(aes(y=preds), colour="blue")
ggplotly(g1)
This is a linear regression, so the line must be linear. But when we have multiple linear regression we don’t have a line but a plane or hyperplane. The above blue line connects the predictions, so it doesn’t show how the model behaves. But it still tells some.
Categorical vs. Numerical Variables
Categorical variables can also be included in regressions. They are called dummy variables that take the value 1 if the observation belongs to that particular category and 0 otherwise.
As discussed earlier, some variables are read as numeric even though they are categorical. This is a serious issue that you must be aware of. Here is why. We convert zipcode
back to numeric and carry out the regression:
house$zipcode <- as.numeric(as.character(house$zipcode))
fit5 <- lm(price ~ sqft_living + bedrooms + zipcode, data = house)
summary(fit5)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.626411e+08 2.480767e+07 -6.556083 7.085855e-11
## sqft_living 5.054627e+02 9.074986e+00 55.698448 0.000000e+00
## bedrooms -6.881119e+04 8.899177e+03 -7.732309 1.695156e-14
## zipcode 1.657099e+03 2.528654e+02 6.553285 7.216985e-11
Seems nice. As zipcode increases, the house price increases 1.657e+03 and it is significant. What does it mean? It means you lose marks today, lose money in the future.
## [1] 330446.2
Bad performance compared to the previous model.
If we convert it back to category here is how it is treated:
house$zipcode <- as.factor(house$zipcode)
fit5 <- lm(price ~ sqft_living + bedrooms + zipcode, data = house)
summary(fit5)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 828134.6378 47123.635137 17.573658 3.448743e-64
## sqft_living 418.7334 7.941101 52.729887 0.000000e+00
## bedrooms -63815.7027 7341.107123 -8.692926 7.473798e-18
## zipcode98040 -677766.2630 41407.194109 -16.368321 1.906721e-56
## zipcode98055 -1065439.9178 43387.416190 -24.556427 6.430236e-116
## zipcode98070 -952223.7374 46889.910920 -20.307647 3.733822e-83
## zipcode98105 -637317.1691 43320.379851 -14.711717 1.741128e-46
## zipcode98108 -981754.2269 44896.558837 -21.867026 9.336389e-95
## zipcode98112 -554994.5701 42120.602934 -13.176321 5.231895e-38
## zipcode98116 -789866.7428 42594.893077 -18.543696 1.053513e-70
## zipcode98119 -610409.5284 44322.553465 -13.771985 3.304207e-41
## [1] 267554.8
Being in any zipcode affects the house proce significantly. And the RMSE improves very much compared to what we had before.
Or they can let it interact with other variables too:
house$zipcode <- as.factor(house$zipcode)
fit6 <- lm(price ~ sqft_living + bedrooms + zipcode + zipcode:sqft_living, data = house)
summary(fit6)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -65673.2641 83921.61731 -0.7825548 4.339863e-01
## sqft_living 634.1404 19.52976 32.4704584 2.635553e-184
## bedrooms -45326.1908 6790.77597 -6.6746703 3.242837e-11
## zipcode98040 129174.5264 90768.56655 1.4231196 1.548659e-01
## zipcode98055 243277.3593 92621.94675 2.6265628 8.694621e-03
## zipcode98070 312633.4600 101153.64370 3.0906792 2.026013e-03
## zipcode98105 101495.8254 90628.63447 1.1199090 2.628943e-01
## zipcode98108 289619.3898 97035.28271 2.9846813 2.875027e-03
## zipcode98112 113891.5210 89219.95288 1.2765252 2.019262e-01
## zipcode98116 271366.7670 88699.33779 3.0594001 2.248851e-03
## zipcode98119 174512.7297 94177.71753 1.8530151 6.403540e-02
## sqft_living:zipcode98040 -211.4208 22.92419 -9.2226074 7.480072e-20
## sqft_living:zipcode98055 -481.2920 30.26186 -15.9042412 1.463507e-53
## sqft_living:zipcode98070 -439.8288 35.28370 -12.4654954 2.449992e-34
## sqft_living:zipcode98105 -174.1214 25.84959 -6.7359424 2.152335e-11
## sqft_living:zipcode98108 -471.6581 34.93481 -13.5010930 9.898978e-40
## sqft_living:zipcode98112 -151.3601 23.58108 -6.4187116 1.731707e-10
## sqft_living:zipcode98116 -336.3363 26.09386 -12.8894842 1.674694e-36
## sqft_living:zipcode98119 -191.1781 29.16052 -6.5560600 7.100691e-11
## [1] 240174
## sapply is a function that takes a list or array of objects (here models)
## and applies a function to all. The result is allocated into an array.
sapply(list(fit1,fit2,fit3,fit4,fit5,fit6), sigma)
## [1] 339115.3 334135.7 321015.0 319562.2 267554.8 240174.0
Regression Assumptions and Transformation
For a linear regression model to be valid, it must satisfy some criteria. These include:
- Linear relationship between input and output
- Homoskedasticity (constant cariables among residuals)
- No or little multicollinearity
- Normal and independent residuals
In ML, we usually overlook some of these assumptions because as data gets bigger, multicollinearity and independence assupmtions can resolve. Nevertheless, the relation must be somewhat linear and it is best to have normal input and outputs:
This can be achieved by transforming the variables, either via log
or sqrt
transformations or something in form of X^n
:
Polynomial Regression
Well, it is another topic if you took a statistics course, but as far as we are concerned it is not a big deal. Sometimes, the input variable X
doesn’t have a linear influence on Y
but a polinomial one. For example it can be:
\[HP_i \sim \beta_0 + \beta_1 LivSpc_i^2 + \epsilon_i\]
## [1] 328956.3
\[HP_i \sim \beta_0 + \beta_1 LivSpc_i^1 + \beta_2 LivSpc_i^2 + \epsilon_i\]
## [1] 335170
\[HP_i \sim \beta_0 + \beta_1 LivSpc_i^1 + \beta_2 LivSpc_i^2 + \beta_3 LivSpc_i^3 + \epsilon_i\]
## [1] 321515.5
Overfitting
To avoid running the same code and giving names 12 times, we will use lapply
and sapply
functions. You don’t have to know these.
First, we’ll define a function that takes degree of the polynomial regression and fits regression line to the scatterplot:
plot_poly <- function(p){
fit <- lm(price ~ poly(sqft_living,p), data = house)
preds <- predict(fit)
RMSE <- sigma(fit)
ggplot(house, aes(y=price, x=sqft_living)) +
geom_point(alpha=.3) +
geom_line(aes(y=preds), colour = 'blue') +
ggtitle(paste0('RMSE: ',round(RMSE)))
}
Now, we apply degrees from 1 to 4 and plot:
It seems increasing the degree can help to closely follow the output values.
What about this:
If we increase the complexity of the model, it will follow the data ridiculously well, so if you fit another data it will not get adjust to it.
If we apply the model to other neighbourhoods, this is what will happen:
poly13 <- lm(price ~ poly(sqft_living,13), data = house)
temp <- read.csv('house_data.csv') %>%
subset(zipcode %in% c('98053', '98103', '98115')) %>%
subset(price < 2e6)
preds <- predict(poly13, newdata=temp)
ggplot(temp, aes(y=price, x=sqft_living)) +
geom_point(alpha=.3) +
geom_line(aes(y=preds), colour = 'blue')
Like times when you learn drawing for the first time.
Train / Test Split
The solution to this problem is evaluating on a set that is totally distinct from a set where you train the model. This is called train/test split.
Let’s split the data into train and test with 80/20 split ratio:
set.seed(156) # Set seed is needed if we want
# to get the same random numbers always
train_size <- floor(0.8 * nrow(house))
train_inds <- sample(1:nrow(house), size = train_size)
test_inds <- setdiff(1:nrow(house), train_inds)
train <- house[ train_inds , ]
test <- house[ test_inds , ]
cat('train size:', nrow(train), '\ntest size:', nrow(test))
## train size: 17290
## test size: 4323
Choosing the best model
Now,let’s train on train set and calculate the RMSE loss on test set. sigma
function calculates the RMSE only for the fitted model. If we want to estimate RMSE between fitted and actual in the test set, we need to use another function.
library('caret')
fit1 <- lm(price ~ sqft_living, data = train)
fit2 <- lm(price ~ sqft_living + bedrooms + bathrooms, data = train)
fit3 <- lm(price ~ sqft_living + bedrooms + yr_built + waterfront, data = train)
fit4 <- lm(price ~ sqft_living + bedrooms + yr_built + waterfront + yr_built:waterfront, data = train)
fit5 <- lm(price ~ sqft_living + bedrooms + zipcode, data = train)
fit6 <- lm(price ~ sqft_living + bedrooms + zipcode + zipcode:sqft_living, data = train)
pred1 <- predict(fit1, newdata=test)
pred2 <- predict(fit2, newdata=test)
pred3 <- predict(fit3, newdata=test)
pred4 <- predict(fit4, newdata=test)
pred5 <- predict(fit5, newdata=test)
pred6 <- predict(fit6, newdata=test)
rmse1 <- RMSE(pred1, test$price)
rmse2 <- RMSE(pred2, test$price)
rmse3 <- RMSE(pred3, test$price)
rmse4 <- RMSE(pred4, test$price)
rmse5 <- RMSE(pred5, test$price)
rmse6 <- RMSE(pred6, test$price)
rmses <- c(rmse1,rmse2,rmse3,rmse4,rmse5,rmse6)
rmses
## [1] 263061.9 262055.0 245087.3 242402.1 194223.9 172347.6
# OR
# preds <- lapply(list(fit1,fit2,fit3,fit4,fit5,fit6))
# rmses <- sapply(preds, function(p) RMSE(p,test$price))
The last model seems the best.