Introduction

In this tutorial, you will learn how to show basic relationship between variables. To this end, we will work with house prices dataset, try to predict the house prices using other variables in case you want to buy a house there. Overall, we will

  • summarize the data and plot basic visuals,
  • carry out single and multiple linear regressions,
  • use train / test split to deal with overfitting.

Getting Started

To follow up, you will need

  • tidyverse
  • caret
  • gridExtra
  • ggExtra
  • leaflet (optional)
  • house_data.csv

Now, call tidyverse library and read the data. Since the data is big, not to challenge your computers’ performance, we will filter out some neighbourhoods from our analysis:

Also, some people such as the writer of this tutorial don’t like the default grey background theme of ggplots. Instead of adding theme_minimal at the end of all plotting code chunks, we can set the default as

Getting Familiar with the Data

## [1] 1916   21
## 'data.frame':    1916 obs. of  21 variables:
##  $ id           : num  2.52e+09 1.79e+09 3.30e+09 9.27e+09 8.22e+08 ...
##  $ date         : Factor w/ 372 levels "20140502T000000",..: 117 56 213 180 305 297 11 345 197 130 ...
##  $ price        : num  2000000 937000 667000 685000 1350000 920000 885000 480000 940000 905000 ...
##  $ bedrooms     : int  3 3 3 3 3 5 4 3 3 4 ...
##  $ bathrooms    : num  2.75 1.75 1 1 2.5 2.25 2.5 1 1.5 2.5 ...
##  $ sqft_living  : int  3050 2450 1400 1570 2753 2730 2830 1040 2140 3300 ...
##  $ sqft_lot     : int  44867 2691 1581 2280 65005 6000 5000 5060 3600 10250 ...
##  $ floors       : num  1 2 1.5 2 1 1.5 2 1 2 1 ...
##  $ waterfront   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ view         : int  4 0 0 0 2 0 0 0 0 0 ...
##  $ condition    : int  3 3 5 3 5 3 3 3 3 3 ...
##  $ grade        : int  9 8 8 7 9 8 9 7 9 7 ...
##  $ sqft_above   : int  2330 1750 1400 1570 2165 2130 2830 1040 1900 2390 ...
##  $ sqft_basement: int  720 700 0 0 588 600 0 0 240 910 ...
##  $ yr_built     : int  1968 1915 1909 1922 1953 1927 1995 1941 1925 1946 ...
##  $ yr_renovated : int  0 0 0 0 0 0 0 0 0 1991 ...
##  $ zipcode      : int  98040 98119 98112 98119 98070 98105 98105 98116 98119 98040 ...
##  $ lat          : num  47.5 47.6 47.6 47.6 47.4 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  4110 1760 1860 1580 2680 2730 1950 890 2020 1950 ...
##  $ sqft_lot15   : int  20336 3573 3861 2640 72513 6000 5000 5060 4800 6045 ...

The data we have contains 20 columns of information about 1916 houses.

Price, bedroom and sqrft_living are rightskewed. Also, the histograms are all on the left and there is much space on the right. This shows that there are some outlying observations that we might consider removing.

If we remove the houses that are more than $4m, the price distribution looks like:

We can see the details better. Most houses are accumulated between 0 and $1m.

An important note about Categorical Variables

Categorical variables seems easy to pick. For example, city name can be entered using strings, e.g. “Waterloo”, “Toronto” etc. But what if not? In our data zipcode is such a variable:

## [1] "integer"

In US, unlike Canada or England, the zipcodes are written in numbers. If you are not aware of this fact your analysis may be screwed. For example, if you forget to convert your plot will be a huge junk:

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Zipcode is treated as a number, so the plot is a huge nonsense.

Now let’s do it after converting to factors using as.factor:

So, before your analysis, it is best to convert them into factors. We will see how useful it is when fitting regressions.

Distribution in different regions

Basic Stats

## # A tibble: 9 x 5
##   zipcode Price_median LivingSp_median Bedrooms_median Bathrooms_median
##   <fct>          <dbl>           <dbl>           <dbl>            <dbl>
## 1 98039        1892500            3560               4             3   
## 2 98040         993750            3020               4             2.5 
## 3 98055         294950            1705               3             2   
## 4 98070         463750            1875               3             1.75
## 5 98105         675000            1910               3             2   
## 6 98108         342500            1645               3             1.75
## 7 98112         915000            2330               3             2.25
## 8 98116         562750            1735               3             2   
## 9 98119         744975            1800               3             2.25

Something Fancy?

Well, there are many things fancy in this course that I will overlook. But not today:

Linear Regression

In R, linear regression is like having knife in the kitchen. It is a central topic in Statistics, and R is a statistical programming language. The basic syntax is lm(Y ~ X1 + X2 + X3, data=data_name) for

\[Y \sim \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3\]

For example, to estimate below model;

\[HP_i = \beta_0 + \beta_1 LivSpc_i+\epsilon_i\]

we run

## 
## Call:
## lm(formula = price ~ sqft_living, data = house)
## 
## Coefficients:
## (Intercept)  sqft_living  
##   -211827.5        451.3

The resulting fit is a class includes more than it just printed. If you wonder you can run str(fit). We can extract basic results using summary as:

## 
## Call:
## lm(formula = price ~ sqft_living, data = house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1327013  -198067     7035   165947  2632993 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.118e+05  1.776e+04  -11.93   <2e-16 ***
## sqft_living  4.513e+02  7.241e+00   62.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 339100 on 1914 degrees of freedom
## Multiple R-squared:   0.67,  Adjusted R-squared:  0.6698 
## F-statistic:  3885 on 1 and 1914 DF,  p-value: < 2.2e-16

The results seem significant. The p-value estimates for coefficient of sqft_living i.e. the hypothesis, \(H_0:\beta_1=0\) can be rejected confidently. Also, the F-statistics shows that the model is significant (note that when we have 1 variable F-statistics is equivalent to t-statistics calculated for the variable). Lastly, \(R^2\) estimate is 0.67. Seems good, but need some improvement.

The above information can be too much spacious. Instead, we will focus on only Coefficients:

##                 Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) -211827.4523 17760.331516 -11.92700 1.101625e-31
## sqft_living     451.3483     7.241161  62.33093 0.000000e+00

or sometimes you simply need:

##  (Intercept)  sqft_living 
## -211827.4523     451.3483

The model predicts the target values as \[\hat Y = \widehat {HP} = -2.118 \cdot 10^5 + 4.513 \cdot\ 10^2 \ LivSpc\]

In ML, people are less interested in the old skool statistican traditions, such as \(R^2\). More important one is, how close are your predictions, \(\hat y\), to the target, \(y\). In other words, what is the (Root) Mean Squarred Error:

\[MSE = \frac 1 {n-p}\sum_{i=1}^n error^2\]

where p is the number of independent variables you used in your regression. Here we have one variable sqft_living and a constant, thus it is 2:

## MSE:  114999214257 , RMSE:  339115.3

Or, you can use the built in function:

## [1] 339115.3

There are other statistics that may be of your interest.

##                    2.5 %       97.5 %
## (Intercept) -246659.0888 -176995.8158
## sqft_living     437.1469     465.5497

The coefficients can be estimated as a number, but this is driven from one sample, for another sample it can be more or less. If we exclude the extreme cases sqft_living can be at least 437.1 and at most 465.5.

Plot

Using Smoothers

There is a method called smoothers; loess`` is one example if them,gam` is another. It fits a line that somehow remove all the noise and summarizes the relation:

If we change the method to lm and remove the error band using se=F

Multiple Linear Regression

We already got an idea of how linear model works. Although mathematically it is challenging, multiple regression in R is straightforward. Assume the following model:

\[HP_i \sim \beta_0 + \beta_1 LivSpc_i + \beta_2 Bedr_i + \beta_3 Bathr_i +\epsilon_i\]

We can estimate it as

##                Estimate  Std. Error    t value      Pr(>|t|)
## (Intercept) -68638.4700 26640.80212 -2.5764416  1.005672e-02
## sqft_living    494.3907    12.10527 40.8409477 1.044581e-262
## bedrooms    -69246.0167  9090.89758 -7.6170715  4.054587e-14
## bathrooms    -1481.7560 13694.63084 -0.1081998  9.138486e-01

Number of bedrooms and living space seems to be significant factors in explaining house price. But number of bathrooms doesn’t seem to tell much.

## [1] 339115.3
## [1] 334135.7

By adding the new variables, we improved the RMSE score.

Interacting Variables

Sometimes assuming the variables are independent from each other is enough; the year it was built in affecs the value; the old is better. If it is waterfront the prices are higher. What if having a new waterfront house have a positive impact. Treating the variables independent will hinder this:

##                 Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) 3816588.1195 4.293585e+05  8.889047 1.392548e-18
## sqft_living     498.7035 8.947064e+00 55.739339 0.000000e+00
## bedrooms     -67841.4362 8.780888e+03 -7.726034 1.778571e-14
## yr_built      -1999.1940 2.194745e+02 -9.109002 2.037944e-19
## waterfront   432027.3467 5.142860e+04  8.400527 8.547379e-17
##                          Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)          3.988551e+06 4.292896e+05  9.291049 4.025750e-20
## sqft_living          4.955824e+02 8.936217e+00 55.457737 0.000000e+00
## bedrooms            -6.701360e+04 8.743275e+03 -7.664588 2.835379e-14
## yr_built            -2.085013e+03 2.193946e+02 -9.503487 5.847845e-21
## waterfront          -1.582732e+07 3.789181e+06 -4.176978 3.087142e-05
## yr_built:waterfront  8.320961e+03 1.938992e+03  4.291386 1.864443e-05
## [1] 321015
## [1] 319562.2

This is a linear regression, so the line must be linear. But when we have multiple linear regression we don’t have a line but a plane or hyperplane. The above blue line connects the predictions, so it doesn’t show how the model behaves. But it still tells some.

Categorical vs. Numerical Variables

Categorical variables can also be included in regressions. They are called dummy variables that take the value 1 if the observation belongs to that particular category and 0 otherwise.

As discussed earlier, some variables are read as numeric even though they are categorical. This is a serious issue that you must be aware of. Here is why. We convert zipcode back to numeric and carry out the regression:

##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) -1.626411e+08 2.480767e+07 -6.556083 7.085855e-11
## sqft_living  5.054627e+02 9.074986e+00 55.698448 0.000000e+00
## bedrooms    -6.881119e+04 8.899177e+03 -7.732309 1.695156e-14
## zipcode      1.657099e+03 2.528654e+02  6.553285 7.216985e-11

Seems nice. As zipcode increases, the house price increases 1.657e+03 and it is significant. What does it mean? It means you lose marks today, lose money in the future.

## [1] 330446.2

Bad performance compared to the previous model.

If we convert it back to category here is how it is treated:

##                   Estimate   Std. Error    t value      Pr(>|t|)
## (Intercept)    828134.6378 47123.635137  17.573658  3.448743e-64
## sqft_living       418.7334     7.941101  52.729887  0.000000e+00
## bedrooms       -63815.7027  7341.107123  -8.692926  7.473798e-18
## zipcode98040  -677766.2630 41407.194109 -16.368321  1.906721e-56
## zipcode98055 -1065439.9178 43387.416190 -24.556427 6.430236e-116
## zipcode98070  -952223.7374 46889.910920 -20.307647  3.733822e-83
## zipcode98105  -637317.1691 43320.379851 -14.711717  1.741128e-46
## zipcode98108  -981754.2269 44896.558837 -21.867026  9.336389e-95
## zipcode98112  -554994.5701 42120.602934 -13.176321  5.231895e-38
## zipcode98116  -789866.7428 42594.893077 -18.543696  1.053513e-70
## zipcode98119  -610409.5284 44322.553465 -13.771985  3.304207e-41
## [1] 267554.8

Being in any zipcode affects the house proce significantly. And the RMSE improves very much compared to what we had before.

Or they can let it interact with other variables too:

##                             Estimate   Std. Error     t value      Pr(>|t|)
## (Intercept)              -65673.2641  83921.61731  -0.7825548  4.339863e-01
## sqft_living                 634.1404     19.52976  32.4704584 2.635553e-184
## bedrooms                 -45326.1908   6790.77597  -6.6746703  3.242837e-11
## zipcode98040             129174.5264  90768.56655   1.4231196  1.548659e-01
## zipcode98055             243277.3593  92621.94675   2.6265628  8.694621e-03
## zipcode98070             312633.4600 101153.64370   3.0906792  2.026013e-03
## zipcode98105             101495.8254  90628.63447   1.1199090  2.628943e-01
## zipcode98108             289619.3898  97035.28271   2.9846813  2.875027e-03
## zipcode98112             113891.5210  89219.95288   1.2765252  2.019262e-01
## zipcode98116             271366.7670  88699.33779   3.0594001  2.248851e-03
## zipcode98119             174512.7297  94177.71753   1.8530151  6.403540e-02
## sqft_living:zipcode98040   -211.4208     22.92419  -9.2226074  7.480072e-20
## sqft_living:zipcode98055   -481.2920     30.26186 -15.9042412  1.463507e-53
## sqft_living:zipcode98070   -439.8288     35.28370 -12.4654954  2.449992e-34
## sqft_living:zipcode98105   -174.1214     25.84959  -6.7359424  2.152335e-11
## sqft_living:zipcode98108   -471.6581     34.93481 -13.5010930  9.898978e-40
## sqft_living:zipcode98112   -151.3601     23.58108  -6.4187116  1.731707e-10
## sqft_living:zipcode98116   -336.3363     26.09386 -12.8894842  1.674694e-36
## sqft_living:zipcode98119   -191.1781     29.16052  -6.5560600  7.100691e-11
## [1] 240174
## [1] 339115.3 334135.7 321015.0 319562.2 267554.8 240174.0

Regression Assumptions and Transformation

For a linear regression model to be valid, it must satisfy some criteria. These include:

  • Linear relationship between input and output
  • Homoskedasticity (constant cariables among residuals)
  • No or little multicollinearity
  • Normal and independent residuals

In ML, we usually overlook some of these assumptions because as data gets bigger, multicollinearity and independence assupmtions can resolve. Nevertheless, the relation must be somewhat linear and it is best to have normal input and outputs:

This can be achieved by transforming the variables, either via log or sqrt transformations or something in form of X^n:

Polynomial Regression

Well, it is another topic if you took a statistics course, but as far as we are concerned it is not a big deal. Sometimes, the input variable X doesn’t have a linear influence on Y but a polinomial one. For example it can be:

\[HP_i \sim \beta_0 + \beta_1 LivSpc_i^2 + \epsilon_i\]

## [1] 328956.3

\[HP_i \sim \beta_0 + \beta_1 LivSpc_i^1 + \beta_2 LivSpc_i^2 + \epsilon_i\]

## [1] 335170

\[HP_i \sim \beta_0 + \beta_1 LivSpc_i^1 + \beta_2 LivSpc_i^2 + \beta_3 LivSpc_i^3 + \epsilon_i\]

## [1] 321515.5

Overfitting

To avoid running the same code and giving names 12 times, we will use lapply and sapply functions. You don’t have to know these.

First, we’ll define a function that takes degree of the polynomial regression and fits regression line to the scatterplot:

Now, we apply degrees from 1 to 4 and plot:

It seems increasing the degree can help to closely follow the output values.

What about this:

If we increase the complexity of the model, it will follow the data ridiculously well, so if you fit another data it will not get adjust to it.

If we apply the model to other neighbourhoods, this is what will happen:

Like times when you learn drawing for the first time.

Train / Test Split

The solution to this problem is evaluating on a set that is totally distinct from a set where you train the model. This is called train/test split.

Let’s split the data into train and test with 80/20 split ratio:

## train size: 17290 
## test size: 4323