Introduction

In the last lab, we visited hypothesis testing and tested population mean and variance. We also compared two population means and two population variances. We will continue comparing means of populations, not of two subsamples, but of many. More clearly, we will hypothesize that the means of all populations of interest are equal. This can be done by analyzing variances.

In this lab we will:

carry out one way ANOVA,
carry out two way ANOVA.

Getting Started

Load house_subset.csv data.
Load tidyverse

library('tidyverse')
theme_set(theme_minimal())

houses <- read.csv('../Lab 3/house_subset.csv') # choose your directory!
houses$zipcode <- as.factor(houses$zipcode)

Analysis of Variance (One way)

The main idea behind ANOVA is simple. Assume we are interested in the house sizes in four regions and trying to test whether the population means are equal. The main idea is as follows; if the ratio between total variability within groups is not statistically different from variability among groups, then we cannot reject the hypothesis that the means are equal. We can see how this works:

temp <- subset(houses, zipcode %in% c(98040,98075,98053,98109))
medians <- temp %>% group_by(zipcode) %>% summarise(medians=median(sqft_living))
temp$zipcode <- factor(temp$zipcode, levels=medians$zipcode[order(medians$medians)])

ggplot(temp, aes(x=sqft_living,y=zipcode, colour=zipcode)) + 
  geom_boxplot(outlier.alpha = 0, fill=adjustcolor('grey50',.3), colour='black') +
  geom_jitter(alpha=.3) + 
  theme(legend.position = 'none')

The method is invented by Ronald Fisher who also invented a distribution that is used to compare variances. No, doesn’t start with F:

\[\frac{SSE/(n-m)}{SST/(m-1)} \Rightarrow F_{df1,df2}\sim \frac{MSE}{MST}\]

where \(df1\) and \(df2\) are \((m-1)\) and \((n-m)\), and \(n\) is number of observations, \(m\) is number of grous and 1 is one. The above statistics is what we already used last week; MSE and MST are analogical to variances. Both are mean squared deviances. MSE is calculated by averaging the deviations of each point from its group mean, and MST is calculated by averaging each group means from the overall mean. As you can guess, there is degrees of freedom. You will learn it in the lectures. R calculates all for you.

temp <- subset(houses, zipcode %in% c(98040,98075,98053))
anova(lm(sqft_living ~ zipcode,data = temp))

## Analysis of Variance Table
## 
## Response: sqft_living
##             Df     Sum Sq  Mean Sq F value    Pr(>F)    
## zipcode      2   48318450 24159225  20.616 1.656e-09 ***
## Residuals 1043 1222249998  1171860                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is too small, so we reject the hypothesis that the four regions have equal population means.

Let’s try another one:

temp <- subset(houses, zipcode %in% c(98027,98059,98072, 98011))
medians <- temp %>% group_by(zipcode) %>% summarise(medians=median(sqft_living))
temp$zipcode <- factor(temp$zipcode, levels=medians$zipcode[order(medians$medians)])

ggplot(temp, aes(x=sqft_living,y=zipcode, colour=zipcode)) + 
  geom_boxplot(outlier.alpha = 0, fill=adjustcolor('grey50',.3), colour='black') +
  geom_jitter(alpha=.3) + 
  theme(legend.position = 'none')

anova(lm(sqft_living ~ zipcode,data = temp))

## Analysis of Variance Table
## 
## Response: sqft_living
##             Df     Sum Sq Mean Sq F value   Pr(>F)   
## zipcode      3    9728355 3242785  3.8747 0.008976 **
## Residuals 1344 1124804977  836908                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the hypothesis that the four are equal

What if we drop 98011:

temp <- subset(houses, zipcode %in% c(98027,98059,98072))
anova(lm(sqft_living ~ zipcode,data = temp))

## Analysis of Variance Table
## 
## Response: sqft_living
##             Df     Sum Sq Mean Sq F value  Pr(>F)  
## zipcode      2    4165118 2082559  2.3336 0.09741 .
## Residuals 1150 1026288382  892425                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The three are closer. We cannot reject the hypothesis at 0.01 significance level but can reject at 0.05.

What if I compare only two regions. Is it different than t-test?

temp <- subset(houses, zipcode %in% c(98027,98059))
anova(lm(sqft_living ~ zipcode,data = temp))

## Analysis of Variance Table
## 
## Response: sqft_living
##            Df    Sum Sq Mean Sq F value  Pr(>F)  
## zipcode     1   2697473 2697473   2.909 0.08844 .
## Residuals 878 814169150  927300                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

sample1 <- subset(houses, zipcode ==98027)$sqft_living
sample2 <- subset(houses, zipcode ==98059)$sqft_living
t.test(sample1,sample2, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  sample1 and sample2
## t = 1.7056, df = 878, p-value = 0.08844
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -16.72585 238.63660
## sample estimates:
## mean of x mean of y 
##  2514.609  2403.654

The two test yields the same p-value. Comparing two populations using ANOVA (thus F-test) is the same as comparing two populations using t-test with var.equal=T

Two way ANOVA

Above we calculated one way ANOVA, that is the standard ANOVA. We compared population means using a sample continuous variable, sqft_living, and one categorical variable, zipcode. But as you can see above, that magical F-distribution is capable of testing many things at the same time. So what about comparing populaton means of a continuous variable based on the variances changing among two categorical variables, e.g. zipcode and whether it is multistorey (more than one floors):

temp <- subset(houses, zipcode %in% c(98122, 98027, 98116, 98072))
temp$multistorey <- temp$floors > 1 

ggplot(temp, aes(x=price, y=zipcode, colour=multistorey)) + 
  geom_boxplot()

It is obvious that whether or not it is multistorey effects the house price (compare the blue and red). It is also seen that zipcode effects the house prices, e.g. when we comparey only the reds we see that they are varying, but when we compare the blue boxplots they don’t seem very different. Notice also that the dealbreaker is the zipcode == 98072 & multistorey == F.

To carry out ANOVA in R, we don’t need to know more. We will use the same function:

temp <- subset(houses, zipcode %in% c(98122, 98027, 98116, 98072))
temp$multistorey <- temp$floors > 1 
anova(lm(price ~ zipcode + multistorey,data = temp))

## Analysis of Variance Table
## 
## Response: price
##               Df     Sum Sq    Mean Sq  F value  Pr(>F)    
## zipcode        3 6.5188e+11 2.1729e+11   3.3584 0.01823 *  
## multistorey    1 7.7201e+12 7.7201e+12 119.3173 < 2e-16 ***
## Residuals   1300 8.4112e+13 6.4702e+10                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that both zipcode and multistorey are significantly changing the variability. Thus we reject the hypothesis that the population means among different zipcodes and house types are equal. In other words, where your house is and whether or not it is multistorey are significant factors to effect house prices.

As we discussed above, the dealbreaker (combo-killer, killjoy, blacksheep) is one region, 98072 (see the above boxplot). If we drop that region and compare the rest the test results might be different:

temp <- subset(houses, zipcode %in% c(98122, 98027, 98116))
temp$multistorey <- temp$floors > 1 
anova(lm(price ~ zipcode + multistorey,data = temp))

## Analysis of Variance Table
## 
## Response: price
##               Df     Sum Sq    Mean Sq F value Pr(>F)    
## zipcode        2 5.8219e+10 2.9110e+10   0.406 0.6664    
## multistorey    1 5.3697e+12 5.3697e+12  74.897 <2e-16 ***
## Residuals   1028 7.3702e+13 7.1695e+10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, we cannot reject the hypothesis that average houseprices among different regions are equal, but we can still reject that average house prices are equal when the house is one floor or more than one floor.

Deliverables

3. Two Way ANOVA

Choose one continuous variable and two categorical variables

if there are many categories in the categorical variables, you may find it useful to filter out some categories from both variables.
You may create a binary variable and use it as the second categorical variable (see the above examples where we created multistorey).

Plot the boxplot or jitter plot or density ringes or violin plot of the continuous variable.

Map the continuous variable to x axis and one categorical variable to y axis. Don’t make your plots simply fancier, but map the other categorical variable to colour (see the last boxplot above). Make sure that your categorical variables are treated as categorical.
Interpret whether the population means can be equal. You can simply compare the medians and tell what you see.

Carry out two way ANOVA

Carry out two way ANOVA to test whether population means are equal.
Interpret the results. According to the p-values, does any of the categorical variables significantly changing the population mean?

One Way ANOVA (EXERCISE, NOT GRADED, NOT TO BE SUBMITTED)

Choose one continuous variable and one categorical variable

if there are many categories in the categorical variable, you may find it useful to filter out some categories using subset(DATA, CATGVAR %in% c(V1,V2,...,VN)). See the above examples where we subseted to the data to compare only four regions.

Plot the boxplot or jitter plot or density ringes or violin plot of the continuous variable.

Map the continuous variable to x axis and categorical variable to y axis. Map categorical variable also to the colour or fill to make your plot fancier. Make sure that the categorical variable is treated as categorical (if it is string, you don’t need to to anything but if it is numeric then you may convert it. See the above examples where I converted zipcodes using as.factor).
Interpret whether the population means can be equal. You can simply compare the medians and tell what you see.

Carry out one way ANOVA:

If there are too many categories in the variable, you can compare only, e.g. four categories. (see the first bulletpoint)
Carry out the ANOVA to test whether population means are equal
Interpret the results. According to the p-values, can they be equal?

Lab 10: Analysis of Variance