Introduction

In the last lab, we visited a very central theorem in statistics, Central Limit Theorem. Roughly speaking, it states that sum (or mean) of any random sample follows a normal distribution. The underlying random variable can be non-normal, it doesn’t matter. If you repeatadly select students randomly on campus, average their ages, and plot the distribution of the averages you got, you will see normal distribution. If you ask them how much money they have in their bank account and calculate the average, the averages will again follow normal distribution.

\[Z \sim \frac{\bar X_i-\mu}{\sigma/\sqrt{n}}\]

Here $\bar X_i$ is the averages you got from any sample of students, $\mu$ is the population average (mean age of all students), $\sigma$ is the population standard deviation (s.d. age of all students) and $n$ is the number of observations in the sample (sample size).

We did it for house sizes. However we assumed that the data we have is the all houses data (population). In fact, the data itself is a sample of 7633 houses in Seattle, WA and the population is all houses in Seattle, WA. So if we calculate the average of houses in the dataset, it will be $X_i$, but we don’t know $\mu$ and sigma. In this lab we will show that we can hack this fact, and test whether the average house price can be, e.g. 4 million dollars in Seattle, WA.

In this lab we will go further and use this fact to:

carry out t-test to test whether the population mean can be any number we wonder,
carry out t-test for comparing two samples, e.g. whether or not houses in Mercer Island is more expensive than West Seattle,
carry out $\mathcal{X}^2$-test to check whether the standard deviation can be equal to any number we wonder,
carry out F-test to check whether standard deviations of two samples are equal.

Getting Started

Load house_subset.csv data.
Load tidyverse

library('tidyverse')
theme_set(theme_minimal())

houses <- read.csv('../Lab 3/house_subset.csv') # choose your directory!

Hypothesis Testing

As we discussed above, in last week’s lab we assumed that the data we have is the population and calculated the averages of random samples and showed that the distribution is normal. This is a hack of life. It holds and it is everywhere. As we also mentioned above, in reality, the data we have is a sample of all houses.

If so, our data is just a sample that can be subsetted from all Seattle, WA. There can be many samples than what we have. So the average house size in our sample can help us to understand the average house size in all Seattle, WA. Because we know the distribution. It is normal.

Let’s clarify our point:

A friend of yours who started working in Seattle, WA, told you that the average house size in all of Seattle is $\mu = 4000$ sqft.

You can test this hypothesis and reconsider your friendship accordingly. You have a sample from the population of all houses and know CLT. You need $\bar X$ and $\sigma$ of your data (see the above formula in the introduction):

Xbar <- mean(houses$sqft_living)
Xbar

## [1] 2129.256

The only thing you don’t have is $\sigma$, the standard deviation of houses in Seattle. Luckily a humble brewer and self trained statistician, who published academic articles under pseudonym Student because he was working at Guiness, solved the problem and invented Student’s t distribution:

\[t_{df}\sim \frac{\bar X_i - \mu}{s/\sqrt n}\]

where $s$ is the standard deviation of $X_i$ and df is degrees of freedom that is equal to $n-1$. The good news is, when sample is large, $t_{df} \sim z$. Here large roughly means more than 30.

Xbar     <- mean(houses$sqft_living)
stdError <- sd(houses$sqft_living) / sqrt(nrow(houses))

tval <- (Xbar - 4000)/stdError
tval

## [1] -175.6294

The number you got is too small. Usually the standard normal (and t-distribution with large $n$) ranges between -3 and 3 but. This number lies somewhere in the left edge of the below plot:

x <- seq(-5,5, , 100)
ggplot(data.frame(x=x, p = dnorm(x,0,1)), aes(x=x,y=p)) + 
  geom_line()

The t-value we calculated is -175.63. If you wonder where that value is, look left, too far away. So the probability of such a hypothesis to hold is p < 0.0001. Consider your friendship.

Now let’s assume you have other friends. One said the average house size is 2080, the other said 2100, another said 2120. Let’s check who can be correct:

Xbar     <- mean(houses$sqft_living)
stdError <- sd(houses$sqft_living) / sqrt(nrow(houses))

sizes <- c(2080,2100,2120)

tvals <- (Xbar - sizes)/stdError
tvals

## [1] 4.6242688 2.7466267 0.8689845

or you can use t-test function to arrive at the same numbers:

The lines are standing for population mean is 2120, 2100 and 2080 and the corresponding t-values to them are 0.87, 2.75 and 4.62. The first one seems probable, but the second doesn’t seem possible. The last one is farfetched.

Hypothesis testing is roughly above. But it has a certain vocabulary. We usually check the t-value falls on the tails or somewhere in between. The area under the curve is 1, we usually test whether the number is somewhere on the tails whose size are 2.5% of the whole region:

There is also another notion, p-value, the area to shade (cumulative probability) from left to the t-values we calculated. If the p-value is less than the threshold, then it is a too unlikely hypothesis.

Testing Population Mean

There are very handy functions in R to test the hypothesis. For t-test we use …, guess what? If we want to test the hypothesis that the sample mean is 2080, we use mu parameter as below:

t.test(houses$sqft_living, mu = 2080)

## 
##  One Sample t-test
## 
## data:  houses$sqft_living
## t = 4.6243, df = 7632, p-value = 3.821e-06
## alternative hypothesis: true mean is not equal to 2080
## 95 percent confidence interval:
##  2108.376 2150.136
## sample estimates:
## mean of x 
##  2129.256

You can see above the t-value is calculated as 4.6243 and the corresponding p-value is something very small, close to 0. So it is too a unlikely hypothesis. If it were larger than say 5%, we could say, well, I am not sure if I can reject the hypothesis.

What about $\mu=2100$:

t.test(houses$sqft_living, mu = 2100)

## 
##  One Sample t-test
## 
## data:  houses$sqft_living
## t = 2.7466, df = 7632, p-value = 0.006035
## alternative hypothesis: true mean is not equal to 2100
## 95 percent confidence interval:
##  2108.376 2150.136
## sample estimates:
## mean of x 
##  2129.256

Still, the t-value is very large, larger than acceptable levels (e.g. 1.96). The easier measure is p-value and it is too small. Therefore it is too hard to accept the hypothesis.

Lastly, for the hypothesis that $\mu=2120$ we can write:

t.test(houses$sqft_living, mu = 2120)

## 
##  One Sample t-test
## 
## data:  houses$sqft_living
## t = 0.86898, df = 7632, p-value = 0.3849
## alternative hypothesis: true mean is not equal to 2120
## 95 percent confidence interval:
##  2108.376 2150.136
## sample estimates:
## mean of x 
##  2129.256

The t-value is 0.8689. It is very well inside the body, not on the tails. The p-value is 0.3849. It is not in the 5% critical region on the tails.

The object that t.test returns an object that has lots of information. We can extract them using $ sign:

test <- t.test(houses$sqft_living, mu = 2120)

test$statistic

##         t 
## 0.8689845

test$p.value

## [1] 0.3848829

test$conf.int

## [1] 2108.376 2150.136
## attr(,"conf.level")
## [1] 0.95

If we want to test multiple hypothesis at once, then we may find it useful to write a function:

extract_stat <- function(ser, m) {
  test <- t.test(ser, mu = m)
  return(c(mu=m, tval=test$statistic,pval=test$p.value))
  }
sapply(c(2080,2100,2120), function(m) extract_stat(houses$sqft_living, m))

##                [,1]         [,2]         [,3]
## mu     2.080000e+03 2.100000e+03 2120.0000000
## tval.t 4.624269e+00 2.746627e+00    0.8689845
## pval   3.821097e-06 6.035281e-03    0.3848829

Hypotheses

You do not necessarily test the $\mu = 4000$, you can test $\geq$ or $\leq$ too. So there are three combinations that you can test here. Assume you want to carry out a test to compare the population mean with $\mu_0$. Then you can test:

$H_0: \mu = \mu_0,\ H_A: \mu \neq \mu_0$
$H_0: \mu \leq \mu_0,\ H_A: \mu > \mu_0$
$H_0: \mu \geq \mu_0,\ H_A: \mu < \mu_0$

and to reject these hypotheses, the t-value you calculated must fall into the following regions:

Take for example the second one. They told you the mean house size in Seattle is less than 4000. Then your $\bar X$ must somehow less than 4000. If sample average is 0, it doesn’t bother you, it doesn’t contradict with the hypothesis. If it is greater than $4000$, you can tolerate it but if it is too much, then after a level you cannot tolerate. Or mathematically, the differnece can be positive, but the difference proportionate to the standard error, $(\bar X - \mu)/(\sigma/\sqrt n)$, cannot be a large positive number.

These correspond to the following:

If you are testing $H_0: \mu = 4000$ and reject if it is in the 5% critical region you write:

t.test(houses$sqft_living, mu=4000) # two sided by default

## 
##  One Sample t-test
## 
## data:  houses$sqft_living
## t = -175.63, df = 7632, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 4000
## 95 percent confidence interval:
##  2108.376 2150.136
## sample estimates:
## mean of x 
##  2129.256

Reject the hypothesis

If you are testing $H_0: \mu \leq 4000$ (and thus $H_A: \mu > 4000$) and reject $H_0$ if it is in the 5% critical region you write:

t.test(houses$sqft_living, mu=4000, alternative = 'greater')

## 
##  One Sample t-test
## 
## data:  houses$sqft_living
## t = -175.63, df = 7632, p-value = 1
## alternative hypothesis: true mean is greater than 4000
## 95 percent confidence interval:
##  2111.734      Inf
## sample estimates:
## mean of x 
##  2129.256

Don’t reject the hypothesis

If you are testing $H_0: \mu \geq 4000$ (and thus $H_A: \mu < 4000$) and reject $H_0$ if it falls into the 5% critical region you write:

t.test(houses$sqft_living, mu=4000, alternative  = 'less')

## 
##  One Sample t-test
## 
## data:  houses$sqft_living
## t = -175.63, df = 7632, p-value < 2.2e-16
## alternative hypothesis: true mean is less than 4000
## 95 percent confidence interval:
##      -Inf 2146.779
## sample estimates:
## mean of x 
##  2129.256

Reject the hypothesis

Testing Population Variance

Your first friend who suggested that the mean population house size is 4000 now try to explain away your proof by saying,

The variance of house sizes are very high. The population standard deviation is 2000.

To test this hypothesis, we will use the distribution for the variance, not mean. Statisticians noticed that many things can be explained using the magical normal distribution and invented $\mathcal{X}^2$-distribution, which is the square of normal distribution. You can see that it starts with 0 and goes to infinity:

x <- seq(0,50, , 100)
cdist <- data.frame(x=x, p = dchisq(x,10,))

ggplot(cdist, aes(x=x,y=p)) + 
  geom_line()

The test statistics here is not t-value but chisquare-value which is: \[\mathcal{X}^2 \sim \frac{(n-1) s^2}{\sigma^2}\]

where $s$ is the sample standard deviation and $\sigma$ is the hypothesized population variance. You have everything you need to test this hypothesis, which is s = 930.6037457.

Similar to the t-test, we can compare the sample variance with the population’s in 3 differnet ways:

$H_0: \sigma = \sigma_0$
$H_0: \sigma \leq \sigma_0$
$H_0: \sigma \geq \sigma_0$

which correspond to testing whether the statistics fall into the below critical (shaded) regions:

If you are testing $H0: \sigma = 2000$ and reject if it is in the 5% critical region:

# install.packages('EnvStats')
library('EnvStats')
varTest(houses$sqft_living, sigma.squared = 2000^2) # two sided by default

## 
##  Chi-Squared Test on Variance
## 
## data:  houses$sqft_living
## Chi-Squared = 1652.4, df = 7632, p-value < 2.2e-16
## alternative hypothesis: true variance is not equal to 4e+06
## 95 percent confidence interval:
##  839189.8 894171.1
## sample estimates:
## variance 
## 866023.3

Reject the hypothesis

If you are testing $H0: \sigma \leq 2000$ and reject if it falls into the 5% critical region:

varTest(houses$sqft_living, sigma.squared = 2000^2, alternative = 'greater')

## 
##  Chi-Squared Test on Variance
## 
## data:  houses$sqft_living
## Chi-Squared = 1652.4, df = 7632, p-value = 1
## alternative hypothesis: true variance is greater than 4e+06
## 95 percent confidence interval:
##  843440.1      Inf
## sample estimates:
## variance 
## 866023.3

Don’t reject the hypothesis

If you are testing $H0: \sigma \geq 2000$ and reject if it falls into the 5% critical region:

varTest(houses$sqft_living, sigma.squared = 2000^2, alternative  = 'less')

## 
##  Chi-Squared Test on Variance
## 
## data:  houses$sqft_living
## Chi-Squared = 1652.4, df = 7632, p-value < 2.2e-16
## alternative hypothesis: true variance is less than 4e+06
## 95 percent confidence interval:
##       0.0 889576.9
## sample estimates:
## variance 
## 866023.3

Reject the hypothesis

Comparing Two Populations

Comparing Means

According to Central Limit Theorem, the distribution of sample means is normal. Besides, sum (or difference) of two normal is also normal. We can use these facts to compare two populations, to check whether their population mean are equal, or one is greater than another and so on.

Your once beloved friend now comes with a new argument. He lives in West Seattle and says the mean house price in West Seattle and Mercer Island are the same. Another sets the bar even higher and says West Seattle is more expensive than Mercer Island. The last one says the others are talking nonsense and West Seattle is cheaper than Mercer Island on the average.

medians <- houses %>% group_by(zipcode) %>% summarise(medians=median(price))
houses$zipcode <- factor(houses$zipcode, levels=medians$zipcode[order(medians$medians)])

ggplot(houses, aes(x=price,y=zipcode, colour=zipcode)) + 
  geom_boxplot() + 
  theme(legend.position = 'none')

The above plot shows what we need (Mercer Island is 98040 and West Seattle is 98116). The samples show great difference in their median and the interquartile ranges. But still, these are samples, not the population. So we will test the three hypotheses using the t-test:

Let’s write down the three hypotheses:

$H_0: \mu_1 = \mu_2 \Rightarrow \mu_1 - \mu_2 = 0$
$H_0: \mu_1 - \mu_2 \geq 0$
$H_0: \mu_1 - \mu_2 \leq 0$

The test statistics for the above hypotheses is given below. Note that there are two test statistics for the comparison, one assumes that two samples have the equal variances, the other assumes they have unequal variances. If you have enough clue that the sample variances are unequal you can use the below formula. Otherwise there is another test statistics but we don’t need to know the formulae. R has them built in:

\[t_{df} \sim \frac{\bar X_1 - \bar X_2 - (\mu_1-\mu_2)}{\sqrt{s_1^2/n_1 + s_2^2/n_2}} =\frac{\bar X_1 - \bar X_2 }{\sqrt{s_1^2/n_1 + s_2^2/n_2}} \text{(when population means are equal)} \]

and the degrees of freedom for it is calculated with another function, which we won’t write down here.

We have enough reason (from the above boxplot) that the sample variances for different zipcodes are unequal. So we can test the first hypothesis using t.test as below:

sample1 <- subset(houses, zipcode ==98116)$price
sample2 <- subset(houses, zipcode ==98040)$price
t.test(sample1,sample2, var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  sample1 and sample2
## t = -14.858, df = 362.85, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -651777.3 -499414.4
## sample estimates:
## mean of x mean of y 
##  618634.2 1194230.0

Reject the hypothesis that the mean house price in two regions are equal.

The second hypothesis ($\mu_1 \geq \mu_2$) can be tested as below:

t.test(sample1,sample2, var.equal = F, alternative = 'less')

## 
##  Welch Two Sample t-test
## 
## data:  sample1 and sample2
## t = -14.858, df = 362.85, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -511712.5
## sample estimates:
## mean of x mean of y 
##  618634.2 1194230.0

Reject the hypothesis that mean house price in West Seattle (x) is greater than or equal to mean price in Mercer Island (y).

The last hypothesis ($\mu_1 \leq \mu_2$) can be tested as below:

t.test(sample1,sample2, var.equal = F, alternative = 'greater')

## 
##  Welch Two Sample t-test
## 
## data:  sample1 and sample2
## t = -14.858, df = 362.85, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -639479.2       Inf
## sample estimates:
## mean of x mean of y 
##  618634.2 1194230.0

We cannot reject that mean price in West Seattle is less than mean price in Mercer Island.

Let’s compare more comparable regions, e.g. University District with Queen Anne

sample1 <- subset(houses, zipcode ==98105)$price
sample2 <- subset(houses, zipcode ==98119)$price
t.test(sample1,sample2, var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  sample1 and sample2
## t = 0.29785, df = 404.78, p-value = 0.766
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -74913.82 101668.25
## sample estimates:
## mean of x mean of y 
##  862825.2  849448.0

We cannot reject that mean house prices in these regions are equal.

Comparing Variance

In some cases, you may question whether two populations’ variances are equal, or one is greater than the other and so on.

Your friend is back and plays his last card. The variance price in West Seattle is greater than the variance in Mercer Island.

To compare variances, you will use another statistics, which is obtained by again dividing two $\mathcal{X}^2$-test statistics, and it follows an F distribution:

\[ \frac{(n_1-1)s_1^2/\sigma_1^2}{(n_2-1)s_2^2/\sigma_2^2} \Rightarrow F_{df1,df2} \sim \frac{s_1^2}{s_2^2} \]

where $df1$ and $df2$ are $n_1-1$ and $n_2-1$.The F distribution looks pretty much like $\mathcal{X}^2$ distribution, but it has parameters for two degrees of freedom. Good news is you don’t need to think much about the underlying distribution when you are writing R code.

First hypothesis that two variances are equal can be tested as below:

sample1 <- subset(houses, zipcode ==98116)$price
sample2 <- subset(houses, zipcode ==98040)$price
var.test(sample1,sample2)

## 
##  F test to compare two variances
## 
## data:  sample1 and sample2
## F = 0.17172, num df = 329, denom df = 281, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1369032 0.2149555
## sample estimates:
## ratio of variances 
##          0.1717203

Reject the hypothesis that the mean house price in two regions are equal.

The second hypothesis ($\sigma_1 \geq \sigma_2$) can be tested as below:

var.test(sample1,sample2, alternative = 'less')

## 
##  F test to compare two variances
## 
## data:  sample1 and sample2
## F = 0.17172, num df = 329, denom df = 281, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is less than 1
## 95 percent confidence interval:
##  0.0000000 0.2073281
## sample estimates:
## ratio of variances 
##          0.1717203

Reject the hypothesis that mean house price in West Seattle (x) is greater than or equal to mean price in Mercer Island (y).

The last hypothesis ($\sigma_1 \leq \sigma_2$) can be tested as below:

var.test(sample1,sample2, alternative = 'greater')

## 
##  F test to compare two variances
## 
## data:  sample1 and sample2
## F = 0.17172, num df = 329, denom df = 281, p-value = 1
## alternative hypothesis: true ratio of variances is greater than 1
## 95 percent confidence interval:
##  0.1419959       Inf
## sample estimates:
## ratio of variances 
##          0.1717203

We cannot reject that mean price in West Seattle is less than mean price in Mercer Island.

Deliverables

1. Comparing Two Variances

Choose a continuous or discrete numeric variable, and a categorical variable (e.g. zipcode here)

if there are a lot of 0’s, remove them (for nuclear data only)

Plot the boxplot or jitter plot or density ringes or violin plot of the continuous variable.

Map the continuous variable to x axis and categorical variable to y axis. Map categorical variable also to the colour or fill to make your plot fancier.

Choose two values in your categorical variable (e.g. 98040 and 98105 here):

Subset the continuous variable based on the two categories, assing the first sample to sample1 and second to sample2
Write down the hypothesis that the two population variances are equal, and test the hypothesis.
Write down the hypothesis that the variance of the first population is greater than or equal the second’s and test it.
Write down the hypothesis that the variance of the first population is less than or equal the second’s and test it.
Interpret the t-value and p-values for each. Can you reject the three hypotheses?

2. Comparing two population means

Depending on the result of the first part, whether or not you could reject the hypothesis that the sample variances are equal, compare the two population means. Recall that there are two tests for comparing the population means depending on whether the variances are equal or unequal.

Write down the three hypotheses, two population means are equal, first is greater than or equal to the second, the first is less than or equal to the second.
Use t-test to test the hypotheses.
Interpret the results. Can you reject any of these? Is there any one you cannot reject?

Lab 9: Hypothesis Thesting

Lab 9: Hypothesis Thesting

Introduction

Getting Started

Hypothesis Testing

Testing Population Mean

Hypotheses

Testing Population Variance

Comparing Two Populations

Comparing Means

Comparing Variance

Deliverables

1. Comparing Two Variances

2. Comparing two population means