Introduction

In the last two labs, we visited hypothesis testing and tested population mean and variance. This week we will test proportions. More clearly, we will hypothesize that the proportion is equal to an ideal value, and check whether two proportions are equal.

In this lab we will:

carry out proportion test,
carry out hypothesis testing for comparing two proportions.

Getting Started

Load house_subset.csv data and game_goals.csv.
Load tidyverse

library('tidyverse')
theme_set(theme_minimal())

goals <- read.csv('../Lab 4/data/game_goals.csv') # choose your directory!
head(goals)

##          player season rank       date game_num    age team   at opp location
## 1 Alex Ovechkin   2006    1 2005-10-05        1 20-018  WSH <NA> CBJ     Home
## 2 Alex Ovechkin   2006    2 2005-10-07        2 20-020  WSH <NA> ATL     Home
## 3 Alex Ovechkin   2006    3 2005-10-08        3 20-021  WSH    @ ATL     Away
## 4 Alex Ovechkin   2006    4 2005-10-10        4 20-023  WSH <NA> NYR     Home
## 5 Alex Ovechkin   2006    5 2005-10-12        5 20-025  WSH    @ CAR     Away
## 6 Alex Ovechkin   2006    6 2005-10-13        6 20-026  WSH <NA> NYI     Home
##   outcome goals assists points plus_minus penalty_min goals_even
## 1       W     2       0      2          1           2          1
## 2       L     0       1      1         -2           0          0
## 3       L     0       1      1          0           4          0
## 4       W     1       0      1          1           2          0
## 5       L     1       0      1          0           0          1
## 6       L     0       1      1         -1           0          0
##   goals_powerplay goals_short goals_gamewinner assists_even assists_powerplay
## 1               1           0                0           NA                NA
## 2               0           0                0           NA                NA
## 3               0           0                0           NA                NA
## 4               1           0                1           NA                NA
## 5               0           0                0           NA                NA
## 6               0           0                0           NA                NA
##   assists_short shots shot_percent
## 1            NA     5         40.0
## 2            NA     1          0.0
## 3            NA     3          0.0
## 4            NA     6         16.7
## 5            NA     6         16.7
## 6            NA     5          0.0

Proportion test

Proportion test is not different from t-test in principle. Here the test statistics is even easier to calculate as it requres less parameter. To be more clear, we don’t need explicit \(\mu\) and \(\sigma\) values, we will make use of binomial distribution to calculate the statistics.

Recall that for binomial distribution. Assume \(X\) is number of success and \(n\) is the number of trials, thus proportion is \(p=X/n\). If \(X\) follows binomial distribution, then \(E[X/n] = pn/n=p\) and \(\sigma^2_p=np(1-p)/n^2 = p(1-p)/n\). Then from CLT we can use z-distribution, \((\bar X - \mu)/(\sigma/\sqrt n)\). For this to work, \(n\) must be a large number and t-test won’t work here. For sample proportion \(\hat P\) and population proportion \(p\):

\[z\sim \frac{\hat P - p}{\sqrt{pq/n}}\]

To compare the two populations, we will use the same by replace the mean and variance to proper places in t-statistics for comparison:

\[z\sim \frac{\hat P_1 - \hat P_2 - (p_1-p_2)}{\sqrt{p_1(1-p_1)/n_1 + p_2(1-p_2)/n_2}} = \frac{\hat P_1 - \hat P_2}{\sqrt{p_1(1-p_1)/n_1 + p_2(1-p_2)/n_2}} \text{ for } p_1=p_2 \]

Goals Data

We can test proportion of any teams success rate, proportion of wins, overall the last 10 seasons. The data may not be perfect, but we will assume it is.

Knowing that each team can play one game a night, if we aggregate by outcome season and team we can calculate the proportions as below:

goals_agg <- goals %>% subset(season>2010) %>% group_by(team,date, season, outcome) %>%  summarize(ngoals=sum(goals))
goals_agg$outcome <- ifelse(goals_agg$outcome == 'W',1,0)
goals_agg <- group_by(goals_agg, team) %>% summarise(nwins = sum(outcome), ngames=n())
goals_agg

## # A tibble: 26 x 3
##    team  nwins ngames
##    <fct> <dbl>  <int>
##  1 ANA     407    758
##  2 ARI      31     66
##  3 BOS     442    766
##  4 CAR     274    608
##  5 CBJ      14     39
##  6 CGY     138    280
##  7 CHI     408    759
##  8 COL      95    225
##  9 DAL     388    755
## 10 EDM      26     50
## # … with 16 more rows

props <- mutate(goals_agg, phat = nwins/ngames) %>% select(team, phat)
props$team <- factor(props$team, levels=props$team[order(props$phat)])

ggplot(props, aes(y=team)) +
  geom_point(aes(x=phat), colour='firebrick') + 
  geom_segment(aes(x=0, yend = team, xend=phat), colour = 'grey', alpha=.5) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(legend.position = 'none') + 
  ggtitle('Win rates of teams in last 10 seasons')

If we want to test whether win rate of Washington Capitals is equal to, e.g., 0.65 we can test it using prop.test. Assume the following hypotheses:

\(H_0: p=0.65, H_A: p\neq 0.65\)
\(H_0: p\geq 0.65, H_A: p< 0.65\)
\(H_0: p\leq0.65, H_A: p> 0.65\)

We can test these as below:

nwins  <- goals_agg$nwins[goals_agg$team == 'WSH']
ngames <- goals_agg$ngames[goals_agg$team == 'WSH']
prop.test(x = nwins, n = ngames, p=0.65)

## 
##  1-sample proportions test with continuity correction
## 
## data:  nwins out of ngames, null probability 0.65
## X-squared = 13.987, df = 1, p-value = 0.0001841
## alternative hypothesis: true p is not equal to 0.65
## 95 percent confidence interval:
##  0.5481186 0.6196680
## sample estimates:
##         p 
## 0.5843293

Reject the hypothesis that it is equal to 0.65

We can test \(H_0: p\geq 0.65\) as below:

nwins  <- goals_agg$nwins[goals_agg$team == 'WSH']
ngames <- goals_agg$ngames[goals_agg$team == 'WSH']
prop.test(x = nwins, n = ngames, p=0.65, alternative = 'less')

## 
##  1-sample proportions test with continuity correction
## 
## data:  nwins out of ngames, null probability 0.65
## X-squared = 13.987, df = 1, p-value = 9.203e-05
## alternative hypothesis: true p is less than 0.65
## 95 percent confidence interval:
##  0.0000000 0.6141724
## sample estimates:
##         p 
## 0.5843293

Reject the hypothesis that it is greater than or equal to 0.65

We can test \(H_0: p\leq 0.65\) as below:

nwins  <- goals_agg$nwins[goals_agg$team == 'WSH']
ngames <- goals_agg$ngames[goals_agg$team == 'WSH']
prop.test(x = nwins, n = ngames, p=0.65, alternative = 'greater')

## 
##  1-sample proportions test with continuity correction
## 
## data:  nwins out of ngames, null probability 0.65
## X-squared = 13.987, df = 1, p-value = 0.9999
## alternative hypothesis: true p is greater than 0.65
## 95 percent confidence interval:
##  0.5538689 1.0000000
## sample estimates:
##         p 
## 0.5843293

Don’t reject the hypothesis that it is less than or equal to 0.65

Similarly we can compare two proportions, e.g. compare WSH with PIT’s win rates:

x1 <- goals_agg$nwins[goals_agg$team == 'WSH']
x2 <- goals_agg$nwins[goals_agg$team == 'PIT']
n1 <- goals_agg$ngames[goals_agg$team == 'WSH']
n2 <- goals_agg$ngames[goals_agg$team == 'PIT']

prop.test(x = c(x1,x2), n = c(n1,n2))

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(x1, x2) out of c(n1, n2)
## X-squared = 0.14906, df = 1, p-value = 0.6994
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.06220782  0.03994529
## sample estimates:
##    prop 1    prop 2 
## 0.5843293 0.5954606

Don’t reject the hypothesis that they are equal.

prop.test(x = c(x1,x2), n = c(n1,n2), alternative = 'less')

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(x1, x2) out of c(n1, n2)
## X-squared = 0.14906, df = 1, p-value = 0.3497
## alternative hypothesis: less
## 95 percent confidence interval:
##  -1.00000000  0.03194761
## sample estimates:
##    prop 1    prop 2 
## 0.5843293 0.5954606

Don’t rejecct the hypohtesis that win rate WSH is greater than or equal to rate PIT

prop.test(x = c(x1,x2), n = c(n1,n2), alternative = 'greater')

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(x1, x2) out of c(n1, n2)
## X-squared = 0.14906, df = 1, p-value = 0.6503
## alternative hypothesis: greater
## 95 percent confidence interval:
##  -0.05421014  1.00000000
## sample estimates:
##    prop 1    prop 2 
## 0.5843293 0.5954606

Don’t rejecct the hypohtesis that win rate WSH is less than or equal to rate PIT.

House Data

Here I processed the data very to calculate the proportions. But it is not that hard always. For example we can calculate the proportion of multistorey houses in Mercer Island and test some hypotheses, or compare it with University District:

houses <- read.csv('../Lab 3/house_subset.csv')
houses$zipcode <- as.factor(houses$zipcode)
houses$multistorey <- ifelse(houses$floors > 1, 1 , 0)

props <- houses %>% group_by(zipcode) %>% summarize(phat = mean(multistorey))
props$zipcode <- factor(props$zipcode, levels=props$zipcode[order(props$phat)])

ggplot(props, aes(y=zipcode)) +
  geom_point(aes(x=phat), colour='firebrick') + 
  geom_segment(aes(x=0, yend = zipcode, xend=phat), colour = 'grey', alpha=.5) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(legend.position = 'none') + 
  ggtitle('Proportions of multistorey houses in different neighbourhoods')

We can test the hypothesis that proportion of multistorey houses is equal to / greater than or equal to / less than or equal to 0.5 for Mercer Island as below:

houses_agg <- houses %>% group_by(zipcode) %>% summarize(nmulti = sum(multistorey), allhouses = n())

x_mercer  <- houses_agg$nmulti[houses_agg$zipcode == 98040]
n_mercer  <- houses_agg$allhouses[houses_agg$zipcode == 98040]

prop.test(x = x_mercer, n = n_mercer, p=0.5)

## 
##  1-sample proportions test with continuity correction
## 
## data:  x_mercer out of n_mercer, null probability 0.5
## X-squared = 0.088652, df = 1, p-value = 0.7659
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.4507915 0.5701905
## sample estimates:
##         p 
## 0.5106383

prop.test(x = x_mercer, n = n_mercer, p=0.5, alternative = 'less')

## 
##  1-sample proportions test with continuity correction
## 
## data:  x_mercer out of n_mercer, null probability 0.5
## X-squared = 0.088652, df = 1, p-value = 0.6171
## alternative hypothesis: true p is less than 0.5
## 95 percent confidence interval:
##  0.0000000 0.5610201
## sample estimates:
##         p 
## 0.5106383

prop.test(x = x_mercer, n = n_mercer, p=0.5, alternative = 'greater')

## 
##  1-sample proportions test with continuity correction
## 
## data:  x_mercer out of n_mercer, null probability 0.5
## X-squared = 0.088652, df = 1, p-value = 0.3829
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
##  0.460047 1.000000
## sample estimates:
##         p 
## 0.5106383

houses_agg <- houses %>% group_by(zipcode) %>% summarize(nmulti = sum(multistorey), allhouses = n())

x_mercer  <- houses_agg$nmulti[houses_agg$zipcode == 98040]
n_mercer  <- houses_agg$allhouses[houses_agg$zipcode == 98040]
x_univ    <- houses_agg$nmulti[houses_agg$zipcode == 98109]
n_univ    <- houses_agg$allhouses[houses_agg$zipcode == 98109]


prop.test(x = c(x_mercer,x_univ), n = c(n_mercer,n_univ))

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(x_mercer, x_univ) out of c(n_mercer, n_univ)
## X-squared = 13.849, df = 1, p-value = 0.0001981
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.3226403 -0.1056244
## sample estimates:
##    prop 1    prop 2 
## 0.5106383 0.7247706

prop.test(x = c(x_mercer,x_univ), n = c(n_mercer,n_univ), alternative = 'less')

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(x_mercer, x_univ) out of c(n_mercer, n_univ)
## X-squared = 13.849, df = 1, p-value = 9.906e-05
## alternative hypothesis: less
## 95 percent confidence interval:
##  -1.000000 -0.122047
## sample estimates:
##    prop 1    prop 2 
## 0.5106383 0.7247706

prop.test(x = c(x_mercer,x_univ), n = c(n_mercer,n_univ), alternative = 'greater')

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(x_mercer, x_univ) out of c(n_mercer, n_univ)
## X-squared = 13.849, df = 1, p-value = 0.9999
## alternative hypothesis: greater
## 95 percent confidence interval:
##  -0.3062176  1.0000000
## sample estimates:
##    prop 1    prop 2 
## 0.5106383 0.7247706

Deliverables

4. Testing proportions

Choose (1) a variable (numeric or categorical, doesn’t matter) and a (2) categorical variable.

Mutate the data by creating a new binary variable using the variable (1) and a criterion. Here, for example, we created a binary variable multistorey using floors > 1 denoting whether a house is multistorey. With NHL data we used a categorical variable and created win/not win binary variable using outcome=='W'.
Group your mutated data using by the categorical (2) variable (here we used zipcode) and calculate \(X\) (the number of success) and \(n\). Here, for example, \(X\) was nmulti (number of multistorey houses) and \(n\) was the total number of houses
Choose two categories from your categorical variable (2). Here, for example, we used zipcodes 98040 and 98109.
Compare the two categories’ proportions using prop.test.

Lab 11: Testing Proportions