Lab 11: Testing Proportions
Introduction
In the last two labs, we visited hypothesis testing and tested population mean and variance. This week we will test proportions. More clearly, we will hypothesize that the proportion is equal to an ideal value, and check whether two proportions are equal.
In this lab we will:
- carry out proportion test,
- carry out hypothesis testing for comparing two proportions.
Getting Started
- Load house_subset.csv data and game_goals.csv.
- Load
tidyverse
## player season rank date game_num age team at opp location
## 1 Alex Ovechkin 2006 1 2005-10-05 1 20-018 WSH <NA> CBJ Home
## 2 Alex Ovechkin 2006 2 2005-10-07 2 20-020 WSH <NA> ATL Home
## 3 Alex Ovechkin 2006 3 2005-10-08 3 20-021 WSH @ ATL Away
## 4 Alex Ovechkin 2006 4 2005-10-10 4 20-023 WSH <NA> NYR Home
## 5 Alex Ovechkin 2006 5 2005-10-12 5 20-025 WSH @ CAR Away
## 6 Alex Ovechkin 2006 6 2005-10-13 6 20-026 WSH <NA> NYI Home
## outcome goals assists points plus_minus penalty_min goals_even
## 1 W 2 0 2 1 2 1
## 2 L 0 1 1 -2 0 0
## 3 L 0 1 1 0 4 0
## 4 W 1 0 1 1 2 0
## 5 L 1 0 1 0 0 1
## 6 L 0 1 1 -1 0 0
## goals_powerplay goals_short goals_gamewinner assists_even assists_powerplay
## 1 1 0 0 NA NA
## 2 0 0 0 NA NA
## 3 0 0 0 NA NA
## 4 1 0 1 NA NA
## 5 0 0 0 NA NA
## 6 0 0 0 NA NA
## assists_short shots shot_percent
## 1 NA 5 40.0
## 2 NA 1 0.0
## 3 NA 3 0.0
## 4 NA 6 16.7
## 5 NA 6 16.7
## 6 NA 5 0.0
Proportion test
Proportion test is not different from t-test in principle. Here the test statistics is even easier to calculate as it requres less parameter. To be more clear, we don’t need explicit \(\mu\) and \(\sigma\) values, we will make use of binomial distribution to calculate the statistics.
Recall that for binomial distribution. Assume \(X\) is number of success and \(n\) is the number of trials, thus proportion is \(p=X/n\). If \(X\) follows binomial distribution, then \(E[X/n] = pn/n=p\) and \(\sigma^2_p=np(1-p)/n^2 = p(1-p)/n\). Then from CLT we can use z-distribution, \((\bar X - \mu)/(\sigma/\sqrt n)\). For this to work, \(n\) must be a large number and t-test won’t work here. For sample proportion \(\hat P\) and population proportion \(p\):
\[z\sim \frac{\hat P - p}{\sqrt{pq/n}}\]
To compare the two populations, we will use the same by replace the mean and variance to proper places in t-statistics for comparison:
\[z\sim \frac{\hat P_1 - \hat P_2 - (p_1-p_2)}{\sqrt{p_1(1-p_1)/n_1 + p_2(1-p_2)/n_2}} = \frac{\hat P_1 - \hat P_2}{\sqrt{p_1(1-p_1)/n_1 + p_2(1-p_2)/n_2}} \text{ for } p_1=p_2 \]
Goals Data
We can test proportion of any teams success rate, proportion of wins, overall the last 10 seasons. The data may not be perfect, but we will assume it is.
Knowing that each team can play one game a night, if we aggregate by outcome season and team we can calculate the proportions as below:
goals_agg <- goals %>% subset(season>2010) %>% group_by(team,date, season, outcome) %>% summarize(ngoals=sum(goals))
goals_agg$outcome <- ifelse(goals_agg$outcome == 'W',1,0)
goals_agg <- group_by(goals_agg, team) %>% summarise(nwins = sum(outcome), ngames=n())
goals_agg
## # A tibble: 26 x 3
## team nwins ngames
## <fct> <dbl> <int>
## 1 ANA 407 758
## 2 ARI 31 66
## 3 BOS 442 766
## 4 CAR 274 608
## 5 CBJ 14 39
## 6 CGY 138 280
## 7 CHI 408 759
## 8 COL 95 225
## 9 DAL 388 755
## 10 EDM 26 50
## # … with 16 more rows
props <- mutate(goals_agg, phat = nwins/ngames) %>% select(team, phat)
props$team <- factor(props$team, levels=props$team[order(props$phat)])
ggplot(props, aes(y=team)) +
geom_point(aes(x=phat), colour='firebrick') +
geom_segment(aes(x=0, yend = team, xend=phat), colour = 'grey', alpha=.5) +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(legend.position = 'none') +
ggtitle('Win rates of teams in last 10 seasons')
If we want to test whether win rate of Washington Capitals is equal to, e.g., 0.65 we can test it using prop.test
. Assume the following hypotheses:
- \(H_0: p=0.65, H_A: p\neq 0.65\)
- \(H_0: p\geq 0.65, H_A: p< 0.65\)
- \(H_0: p\leq0.65, H_A: p> 0.65\)
We can test these as below:
nwins <- goals_agg$nwins[goals_agg$team == 'WSH']
ngames <- goals_agg$ngames[goals_agg$team == 'WSH']
prop.test(x = nwins, n = ngames, p=0.65)
##
## 1-sample proportions test with continuity correction
##
## data: nwins out of ngames, null probability 0.65
## X-squared = 13.987, df = 1, p-value = 0.0001841
## alternative hypothesis: true p is not equal to 0.65
## 95 percent confidence interval:
## 0.5481186 0.6196680
## sample estimates:
## p
## 0.5843293
Reject the hypothesis that it is equal to 0.65
We can test \(H_0: p\geq 0.65\) as below:
nwins <- goals_agg$nwins[goals_agg$team == 'WSH']
ngames <- goals_agg$ngames[goals_agg$team == 'WSH']
prop.test(x = nwins, n = ngames, p=0.65, alternative = 'less')
##
## 1-sample proportions test with continuity correction
##
## data: nwins out of ngames, null probability 0.65
## X-squared = 13.987, df = 1, p-value = 9.203e-05
## alternative hypothesis: true p is less than 0.65
## 95 percent confidence interval:
## 0.0000000 0.6141724
## sample estimates:
## p
## 0.5843293
Reject the hypothesis that it is greater than or equal to 0.65
We can test \(H_0: p\leq 0.65\) as below:
nwins <- goals_agg$nwins[goals_agg$team == 'WSH']
ngames <- goals_agg$ngames[goals_agg$team == 'WSH']
prop.test(x = nwins, n = ngames, p=0.65, alternative = 'greater')
##
## 1-sample proportions test with continuity correction
##
## data: nwins out of ngames, null probability 0.65
## X-squared = 13.987, df = 1, p-value = 0.9999
## alternative hypothesis: true p is greater than 0.65
## 95 percent confidence interval:
## 0.5538689 1.0000000
## sample estimates:
## p
## 0.5843293
Don’t reject the hypothesis that it is less than or equal to 0.65
Similarly we can compare two proportions, e.g. compare WSH with PIT’s win rates:
x1 <- goals_agg$nwins[goals_agg$team == 'WSH']
x2 <- goals_agg$nwins[goals_agg$team == 'PIT']
n1 <- goals_agg$ngames[goals_agg$team == 'WSH']
n2 <- goals_agg$ngames[goals_agg$team == 'PIT']
prop.test(x = c(x1,x2), n = c(n1,n2))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(x1, x2) out of c(n1, n2)
## X-squared = 0.14906, df = 1, p-value = 0.6994
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.06220782 0.03994529
## sample estimates:
## prop 1 prop 2
## 0.5843293 0.5954606
Don’t reject the hypothesis that they are equal.
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(x1, x2) out of c(n1, n2)
## X-squared = 0.14906, df = 1, p-value = 0.3497
## alternative hypothesis: less
## 95 percent confidence interval:
## -1.00000000 0.03194761
## sample estimates:
## prop 1 prop 2
## 0.5843293 0.5954606
Don’t rejecct the hypohtesis that win rate WSH is greater than or equal to rate PIT
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(x1, x2) out of c(n1, n2)
## X-squared = 0.14906, df = 1, p-value = 0.6503
## alternative hypothesis: greater
## 95 percent confidence interval:
## -0.05421014 1.00000000
## sample estimates:
## prop 1 prop 2
## 0.5843293 0.5954606
Don’t rejecct the hypohtesis that win rate WSH is less than or equal to rate PIT.
House Data
Here I processed the data very to calculate the proportions. But it is not that hard always. For example we can calculate the proportion of multistorey houses in Mercer Island and test some hypotheses, or compare it with University District:
houses <- read.csv('../Lab 3/house_subset.csv')
houses$zipcode <- as.factor(houses$zipcode)
houses$multistorey <- ifelse(houses$floors > 1, 1 , 0)
props <- houses %>% group_by(zipcode) %>% summarize(phat = mean(multistorey))
props$zipcode <- factor(props$zipcode, levels=props$zipcode[order(props$phat)])
ggplot(props, aes(y=zipcode)) +
geom_point(aes(x=phat), colour='firebrick') +
geom_segment(aes(x=0, yend = zipcode, xend=phat), colour = 'grey', alpha=.5) +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(legend.position = 'none') +
ggtitle('Proportions of multistorey houses in different neighbourhoods')
We can test the hypothesis that proportion of multistorey houses is equal to / greater than or equal to / less than or equal to 0.5 for Mercer Island as below:
houses_agg <- houses %>% group_by(zipcode) %>% summarize(nmulti = sum(multistorey), allhouses = n())
x_mercer <- houses_agg$nmulti[houses_agg$zipcode == 98040]
n_mercer <- houses_agg$allhouses[houses_agg$zipcode == 98040]
prop.test(x = x_mercer, n = n_mercer, p=0.5)
##
## 1-sample proportions test with continuity correction
##
## data: x_mercer out of n_mercer, null probability 0.5
## X-squared = 0.088652, df = 1, p-value = 0.7659
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4507915 0.5701905
## sample estimates:
## p
## 0.5106383
##
## 1-sample proportions test with continuity correction
##
## data: x_mercer out of n_mercer, null probability 0.5
## X-squared = 0.088652, df = 1, p-value = 0.6171
## alternative hypothesis: true p is less than 0.5
## 95 percent confidence interval:
## 0.0000000 0.5610201
## sample estimates:
## p
## 0.5106383
##
## 1-sample proportions test with continuity correction
##
## data: x_mercer out of n_mercer, null probability 0.5
## X-squared = 0.088652, df = 1, p-value = 0.3829
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
## 0.460047 1.000000
## sample estimates:
## p
## 0.5106383
houses_agg <- houses %>% group_by(zipcode) %>% summarize(nmulti = sum(multistorey), allhouses = n())
x_mercer <- houses_agg$nmulti[houses_agg$zipcode == 98040]
n_mercer <- houses_agg$allhouses[houses_agg$zipcode == 98040]
x_univ <- houses_agg$nmulti[houses_agg$zipcode == 98109]
n_univ <- houses_agg$allhouses[houses_agg$zipcode == 98109]
prop.test(x = c(x_mercer,x_univ), n = c(n_mercer,n_univ))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(x_mercer, x_univ) out of c(n_mercer, n_univ)
## X-squared = 13.849, df = 1, p-value = 0.0001981
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.3226403 -0.1056244
## sample estimates:
## prop 1 prop 2
## 0.5106383 0.7247706
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(x_mercer, x_univ) out of c(n_mercer, n_univ)
## X-squared = 13.849, df = 1, p-value = 9.906e-05
## alternative hypothesis: less
## 95 percent confidence interval:
## -1.000000 -0.122047
## sample estimates:
## prop 1 prop 2
## 0.5106383 0.7247706
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(x_mercer, x_univ) out of c(n_mercer, n_univ)
## X-squared = 13.849, df = 1, p-value = 0.9999
## alternative hypothesis: greater
## 95 percent confidence interval:
## -0.3062176 1.0000000
## sample estimates:
## prop 1 prop 2
## 0.5106383 0.7247706
Deliverables
4. Testing proportions
Choose (1) a variable (numeric or categorical, doesn’t matter) and a (2) categorical variable.
Mutate the data by creating a new binary variable using the variable (1) and a criterion. Here, for example, we created a binary variable
multistorey
usingfloors > 1
denoting whether a house is multistorey. With NHL data we used a categorical variable and created win/not win binary variable usingoutcome=='W'
.Group your mutated data using by the categorical (2) variable (here we used
zipcode
) and calculate \(X\) (the number of success) and \(n\). Here, for example, \(X\) was nmulti (number of multistorey houses) and \(n\) was the total number of housesChoose two categories from your categorical variable (2). Here, for example, we used zipcodes 98040 and 98109.
Compare the two categories’ proportions using
prop.test
.