Lab 4: Discrete and Continuous Distributions
Introduction
In this tutorial, you will learn discrete and continuous distributions, how they look like and how to obtain them. We will be visualizing random generated numbers and compare them with the real distributions from the data. By the end of the tutorial you will be able to:
- visually identify certain distributions,
- identify the distribution of a variable in a function.
Getting Started
We will work with three datasets. To be able to follow you need:
- to download earthquake.csv, game_goals.csv and returns.csv data uploaded on LEARN.
- some packages including
tidyverse
lubridate
Now we can read the datasets and have a closer look
library('tidyverse')
goals.p <- read.csv('data/game_goals.csv')
eq <- read.csv('data/earthquake.csv')
returns <- read.csv('data/returns.csv')
## [1] 49384 25
## player season rank date game_num age team at opp location
## 1 Alex Ovechkin 2006 1 2005-10-05 1 20-018 WSH <NA> CBJ Home
## 2 Alex Ovechkin 2006 2 2005-10-07 2 20-020 WSH <NA> ATL Home
## 3 Alex Ovechkin 2006 3 2005-10-08 3 20-021 WSH @ ATL Away
## 4 Alex Ovechkin 2006 4 2005-10-10 4 20-023 WSH <NA> NYR Home
## 5 Alex Ovechkin 2006 5 2005-10-12 5 20-025 WSH @ CAR Away
## 6 Alex Ovechkin 2006 6 2005-10-13 6 20-026 WSH <NA> NYI Home
## outcome goals assists points plus_minus penalty_min goals_even
## 1 W 2 0 2 1 2 1
## 2 L 0 1 1 -2 0 0
## 3 L 0 1 1 0 4 0
## 4 W 1 0 1 1 2 0
## 5 L 1 0 1 0 0 1
## 6 L 0 1 1 -1 0 0
## goals_powerplay goals_short goals_gamewinner assists_even assists_powerplay
## 1 1 0 0 NA NA
## 2 0 0 0 NA NA
## 3 0 0 0 NA NA
## 4 1 0 1 NA NA
## 5 0 0 0 NA NA
## 6 0 0 0 NA NA
## assists_short shots shot_percent
## 1 NA 5 40.0
## 2 NA 1 0.0
## 3 NA 3 0.0
## 4 NA 6 16.7
## 5 NA 6 16.7
## 6 NA 5 0.0
## [1] 42
Discrete Distributions
There are a couple of discrete distributions each are for different random variables:
Binomial Distribution
Probability of \(k\) success in \(n\) trials with success rate, \(p\):
\[X \sim Binom(n,k,p) \Rightarrow P(X=k) = \frac{n!}{(n-k)!k!} p^k(1-p)^{n-k}\]
Examples:
- Number of wins in a pre-season of 8 games
- Number of people attending to the flight (k) where there are 50 seats (n) and 95% attendance rate (p)
Geometric Distribution
Probability of success in \(k\)th trial with success rate of \(p\): \[X \sim Geom(k,p) \Rightarrow P(X=k) = p(1-p)^{k-1}\]
Examples:
- Number of times you try the episode (k) in your favourite video game with 40% success rate.
- Number of times you install and remove (k) Linux/Ubuntu to your computer before you get addicted to it with 20% success rate.
- Number of times you propose marriag…
Poisson Distribution
Probability of a given number of events occurring in a given time interval or space.
\[X \sim Pois(k,\lambda) \Rightarrow P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}\]
Examples:`
- Number of goals in a hockey match with goal rate (avg. goal) \(\lambda = 3.22\)
- Number of typo-errors in the lab files that I share with error rate (average error), \(\lambda = 7.41\)
- Number of people (k) to withdraw money from the ATM in University Plaza in a day with average \(\lambda = 220\)
Continuous Distributions
There are a couple of discrete distributions each are for different random variables:
Normal
\[X \sim N(\mu,\sigma^2) \Rightarrow f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac12(\frac{x-\mu}{\sigma})^2}\]
Exponential
\[X \sim exp(\lambda) \Rightarrow f(x) = \lambda e^{-\lambda x}\]
Uniform
Log-Normal
Beta
Gamma
Weibull
Identifying Discrete Distributions
Aggregating Data (Group By and Sum)
Last week we covered some ways to aggregate the data but just by counting. We will add one more function today.
The data contains scores per match per player, so if we sum the goals column by grouping the players we can obtain the players’ scorecard:
agg_goals <- group_by(goals.p, player) %>%
summarise(ngoals = sum(goals)) %>%
arrange(desc(ngoals))
head(agg_goals)
## # A tibble: 6 x 2
## player ngoals
## <chr> <int>
## 1 Wayne Gretzky 894
## 2 Jaromir Jagr 766
## 3 Brett Hull 741
## 4 Mike Gartner 708
## 5 Alex Ovechkin 701
## 6 Mark Messier 694
What about scoring rate per player?
agg_goals <- group_by(goals.p, player) %>%
summarise(ngoals = mean(goals)) %>%
arrange(desc(ngoals))
head(agg_goals)
## # A tibble: 6 x 2
## player ngoals
## <chr> <dbl>
## 1 Mario Lemieux 0.754
## 2 Alex Ovechkin 0.612
## 3 Wayne Gretzky 0.601
## 4 Brett Hull 0.584
## 5 Steven Stamkos 0.526
## 6 Mike Gartner 0.494
Number of goals per the date
and team
:
## # A tibble: 6 x 3
## # Groups: date [2]
## date team ngoals
## <chr> <chr> <int>
## 1 2020-02-25 TBL 0
## 2 2020-02-25 TOR 2
## 3 2020-02-25 WSH 1
## 4 2020-02-26 LAK 0
## 5 2020-02-26 PIT 0
## 6 2020-02-26 VEG 1
Note that our dataset contains only 42 players, the above doesn’t represent the real scores, but filters only the goals by the 42 players.
Both player’s total
number of goals and average
per game :
agg_goals <- group_by(goals.p, player) %>%
summarise(ngoals= sum(goals), rate = mean(goals)) %>%
arrange(desc(ngoals))
head(agg_goals)
## # A tibble: 6 x 3
## player ngoals rate
## <chr> <int> <dbl>
## 1 Wayne Gretzky 894 0.601
## 2 Jaromir Jagr 766 0.442
## 3 Brett Hull 741 0.584
## 4 Mike Gartner 708 0.494
## 5 Alex Ovechkin 701 0.612
## 6 Mark Messier 694 0.395
Sometimes we need to group by just counting rows. For example in covid
dataset each row corresponds to a person who is diagnosed to have the disease. Let’s group by City
and Gender
:
Discrete Numeric Variables
As we discussed above, probability of a given number of events occurring in a given time interval or space follows a Poisson distribution. Therefore, number of goals per player in a given match must be Poisson too with \(\lambda\) being the average goals per match:
## [1] 0.413636
In the data the average goal by a team is \(\lambda=\) 0.413636. On the other hand, the distribution of goals per game looks like as below. By default number of bins is 30, but we want one bin per each possible goal, i.e. max(goals) - min(goals) + 1
.
goalRange <- max(goals.p$goals) - min(goals.p$goals) + 1
ggplot(goals.p, aes(x=goals)) +
geom_histogram(bins = goalRange, colour = 'white', alpha=.8) +
theme_minimal()
The above is counts. We will plot the pmf using y=..density..
:
goalRange <- max(goals.p$goals) - min(goals.p$goals) + 1
ggplot(goals.p, aes(x=goals, y=..density..)) +
geom_histogram(bins = goalRange, colour = 'white', alpha=.8) +
scale_x_continuous(n.breaks = goalRange) +
theme_minimal()
Do you think it is Poisson? Well, there is a way to verify this. We will add the ideal probabilities \(P(X = k)\) following a poisson distribution with average goal, \(\lambda=0.413\)on the graph:
\[X \sim Pois(k,\lambda) \Rightarrow P(X=k) = \frac{0.41^k e^{-0.41}}{k!}\]
goalRange <- max(goals.p$goals) - min(goals.p$goals) + 1
lambda <- mean(goals.p$goals)
pois <- data.frame(x = 0:max(goals.p$goals))
pois$p <- dpois(pois$x, lambda) # identical probabilities
t(pois)
## [,1] [,2] [,3] [,4] [,5] [,6]
## x 0.0000000 1.0000000 2.00000000 3.000000000 4.0000000000 5.000000e+00
## p 0.6612416 0.2735133 0.05656748 0.007799448 0.0008065332 6.672223e-05
Now we can overlap the graphs:
ggplot(goals.p, aes(x=goals, y=..density..)) +
geom_histogram(bins = goalRange, colour = 'white', alpha=.8) +
geom_point(data = pois, mapping = aes(x = x, y = p), size = 3, colour='firebrick') + # points
geom_line(data = pois, mapping = aes(x = x, y = p), colour='firebrick') + # lines
scale_x_continuous(n.breaks = goalRange) +
theme_minimal()
Earthquakes
## Date Time Latitude Longitude Type Depth Depth.Error
## 1 01/02/1965 13:44:18 19.246 145.616 Earthquake 131.6 NA
## 2 01/04/1965 11:29:49 1.863 127.352 Earthquake 80.0 NA
## 3 01/05/1965 18:05:58 -20.579 -173.972 Earthquake 20.0 NA
## 4 01/08/1965 18:49:43 -59.076 -23.557 Earthquake 15.0 NA
## 5 01/09/1965 13:32:50 11.938 126.427 Earthquake 15.0 NA
## 6 01/10/1965 13:36:32 -13.405 166.629 Earthquake 35.0 NA
## Depth.Seismic.Stations Magnitude Magnitude.Type Magnitude.Error
## 1 NA 6.0 MW NA
## 2 NA 5.8 MW NA
## 3 NA 6.2 MW NA
## 4 NA 5.8 MW NA
## 5 NA 5.8 MW NA
## 6 NA 6.7 MW NA
## Magnitude.Seismic.Stations Azimuthal.Gap Horizontal.Distance Horizontal.Error
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## Root.Mean.Square ID Source Location.Source Magnitude.Source
## 1 NA ISCGEM860706 ISCGEM ISCGEM ISCGEM
## 2 NA ISCGEM860737 ISCGEM ISCGEM ISCGEM
## 3 NA ISCGEM860762 ISCGEM ISCGEM ISCGEM
## 4 NA ISCGEM860856 ISCGEM ISCGEM ISCGEM
## 5 NA ISCGEM860890 ISCGEM ISCGEM ISCGEM
## 6 NA ISCGEM860922 ISCGEM ISCGEM ISCGEM
## Status
## 1 Automatic
## 2 Automatic
## 3 Automatic
## 4 Automatic
## 5 Automatic
## 6 Automatic
## Warning: 3 failed to parse.
# eqs <- eqs %>% mutate(Date = mdy(Date))
# year(eqs$Date)
# month(eqs$Date)
# paste0(year(eqs$Date), '-', month(eqs$Date))
eqs <- eqs %>% mutate(YearMonth = paste0(year(Date), '-', month(Date)),
DateTime = as_datetime(paste0(Date,' ',Time)))
head(eqs)
## Date Time Latitude Longitude Type Depth Depth.Error
## 1 1965-01-02 13:44:18 19.246 145.616 Earthquake 131.6 NA
## 2 1965-01-04 11:29:49 1.863 127.352 Earthquake 80.0 NA
## 3 1965-01-05 18:05:58 -20.579 -173.972 Earthquake 20.0 NA
## 4 1965-01-08 18:49:43 -59.076 -23.557 Earthquake 15.0 NA
## 5 1965-01-09 13:32:50 11.938 126.427 Earthquake 15.0 NA
## 6 1965-01-10 13:36:32 -13.405 166.629 Earthquake 35.0 NA
## Depth.Seismic.Stations Magnitude Magnitude.Type Magnitude.Error
## 1 NA 6.0 MW NA
## 2 NA 5.8 MW NA
## 3 NA 6.2 MW NA
## 4 NA 5.8 MW NA
## 5 NA 5.8 MW NA
## 6 NA 6.7 MW NA
## Magnitude.Seismic.Stations Azimuthal.Gap Horizontal.Distance Horizontal.Error
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## Root.Mean.Square ID Source Location.Source Magnitude.Source
## 1 NA ISCGEM860706 ISCGEM ISCGEM ISCGEM
## 2 NA ISCGEM860737 ISCGEM ISCGEM ISCGEM
## 3 NA ISCGEM860762 ISCGEM ISCGEM ISCGEM
## 4 NA ISCGEM860856 ISCGEM ISCGEM ISCGEM
## 5 NA ISCGEM860890 ISCGEM ISCGEM ISCGEM
## 6 NA ISCGEM860922 ISCGEM ISCGEM ISCGEM
## Status YearMonth DateTime
## 1 Automatic 1965-1 1965-01-02 13:44:18
## 2 Automatic 1965-1 1965-01-04 11:29:49
## 3 Automatic 1965-1 1965-01-05 18:05:58
## 4 Automatic 1965-1 1965-01-08 18:49:43
## 5 Automatic 1965-1 1965-01-09 13:32:50
## 6 Automatic 1965-1 1965-01-10 13:36:32
world <- map_data("world")
ggplot(data = world, aes(x=long, y=lat, group=group)) +
geom_polygon(fill = "white", colour = "black") +
geom_point(data=subset(eqs,Date>'2010-01-01'), aes(x=Longitude, y=Latitude, group=1, size = Magnitude), colour='firebrick', alpha=.1)+
coord_quickmap() +
theme_void()
Continuous Distributions
References
- https://www.datanovia.com/en/blog/ggplot-colors-best-tricks-you-will-love/
- Data Visualization with R
- https://www.datanovia.com/en/blog/gganimate-how-to-create-plots-with-beautiful-animation-in-r/
- Treemapify Documentation2
- Visualizing a Categorical Variable
- Waffle Chart
- GGPlot Colors Best Tricks You Will Love