Lab 4: Discrete and Continuous Distributions

Introduction

In this tutorial, you will learn discrete and continuous distributions, how they look like and how to obtain them. We will be visualizing random generated numbers and compare them with the real distributions from the data. By the end of the tutorial you will be able to:

visually identify certain distributions,
identify the distribution of a variable in a function.

Getting Started

We will work with three datasets. To be able to follow you need:

to download earthquake.csv, game_goals.csv and returns.csv data uploaded on LEARN.
some packages including

tidyverse
lubridate

Now we can read the datasets and have a closer look

library('tidyverse')
goals.p  <- read.csv('data/game_goals.csv')
eq       <- read.csv('data/earthquake.csv')
returns <- read.csv('data/returns.csv')

dim(goals.p)

## [1] 49384    25

head(goals.p)

##          player season rank       date game_num    age team   at opp location
## 1 Alex Ovechkin   2006    1 2005-10-05        1 20-018  WSH <NA> CBJ     Home
## 2 Alex Ovechkin   2006    2 2005-10-07        2 20-020  WSH <NA> ATL     Home
## 3 Alex Ovechkin   2006    3 2005-10-08        3 20-021  WSH    @ ATL     Away
## 4 Alex Ovechkin   2006    4 2005-10-10        4 20-023  WSH <NA> NYR     Home
## 5 Alex Ovechkin   2006    5 2005-10-12        5 20-025  WSH    @ CAR     Away
## 6 Alex Ovechkin   2006    6 2005-10-13        6 20-026  WSH <NA> NYI     Home
##   outcome goals assists points plus_minus penalty_min goals_even
## 1       W     2       0      2          1           2          1
## 2       L     0       1      1         -2           0          0
## 3       L     0       1      1          0           4          0
## 4       W     1       0      1          1           2          0
## 5       L     1       0      1          0           0          1
## 6       L     0       1      1         -1           0          0
##   goals_powerplay goals_short goals_gamewinner assists_even assists_powerplay
## 1               1           0                0           NA                NA
## 2               0           0                0           NA                NA
## 3               0           0                0           NA                NA
## 4               1           0                1           NA                NA
## 5               0           0                0           NA                NA
## 6               0           0                0           NA                NA
##   assists_short shots shot_percent
## 1            NA     5         40.0
## 2            NA     1          0.0
## 3            NA     3          0.0
## 4            NA     6         16.7
## 5            NA     6         16.7
## 6            NA     5          0.0

hist(goals.p$season)

length(unique(goals.p$player))

## [1] 42

Discrete Distributions

There are a couple of discrete distributions each are for different random variables:

Binomial Distribution

Probability of \(k\) success in \(n\) trials with success rate, \(p\):

\[X \sim Binom(n,k,p) \Rightarrow P(X=k) = \frac{n!}{(n-k)!k!} p^k(1-p)^{n-k}\]

Examples:

Number of wins in a pre-season of 8 games
Number of people attending to the flight (k) where there are 50 seats (n) and 95% attendance rate (p)

Geometric Distribution

Probability of success in \(k\)th trial with success rate of \(p\): \[X \sim Geom(k,p) \Rightarrow P(X=k) = p(1-p)^{k-1}\]

Examples:

Number of times you try the episode (k) in your favourite video game with 40% success rate.
Number of times you install and remove (k) Linux/Ubuntu to your computer before you get addicted to it with 20% success rate.
Number of times you propose marriag…

Poisson Distribution

Probability of a given number of events occurring in a given time interval or space.

\[X \sim Pois(k,\lambda) \Rightarrow P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}\]

Examples:`

Number of goals in a hockey match with goal rate (avg. goal) \(\lambda = 3.22\)
Number of typo-errors in the lab files that I share with error rate (average error), \(\lambda = 7.41\)
Number of people (k) to withdraw money from the ATM in University Plaza in a day with average \(\lambda = 220\)

Continuous Distributions

There are a couple of discrete distributions each are for different random variables:

Normal

\[X \sim N(\mu,\sigma^2) \Rightarrow f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac12(\frac{x-\mu}{\sigma})^2}\]

Exponential

\[X \sim exp(\lambda) \Rightarrow f(x) = \lambda e^{-\lambda x}\]

Uniform

Log-Normal

Beta

Gamma

Weibull

Identifying Discrete Distributions

Aggregating Data (Group By and Sum)

Last week we covered some ways to aggregate the data but just by counting. We will add one more function today.

The data contains scores per match per player, so if we sum the goals column by grouping the players we can obtain the players’ scorecard:

agg_goals <- group_by(goals.p, player) %>% 
  summarise(ngoals = sum(goals)) %>% 
  arrange(desc(ngoals))
head(agg_goals)

## # A tibble: 6 x 2
##   player        ngoals
##   <chr>          <int>
## 1 Wayne Gretzky    894
## 2 Jaromir Jagr     766
## 3 Brett Hull       741
## 4 Mike Gartner     708
## 5 Alex Ovechkin    701
## 6 Mark Messier     694

What about scoring rate per player?

agg_goals <- group_by(goals.p, player) %>% 
  summarise(ngoals = mean(goals)) %>% 
  arrange(desc(ngoals))
head(agg_goals)

## # A tibble: 6 x 2
##   player         ngoals
##   <chr>           <dbl>
## 1 Mario Lemieux   0.754
## 2 Alex Ovechkin   0.612
## 3 Wayne Gretzky   0.601
## 4 Brett Hull      0.584
## 5 Steven Stamkos  0.526
## 6 Mike Gartner    0.494

Number of goals per the date and team:

agg_goals <- group_by(goals.p, date, team) %>% summarise(ngoals = sum(goals))
tail(agg_goals)

## # A tibble: 6 x 3
## # Groups:   date [2]
##   date       team  ngoals
##   <chr>      <chr>  <int>
## 1 2020-02-25 TBL        0
## 2 2020-02-25 TOR        2
## 3 2020-02-25 WSH        1
## 4 2020-02-26 LAK        0
## 5 2020-02-26 PIT        0
## 6 2020-02-26 VEG        1

Note that our dataset contains only 42 players, the above doesn’t represent the real scores, but filters only the goals by the 42 players.

Both player’s total number of goals and average per game :

agg_goals <- group_by(goals.p, player) %>% 
  summarise(ngoals= sum(goals), rate = mean(goals)) %>% 
  arrange(desc(ngoals))
head(agg_goals)

## # A tibble: 6 x 3
##   player        ngoals  rate
##   <chr>          <int> <dbl>
## 1 Wayne Gretzky    894 0.601
## 2 Jaromir Jagr     766 0.442
## 3 Brett Hull       741 0.584
## 4 Mike Gartner     708 0.494
## 5 Alex Ovechkin    701 0.612
## 6 Mark Messier     694 0.395

Sometimes we need to group by just counting rows. For example in covid dataset each row corresponds to a person who is diagnosed to have the disease. Let’s group by City and Gender:

Discrete Numeric Variables

As we discussed above, probability of a given number of events occurring in a given time interval or space follows a Poisson distribution. Therefore, number of goals per player in a given match must be Poisson too with \(\lambda\) being the average goals per match:

mean(goals.p$goals)

## [1] 0.413636

In the data the average goal by a team is \(\lambda=\) 0.413636. On the other hand, the distribution of goals per game looks like as below. By default number of bins is 30, but we want one bin per each possible goal, i.e. max(goals) - min(goals) + 1.

goalRange <- max(goals.p$goals) - min(goals.p$goals) + 1
ggplot(goals.p, aes(x=goals)) + 
  geom_histogram(bins = goalRange, colour = 'white', alpha=.8) + 
  theme_minimal()

The above is counts. We will plot the pmf using y=..density..:

goalRange <- max(goals.p$goals) - min(goals.p$goals) + 1
ggplot(goals.p, aes(x=goals, y=..density..)) + 
  geom_histogram(bins = goalRange, colour = 'white', alpha=.8) + 
  scale_x_continuous(n.breaks = goalRange) +
  theme_minimal()

Do you think it is Poisson? Well, there is a way to verify this. We will add the ideal probabilities \(P(X = k)\) following a poisson distribution with average goal, \(\lambda=0.413\)on the graph:

\[X \sim Pois(k,\lambda) \Rightarrow P(X=k) = \frac{0.41^k e^{-0.41}}{k!}\]

goalRange <- max(goals.p$goals) - min(goals.p$goals) + 1
lambda    <- mean(goals.p$goals)

pois   <- data.frame(x = 0:max(goals.p$goals)) 
pois$p <- dpois(pois$x, lambda) # identical probabilities
t(pois)

##        [,1]      [,2]       [,3]        [,4]         [,5]         [,6]
## x 0.0000000 1.0000000 2.00000000 3.000000000 4.0000000000 5.000000e+00
## p 0.6612416 0.2735133 0.05656748 0.007799448 0.0008065332 6.672223e-05

Now we can overlap the graphs:

ggplot(goals.p, aes(x=goals, y=..density..)) + 
  geom_histogram(bins = goalRange, colour = 'white', alpha=.8) + 
  geom_point(data = pois, mapping = aes(x = x, y = p), size = 3, colour='firebrick') + # points
  geom_line(data = pois, mapping = aes(x = x, y = p), colour='firebrick') + # lines
  scale_x_continuous(n.breaks = goalRange) +
  theme_minimal()

Earthquakes

eqs <- read.csv('data/earthquake.csv')
head(eqs)

##         Date     Time Latitude Longitude       Type Depth Depth.Error
## 1 01/02/1965 13:44:18   19.246   145.616 Earthquake 131.6          NA
## 2 01/04/1965 11:29:49    1.863   127.352 Earthquake  80.0          NA
## 3 01/05/1965 18:05:58  -20.579  -173.972 Earthquake  20.0          NA
## 4 01/08/1965 18:49:43  -59.076   -23.557 Earthquake  15.0          NA
## 5 01/09/1965 13:32:50   11.938   126.427 Earthquake  15.0          NA
## 6 01/10/1965 13:36:32  -13.405   166.629 Earthquake  35.0          NA
##   Depth.Seismic.Stations Magnitude Magnitude.Type Magnitude.Error
## 1                     NA       6.0             MW              NA
## 2                     NA       5.8             MW              NA
## 3                     NA       6.2             MW              NA
## 4                     NA       5.8             MW              NA
## 5                     NA       5.8             MW              NA
## 6                     NA       6.7             MW              NA
##   Magnitude.Seismic.Stations Azimuthal.Gap Horizontal.Distance Horizontal.Error
## 1                         NA            NA                  NA               NA
## 2                         NA            NA                  NA               NA
## 3                         NA            NA                  NA               NA
## 4                         NA            NA                  NA               NA
## 5                         NA            NA                  NA               NA
## 6                         NA            NA                  NA               NA
##   Root.Mean.Square           ID Source Location.Source Magnitude.Source
## 1               NA ISCGEM860706 ISCGEM          ISCGEM           ISCGEM
## 2               NA ISCGEM860737 ISCGEM          ISCGEM           ISCGEM
## 3               NA ISCGEM860762 ISCGEM          ISCGEM           ISCGEM
## 4               NA ISCGEM860856 ISCGEM          ISCGEM           ISCGEM
## 5               NA ISCGEM860890 ISCGEM          ISCGEM           ISCGEM
## 6               NA ISCGEM860922 ISCGEM          ISCGEM           ISCGEM
##      Status
## 1 Automatic
## 2 Automatic
## 3 Automatic
## 4 Automatic
## 5 Automatic
## 6 Automatic

library('lubridate')

eqs$Date <- mdy(eqs$Date)

## Warning: 3 failed to parse.

# eqs <- eqs %>% mutate(Date = mdy(Date))

# year(eqs$Date)
# month(eqs$Date)
# paste0(year(eqs$Date), '-', month(eqs$Date))

eqs <- eqs %>% mutate(YearMonth = paste0(year(Date), '-', month(Date)),
                      DateTime  = as_datetime(paste0(Date,' ',Time)))
head(eqs)

##         Date     Time Latitude Longitude       Type Depth Depth.Error
## 1 1965-01-02 13:44:18   19.246   145.616 Earthquake 131.6          NA
## 2 1965-01-04 11:29:49    1.863   127.352 Earthquake  80.0          NA
## 3 1965-01-05 18:05:58  -20.579  -173.972 Earthquake  20.0          NA
## 4 1965-01-08 18:49:43  -59.076   -23.557 Earthquake  15.0          NA
## 5 1965-01-09 13:32:50   11.938   126.427 Earthquake  15.0          NA
## 6 1965-01-10 13:36:32  -13.405   166.629 Earthquake  35.0          NA
##   Depth.Seismic.Stations Magnitude Magnitude.Type Magnitude.Error
## 1                     NA       6.0             MW              NA
## 2                     NA       5.8             MW              NA
## 3                     NA       6.2             MW              NA
## 4                     NA       5.8             MW              NA
## 5                     NA       5.8             MW              NA
## 6                     NA       6.7             MW              NA
##   Magnitude.Seismic.Stations Azimuthal.Gap Horizontal.Distance Horizontal.Error
## 1                         NA            NA                  NA               NA
## 2                         NA            NA                  NA               NA
## 3                         NA            NA                  NA               NA
## 4                         NA            NA                  NA               NA
## 5                         NA            NA                  NA               NA
## 6                         NA            NA                  NA               NA
##   Root.Mean.Square           ID Source Location.Source Magnitude.Source
## 1               NA ISCGEM860706 ISCGEM          ISCGEM           ISCGEM
## 2               NA ISCGEM860737 ISCGEM          ISCGEM           ISCGEM
## 3               NA ISCGEM860762 ISCGEM          ISCGEM           ISCGEM
## 4               NA ISCGEM860856 ISCGEM          ISCGEM           ISCGEM
## 5               NA ISCGEM860890 ISCGEM          ISCGEM           ISCGEM
## 6               NA ISCGEM860922 ISCGEM          ISCGEM           ISCGEM
##      Status YearMonth            DateTime
## 1 Automatic    1965-1 1965-01-02 13:44:18
## 2 Automatic    1965-1 1965-01-04 11:29:49
## 3 Automatic    1965-1 1965-01-05 18:05:58
## 4 Automatic    1965-1 1965-01-08 18:49:43
## 5 Automatic    1965-1 1965-01-09 13:32:50
## 6 Automatic    1965-1 1965-01-10 13:36:32

world <- map_data("world")
ggplot(data = world, aes(x=long, y=lat, group=group)) + 
  geom_polygon(fill = "white", colour = "black") +
  geom_point(data=subset(eqs,Date>'2010-01-01'), aes(x=Longitude, y=Latitude, group=1, size = Magnitude), colour='firebrick', alpha=.1)+
  coord_quickmap() + 
  theme_void()

library('leaflet')
leaflet(eqs) %>%
  addTiles() %>%  # use the default base map which is OpenStreetMap tiles
  addCircleMarkers(lng=~Longitude, lat=~Latitude,  radius = ~Magnitude/5)

totalEqs <- eqs %>% group_by(YearMonth) %>% summarize(nEqs = n()) 

ggplot(subset(totalEqs, nEqs < 100), aes(x=nEqs)) + 
  geom_histogram(colour = 'white')

Continuous Distributions

ggplot(eqs, aes(x=Magnitude)) + 
  geom_histogram(colour='white')

timeElapsed <- diff(eqs$DateTime) / 3600
lambda <- mean(timeElapsed)
expDist <- dexp(0:250, rate = 1/as.numeric(lambda))


ggplot() + 
  geom_histogram(aes(x=timeElapsed,y=..density..), colour = 'white') +
  geom_line(aes(x=0:250,y=expDist), colour = 'firebrick')

Finance Data

returns <- read.csv('data/returns.csv')

ggplot(subset(returns, MSFT<.1& MSFT>-.1), aes(x=MSFT)) +
  geom_histogram(colour='white')

Lab 4: Discrete and Continuous Distributions