Lab 5: Fitting Distributions

Introduction

In this tutorial you will learn how to be determine the distribution of your data and compare them. By the end of the tutorial you will be able to:

identify different distributions,
calculate the four moments,
compare the distributions with different samples or the ideal ones,
draw quantile plots and qqplots

Getting Started

We will visualize two financial series and earthquakes datasets. To be able to follow you need:

to download earthquakes.csv and returns.csv data uploaded on LEARN
some packages including

tidyverse
moments
qqtest
plotly (takes time to install)

library('tidyverse')
theme_set(theme_minimal())

Reading and Preprocessing Data

Let’s read the findata and earthquakes datasets as below:

returns <- read.csv('data/returns.csv')
tail(returns)

##         X       Date          MSFT          AAPL           DJI          S.P
## 8618 8618 2020-05-21 -0.0120112623 -0.0074554551 -0.0041415065 -0.007773596
## 8619 8619 2020-05-22  0.0004361446  0.0064384061 -0.0003660595  0.002353711
## 8620 8620 2020-05-26 -0.0105715659 -0.0067735078  0.0216613836  0.012289184
## 8621 8621 2020-05-27  0.0013217546  0.0043569411  0.0221307356  0.014827297
## 8622 8622 2020-05-28 -0.0022551235  0.0004401465 -0.0057784307 -0.002107915
## 8623 8623 2020-05-29  0.0101984899 -0.0009740707 -0.0006901893  0.004812336
##              Coca        Pepsi
## 8618 -0.015689715 -0.007999299
## 8619 -0.003099380  0.002073489
## 8620  0.023539885 -0.005594697
## 8621  0.013885875  0.008169541
## 8622  0.007703831  0.011314082
## 8623 -0.008706732 -0.005593696

The problem with financial series is that there are some outliers that reduces the quality of visuals. For example, market crashes occur with low probability but when it happens the daily return can be e.g. -35%. In the below plot there are two extreme events that squizes the plot:

To remove the extreme events on the edges, we will discard left of 1% quantile of the data and right of the 99% quantile, ser < quantile(ser,0.99) & ser > quantile(ser,0.01):

# quantile(returns$DJI, 0.9999)

returns <- returns[, -c(1,2)]
remove_outliers <- function(ser){
  ser[ ser < quantile(ser,0.99) & ser > quantile(ser,0.01) ]
}

returns <- apply(returns, 2, remove_outliers)
returns <- data.frame(returns)

dim(returns)

## [1] 8449    6

head(returns)

##          MSFT         AAPL          DJI           S.P         Coca        Pepsi
## 1  0.03571208  0.055556561  0.022255692  0.0144088553 -0.020689655  0.039634249
## 2  0.01725028 -0.004784381 -0.008857952 -0.0079476008  0.004694986 -0.019061631
## 3 -0.02543175  0.033653825  0.007321911  0.0047300507  0.007009345  0.005979122
## 4 -0.01739026 -0.013954671 -0.001072728 -0.0007633938 -0.002320482 -0.002971792
## 5 -0.02654705  0.066037776  0.009111015  0.0039897580  0.014510277 -0.007451517
## 6 -0.02727101 -0.022122490 -0.019775601 -0.0135283550 -0.008343189 -0.030030054

Second data that we wil use is earthquakes:

eqs <- read.csv('data/earthquake.csv')
str(eqs)

## 'data.frame':    23412 obs. of  21 variables:
##  $ Date                      : chr  "01/02/1965" "01/04/1965" "01/05/1965" "01/08/1965" ...
##  $ Time                      : chr  "13:44:18" "11:29:49" "18:05:58" "18:49:43" ...
##  $ Latitude                  : num  19.25 1.86 -20.58 -59.08 11.94 ...
##  $ Longitude                 : num  145.6 127.4 -174 -23.6 126.4 ...
##  $ Type                      : chr  "Earthquake" "Earthquake" "Earthquake" "Earthquake" ...
##  $ Depth                     : num  132 80 20 15 15 ...
##  $ Depth.Error               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Depth.Seismic.Stations    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Magnitude                 : num  6 5.8 6.2 5.8 5.8 6.7 5.9 6 6 5.8 ...
##  $ Magnitude.Type            : chr  "MW" "MW" "MW" "MW" ...
##  $ Magnitude.Error           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Magnitude.Seismic.Stations: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Azimuthal.Gap             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Horizontal.Distance       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Horizontal.Error          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Root.Mean.Square          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ ID                        : chr  "ISCGEM860706" "ISCGEM860737" "ISCGEM860762" "ISCGEM860856" ...
##  $ Source                    : chr  "ISCGEM" "ISCGEM" "ISCGEM" "ISCGEM" ...
##  $ Location.Source           : chr  "ISCGEM" "ISCGEM" "ISCGEM" "ISCGEM" ...
##  $ Magnitude.Source          : chr  "ISCGEM" "ISCGEM" "ISCGEM" "ISCGEM" ...
##  $ Status                    : chr  "Automatic" "Automatic" "Automatic" "Automatic" ...

The data has dates that are read as factors. We need them as dates. Besides we will need Datetime to calculate time difference between two earthquakes, also Yearmonth to be able to count number of events each month:

library('lubridate')

eqs$Date      <- as.Date(eqs$Date, format = '%m/%d/%Y')

eqs <- mutate(eqs, 
              Datetime  = as_datetime(paste0(Date, ' ', Time)),
              YearMonth = paste0(year(Date), '-', month(Date)))

dim(eqs)

## [1] 23412    23

head(eqs)

##         Date     Time Latitude Longitude       Type Depth Depth.Error
## 1 1965-01-02 13:44:18   19.246   145.616 Earthquake 131.6          NA
## 2 1965-01-04 11:29:49    1.863   127.352 Earthquake  80.0          NA
## 3 1965-01-05 18:05:58  -20.579  -173.972 Earthquake  20.0          NA
## 4 1965-01-08 18:49:43  -59.076   -23.557 Earthquake  15.0          NA
## 5 1965-01-09 13:32:50   11.938   126.427 Earthquake  15.0          NA
## 6 1965-01-10 13:36:32  -13.405   166.629 Earthquake  35.0          NA
##   Depth.Seismic.Stations Magnitude Magnitude.Type Magnitude.Error
## 1                     NA       6.0             MW              NA
## 2                     NA       5.8             MW              NA
## 3                     NA       6.2             MW              NA
## 4                     NA       5.8             MW              NA
## 5                     NA       5.8             MW              NA
## 6                     NA       6.7             MW              NA
##   Magnitude.Seismic.Stations Azimuthal.Gap Horizontal.Distance Horizontal.Error
## 1                         NA            NA                  NA               NA
## 2                         NA            NA                  NA               NA
## 3                         NA            NA                  NA               NA
## 4                         NA            NA                  NA               NA
## 5                         NA            NA                  NA               NA
## 6                         NA            NA                  NA               NA
##   Root.Mean.Square           ID Source Location.Source Magnitude.Source
## 1               NA ISCGEM860706 ISCGEM          ISCGEM           ISCGEM
## 2               NA ISCGEM860737 ISCGEM          ISCGEM           ISCGEM
## 3               NA ISCGEM860762 ISCGEM          ISCGEM           ISCGEM
## 4               NA ISCGEM860856 ISCGEM          ISCGEM           ISCGEM
## 5               NA ISCGEM860890 ISCGEM          ISCGEM           ISCGEM
## 6               NA ISCGEM860922 ISCGEM          ISCGEM           ISCGEM
##      Status            Datetime YearMonth
## 1 Automatic 1965-01-02 13:44:18    1965-1
## 2 Automatic 1965-01-04 11:29:49    1965-1
## 3 Automatic 1965-01-05 18:05:58    1965-1
## 4 Automatic 1965-01-08 18:49:43    1965-1
## 5 Automatic 1965-01-09 13:32:50    1965-1
## 6 Automatic 1965-01-10 13:36:32    1965-1

Continuous Distributions

There are a couple of discrete distributions each are for different random variables:

Normal

\[X \sim N(\mu,\sigma^2) \Rightarrow f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac12(\frac{x-\mu}{\sigma})^2}\]

Exponential

\[X \sim exp(\lambda) \Rightarrow f(x) = \lambda e^{-\lambda x}\]

Uniform

Log-Normal

Beta

Gamma

Weibull

Quantile Plots - Chef’s Knife

One visual can give all information we need, except a bit harder to read. Quantile plots can be drawn by sorting the data and then assigning order IDs (probability points), numbers ranging between 0 and 1. On the x axis, we plot the probability points, and on the y axis we plot the values of these points:

set.seed(156)
normal  <- data.frame(vals = rnorm(100, 0, 1))

ggplot(normal) + 
  geom_histogram(aes(y=vals, x = -..density..), alpha = .7, colour = 'white', bins=20) +
  geom_point(aes(y=sort(vals), x=ppoints(vals)), colour='firebrick', alpha = .2, size =3 ) +
  geom_boxplot(aes(y=vals,x = 1.25), width=.25)  + 
  labs(x='Probability Points', y='Sample Quantile')

The regions where quantile plot is flat is the region where there are more data points, where the distribution piles up. If it is steep (e.g. the bottom and top parts), then there are a few data there.

Unlike histogram, we can locate exactly where the median, min/max values, 1st and 3rd quartiles in quantile plots. The median is the point whose x axis value is equal to 0.5. Here it is the point whose y axis value is slightly below 0. We can also locate the first and third quartiles, which are where x is 0.25 and 0.75. The distribution exposes a signature normal curve, which means it is a normal-like distribution.

Let’s verify these values

summary(normal$vals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.88463 -0.66262 -0.01979 -0.04263  0.69443  1.95145

set.seed(156)
logNormal <- data.frame(vals = rlnorm(200)) 
ggplot(logNormal) + 
  geom_histogram(aes(y=vals,x = -..density..), alpha = .7, colour = 'white', bins=20) +
  geom_point(aes(y=sort(vals), x=ppoints(vals)), colour='steelblue', alpha = .2, size =3 ) +
  geom_boxplot(aes(y=vals,x = 1.25), width=.25) + 
  labs(x = 'Probability Points', y= 'Sample Quantiles')

Here the distribution is right-skewed because the points are forming a flat line at the below of the plot and they are steep (and sparse) at the top. The median is around 1, first quartile is around 0.5 and third quartile is around 2. The maximum value is ~14.

set.seed(156)
uniform <- data.frame(vals = runif(200)) 
ggplot(uniform) + 
  geom_histogram(aes(y=vals,x = -..density..), alpha = .7, colour = 'white', bins=20) +
  geom_point(aes(y=sort(vals), x=ppoints(vals)), colour='darkolivegreen3', alpha = .1, size =3 ) +
  geom_boxplot(aes(y=vals,x = 1.25), width=.25) + 
  labs(x = 'Probability Points', y= 'Sample Quantiles')

The plot doesn’t expose a curve but a line. This is signature of uniform distribution, it is not flat or steep anywhere.

set.seed(157)
betaDist <- data.frame(vals = rbeta(200, 6, 1))
ggplot(betaDist) + 
  geom_point(aes(y=sort(vals), x=seq(0,1, length.out = length(vals))), colour='orange', alpha = .2, size =3 ) +
  geom_boxplot(aes(y=vals,x = 1.25), width=.25) + 
  labs(x = 'Probability Points', y= 'Sample Quantiles')

The plot is left-skewed because the points are sparse at the bottom and flat at the top.

ggplot(iris, aes(y=Petal.Length)) + 
  geom_point(aes(y=sort(Petal.Length), x=seq(0,1, length.out = length(Petal.Length))), colour='darkolivegreen3',
             alpha = .3, size =3 ) +
  geom_histogram(aes(x = -..density..), alpha = .7, colour = 'white', bins=15) +
  geom_boxplot(aes(x = 1.25), width=.25) + 
  labs(x = 'Probability Points', y= 'Sample Quantiles', title='Iris data petal lenghts')

One advantage of quantile plots is it can show the granularity and doesn’t hinder the multimodal distributions. Boxplot fails here and always assumes the distribution is unimodal. We can see the two modes and median of the both modes in quantile plots.

set.seed(156)

ind <- seq(0,1,length.out = 1000)
binom     <- data.frame(ind=ind, X = sort(rbinom(1000, 50,.95)))
geom      <- data.frame(ind=ind, X = sort(rgeom(1000, .4)))
pois      <- data.frame(ind=ind, X = sort(rpois(1000, 3.22)))
normal    <- data.frame(ind=ind, X = sort(rnorm(1000)))
logNormal <- data.frame(ind=ind, X = sort(exp(normal$X)))
betaDist  <- data.frame(ind=ind, X = sort(rbeta(1000, 3, 3)))


dists <- data.frame(rbind(normal,logNormal,betaDist, binom,geom,pois))
dists$Distribution <- unlist(lapply(c('N(0,1)','LogNormal(0,1)','Beta(3,3)','Binom(50,0.95)','Geom(0.95)','Poisson(3.22)'), function(n) rep(n,1000)))
dists$Distribution <- factor(dists$Distribution, levels = c('N(0,1)','LogNormal(0,1)','Beta(3,3)','Binom(50,0.95)','Geom(0.95)','Poisson(3.22)'))

ggplot(dists,aes(y=X, x=ind, colour=Distribution)) + 
  geom_point(,alpha = .3, size =3 ) +
  facet_wrap( ~ Distribution, ncol=3, scale='free') +
  theme(legend.position = 'none') + 
  ggtitle('Quantile Plots of Different Distributions')

Identifying Distributions

Moments - Measures for Distribution Characteristics

There are two moments that are important in life. The moment you are born and the moment you find out why… Nevermind.

There are four moments that are important in summarizing characteristics of distributions. These are first, second, third and fourth moments which correspond to:

Mean: The average
Variance: Square of standard deviation
Skewness: The asymmetry
Kurtosis: Thickness of tails

You will learn moments, moment generating function etc. in detail. R doesn’t list them for you, so we will use moments package for it.

The moments can have certain values specific to each distribution. For example 3rd and 4th moment of Normal is 0 and 3 respectively. For other distributions you can sometimes see excess kurtosis, which is benchmarked to kurtosis of normal distribution:

skewness and kurtosis of Normal is 0 and 3 respectively.
skewness and ex. kurtosis of Exponential is 2 and 6 respectively.
skewness and ex. kurtosis of Uniform is 0 and -6/5 (no kidding) respectively.
skewness and ex. kurtosis of Poisson is \(\sqrt \lambda\) and \(\lambda^{-1}\)

You can check them from Wikipedia by searching the distribution. On the right hand side the table summarizes these properties.

We will now generate random variables and calculate their moments. Also we define fourmoments function to summarize the four moments:

library('moments')
set.seed(156)
dists <- data.frame('Normal' = rnorm(1000), 
                    'Exponential' = rexp(1000,3), 
                    'Uniform' = runif(1000), 
                    'Beta' = rbeta(1000, 6, 4))

fourmoments <- function(rv){
  c('Mean' = mean(rv), 
    'Variance' = var(rv),
    'Skewness' = skewness(rv),
    'Kurtosis' = kurtosis(rv))
}

fourmoments(dists$Normal)

##        Mean    Variance    Skewness    Kurtosis 
## -0.01634422  1.00645313 -0.04879710  2.95845771

We will use apply to apply the function to all columns:

apply(dists, 2, fourmoments)

##               Normal Exponential    Uniform        Beta
## Mean     -0.01634422   0.3266587 0.50267007  0.60474814
## Variance  1.00645313   0.1046882 0.08356373  0.02025603
## Skewness -0.04879710   2.1312615 0.02188583 -0.26241843
## Kurtosis  2.95845771  10.1662499 1.79866096  2.78964164

The four moments are close to the ideal ones but there are small deviations. For example, Kurtosis of Normal is 2.95 as opposed to 3 whereas Uniform is 1.8 as it is for the ideal, 3-6/5 = 1.8. But we can see the deviation is much larger for Exponential, the kurtosis for ideal distribution is 9 (or excess kurtosis is 6) whereas it was calculated as 10. This is not a big difference because a small number of unlikely events can make yor data leptokurtic (thick tails).

If we generate larger numbers it will get closer to the ideal moments:

set.seed(157)
dists <- data.frame('Normal' = rnorm(10000),
                    'Exponential' = rexp(10000,3),
                    'Uniform' = runif(10000),
                    'Beta' = rbeta(10000, 6, 4))

apply(dists,2,fourmoments)

##                Normal Exponential     Uniform        Beta
## Mean     -0.009463987   0.3343949  0.50011903  0.59707382
## Variance  1.005879410   0.1141077  0.08367487  0.02166831
## Skewness -0.004849413   2.0193060 -0.01112597 -0.17851546
## Kurtosis  2.933946209   8.9445674  1.80793466  2.55468255

Quantile Comparisons

Now, let’s use this wisdom to see which stock is more likely to yield more or less returns.

g1 <- ggplot(returns) + 
  geom_histogram(aes(y = AAPL, x=..count..), bins = 20, colour = 'white', fill = 'firebrick') +
  geom_histogram(aes(y = MSFT, x=-..count..), bins = 20, colour = 'white', fill = 'steelblue') +
  labs(y = "Returns") + 
  annotate('text',x = -1000, y = 0.05, label='MSFT') +
  annotate('text',x = 1000, y = 0.05, label='AAPL')

g2 <- ggplot(returns) +
  geom_point(aes(y = sort(AAPL), x=ppoints(AAPL)), colour = 'firebrick', alpha=.2) +
  geom_point(aes(y = sort(MSFT), x=ppoints(MSFT)),  colour = 'steelblue', alpha=.2) +
  labs(x = 'Probability Points', y= 'Sample Quantiles') 

grid.arrange(g1,g2,ncol=2)

On the left we can see the the distributions back2back. Microsoft returns are more compressed to the center whereas Apple returns have thicker tails. It can also be read from the quantile plots. Between returns -0.02 and 0 we can see Microsoft quantiles are more flat than Apple, between -0.04 to -0.02 Apple is flatter. Or it is better to see that the Apple’s line is above Microsoft’s on the both tails. Same for numbers greater than 0. Apple is more risky than Microsoft.

ggplot(returns) +
  geom_point(aes(y = sort(AAPL), x=ppoints(AAPL)), colour = 'firebrick', alpha=.2) +
  geom_point(aes(y = sort(MSFT), x=ppoints(MSFT)),  colour = 'steelblue', alpha=.2) +
  geom_point(aes(y = sort(Coca), x=ppoints(Coca)),  colour = 'orange', alpha=.2) +
  geom_point(aes(y = sort(Pepsi), x=ppoints(Pepsi)),  colour = 'purple', alpha=.2) +
  labs(x = 'Probability Points', y= 'Sample Quantiles')

Pepsi and Coca cola group are less risky than the tech companies. Their distributions are quite closer to each other. What do the moments say?

round(apply(returns, 2, fourmoments) , 4)

##            MSFT   AAPL     DJI     S.P   Coca  Pepsi
## Mean     0.0011 0.0011  0.0004  0.0004 0.0005 0.0005
## Variance 0.0003 0.0005  0.0001  0.0001 0.0002 0.0002
## Skewness 0.1583 0.1245 -0.1670 -0.1900 0.1177 0.0708
## Kurtosis 3.7222 3.6565  3.9586  4.0470 3.6895 3.7874

Tech companies on the average has higher return. Also their variance (risk^2) is higher with Apple’s risk is the highest. All the companies are right skewed meaning that the distribution is in favour of negative returns (there are more numbers on the left than right.). All the data are leptokurtic, the tails are thick.

The sample size is large. In practice, the we don’t always have this amount of data. Let’s check the last 300 days (last year):

g1 <- ggplot(tail(returns,300)) + 
  geom_histogram(aes(y = AAPL, x=..count..), bins = 20, colour = 'white', fill = 'firebrick')+
  geom_histogram(aes(y = MSFT, x=-..count..), bins = 20, colour = 'white', fill = 'steelblue') +
  labs(y = "Returns") + 
  annotate('text',x = -50, y = 0.05, label='MSFT') +
  annotate('text',x = 50, y = 0.05, label='AAPL')

g2 <- ggplot(tail(returns,300)) +
  geom_point(aes(y = sort(AAPL), x=ppoints(AAPL)), colour = 'firebrick', alpha=.2) +
  geom_point(aes(y = sort(MSFT), x=ppoints(MSFT)),  colour = 'steelblue', alpha=.2) 

grid.arrange(g1,g2,ncol=2)

Apparently Apple is continuing to be risky. On the left tail (negative values), Apple’s and Microsoft’s plots mostly overlap but Apple has more negative outliers. On the right tail (positive values), Apple’s tail is above Microsoft’s. This means Apple was more likely to return more than Microsoft in the end of 2019 and beginning of 2020.

Now, let’s focus on Microsoft. Is it Normal? How can we verify it? To check, we will generate Normal numbers with the same mean and variance:

series   <- tail(returns$MSFT, 300) 
avgRet   <- mean(series) # mean of MSFT
sdRet    <- sd(series) # sd of MSFT
probs    <- ppoints(series) # the same x axis for both
q.normal <- qnorm(probs, avgRet, sdRet) # normal quantiles
q.sample <- sort(series) # MSFT's quantiles


ggplot(tail(returns,300)) + 
  geom_point(aes(y = q.sample, x=probs) ,colour = 'steelblue', alpha = .2) +
  geom_line (aes(y = q.normal, x=probs)) +
  labs(x = 'Probability Points', y= 'Sample Quantiles')

The black line is the ideal distribution. We can see that for positive numbers the blue points are closely following the black line but always below it. This means it returns less than the normal on the right tail. For negative values, it deviates more; for the negative tail or bottom part (-0.050, -0.025) the blue points are higher than the black line which translates into that the negative extreme returns are more likely than normal.

qqplots

The above plot shows if the sample distribution looks like the ideal one. Another useful way would be plotting one on the x and the other on the y axis. Why? Because if they follow pretty much the same distribution, then the result would be x=y line, or a straight line.

This plot is known as quantile-quantile plots:

series   <- tail(returns$MSFT,300)
probs    <- ppoints(series)
q.normal <- qnorm(probs)
q.sample <- sort(series)

ggplot() +
  geom_point(aes(y = q.sample, x= q.normal) ,colour = 'steelblue', alpha = .2) +
  geom_abline(aes(slope=sd(series), intercept = mean(series)))

Good news is we don’t have to write too much code for qqplot if we want to check normality. We can use qqnorm and qqline as below:

qqnorm(tail(returns$MSFT, 300), col=adjustcolor('steelblue', .3), pch=16)
qqline(tail(returns$MSFT, 300))

The points deviate from the black line on the tails, which means on both tails it doesn’t behave like normal. But in the middle of the plot they are pretty much like normal.

qqtest Package

Our Statistics faculty is working hard to help you engineers to visualize things. Here is a package written by a University of Waterloo professor.

In the previous plot, we saw some points that were deviating but some fell far, some fell close. Some are just few some are quite a few. We cannot expect them to perfectly overlap, but what is a tolerable deviance? I wish I had an envelope for the quantile of my plot…. This can be done using qqtest package.

library('qqtest')
qqtest(tail(returns$MSFT,300),dist = 'normal')

The left tail points deviate more than tolerable range. This means we can reject that the Microsoft daily return series are normal, especially for their left-tail behaviour. The right tail is still in the envelope.

In theory, if \(X\) is normal, then \(X^2\) must follow a chi-squared distribution. To check if it applies to our data, we will calculate the deviance it we can calculate the deviation from the mean and check whether it is chi-squared:

series <- tail(returns$MSFT,300)
sq     <- series^2

qqtest(sq, dist='chi-squared')

It is hard to say it is chi-squarred.

qqtest(sq, dist='exponential')

The data is not exponential either.

qqtest(sq, dist='log-normal')

Got it! It is more like log-normal than others. The reason is these tails that are not normal, so the square not chi-square.

Earthquake Data

Days Passed

Now, let’s examine the earthquake dataset. In theory, the time passing between two such events must be exponential. Does it apply to our data? First, we will remove the earthquakes whose mangitude is less than a level, say 6.5 then calculate the time difference between each two events:

temp <- eqs %>% subset(Magnitude > 6.5 )

dayPassed <- diff(temp$Datetime) / (3600*24) # seconds passed divided by number of seconds in a day
dayPassed <- as.numeric(dayPassed) # converted a date-time object to numeric

ggplot() +
  geom_histogram(aes(x = dayPassed), colour = 'black', alpha=.2)

The shape looks like the signature of exponential.

library('qqtest')
qqtest(dayPassed, dist='exponential')

Yeah. That’s how it looks like.

fourmoments(dayPassed)

##       Mean   Variance   Skewness   Kurtosis 
##  10.324827 134.940344   2.160754  10.042498

qqplot with envelope also indicate a strong evidence that the distribution is exponential. Besides the skewness and kurtosis are 2.16 and 10.04 close to the ideal ones, 2 and 9.

library('plotly')
rate <- 1/mean(dayPassed)

probs    <- ppoints(dayPassed)
q.exp    <- qexp(probs, rate)
q.wbl    <- qweibull(probs, .9, 1/rate)
q.gmm    <- qgamma(probs, 1.1, rate)
q.sample <- sort(dayPassed)

g <- ggplot() + 
  geom_point(aes(y = q.sample, x=probs), colour = 'steelblue', alpha = .2) +
  geom_line(aes(y = q.exp,   x=probs)) +
  geom_line(aes(y = q.wbl,   x=probs), colour = 'brown') +
  geom_line(aes(y = q.gmm,   x=probs), colour = 'orange') +
  labs(x = 'Probability Points', y= 'Sample Quantiles') 


ggplotly(g)

Weibull follows the data closely except the tails, where gamma fits better there.

Magnitude

What about the magnitudes of each earthquake?

subdata <- subset(eqs, Type == 'Earthquake' & Magnitude >=6.5)

ggplot(subdata) +
  geom_histogram(aes(x = Magnitude), colour='black', alpha=.2)

qqtest(subdata$Magnitude, dist='exponential')

The earthquakes whose magnitude are over 6.5 seems like exponentially distributed.

Deliverables (For Second lab report)

This week’s deliverables are to submitted with the second lab report.

1. Distribution Characteristics

Choose a continuous variable (not discrete) and plot the histogram:

change fill and/or colour to make your plot fancier
interpret the graph, is it left-right skewed?
are tails irregularly thick in certain ranges?
Can you locate where the first and third quartiles? What about the median?

Plot the quantile plot of the same data.

Is it left or right-skewed. How would you tell?
Can you locate where the first and third quartiles? What about the median?

Calculate 4 moments and interpret

Is it left or right skewed? Why?
Is it platykurtic (thinner tails than normal) or leptokurtic (thicker than normal)?

2. Fitting Distributions

Find a right-skewed or symmetric variable. You can plot the histogram for an insight. If your data is left skewed, either choose another data or transform your data using transformation functions (we will see later on week 6 or 7). Fit a distribution to the same continuous variable:

Use qqtest to check whether your data is normal, lognormal, chi-squared or exponential.
Plot the sample quantiles and ideal quantiles against probability points (probs). For ideal quantiles, generate quantiles using qnorm, qlnorm,qchi or qexp depending on the distribution you found in the previous step and add as a second line in the quantile plot.

Lab 5: Fitting Distributions