Lab 7: Data Transformation and Time Series

Introduction

The data comes in various shapes but never quite as you like when you want to plot the time series. The dates are read as characters, the columns have information but you need in rows etc. Another problem is, when you want to fit a model to your data, often you need it in certain forms, usually normal. In Statistics, you will often see the normality assumption.

In this lab we will cover both of these separately. We will first transform our data to satisfy normality, which will solve many of our issues that we will see later. Also, we will show how to plot time series, but also cross two hurdles between you and a time series plot, namely processing dates and changing the data structure.

In this tutorial you will learn how to:

transform the data using a parametric transformation method
preprocess the data (dates and shape)
play with date objects
plot time series

Getting Started

We will visualize house prices data, Covid Confirmed Cases dataset, and Maddison Project GDP Per Capita dataset. To be able to follow you need:

to download covid_confirmed.csv and maddison.csv data uploaded on LEARN
some packages including

tidyverse
lubridate
gganimate (takes effort to install)
plotly (takes time to install)

library('tidyverse')
theme_set(theme_minimal())

houses <- read.csv('data/house_subset.csv') # choose your directory!
covid  <- read.csv('data/covid19-download.csv')
head(covid)

##   pruid           prname             prnameFR       date update numconf numprob
## 1    35          Ontario              Ontario 2020-01-31     NA       3       0
## 2    59 British Columbia Colombie-Britannique 2020-01-31     NA       1       0
## 3     1           Canada               Canada 2020-01-31     NA       4       0
## 4    35          Ontario              Ontario 2020-02-08     NA       3       0
## 5    59 British Columbia Colombie-Britannique 2020-02-08     NA       4       0
## 6     1           Canada               Canada 2020-02-08     NA       7       0
##   numdeaths numtotal numtested numtests numrecover percentrecover ratetested
## 1         0        3        NA        0         NA                        NA
## 2         0        1        NA        0         NA                        NA
## 3         0        4        NA        0         NA                        NA
## 4         0        3        NA        0         NA                        NA
## 5         0        4        NA       63         NA                        NA
## 6         0        7        NA       63         NA                        NA
##   ratetests numtoday percentoday ratetotal ratedeaths numdeathstoday
## 1        NA        3         300      0.02          0              0
## 2        NA        1         100      0.02          0              0
## 3        NA        4         400      0.01          0              0
## 4        NA        0           0      0.02          0              0
## 5        12        3         300      0.08          0              0
## 6         2        3          75      0.02          0              0
##   percentdeath numtestedtoday numteststoday numrecoveredtoday percentactive
## 1            0             NA            NA                NA           100
## 2            0             NA            NA                NA           100
## 3            0             NA            NA                NA           100
## 4            0             NA            NA                NA           100
## 5            0             NA            NA                NA           100
## 6            0             NA            NA                NA           100
##   numactive rateactive numtotal_last14 ratetotal_last14 numdeaths_last14
## 1         3       0.02              NA               NA               NA
## 2         1       0.02              NA               NA               NA
## 3         4       0.01              NA               NA               NA
## 4         3       0.02              NA               NA               NA
## 5         4       0.08              NA               NA               NA
## 6         7       0.02              NA               NA               NA
##   ratedeaths_last14 numtotal_last7 ratetotal_last7 numdeaths_last7
## 1                NA             NA              NA              NA
## 2                NA             NA              NA              NA
## 3                NA             NA              NA              NA
## 4                NA             NA              NA              NA
## 5                NA             NA              NA              NA
## 6                NA             NA              NA              NA
##   ratedeaths_last7 avgtotal_last7 avgincidence_last7 avgdeaths_last7
## 1               NA             NA                 NA              NA
## 2               NA             NA                 NA              NA
## 3               NA             NA                 NA              NA
## 4               NA             NA                 NA              NA
## 5               NA             NA                 NA              NA
## 6               NA             NA                 NA              NA
##   avgratedeaths_last7 raterecovered
## 1                  NA             0
## 2                  NA             0
## 3                  NA             0
## 4                  NA             0
## 5                  NA             0
## 6                  NA             0

tail(covid)

##      pruid                 prname                  prnameFR       date update
## 7102    11   Prince Edward Island     Île-du-Prince-Édouard 2021-06-23      0
## 7103    60                  Yukon                     Yukon 2021-06-23      1
## 7104    61  Northwest Territories Territoires du Nord-Ouest 2021-06-23      1
## 7105    62                Nunavut                   Nunavut 2021-06-23      1
## 7106    99 Repatriated travellers       Voyageurs rapatriés 2021-06-23     NA
## 7107     1                 Canada                    Canada 2021-06-23     NA
##      numconf numprob numdeaths numtotal numtested numtests numrecover
## 7102     206      NA         0      206        NA   172743        206
## 7103     220      NA         3      220        NA     9129        125
## 7104     128      NA         0      128        NA    24548        128
## 7105     657      NA         4      657        NA    17497        653
## 7106      13      NA         0       13        NA       76         13
## 7107 1410927      NA     26175  1410927        NA 36344745    1375107
##      percentrecover ratetested ratetests numtoday percentoday ratetotal
## 7102         100.00         NA   1082180        0        0.00    129.05
## 7103          56.82         NA    217088        8        3.77    523.16
## 7104         100.00         NA    543566        0        0.00    283.43
## 7105          99.39         NA    444617        0        0.00   1669.50
## 7106         100.00         NA        NA        0        0.00        NA
## 7107          97.46         NA    956309      727        0.05   3712.45
##      ratedeaths numdeathstoday percentdeath numtestedtoday numteststoday
## 7102       0.00              0         0.00             NA           366
## 7103       7.13              0         1.36             NA             0
## 7104       0.00              0         0.00             NA             7
## 7105      10.16              0         0.61             NA            84
## 7106         NA              0         0.00             NA             0
## 7107      68.87             21         1.86             NA         57255
##      numrecoveredtoday percentactive numactive rateactive numtotal_last14
## 7102                 0          0.00         0       0.00               0
## 7103                11         41.82        92     218.78             128
## 7104                 0          0.00         0       0.00               0
## 7105                 0          0.00         0       0.00               8
## 7106                 0          0.00         0         NA               0
## 7107              1218          0.68      9645      25.38           14114
##      ratetotal_last14 numdeaths_last14 ratedeaths_last14 numtotal_last7
## 7102             0.00                0              0.00              0
## 7103           304.39                1              2.38            100
## 7104             0.00                0              0.00              0
## 7105            20.33                0              0.00              0
## 7106               NA                0                NA              0
## 7107            37.14              332              0.87           5788
##      ratetotal_last7 numdeaths_last7 ratedeaths_last7 avgtotal_last7
## 7102            0.00               0             0.00              0
## 7103          237.80               0             0.00             14
## 7104            0.00               0             0.00              0
## 7105            0.00               0             0.00              0
## 7106              NA               0               NA              0
## 7107           15.23             176             0.46            827
##      avgincidence_last7 avgdeaths_last7 avgratedeaths_last7 raterecovered
## 7102               0.00               0                0.00          0.00
## 7103              33.97               0                0.00         26.16
## 7104               0.00               0                0.00          0.00
## 7105               0.00               0                0.00          0.00
## 7106                 NA               0                  NA            NA
## 7107               2.18              25                0.07          3.20

Data Transformation

Let’s see our pain. The below weird looking graph is unwelcomed in statistics. The reason is we want to explain a relationship but the data is not spread evenly:

library(ggExtra)
p <- ggplot(houses, aes(x=sqft_living, y=price)) + 
  geom_point(alpha=.1) + 
  geom_smooth(colour='firebrick')

ggMarginal(p, type = 'histogram')

Both variables are right skewed. When we draw the scatterplot, the data is grouped into certain intervals, and the rest is sprinkled all around. In other words, for the low values the data is condensed around the line (variance is low) but for the higher values the data is covering a wider range around the line (variance is high). This is known as heteroskedasticity and as you will see later when covering regressions, it is trouble.

We will show you how to obtain a normal looking dataset using transformations:

library(gridExtra)
g1 <- ggplot(houses, aes(x=price)) + 
  geom_histogram(colour='white')
g2 <- ggplot(houses, aes(x=sqrt(price))) + 
  geom_histogram(colour='white')
g3 <- ggplot(houses, aes(x=log(price))) + 
  geom_histogram(colour='white')

g4 <- ggplot(houses, aes(x=sqft_living)) + 
  geom_histogram(colour='white')
g5 <- ggplot(houses, aes(x=sqrt(sqft_living))) + 
  geom_histogram(colour='white')
g6 <- ggplot(houses, aes(x=log(sqft_living))) + 
  geom_histogram(colour='white')

grid.arrange(g1,g4,g2,g5,g3,g6)

price and sqft_living are originally right skewed. The sqrt transformation doesn’t make it symmetric but the most symmetric one seems to be log-transformed one.

library('qqtest')
qqtest(log(houses$price))

qqtest(log(houses$sqft_living))

The log transformation for sqft_living worked fine. qqtest shows it is in the envelope, so we cannot reject it is normal. But for price data we need more work. The edges are still out of the band.

Parametric Transformations

A transformation function can generalize the above transformations using sqrt and log functions:

\[T(x,\alpha) = \frac{x^\alpha-1}{\alpha}\]

The above formula is a general form of the transformations you can use. For example, when $\alpha = 1$, the above will be equal to the original series. When $\alpha = 0.5$, it will be $2\sqrt{x}-2$, which is equivalent to $\sqrt{x}$. When $x\rightarrow 0$, the equation goes to $\log x$.

It can be defined in R as:

tform <- function(x, alpha){
  if(alpha == 0) {log(x)} else {(x^alpha-1)/alpha}
}

g1 <- ggplot(houses, aes(x=tform(price,1))) + 
  geom_histogram(colour='white')
g2 <- ggplot(houses, aes(x=tform(price,0.5))) + 
  geom_histogram(colour='white')
g3 <- ggplot(houses, aes(x=tform(price,0))) + 
  geom_histogram(colour='white')

g4 <- ggplot(houses, aes(x=tform(sqft_living,1))) + 
  geom_histogram(colour='white')
g5 <- ggplot(houses, aes(x=tform(sqft_living,.5))) + 
  geom_histogram(colour='white')
g6 <- ggplot(houses, aes(x=tform(sqft_living,0))) + 
  geom_histogram(colour='white')

grid.arrange(g1,g4,g2,g5,g3,g6)

The advantage of the above is to allow for further values. For example for data needs transformation between log and sqrt, we can use values between 0 to 0.5. Or for left-skewed data we can use negative values:

g1 <- ggplot(houses, aes(x=tform(price,0))) + 
  geom_histogram(colour='white')
g2 <- ggplot(houses, aes(x=tform(price,-0.25))) + 
  geom_histogram(colour='white')
g3 <- ggplot(houses, aes(x=tform(price,-.5))) + 
  geom_histogram(colour='white')

grid.arrange(g1,g2,g3)

We reached the sweet spot when alpha = -0.25.

Let’s verify if it is normal enough:

qqtest(tform(houses$price,-0.25))

The line is not perfectly linear, but it is in the band. So it is hard to reject it is not normal. # Preprocessing Data

Moreover, if your data is leftskewed, then you can set $\alpha$ a larger value than 1 so that it will shift the bump to the centre:

g1 <- ggplot(airquality, aes(x=tform(Temp,1))) + 
  geom_histogram(colour='white', bins=15)
g2 <- ggplot(airquality, aes(x=tform(Temp,1.5))) + 
  geom_histogram(colour='white', bins=15)
g3 <- ggplot(airquality, aes(x=tform(Temp,2.2))) + 
  geom_histogram(colour='white', bins=15)
grid.arrange(g1,g2,g3)

qqtest(tform(airquality$Temp,2.2))

Time Series

Datetime in R

Date and Datetime are special data types that are processed and used in a different fashion. Mostly data has them in a particular format, but that format is not universal. For example, we can write the birthday of Guthrie Govan in many ways, 12/27/1971, 12/27/71, 27.12.1971, 27/12/1971 and so on. Luckily there is a ISO format for the dates in form of 1971-12-27 and this is R’s default format.

If you read a csv file, by default, date will be read as factor. If in the source it is in ISO format, you can convert it easily as below:

bday <- '1971-12-27'
class(bday)

## [1] "character"

bday <- as.Date(bday)
class(bday)

## [1] "Date"

bday

## [1] "1971-12-27"

If the date is not in YYYY-MM-DD format, you need to specify the input format separately:

bday <- '27.12.1971'

bday <- as.Date(bday, format='%d.%m.%Y')
class(bday)

## [1] "Date"

bday

## [1] "1971-12-27"

You can notice that there is a different language for the format:

Code	Value	Example
%d	day	27
%m	month	12
%b	month (string)	Dec
%B	month (long str.)	December
%a	weekday (string)	Mon
%A	weekday (long str.)	Monday
%y	year (2 digits)	71
%Y	year (4 digits)	1971
%H	hour	21
%M	minute	30
%S	second	0

So if you have in other formats you can convert it easily:

bday <- 'December 27, 1971'
bday <- as.Date(bday, format='%B %d, %Y')
bday

## [1] "1971-12-27"

If you receive NA, this most probably means it couldn’t process because the format was not given properly:

bday <- 'December 27, 1971'
bday <- as.Date(bday, format='%B %d %Y')
bday

## [1] NA

After a good conversion, you can easily carry out mathematical operations:

library('lubridate')
guthrie <- as.Date('12/27/1971', format='%m/%d/%Y')
marco   <- as.Date('12/24/1970', format='%m/%d/%Y')

as.integer( (today() - guthrie)/365 )

## [1] 49

as.integer( (today() - marco)/365 )

## [1] 50

guthrie - marco

## Time difference of 368 days

If you have datetime input, then you can convert it similar to dates but only using strptime:

btime <- '12/27/1971 21:30:00'
guthrie <- strptime(btime, format='%m/%d/%Y %H:%M:%S', tz='GMT')
guthrie

## [1] "1971-12-27 21:30:00 GMT"

class(guthrie)

## [1] "POSIXlt" "POSIXt"

You can also print it back in any format you like:

format(guthrie,'%A, %B %d, %Y %H:%M:%S')

## [1] "Monday, December 27, 1971 21:30:00"

Lubridate Package

It is not easy to extract some good information by using the above methods. That’s why humanity invented lubridate. The only thing you need is what information you have an in what order:

## month day year hour minute second:
mdy_hms('12/27/1971 21:30:00')

## [1] "1971-12-27 21:30:00 UTC"

mdy_hms('12 27 1971 21:30:00')

## [1] "1971-12-27 21:30:00 UTC"

mdy_hms('Dec 27 1971 21:30:00')

## [1] "1971-12-27 21:30:00 UTC"

mdy_hms('December 27 1971 21:30:00')

## [1] "1971-12-27 21:30:00 UTC"

mdy('12/27/1971')

## [1] "1971-12-27"

## day month year hour minute second:
dmy_hms('27/12/1971 21:30:00')

## [1] "1971-12-27 21:30:00 UTC"

dmy_hms('27 12 1971 21:30:00')

## [1] "1971-12-27 21:30:00 UTC"

dmy_hms('27 Dec 1971 21:30:00')

## [1] "1971-12-27 21:30:00 UTC"

dmy_hms('27 December 1971 21:30:00')

## [1] "1971-12-27 21:30:00 UTC"

dmy('27/12/1971')

## [1] "1971-12-27"

After a good conversion, we can extract all date related information we need:

library('lubridate')
bday <- mdy('12/27/1971')
c(day(bday), month(bday), year(bday))

## [1]   27   12 1971

c(year(bday), quarter(bday))

## [1] 1971    4

c(week(bday), wday(bday), weekdays(bday))

## [1] "52"     "2"      "Monday"

It also has a separate function for datetime, as_datetime, which is doing the same thing as strptime:

as_datetime('12/27/1971 21:30:00', format='%m/%d/%Y %H:%M:%S', tz='GMT')

## [1] "1971-12-27 21:30:00 GMT"

Basic operations

In R, there are some operations that you must now by heart. These are selecting row and columns, slicing the data and so on. The “by heart” here means you must know how they work but you may not use it in daily life.

Assume you have the following data:

musicians <- data.frame(Name=c('Guthrie Govan', 'Marco Minnemann', 'Tony Levin', 'Steven Wilson', 'Jordan Rudess'),
           Instrument=c('guitar','drums','bass','guitar/keyboard', 'keyboard'),
           Birthday = c('27.12.1971', '24.12.1970','6.6.1946','3.11.1967', '4.11.1956'))

musicians$Birthday <- dmy(musicians$Birthday)
musicians$Age      <- as.numeric(today() - musicians$Birthday)/365
musicians

##              Name      Instrument   Birthday      Age
## 1   Guthrie Govan          guitar 1971-12-27 49.52603
## 2 Marco Minnemann           drums 1970-12-24 50.53425
## 3      Tony Levin            bass 1946-06-06 75.10137
## 4   Steven Wilson guitar/keyboard 1967-11-03 53.67671
## 5   Jordan Rudess        keyboard 1956-11-04 64.67945

Let’s recall the elementary operations.

Selecting Columns

Since this is a dataframe, you can use musicians$birthday to choose the third column. But for matrices it wont’ work. There is another notation that you need to know: You can select certain columns and rows using [ , ] notation:

musicians[ ,1]

## [1] "Guthrie Govan"   "Marco Minnemann" "Tony Levin"      "Steven Wilson"  
## [5] "Jordan Rudess"

musicians[ ,1:3]

##              Name      Instrument   Birthday
## 1   Guthrie Govan          guitar 1971-12-27
## 2 Marco Minnemann           drums 1970-12-24
## 3      Tony Levin            bass 1946-06-06
## 4   Steven Wilson guitar/keyboard 1967-11-03
## 5   Jordan Rudess        keyboard 1956-11-04

musicians[ ,c(1,4)]

##              Name      Age
## 1   Guthrie Govan 49.52603
## 2 Marco Minnemann 50.53425
## 3      Tony Levin 75.10137
## 4   Steven Wilson 53.67671
## 5   Jordan Rudess 64.67945

or we can drop the rows 2 and 3 by using ‘-’ sign:

musicians[ ,-c(2,3)]

##              Name      Age
## 1   Guthrie Govan 49.52603
## 2 Marco Minnemann 50.53425
## 3      Tony Levin 75.10137
## 4   Steven Wilson 53.67671
## 5   Jordan Rudess 64.67945

musicians[ ,c('Name','Age')]

##              Name      Age
## 1   Guthrie Govan 49.52603
## 2 Marco Minnemann 50.53425
## 3      Tony Levin 75.10137
## 4   Steven Wilson 53.67671
## 5   Jordan Rudess 64.67945

Tidyverse way

There is a special function defined in tidyverse, select for selecting columns:

select(musicians, c('Name','Birthday'))

##              Name   Birthday
## 1   Guthrie Govan 1971-12-27
## 2 Marco Minnemann 1970-12-24
## 3      Tony Levin 1946-06-06
## 4   Steven Wilson 1967-11-03
## 5   Jordan Rudess 1956-11-04

It also allowing choosing a range of columns using their names:

select(musicians, Name:Birthday)

##              Name      Instrument   Birthday
## 1   Guthrie Govan          guitar 1971-12-27
## 2 Marco Minnemann           drums 1970-12-24
## 3      Tony Levin            bass 1946-06-06
## 4   Steven Wilson guitar/keyboard 1967-11-03
## 5   Jordan Rudess        keyboard 1956-11-04

Selecting Rows

musicians[1:2, ]

##              Name Instrument   Birthday      Age
## 1   Guthrie Govan     guitar 1971-12-27 49.52603
## 2 Marco Minnemann      drums 1970-12-24 50.53425

handsAlbum  <- c(1,2,4)
musicians[handsAlbum, ]

##              Name      Instrument   Birthday      Age
## 1   Guthrie Govan          guitar 1971-12-27 49.52603
## 2 Marco Minnemann           drums 1970-12-24 50.53425
## 4   Steven Wilson guitar/keyboard 1967-11-03 53.67671

or we can drop the rows 3 and 4 by using ‘-’ sign:

musicians[-c(3,4),]

##              Name Instrument   Birthday      Age
## 1   Guthrie Govan     guitar 1971-12-27 49.52603
## 2 Marco Minnemann      drums 1970-12-24 50.53425
## 5   Jordan Rudess   keyboard 1956-11-04 64.67945

We can also use logical arrays as filters:

handsAlbum <- c(T,T,F,T,F)
musicians[handsAlbum,]

##              Name      Instrument   Birthday      Age
## 1   Guthrie Govan          guitar 1971-12-27 49.52603
## 2 Marco Minnemann           drums 1970-12-24 50.53425
## 4   Steven Wilson guitar/keyboard 1967-11-03 53.67671

The musicians younger than age 50:

musicians$Age < 50

## [1]  TRUE FALSE FALSE FALSE FALSE

musicians[musicians$Age<50,]

##            Name Instrument   Birthday      Age
## 1 Guthrie Govan     guitar 1971-12-27 49.52603

# alternatively
# musicians[musicians[,4]<50,]
# subset(musicians, Age < 50)

Tidyverse way

In tidyverse package there is a special function for it, which does the same thing as subset does. So you may ask why. Don’t ask me.

subset(musicians, Age<50)

##            Name Instrument   Birthday      Age
## 1 Guthrie Govan     guitar 1971-12-27 49.52603

filter(musicians, Age<50)

##            Name Instrument   Birthday      Age
## 1 Guthrie Govan     guitar 1971-12-27 49.52603

Gather and Spread

There is a life saver function, gather and its duality, spread in tidyverse package. May tidyverse be with you.

Let’s see our problem first. Assume you want to plot time series graph of the economies and compare them. To be able to compare them you must add a line, because each of the countries is a new variable:

econ <- read.csv('data/madison.csv')

ggplot(econ, aes(x=Year)) + 
  geom_line(aes(y=Canada), colour='firebrick') + 
  geom_line(aes(y=France), colour = 'steelblue') +
  geom_line(aes(y=Italy), colour = 'darkolivegreen3') + 
  geom_text(aes(y=tail(Canada,1), x=2013,label='Canada')) + 
  geom_text(aes(y=tail(France,1), x=2013,label='France')) + 
  geom_text(aes(y=tail(Italy,1), x=2013, label='Italy')) +
  labs(y='GDPPC')

I don’t have time to do it for all.

Instead, you can transform your data using gather as below:

# econ.long <- gather(econ, key='Country', value='GDPPC', Canada:United.States)
econ.long <- gather(econ, key='Country', value='GDPPC', -Year)
head(econ.long)

##   Year Country GDPPC
## 1 1950  Canada  7291
## 2 1951  Canada  7533
## 3 1952  Canada  7833
## 4 1953  Canada  7984
## 5 1954  Canada  7699
## 6 1955  Canada  8201

tail(econ.long)

##     Year       Country GDPPC
## 422 2005 United.States 30842
## 423 2006 United.States 31358
## 424 2007 United.States 31655
## 425 2008 United.States 31251
## 426 2009 United.States 29899
## 427 2010 United.States 30491

Now we have a more handy data. Each row contains GDP per capita values of each year and country combination. It is way easier to plot it:

col <- c('firebrick','steelblue','darkolivegreen','purple','black','brown','orange')
ggplot(econ.long, aes(x=Year, y=GDPPC)) + 
  geom_line(aes(colour=Country)) +
  geom_text(data=subset(econ.long, Year==2010), aes(y=GDPPC, x=2010, label=Country),hjust=-.1) + 
  xlim(c(1950,2017)) + 
  theme(legend.position = 'none') + 
  scale_color_manual(values = col)

In case you need a data that is wider, you can use spread:

econ.wide <- spread(econ.long,key = 'Country',value='GDPPC')
head( econ.wide )

##   Year Canada France Germany Italy Japan United.Kingdom United.States
## 1 1950   7291   5186    3881  3172  1921           6939          9561
## 2 1951   7533   5461    4206  3451  2126           7123         10116
## 3 1952   7833   5564    4553  3591  2336           7091         10316
## 4 1953   7984   5684    4905  3830  2474           7346         10613
## 5 1954   7699   5915    5247  3947  2582           7619         10359
## 6 1955   8201   6199    5797  4190  2771           7868         10897

Visualizing Covid Confirmed Data

covid <- read.csv('data/covid19-download.csv')
head(covid)

##   pruid           prname             prnameFR       date update numconf numprob
## 1    35          Ontario              Ontario 2020-01-31     NA       3       0
## 2    59 British Columbia Colombie-Britannique 2020-01-31     NA       1       0
## 3     1           Canada               Canada 2020-01-31     NA       4       0
## 4    35          Ontario              Ontario 2020-02-08     NA       3       0
## 5    59 British Columbia Colombie-Britannique 2020-02-08     NA       4       0
## 6     1           Canada               Canada 2020-02-08     NA       7       0
##   numdeaths numtotal numtested numtests numrecover percentrecover ratetested
## 1         0        3        NA        0         NA                        NA
## 2         0        1        NA        0         NA                        NA
## 3         0        4        NA        0         NA                        NA
## 4         0        3        NA        0         NA                        NA
## 5         0        4        NA       63         NA                        NA
## 6         0        7        NA       63         NA                        NA
##   ratetests numtoday percentoday ratetotal ratedeaths numdeathstoday
## 1        NA        3         300      0.02          0              0
## 2        NA        1         100      0.02          0              0
## 3        NA        4         400      0.01          0              0
## 4        NA        0           0      0.02          0              0
## 5        12        3         300      0.08          0              0
## 6         2        3          75      0.02          0              0
##   percentdeath numtestedtoday numteststoday numrecoveredtoday percentactive
## 1            0             NA            NA                NA           100
## 2            0             NA            NA                NA           100
## 3            0             NA            NA                NA           100
## 4            0             NA            NA                NA           100
## 5            0             NA            NA                NA           100
## 6            0             NA            NA                NA           100
##   numactive rateactive numtotal_last14 ratetotal_last14 numdeaths_last14
## 1         3       0.02              NA               NA               NA
## 2         1       0.02              NA               NA               NA
## 3         4       0.01              NA               NA               NA
## 4         3       0.02              NA               NA               NA
## 5         4       0.08              NA               NA               NA
## 6         7       0.02              NA               NA               NA
##   ratedeaths_last14 numtotal_last7 ratetotal_last7 numdeaths_last7
## 1                NA             NA              NA              NA
## 2                NA             NA              NA              NA
## 3                NA             NA              NA              NA
## 4                NA             NA              NA              NA
## 5                NA             NA              NA              NA
## 6                NA             NA              NA              NA
##   ratedeaths_last7 avgtotal_last7 avgincidence_last7 avgdeaths_last7
## 1               NA             NA                 NA              NA
## 2               NA             NA                 NA              NA
## 3               NA             NA                 NA              NA
## 4               NA             NA                 NA              NA
## 5               NA             NA                 NA              NA
## 6               NA             NA                 NA              NA
##   avgratedeaths_last7 raterecovered
## 1                  NA             0
## 2                  NA             0
## 3                  NA             0
## 4                  NA             0
## 5                  NA             0
## 6                  NA             0

tail(covid)

##      pruid                 prname                  prnameFR       date update
## 7102    11   Prince Edward Island     Île-du-Prince-Édouard 2021-06-23      0
## 7103    60                  Yukon                     Yukon 2021-06-23      1
## 7104    61  Northwest Territories Territoires du Nord-Ouest 2021-06-23      1
## 7105    62                Nunavut                   Nunavut 2021-06-23      1
## 7106    99 Repatriated travellers       Voyageurs rapatriés 2021-06-23     NA
## 7107     1                 Canada                    Canada 2021-06-23     NA
##      numconf numprob numdeaths numtotal numtested numtests numrecover
## 7102     206      NA         0      206        NA   172743        206
## 7103     220      NA         3      220        NA     9129        125
## 7104     128      NA         0      128        NA    24548        128
## 7105     657      NA         4      657        NA    17497        653
## 7106      13      NA         0       13        NA       76         13
## 7107 1410927      NA     26175  1410927        NA 36344745    1375107
##      percentrecover ratetested ratetests numtoday percentoday ratetotal
## 7102         100.00         NA   1082180        0        0.00    129.05
## 7103          56.82         NA    217088        8        3.77    523.16
## 7104         100.00         NA    543566        0        0.00    283.43
## 7105          99.39         NA    444617        0        0.00   1669.50
## 7106         100.00         NA        NA        0        0.00        NA
## 7107          97.46         NA    956309      727        0.05   3712.45
##      ratedeaths numdeathstoday percentdeath numtestedtoday numteststoday
## 7102       0.00              0         0.00             NA           366
## 7103       7.13              0         1.36             NA             0
## 7104       0.00              0         0.00             NA             7
## 7105      10.16              0         0.61             NA            84
## 7106         NA              0         0.00             NA             0
## 7107      68.87             21         1.86             NA         57255
##      numrecoveredtoday percentactive numactive rateactive numtotal_last14
## 7102                 0          0.00         0       0.00               0
## 7103                11         41.82        92     218.78             128
## 7104                 0          0.00         0       0.00               0
## 7105                 0          0.00         0       0.00               8
## 7106                 0          0.00         0         NA               0
## 7107              1218          0.68      9645      25.38           14114
##      ratetotal_last14 numdeaths_last14 ratedeaths_last14 numtotal_last7
## 7102             0.00                0              0.00              0
## 7103           304.39                1              2.38            100
## 7104             0.00                0              0.00              0
## 7105            20.33                0              0.00              0
## 7106               NA                0                NA              0
## 7107            37.14              332              0.87           5788
##      ratetotal_last7 numdeaths_last7 ratedeaths_last7 avgtotal_last7
## 7102            0.00               0             0.00              0
## 7103          237.80               0             0.00             14
## 7104            0.00               0             0.00              0
## 7105            0.00               0             0.00              0
## 7106              NA               0               NA              0
## 7107           15.23             176             0.46            827
##      avgincidence_last7 avgdeaths_last7 avgratedeaths_last7 raterecovered
## 7102               0.00               0                0.00          0.00
## 7103              33.97               0                0.00         26.16
## 7104               0.00               0                0.00          0.00
## 7105               0.00               0                0.00          0.00
## 7106                 NA               0                  NA            NA
## 7107               2.18              25                0.07          3.20

We have to write some line of codes before working with this pain:

covid$date <- ymd(covid$date)
temp <- covid %>% 
  mutate(date = ymd(date)) %>% 
  select(prname,date,numtotal,numactive) %>% 
  rename(Province = prname, Date=date,Active=numactive,Total=numtotal)
  
head(temp)

##           Province       Date Total Active
## 1          Ontario 2020-01-31     3      3
## 2 British Columbia 2020-01-31     1      1
## 3           Canada 2020-01-31     4      4
## 4          Ontario 2020-02-08     3      3
## 5 British Columbia 2020-02-08     4      4
## 6           Canada 2020-02-08     7      7

tail(temp)

##                    Province       Date   Total Active
## 7102   Prince Edward Island 2021-06-23     206      0
## 7103                  Yukon 2021-06-23     220     92
## 7104  Northwest Territories 2021-06-23     128      0
## 7105                Nunavut 2021-06-23     657      0
## 7106 Repatriated travellers 2021-06-23      13      0
## 7107                 Canada 2021-06-23 1410927   9645

labelData <- subset(temp, Date=='2021-06-23' & Province !='Canada')
subset(temp, Province!='Canada') %>% 
  ggplot(aes(x=Date, y=Total, colour = Province)) + 
  geom_text(data=labelData, aes(y=Total, x=Date, label=Province),hjust=-.1) + 
  geom_line() + 
  scale_x_date(limits = as.Date(c('2021-01-01','2021-07-15')))+
  theme(legend.position = 'none')

labelData <- subset(temp, Date=='2021-04-23' & Province !='Canada')
subset(temp, Province!='Canada') %>% 
  ggplot(aes(x=Date, y=Active, colour = Province)) + 
  geom_text(data=labelData, aes(y=Active, x=Date, label=Province),hjust=-.1,colour='black') +
  geom_line() + 
  scale_x_date(limits = as.Date(c('2021-01-01','2021-07-15')))+
  theme(legend.position = 'none')

Kool Stuff

Let’s do something cool. The below code generates an animated version of the above plot.

CAUTION: You need gganimate, gifski and av installed. After installing, you need to restart R.

# install.packages('gganimate')
# install.packages('gifski')
# install.packages('av')

library('gganimate')

subset(temp, Date > '2021-01-01' & 
         Province %in% c('Ontario','Alberta','British Columbia', 'Quebec', 'Saskatchewan')) %>% 
  ggplot(aes(x=Date, y=Active, colour = Province)) + 
  geom_line() + 
  geom_point() + 
  transition_reveal(Date) +
  scale_x_date(limits = as.Date(c('2021-01-01','2021-07-15'))) + 
  shadow_wake(wake_length = 0.1, alpha = FALSE)

subset(temp, Date > '2021-01-01' & 
         Province %in% c('Ontario','Alberta','British Columbia', 'Quebec', 'Saskatchewan')) %>% 
  ggplot(aes(x=Date, y=Active, colour = Province)) + 
  geom_point() + 
  transition_reveal(Date) +
  scale_x_date(limits = as.Date(c('2021-01-01','2021-07-15'))) +
  theme(plot.margin = margin(5.5, 40, 5.5, 5.5)) + 
  shadow_wake(wake_length = 0.1, alpha = FALSE)

For more examples see https://www.datanovia.com/en/blog/gganimate-how-to-create-plots-with-beautiful-animation-in-r/.

Lab 7: Data Transformation and Time Series