Lab 7: Data Transformation and Time Series
Introduction
The data comes in various shapes but never quite as you like when you want to plot the time series. The dates are read as characters, the columns have information but you need in rows etc. Another problem is, when you want to fit a model to your data, often you need it in certain forms, usually normal. In Statistics, you will often see the normality assumption.
In this lab we will cover both of these separately. We will first transform our data to satisfy normality, which will solve many of our issues that we will see later. Also, we will show how to plot time series, but also cross two hurdles between you and a time series plot, namely processing dates and changing the data structure.
In this tutorial you will learn how to:
- transform the data using a parametric transformation method
- preprocess the data (dates and shape)
- play with date objects
- plot time series
Getting Started
We will visualize house prices data, Covid Confirmed Cases dataset, and Maddison Project GDP Per Capita dataset. To be able to follow you need:
- to download covid_confirmed.csv and maddison.csv data uploaded on LEARN
- some packages including
tidyverse
lubridate
gganimate
(takes effort to install)plotly
(takes time to install)
houses <- read.csv('data/house_subset.csv') # choose your directory!
covid <- read.csv('data/covid19-download.csv')
head(covid)
## pruid prname prnameFR date update numconf numprob
## 1 35 Ontario Ontario 2020-01-31 NA 3 0
## 2 59 British Columbia Colombie-Britannique 2020-01-31 NA 1 0
## 3 1 Canada Canada 2020-01-31 NA 4 0
## 4 35 Ontario Ontario 2020-02-08 NA 3 0
## 5 59 British Columbia Colombie-Britannique 2020-02-08 NA 4 0
## 6 1 Canada Canada 2020-02-08 NA 7 0
## numdeaths numtotal numtested numtests numrecover percentrecover ratetested
## 1 0 3 NA 0 NA NA
## 2 0 1 NA 0 NA NA
## 3 0 4 NA 0 NA NA
## 4 0 3 NA 0 NA NA
## 5 0 4 NA 63 NA NA
## 6 0 7 NA 63 NA NA
## ratetests numtoday percentoday ratetotal ratedeaths numdeathstoday
## 1 NA 3 300 0.02 0 0
## 2 NA 1 100 0.02 0 0
## 3 NA 4 400 0.01 0 0
## 4 NA 0 0 0.02 0 0
## 5 12 3 300 0.08 0 0
## 6 2 3 75 0.02 0 0
## percentdeath numtestedtoday numteststoday numrecoveredtoday percentactive
## 1 0 NA NA NA 100
## 2 0 NA NA NA 100
## 3 0 NA NA NA 100
## 4 0 NA NA NA 100
## 5 0 NA NA NA 100
## 6 0 NA NA NA 100
## numactive rateactive numtotal_last14 ratetotal_last14 numdeaths_last14
## 1 3 0.02 NA NA NA
## 2 1 0.02 NA NA NA
## 3 4 0.01 NA NA NA
## 4 3 0.02 NA NA NA
## 5 4 0.08 NA NA NA
## 6 7 0.02 NA NA NA
## ratedeaths_last14 numtotal_last7 ratetotal_last7 numdeaths_last7
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## ratedeaths_last7 avgtotal_last7 avgincidence_last7 avgdeaths_last7
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## avgratedeaths_last7 raterecovered
## 1 NA 0
## 2 NA 0
## 3 NA 0
## 4 NA 0
## 5 NA 0
## 6 NA 0
## pruid prname prnameFR date update
## 7102 11 Prince Edward Island Île-du-Prince-Édouard 2021-06-23 0
## 7103 60 Yukon Yukon 2021-06-23 1
## 7104 61 Northwest Territories Territoires du Nord-Ouest 2021-06-23 1
## 7105 62 Nunavut Nunavut 2021-06-23 1
## 7106 99 Repatriated travellers Voyageurs rapatriés 2021-06-23 NA
## 7107 1 Canada Canada 2021-06-23 NA
## numconf numprob numdeaths numtotal numtested numtests numrecover
## 7102 206 NA 0 206 NA 172743 206
## 7103 220 NA 3 220 NA 9129 125
## 7104 128 NA 0 128 NA 24548 128
## 7105 657 NA 4 657 NA 17497 653
## 7106 13 NA 0 13 NA 76 13
## 7107 1410927 NA 26175 1410927 NA 36344745 1375107
## percentrecover ratetested ratetests numtoday percentoday ratetotal
## 7102 100.00 NA 1082180 0 0.00 129.05
## 7103 56.82 NA 217088 8 3.77 523.16
## 7104 100.00 NA 543566 0 0.00 283.43
## 7105 99.39 NA 444617 0 0.00 1669.50
## 7106 100.00 NA NA 0 0.00 NA
## 7107 97.46 NA 956309 727 0.05 3712.45
## ratedeaths numdeathstoday percentdeath numtestedtoday numteststoday
## 7102 0.00 0 0.00 NA 366
## 7103 7.13 0 1.36 NA 0
## 7104 0.00 0 0.00 NA 7
## 7105 10.16 0 0.61 NA 84
## 7106 NA 0 0.00 NA 0
## 7107 68.87 21 1.86 NA 57255
## numrecoveredtoday percentactive numactive rateactive numtotal_last14
## 7102 0 0.00 0 0.00 0
## 7103 11 41.82 92 218.78 128
## 7104 0 0.00 0 0.00 0
## 7105 0 0.00 0 0.00 8
## 7106 0 0.00 0 NA 0
## 7107 1218 0.68 9645 25.38 14114
## ratetotal_last14 numdeaths_last14 ratedeaths_last14 numtotal_last7
## 7102 0.00 0 0.00 0
## 7103 304.39 1 2.38 100
## 7104 0.00 0 0.00 0
## 7105 20.33 0 0.00 0
## 7106 NA 0 NA 0
## 7107 37.14 332 0.87 5788
## ratetotal_last7 numdeaths_last7 ratedeaths_last7 avgtotal_last7
## 7102 0.00 0 0.00 0
## 7103 237.80 0 0.00 14
## 7104 0.00 0 0.00 0
## 7105 0.00 0 0.00 0
## 7106 NA 0 NA 0
## 7107 15.23 176 0.46 827
## avgincidence_last7 avgdeaths_last7 avgratedeaths_last7 raterecovered
## 7102 0.00 0 0.00 0.00
## 7103 33.97 0 0.00 26.16
## 7104 0.00 0 0.00 0.00
## 7105 0.00 0 0.00 0.00
## 7106 NA 0 NA NA
## 7107 2.18 25 0.07 3.20
Data Transformation
Let’s see our pain. The below weird looking graph is unwelcomed in statistics. The reason is we want to explain a relationship but the data is not spread evenly:
library(ggExtra)
p <- ggplot(houses, aes(x=sqft_living, y=price)) +
geom_point(alpha=.1) +
geom_smooth(colour='firebrick')
ggMarginal(p, type = 'histogram')
Both variables are right skewed. When we draw the scatterplot, the data is grouped into certain intervals, and the rest is sprinkled all around. In other words, for the low values the data is condensed around the line (variance is low) but for the higher values the data is covering a wider range around the line (variance is high). This is known as heteroskedasticity and as you will see later when covering regressions, it is trouble.
We will show you how to obtain a normal looking dataset using transformations:
library(gridExtra)
g1 <- ggplot(houses, aes(x=price)) +
geom_histogram(colour='white')
g2 <- ggplot(houses, aes(x=sqrt(price))) +
geom_histogram(colour='white')
g3 <- ggplot(houses, aes(x=log(price))) +
geom_histogram(colour='white')
g4 <- ggplot(houses, aes(x=sqft_living)) +
geom_histogram(colour='white')
g5 <- ggplot(houses, aes(x=sqrt(sqft_living))) +
geom_histogram(colour='white')
g6 <- ggplot(houses, aes(x=log(sqft_living))) +
geom_histogram(colour='white')
grid.arrange(g1,g4,g2,g5,g3,g6)
price and sqft_living are originally right skewed. The sqrt transformation doesn’t make it symmetric but the most symmetric one seems to be log-transformed one.
The log transformation for sqft_living worked fine. qqtest shows it is in the envelope, so we cannot reject it is normal. But for price data we need more work. The edges are still out of the band.
Parametric Transformations
A transformation function can generalize the above transformations using sqrt and log functions:
\[T(x,\alpha) = \frac{x^\alpha-1}{\alpha}\]
The above formula is a general form of the transformations you can use. For example, when \(\alpha = 1\), the above will be equal to the original series. When \(\alpha = 0.5\), it will be \(2\sqrt{x}-2\), which is equivalent to \(\sqrt{x}\). When \(x\rightarrow 0\), the equation goes to \(\log x\).
It can be defined in R as:
g1 <- ggplot(houses, aes(x=tform(price,1))) +
geom_histogram(colour='white')
g2 <- ggplot(houses, aes(x=tform(price,0.5))) +
geom_histogram(colour='white')
g3 <- ggplot(houses, aes(x=tform(price,0))) +
geom_histogram(colour='white')
g4 <- ggplot(houses, aes(x=tform(sqft_living,1))) +
geom_histogram(colour='white')
g5 <- ggplot(houses, aes(x=tform(sqft_living,.5))) +
geom_histogram(colour='white')
g6 <- ggplot(houses, aes(x=tform(sqft_living,0))) +
geom_histogram(colour='white')
grid.arrange(g1,g4,g2,g5,g3,g6)
The advantage of the above is to allow for further values. For example for data needs transformation between log and sqrt, we can use values between 0 to 0.5. Or for left-skewed data we can use negative values:
g1 <- ggplot(houses, aes(x=tform(price,0))) +
geom_histogram(colour='white')
g2 <- ggplot(houses, aes(x=tform(price,-0.25))) +
geom_histogram(colour='white')
g3 <- ggplot(houses, aes(x=tform(price,-.5))) +
geom_histogram(colour='white')
grid.arrange(g1,g2,g3)
We reached the sweet spot when alpha = -0.25.
Let’s verify if it is normal enough:
The line is not perfectly linear, but it is in the band. So it is hard to reject it is not normal. # Preprocessing Data
Moreover, if your data is leftskewed, then you can set \(\alpha\) a larger value than 1 so that it will shift the bump to the centre:
Time Series
Datetime in R
Date and Datetime are special data types that are processed and used in a different fashion. Mostly data has them in a particular format, but that format is not universal. For example, we can write the birthday of Guthrie Govan in many ways, 12/27/1971
, 12/27/71
, 27.12.1971
, 27/12/1971
and so on. Luckily there is a ISO format for the dates in form of 1971-12-27
and this is R’s default format.
If you read a csv file, by default, date will be read as factor. If in the source it is in ISO format, you can convert it easily as below:
## [1] "character"
## [1] "Date"
## [1] "1971-12-27"
If the date is not in YYYY-MM-DD format, you need to specify the input format separately:
## [1] "Date"
## [1] "1971-12-27"
You can notice that there is a different language for the format:
Code | Value | Example |
---|---|---|
%d | day | 27 |
%m | month | 12 |
%b | month (string) | Dec |
%B | month (long str.) | December |
%a | weekday (string) | Mon |
%A | weekday (long str.) | Monday |
%y | year (2 digits) | 71 |
%Y | year (4 digits) | 1971 |
%H | hour | 21 |
%M | minute | 30 |
%S | second | 0 |
So if you have in other formats you can convert it easily:
## [1] "1971-12-27"
If you receive NA, this most probably means it couldn’t process because the format was not given properly:
## [1] NA
After a good conversion, you can easily carry out mathematical operations:
library('lubridate')
guthrie <- as.Date('12/27/1971', format='%m/%d/%Y')
marco <- as.Date('12/24/1970', format='%m/%d/%Y')
as.integer( (today() - guthrie)/365 )
## [1] 49
## [1] 50
## Time difference of 368 days
If you have datetime
input, then you can convert it similar to dates but only using strptime
:
btime <- '12/27/1971 21:30:00'
guthrie <- strptime(btime, format='%m/%d/%Y %H:%M:%S', tz='GMT')
guthrie
## [1] "1971-12-27 21:30:00 GMT"
## [1] "POSIXlt" "POSIXt"
You can also print it back in any format you like:
## [1] "Monday, December 27, 1971 21:30:00"
Lubridate Package
It is not easy to extract some good information by using the above methods. That’s why humanity invented lubridate
. The only thing you need is what information you have an in what order:
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27"
After a good conversion, we can extract all date related information we need:
## [1] 27 12 1971
## [1] 1971 4
## [1] "52" "2" "Monday"
It also has a separate function for datetime, as_datetime
, which is doing the same thing as strptime
:
## [1] "1971-12-27 21:30:00 GMT"
Basic operations
In R, there are some operations that you must now by heart. These are selecting row and columns, slicing the data and so on. The “by heart” here means you must know how they work but you may not use it in daily life.
Assume you have the following data:
musicians <- data.frame(Name=c('Guthrie Govan', 'Marco Minnemann', 'Tony Levin', 'Steven Wilson', 'Jordan Rudess'),
Instrument=c('guitar','drums','bass','guitar/keyboard', 'keyboard'),
Birthday = c('27.12.1971', '24.12.1970','6.6.1946','3.11.1967', '4.11.1956'))
musicians$Birthday <- dmy(musicians$Birthday)
musicians$Age <- as.numeric(today() - musicians$Birthday)/365
musicians
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 49.52603
## 2 Marco Minnemann drums 1970-12-24 50.53425
## 3 Tony Levin bass 1946-06-06 75.10137
## 4 Steven Wilson guitar/keyboard 1967-11-03 53.67671
## 5 Jordan Rudess keyboard 1956-11-04 64.67945
Let’s recall the elementary operations.
Selecting Columns
Since this is a dataframe, you can use musicians$birthday
to choose the third column. But for matrices it wont’ work. There is another notation that you need to know: You can select certain columns and rows using [ , ]
notation:
## [1] "Guthrie Govan" "Marco Minnemann" "Tony Levin" "Steven Wilson"
## [5] "Jordan Rudess"
## Name Instrument Birthday
## 1 Guthrie Govan guitar 1971-12-27
## 2 Marco Minnemann drums 1970-12-24
## 3 Tony Levin bass 1946-06-06
## 4 Steven Wilson guitar/keyboard 1967-11-03
## 5 Jordan Rudess keyboard 1956-11-04
## Name Age
## 1 Guthrie Govan 49.52603
## 2 Marco Minnemann 50.53425
## 3 Tony Levin 75.10137
## 4 Steven Wilson 53.67671
## 5 Jordan Rudess 64.67945
or we can drop the rows 2 and 3 by using ‘-’ sign:
## Name Age
## 1 Guthrie Govan 49.52603
## 2 Marco Minnemann 50.53425
## 3 Tony Levin 75.10137
## 4 Steven Wilson 53.67671
## 5 Jordan Rudess 64.67945
## Name Age
## 1 Guthrie Govan 49.52603
## 2 Marco Minnemann 50.53425
## 3 Tony Levin 75.10137
## 4 Steven Wilson 53.67671
## 5 Jordan Rudess 64.67945
Tidyverse way
There is a special function defined in tidyverse, select
for selecting columns:
## Name Birthday
## 1 Guthrie Govan 1971-12-27
## 2 Marco Minnemann 1970-12-24
## 3 Tony Levin 1946-06-06
## 4 Steven Wilson 1967-11-03
## 5 Jordan Rudess 1956-11-04
It also allowing choosing a range of columns using their names:
## Name Instrument Birthday
## 1 Guthrie Govan guitar 1971-12-27
## 2 Marco Minnemann drums 1970-12-24
## 3 Tony Levin bass 1946-06-06
## 4 Steven Wilson guitar/keyboard 1967-11-03
## 5 Jordan Rudess keyboard 1956-11-04
Selecting Rows
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 49.52603
## 2 Marco Minnemann drums 1970-12-24 50.53425
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 49.52603
## 2 Marco Minnemann drums 1970-12-24 50.53425
## 4 Steven Wilson guitar/keyboard 1967-11-03 53.67671
or we can drop the rows 3 and 4 by using ‘-’ sign:
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 49.52603
## 2 Marco Minnemann drums 1970-12-24 50.53425
## 5 Jordan Rudess keyboard 1956-11-04 64.67945
We can also use logical arrays as filters:
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 49.52603
## 2 Marco Minnemann drums 1970-12-24 50.53425
## 4 Steven Wilson guitar/keyboard 1967-11-03 53.67671
The musicians younger than age 50:
## [1] TRUE FALSE FALSE FALSE FALSE
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 49.52603
Tidyverse way
In tidyverse package there is a special function for it, which does the same thing as subset
does. So you may ask why. Don’t ask me.
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 49.52603
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 49.52603
Gather and Spread
There is a life saver function, gather
and its duality, spread
in tidyverse package. May tidyverse be with you.
Let’s see our problem first. Assume you want to plot time series graph of the economies and compare them. To be able to compare them you must add a line, because each of the countries is a new variable:
econ <- read.csv('data/madison.csv')
ggplot(econ, aes(x=Year)) +
geom_line(aes(y=Canada), colour='firebrick') +
geom_line(aes(y=France), colour = 'steelblue') +
geom_line(aes(y=Italy), colour = 'darkolivegreen3') +
geom_text(aes(y=tail(Canada,1), x=2013,label='Canada')) +
geom_text(aes(y=tail(France,1), x=2013,label='France')) +
geom_text(aes(y=tail(Italy,1), x=2013, label='Italy')) +
labs(y='GDPPC')
I don’t have time to do it for all.
Instead, you can transform your data using gather
as below:
# econ.long <- gather(econ, key='Country', value='GDPPC', Canada:United.States)
econ.long <- gather(econ, key='Country', value='GDPPC', -Year)
head(econ.long)
## Year Country GDPPC
## 1 1950 Canada 7291
## 2 1951 Canada 7533
## 3 1952 Canada 7833
## 4 1953 Canada 7984
## 5 1954 Canada 7699
## 6 1955 Canada 8201
## Year Country GDPPC
## 422 2005 United.States 30842
## 423 2006 United.States 31358
## 424 2007 United.States 31655
## 425 2008 United.States 31251
## 426 2009 United.States 29899
## 427 2010 United.States 30491
Now we have a more handy data. Each row contains GDP per capita values of each year and country combination. It is way easier to plot it:
col <- c('firebrick','steelblue','darkolivegreen','purple','black','brown','orange')
ggplot(econ.long, aes(x=Year, y=GDPPC)) +
geom_line(aes(colour=Country)) +
geom_text(data=subset(econ.long, Year==2010), aes(y=GDPPC, x=2010, label=Country),hjust=-.1) +
xlim(c(1950,2017)) +
theme(legend.position = 'none') +
scale_color_manual(values = col)
In case you need a data that is wider, you can use spread
:
## Year Canada France Germany Italy Japan United.Kingdom United.States
## 1 1950 7291 5186 3881 3172 1921 6939 9561
## 2 1951 7533 5461 4206 3451 2126 7123 10116
## 3 1952 7833 5564 4553 3591 2336 7091 10316
## 4 1953 7984 5684 4905 3830 2474 7346 10613
## 5 1954 7699 5915 5247 3947 2582 7619 10359
## 6 1955 8201 6199 5797 4190 2771 7868 10897
Visualizing Covid Confirmed Data
## pruid prname prnameFR date update numconf numprob
## 1 35 Ontario Ontario 2020-01-31 NA 3 0
## 2 59 British Columbia Colombie-Britannique 2020-01-31 NA 1 0
## 3 1 Canada Canada 2020-01-31 NA 4 0
## 4 35 Ontario Ontario 2020-02-08 NA 3 0
## 5 59 British Columbia Colombie-Britannique 2020-02-08 NA 4 0
## 6 1 Canada Canada 2020-02-08 NA 7 0
## numdeaths numtotal numtested numtests numrecover percentrecover ratetested
## 1 0 3 NA 0 NA NA
## 2 0 1 NA 0 NA NA
## 3 0 4 NA 0 NA NA
## 4 0 3 NA 0 NA NA
## 5 0 4 NA 63 NA NA
## 6 0 7 NA 63 NA NA
## ratetests numtoday percentoday ratetotal ratedeaths numdeathstoday
## 1 NA 3 300 0.02 0 0
## 2 NA 1 100 0.02 0 0
## 3 NA 4 400 0.01 0 0
## 4 NA 0 0 0.02 0 0
## 5 12 3 300 0.08 0 0
## 6 2 3 75 0.02 0 0
## percentdeath numtestedtoday numteststoday numrecoveredtoday percentactive
## 1 0 NA NA NA 100
## 2 0 NA NA NA 100
## 3 0 NA NA NA 100
## 4 0 NA NA NA 100
## 5 0 NA NA NA 100
## 6 0 NA NA NA 100
## numactive rateactive numtotal_last14 ratetotal_last14 numdeaths_last14
## 1 3 0.02 NA NA NA
## 2 1 0.02 NA NA NA
## 3 4 0.01 NA NA NA
## 4 3 0.02 NA NA NA
## 5 4 0.08 NA NA NA
## 6 7 0.02 NA NA NA
## ratedeaths_last14 numtotal_last7 ratetotal_last7 numdeaths_last7
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## ratedeaths_last7 avgtotal_last7 avgincidence_last7 avgdeaths_last7
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## avgratedeaths_last7 raterecovered
## 1 NA 0
## 2 NA 0
## 3 NA 0
## 4 NA 0
## 5 NA 0
## 6 NA 0
## pruid prname prnameFR date update
## 7102 11 Prince Edward Island Île-du-Prince-Édouard 2021-06-23 0
## 7103 60 Yukon Yukon 2021-06-23 1
## 7104 61 Northwest Territories Territoires du Nord-Ouest 2021-06-23 1
## 7105 62 Nunavut Nunavut 2021-06-23 1
## 7106 99 Repatriated travellers Voyageurs rapatriés 2021-06-23 NA
## 7107 1 Canada Canada 2021-06-23 NA
## numconf numprob numdeaths numtotal numtested numtests numrecover
## 7102 206 NA 0 206 NA 172743 206
## 7103 220 NA 3 220 NA 9129 125
## 7104 128 NA 0 128 NA 24548 128
## 7105 657 NA 4 657 NA 17497 653
## 7106 13 NA 0 13 NA 76 13
## 7107 1410927 NA 26175 1410927 NA 36344745 1375107
## percentrecover ratetested ratetests numtoday percentoday ratetotal
## 7102 100.00 NA 1082180 0 0.00 129.05
## 7103 56.82 NA 217088 8 3.77 523.16
## 7104 100.00 NA 543566 0 0.00 283.43
## 7105 99.39 NA 444617 0 0.00 1669.50
## 7106 100.00 NA NA 0 0.00 NA
## 7107 97.46 NA 956309 727 0.05 3712.45
## ratedeaths numdeathstoday percentdeath numtestedtoday numteststoday
## 7102 0.00 0 0.00 NA 366
## 7103 7.13 0 1.36 NA 0
## 7104 0.00 0 0.00 NA 7
## 7105 10.16 0 0.61 NA 84
## 7106 NA 0 0.00 NA 0
## 7107 68.87 21 1.86 NA 57255
## numrecoveredtoday percentactive numactive rateactive numtotal_last14
## 7102 0 0.00 0 0.00 0
## 7103 11 41.82 92 218.78 128
## 7104 0 0.00 0 0.00 0
## 7105 0 0.00 0 0.00 8
## 7106 0 0.00 0 NA 0
## 7107 1218 0.68 9645 25.38 14114
## ratetotal_last14 numdeaths_last14 ratedeaths_last14 numtotal_last7
## 7102 0.00 0 0.00 0
## 7103 304.39 1 2.38 100
## 7104 0.00 0 0.00 0
## 7105 20.33 0 0.00 0
## 7106 NA 0 NA 0
## 7107 37.14 332 0.87 5788
## ratetotal_last7 numdeaths_last7 ratedeaths_last7 avgtotal_last7
## 7102 0.00 0 0.00 0
## 7103 237.80 0 0.00 14
## 7104 0.00 0 0.00 0
## 7105 0.00 0 0.00 0
## 7106 NA 0 NA 0
## 7107 15.23 176 0.46 827
## avgincidence_last7 avgdeaths_last7 avgratedeaths_last7 raterecovered
## 7102 0.00 0 0.00 0.00
## 7103 33.97 0 0.00 26.16
## 7104 0.00 0 0.00 0.00
## 7105 0.00 0 0.00 0.00
## 7106 NA 0 NA NA
## 7107 2.18 25 0.07 3.20
We have to write some line of codes before working with this pain:
covid$date <- ymd(covid$date)
temp <- covid %>%
mutate(date = ymd(date)) %>%
select(prname,date,numtotal,numactive) %>%
rename(Province = prname, Date=date,Active=numactive,Total=numtotal)
head(temp)
## Province Date Total Active
## 1 Ontario 2020-01-31 3 3
## 2 British Columbia 2020-01-31 1 1
## 3 Canada 2020-01-31 4 4
## 4 Ontario 2020-02-08 3 3
## 5 British Columbia 2020-02-08 4 4
## 6 Canada 2020-02-08 7 7
## Province Date Total Active
## 7102 Prince Edward Island 2021-06-23 206 0
## 7103 Yukon 2021-06-23 220 92
## 7104 Northwest Territories 2021-06-23 128 0
## 7105 Nunavut 2021-06-23 657 0
## 7106 Repatriated travellers 2021-06-23 13 0
## 7107 Canada 2021-06-23 1410927 9645
labelData <- subset(temp, Date=='2021-06-23' & Province !='Canada')
subset(temp, Province!='Canada') %>%
ggplot(aes(x=Date, y=Total, colour = Province)) +
geom_text(data=labelData, aes(y=Total, x=Date, label=Province),hjust=-.1) +
geom_line() +
scale_x_date(limits = as.Date(c('2021-01-01','2021-07-15')))+
theme(legend.position = 'none')
labelData <- subset(temp, Date=='2021-04-23' & Province !='Canada')
subset(temp, Province!='Canada') %>%
ggplot(aes(x=Date, y=Active, colour = Province)) +
geom_text(data=labelData, aes(y=Active, x=Date, label=Province),hjust=-.1,colour='black') +
geom_line() +
scale_x_date(limits = as.Date(c('2021-01-01','2021-07-15')))+
theme(legend.position = 'none')
Kool Stuff
Let’s do something cool. The below code generates an animated version of the above plot.
CAUTION: You need gganimate
, gifski
and av
installed. After installing, you need to restart R.
# install.packages('gganimate')
# install.packages('gifski')
# install.packages('av')
library('gganimate')
subset(temp, Date > '2021-01-01' &
Province %in% c('Ontario','Alberta','British Columbia', 'Quebec', 'Saskatchewan')) %>%
ggplot(aes(x=Date, y=Active, colour = Province)) +
geom_line() +
geom_point() +
transition_reveal(Date) +
scale_x_date(limits = as.Date(c('2021-01-01','2021-07-15'))) +
shadow_wake(wake_length = 0.1, alpha = FALSE)
subset(temp, Date > '2021-01-01' &
Province %in% c('Ontario','Alberta','British Columbia', 'Quebec', 'Saskatchewan')) %>%
ggplot(aes(x=Date, y=Active, colour = Province)) +
geom_point() +
transition_reveal(Date) +
scale_x_date(limits = as.Date(c('2021-01-01','2021-07-15'))) +
theme(plot.margin = margin(5.5, 40, 5.5, 5.5)) +
shadow_wake(wake_length = 0.1, alpha = FALSE)
For more examples see https://www.datanovia.com/en/blog/gganimate-how-to-create-plots-with-beautiful-animation-in-r/.