Lab 1: Data Visualization

Introduction

In this tutorial you will learn how to import data from external sources, preprocess it and plot a simple graph. By the end of the tutorial you will be able to:

read from xlsx, csv / txt / tsv, rda and SQL database,
identify dimensions, overwrite column and row names,
draw different plots with ggplot2,
visualize continuous (and other non-categorical numeric data) and categorical data.

Getting Started

We will visualize,

House Prices data used once in Kaggle Challenge, which involves house prices and properties for various different houses in Washington
Ontario Covid Confirmed Cases data

You will need, - tidyverse (install if not yet installed) - house_subset, provinces and covid datasets

You will also need some additional packages

# install.packages('tidyverse')
# install.packages('RSQLite')
# install.packages('readxl')
# install.packages('leaflet')

Now call tidyverse library:

library('tidyverse')

Reading Data

If you want to read data, it must be in the same folder where your working directory currently is.

You can check it in RStudio on the right hand side, under Files tab. You can browse to the folder where your data is, press More and click on Set as working directory option:

You can check it by typing getwd() to the console and change it using setwd('path/to/folder'), for example setwd('Documents/MSCI253/Lab/') if the data is in the folder.

Read from CSV / TXT / TSV

These files are all in same format, and even their extension doesn’t matter. Whatever the name is the columns are separated either

by a comma (,)
a tab (\t)
or anything else

By default read.csv expects comma, and if something else appears it doesn’t split anything. Besides

The file may not have a header,
It may start in the nth row and so on.

Comma Separated

If your file is comma separated then it must look like this:

Then you can read it easily as:

provinces <- read.csv('data/provinces.csv') # by default it expects , to separate columns
head(provinces)

##               Province Latitude Longitude
## 1         Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000  -63.0000
## 3              Ontario 50.00000  -85.0000
## 4          Nova Scotia 45.00000  -63.0000
## 5              Alberta 55.00000 -115.0000
## 6     British Columbia 53.72667 -127.6476

head function returns only first 6 rows to give you an idea about what kind of data you have (similarly tail returns last 6). You can see that data is read properly. Also the dimensions of the data is:

dim(provinces)

## [1] 10  3

There is information for 9 Canadian provinces and the data is 3 columns.

Tab or Anything Else Separated

tab vs semicolon separated

However when the data is tab or something else separated (e.g. semicolon) as above, then the default option will read the data incorrectly as:

provinces <- read.csv('data/provinces.tsv')
head(provinces)

##                Province.Latitude.Longitude
## 1                   Saskatchewan\t55\t-106
## 2         Prince Edward Island\t46.25\t-63
## 3                         Ontario\t50\t-85
## 4                     Nova Scotia\t45\t-63
## 5                        Alberta\t55\t-115
## 6 British Columbia\t53.726669\t-127.647621

Notice that between each column there is a \t sign and all is in 1 column:

dim(provinces)

## [1] 10  1

which means the data is not properly read. To read it properly we have to specify the separator:

provinces <- read.csv('data/provinces.tsv', sep='\t')
head(provinces)

##               Province Latitude Longitude
## 1         Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000  -63.0000
## 3              Ontario 50.00000  -85.0000
## 4          Nova Scotia 45.00000  -63.0000
## 5              Alberta 55.00000 -115.0000
## 6     British Columbia 53.72667 -127.6476

If it is separated by semicolons (‘;’) then

provinces <- read.csv('data/provinces_semicol.txt', sep=';')
head(provinces)

##               Province Latitude Longitude
## 1         Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000  -63.0000
## 3              Ontario 50.00000  -85.0000
## 4          Nova Scotia 45.00000  -63.0000
## 5              Alberta 55.00000 -115.0000
## 6     British Columbia 53.72667 -127.6476

Without Header

Data do not always have header, and if you don’t specify it, the first row will be read as header:

provinces <- read.csv('data/provinces_headless.csv')
head(provinces)

##           Saskatchewan      X55      X.106
## 1 Prince Edward Island 46.25000  -63.00000
## 2              Ontario 50.00000  -85.00000
## 3          Nova Scotia 45.00000  -63.00000
## 4              Alberta 55.00000 -115.00000
## 5     British Columbia 53.72667 -127.64762
## 6             Manitoba 53.76086  -98.81387

Instead, you must add header = F:

provinces <- read.csv('data/provinces_headless.csv', header = F)
head(provinces)

##                     V1       V2        V3
## 1         Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000  -63.0000
## 3              Ontario 50.00000  -85.0000
## 4          Nova Scotia 45.00000  -63.0000
## 5              Alberta 55.00000 -115.0000
## 6     British Columbia 53.72667 -127.6476

Now you can write the headers manually:

colnames(provinces) <- c('Province','Latitude','Longitude')
head(provinces)

##               Province Latitude Longitude
## 1         Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000  -63.0000
## 3              Ontario 50.00000  -85.0000
## 4          Nova Scotia 45.00000  -63.0000
## 5              Alberta 55.00000 -115.0000
## 6     British Columbia 53.72667 -127.6476

Starts from Nth row

If there is a line that you must skip to reach the data you can specify it as

provinces <- read.csv('data/provinces_nth_row.csv', skip = 2)
head(provinces)

##               Province Latitude Longitude
## 1         Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000  -63.0000
## 3              Ontario 50.00000  -85.0000
## 4          Nova Scotia 45.00000  -63.0000
## 5              Alberta 55.00000 -115.0000
## 6     British Columbia 53.72667 -127.6476

From Excel

R is an open source program and is not by default compatible with Excel. You need an additional package, readxl, to read xlsx files. When you have readxl installed you can call the library with library('readxl') then use the function read_excel(...). However we don’t want unnecessary functions to be loaded to the RAM. So instead of using library('readxl') to call all library, we can

# install.packages('readxl')

house <- readxl::read_excel('data/house_subset.xlsx')
head(house)

## # A tibble: 6 x 21
##        id date   price bedrooms bathrooms sqft_living sqft_lot floors waterfront
##     <dbl> <chr>  <dbl>    <dbl>     <dbl>       <dbl>    <dbl>  <dbl>      <dbl>
## 1  6.41e9 2014… 5.38e5        3      2.25        2570     7242      2          0
## 2  7.24e9 2014… 1.23e6        4      4.5         5420   101930      1          0
## 3  1.32e9 2014… 2.58e5        3      2.25        1715     6819      2          0
## 4  9.30e9 2015… 6.50e5        4      3           2950     5000      2          0
## 5  1.88e9 2014… 3.95e5        3      2           1890    14040      2          0
## 6  7.98e9 2015… 2.30e5        3      1           1250     9774      1          0
## # … with 12 more variables: view <dbl>, condition <dbl>, grade <dbl>,
## #   sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## #   zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>

By default it reads the first sheet. If you want the second page you can type readxl::read_excel('house_subset.xlsx',2) instead. Or you can write the sheet name too:

# install.packages('readxl')
house <- readxl::read_excel('data/house_subset.xlsx', 'prices')

From rda (R native format)

R has its own format, sometimes it is convenient to write and read from it because it is capable of writing and reading any data type:

house <- get(load('data/house_small_subset.rda'))
head(house)

## # A tibble: 6 x 21
##        id date   price bedrooms bathrooms sqft_living sqft_lot floors waterfront
##     <dbl> <chr>  <dbl>    <dbl>     <dbl>       <dbl>    <dbl>  <dbl>      <dbl>
## 1  2.52e9 2014… 2.00e6        3      2.75        3050    44867    1            0
## 2  4.22e9 2015… 9.20e5        5      2.25        2730     6000    1.5          0
## 3  9.82e9 2014… 8.85e5        4      2.5         2830     5000    2            0
## 4  2.39e9 2015… 4.80e5        3      1           1040     5060    1            0
## 5  1.48e9 2014… 9.05e5        4      2.5         3300    10250    1            0
## 6  2.29e9 2014… 7.99e5        3      2.5         2140     9897    1            0
## # … with 12 more variables: view <dbl>, condition <dbl>, grade <dbl>,
## #   sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## #   zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>

From SQL

SQL is the language for any data scientist. You will learn how to write SQL in Database course.

We downloaded Ontario Covid Confirmed Cases data and imported to SQLite. The dataset includes only one table, confirmed, which contains details about people who were confirmed having Covid in Ontario.

You can run SQL codes prior to importing, and assign the result into a data.frame:

## install.packages('RSQLite')
library('RSQLite')

con <- dbConnect(RSQLite::SQLite(), dbname = "data/covid.db")
covid <- dbGetQuery(con, 
          "SELECT Date, City, COUNT(*) as Confirmed
          FROM confirmed
          GROUP BY Date, City
          ORDER BY Date")
dbDisconnect(con)

tail(covid)

##            Date        City Confirmed
## 1645 2020-05-12      London         2
## 1646 2020-05-12 Mississauga         1
## 1647 2020-05-12   Newmarket         1
## 1648 2020-05-12     Toronto         1
## 1649 2020-05-12      Whitby         3
## 1650 2020-05-12     Windsor         5

In this lab, we need the data in full detail so that we can play with it:

library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "data/covid.db")
covid <- dbGetQuery(con, "SELECT * FROM confirmed")
dbDisconnect(con)

head(covid)

##   Row_ID       Date Age_Group Gender                 Acquisition     Outcome1
## 1      1 2020-04-29       50s FEMALE         Information pending Not Resolved
## 2      2 2020-03-04       40s FEMALE Contact of a confirmed case     Resolved
## 3      3 2020-05-04       20s FEMALE         Information pending Not Resolved
## 4      4 2020-05-02       50s FEMALE         Information pending Not Resolved
## 5      5 2020-04-28       50s   MALE                     Neither Not Resolved
## 6      6 2020-04-10       30s   MALE                     Neither     Resolved
##                        Reporting_PHU                 Address        City
## 1                 Peel Public Health  7120 Hurontario Street Mississauga
## 2               Ottawa Public Health 100 Constellation Drive      Ottawa
## 3   Windsor-Essex County Health Unit   1005 Ouellette Avenue     Windsor
## 4  Region of Waterloo, Public Health  99 Regina Street South    Waterloo
## 5 York Region Public Health Services      17250 Yonge Street   Newmarket
## 6               Ottawa Public Health 100 Constellation Drive      Ottawa
##   Postal_Code                   Reporting_PHU_Website    Latitude    Longitude
## 1     L5W 1N4               www.peelregion.ca/health/  43.6474713  -79.7088933
## 2     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122
## 3     N9A 4J8                           www.wechu.org  42.3087965  -83.0336705
## 4     N2J 4V3           www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5     L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/   44.048023   -79.480239
## 6     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122

Directly From Online Source

You can also read data into R directly from its source. The below data can be downloaded from Government of Canada Website:

covid.canada <- read.csv(url('https://health-infobase.canada.ca/src/data/covidLive/covid19.csv')) 
head(covid.canada)

##   pruid           prname             prnameFR       date update numconf numprob
## 1    35          Ontario              Ontario 31-01-2020     NA       3       0
## 2    59 British Columbia Colombie-Britannique 31-01-2020     NA       1       0
## 3     1           Canada               Canada 31-01-2020     NA       4       0
## 4    35          Ontario              Ontario 08-02-2020     NA       3       0
## 5    59 British Columbia Colombie-Britannique 08-02-2020     NA       4       0
## 6     1           Canada               Canada 08-02-2020     NA       7       0
##   numdeaths numtotal numtested numtests numrecover percentrecover ratetested
## 1         0        3        NA        0         NA                        NA
## 2         0        1        NA        0         NA                        NA
## 3         0        4        NA        0         NA                        NA
## 4         0        3        NA        0         NA                        NA
## 5         0        4        NA       63         NA                        NA
## 6         0        7        NA       63         NA                        NA
##   ratetests numtoday percentoday ratetotal ratedeaths numdeathstoday
## 1        NA        3         300      0.02          0              0
## 2        NA        1         100      0.02          0              0
## 3        NA        4         400      0.01          0              0
## 4        NA        0           0      0.02          0              0
## 5        12        3         300      0.08          0              0
## 6         2        3          75      0.02          0              0
##   percentdeath numtestedtoday numteststoday numrecoveredtoday percentactive
## 1            0             NA            NA                NA           100
## 2            0             NA            NA                NA           100
## 3            0             NA            NA                NA           100
## 4            0             NA            NA                NA           100
## 5            0             NA            NA                NA           100
## 6            0             NA            NA                NA           100
##   numactive rateactive numtotal_last14 ratetotal_last14 numdeaths_last14
## 1         3       0.02              NA               NA               NA
## 2         1       0.02              NA               NA               NA
## 3         4       0.01              NA               NA               NA
## 4         3       0.02              NA               NA               NA
## 5         4       0.08              NA               NA               NA
## 6         7       0.02              NA               NA               NA
##   ratedeaths_last14 numtotal_last7 ratetotal_last7 numdeaths_last7
## 1                NA             NA              NA              NA
## 2                NA             NA              NA              NA
## 3                NA             NA              NA              NA
## 4                NA             NA              NA              NA
## 5                NA             NA              NA              NA
## 6                NA             NA              NA              NA
##   ratedeaths_last7 avgtotal_last7 avgincidence_last7 avgdeaths_last7
## 1               NA             NA                 NA              NA
## 2               NA             NA                 NA              NA
## 3               NA             NA                 NA              NA
## 4               NA             NA                 NA              NA
## 5               NA             NA                 NA              NA
## 6               NA             NA                 NA              NA
##   avgratedeaths_last7
## 1                  NA
## 2                  NA
## 3                  NA
## 4                  NA
## 5                  NA
## 6                  NA

Visualization

Knowing your data

Before playing with the data it is best to know its properties:

head(data) and tail(data) returns top 6 and bottom 6 rows
str(data) returns the structure of the data
dim(data) returns the dimensions, # or rows and # of columns
- nrow(data) returns only # of rows
- ncol(data) returns only # of columns
colnames(data) and rownames(data) returns the column and row names
unique(series) returns nonduplicating data points in the series

house <- get(load('data/house_small_subset.rda'))
dim(house)

## [1] 841  21

There are 841 data in the dataset and 21 columns of information.

tail(house)

## # A tibble: 6 x 21
##        id date   price bedrooms bathrooms sqft_living sqft_lot floors waterfront
##     <dbl> <chr>  <dbl>    <dbl>     <dbl>       <dbl>    <dbl>  <dbl>      <dbl>
## 1  1.49e9 2014… 4.20e5        3      2.5         1470     1571      2          0
## 2  7.90e9 2014… 3.90e5        3      3.25        1370      913      2          0
## 3  8.89e8 2014… 5.99e5        3      1.75        1650     1180      3          0
## 4  9.52e8 2014… 3.80e5        3      2.5         1260      900      2          0
## 5  1.91e8 2015… 1.58e6        4      3.25        3410    10125      2          0
## 6  3.00e9 2015… 4.75e5        3      2.5         1310     1294      2          0
## # … with 12 more variables: view <dbl>, condition <dbl>, grade <dbl>,
## #   sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## #   zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>

Apart from the price of the house, there are information about number of bathrooms, square feet of living spaces, number of floors, whether or not it is water front, location and so on. A problem in our data is even though the zipcode is a categorical variable, since it is numbers it is imported as num.

Zipcode is a column of the dataset. Below we show the unique zipcodes which gives information about how many neighbourhoods are included in the dataset:

unique(house$zipcode)

## [1] 98040 98105 98116

Visualizing Numeric Data

In this subsection we will try to understand the relation mainly between two continuous variables, sqft_living space of the house and its price.

Before starting let’s give some overview about plotting. Plots in ggplot2 are created by overlapping layers. The layers here refer to background, points in the plot, title and many more. To plot a graph:

You first create the space and give its data,
Put the points, lines, etc.
Add titles
Tweak label and title sizes (if necessary)
Change the layout (theme)
And so on

# 1. The space
ggplot(data = house)

As you can see it is an empty plot. We must add the points layer:

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price))

Other Visuals

Boxplot

Similarly if you want a boxplot of prices you can:

ggplot(data = house) + 
  geom_boxplot(aes(x = price))

Notice for the scatter plot we used two data, x and y, and for the histogram we only used only x.

Histogram

ggplot(data = house) + 
  geom_histogram(aes(x = price))

Density

ggplot(data = house) + 
  geom_density(aes(x = price))

Histogram & Density

ggplot(data = house, aes(x = price)) + 
  geom_histogram(aes(y=..density..), fill='steelblue', alpha=.6, color='grey75') + 
  geom_density()

There are countably finite graphs you can plot with ggplot, including but not limited to:

line: geom_line()
point: geom_point()
barplot: geom_bar()
boxplot: geom_boxplot()
density: geom_density()
historgram: guess what
and more

Aesthetic Mapping vs. Assignment: Colours, sizes, shapes and else

There is a very important distinction for you to keep in mind:

Aesthetic mapping: aes(...) function is a mapping function. It maps the variables into shapes, sizes and colours
Assigning: Instead of mapping (dynamic), you assign single colour, size etc. to the points.

Let’s begin with Aesthetic mapping. The colour can be given with colour parameter inside the aes() function:

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price, colour = zipcode, size=floors))

On the other hand you can assign a specific colour, say firebrick, to the points by writing outside of the aes() function:

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price) , colour = 'firebrick', size = 2)

Or a combination of both

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price, colour=zipcode) , size = 3)

alpha, the little touch to makes things beautiful

alpha is the transitivity of the object. For crowded data like the below it makes a lot of difference:

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price, colour=zipcode), size=2, alpha=0.6)

Or you can also use with a couple of different options so that it gives more information about the data

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price), alpha=0.2, size =4, colour='firebrick') + 
  theme_classic()

Converting to categorical

As we talked above, the zipcode variable is not continuous but categorical but since it is a number it is read as numeric and therefore treated as a continuous variable. We must know this in advance and convert during preprocessing:

house$zipcode  <- as.factor(house$zipcode)

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price, colour=zipcode, size = floors), alpha=0.5)

Shape

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price, colour=zipcode, shape = zipcode), alpha=0.5, size = 3)

The shapes are coded with numbers. If you want to assign shapes you must know the code:

knitr::include_graphics('pics/shapes-1.png')

Source: R4DS

Try out yourself

Produce the following:

Assign colour = 'black', shape = 23, stroke=2, alpha=.3, size=3
Use aesthetic mapping to change the fill only for the waterfront houses with more than 2 bathrooms: fill = waterfront == 1 & bathrooms >2
Add classical theme: ggplot(...) + geom_point(...) + theme_classic()

Smoothers

To see the trend in the data we may choose to add smoothers on the plot:

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price), size=3, alpha=.5) +
  geom_smooth(aes(x = sqft_living, y = price), color='firebrick')

If we add colour changing w.r.t. the zipcode by adding inside the aes() then:

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price), size=3, alpha=.5) +
  geom_smooth(aes(x = sqft_living, y = price, color=zipcode))

Adding aes into top level

You can see that it is becoming verbose. The aes function has almost the same parameters but we have to write it twice. Instead we can add aes() into top level and give the parameters which is common to all data. Also let’s remove the error bands:

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3, alpha=.5) +
  geom_smooth(aes(color=zipcode), se=F)

As you can see above aes(color=zipcode) inside the smooth parameter adds up to the top level aes(x=sqft_living, y=price). We can also overwrite it if necessary. For example we may think it is unnecessary to add a smoother to zipcode == 98040. So we can overwrite the data that smoother will be applied:

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3, alpha=.5) +
  geom_smooth(data = subset(house, zipcode != 98040), aes(color=zipcode))

ggplot(data = house, aes(x = sqft_living, y = price, color=zipcode)) + 
  geom_point(size=3,alpha=.5) +
  geom_smooth(color='black')

Facets

You may have noticed that the data is too dense and overlapped. We could have some insights such as in the neighbourhood where zipcode = 98116 the houses are cheaper. But still it is hard to see the trends.

Facets can help us to see the details:

ggplot(data = house, aes(x = sqft_living, y = price, color=zipcode)) + 
  geom_point(size=3,alpha=.5) +
  facet_wrap( ~ zipcode)

We can also change the orientation of facets, also add the smoothers back:

ggplot(data = house, aes(x = sqft_living, y = price, color=zipcode)) + 
  geom_point(size=3,alpha=.5) +
  geom_smooth(color='black') +
  facet_wrap( ~ zipcode, dir='v')

Overwriting the facet captions:

zipcode.labs <- c('Mercer Island','University District', 'West Seattle')
names(zipcode.labs) <- c('98040','98105','98116')

ggplot(data = house, aes(x = sqft_living, y = price, color=zipcode)) + 
  geom_point(size=3,alpha=.5) +
  geom_smooth(color='black') +
  facet_wrap( ~ zipcode, dir='v', labeller = labeller(zipcode = zipcode.labs))

Facet grids

Facet grids are also a good option to see the scatter plot but broke down to two categorical variables.

ggplot(data = house, aes(x = sqft_living, y = price, color=zipcode)) + 
  geom_point(size=3,alpha=.5) +
  facet_grid(waterfront ~ zipcode)

Themes

There are some default themes available to make your plot look beautiful as it deserves. Here are some of them:

theme_bw

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_bw()

theme_classic

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_classic()

theme_dark

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_dark()

theme_light

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_light()

theme_linedraw

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_linedraw()

theme_minimal

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_minimal()

theme_void

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_void()

Visualizing Categorical Data

We cannot use the same methods we used for categorical data. For example the covid data doesn’t have a continuous variable (except longitude and latiude). The questions such as Which gender does COVID-19 affect the most, what is the severity of the pandemic in different cities city cannot be answered with scatter plot or boxplots. We need to visualize the counts of people falling into the categories:

head(covid)

##   Row_ID       Date Age_Group Gender                 Acquisition     Outcome1
## 1      1 2020-04-29       50s FEMALE         Information pending Not Resolved
## 2      2 2020-03-04       40s FEMALE Contact of a confirmed case     Resolved
## 3      3 2020-05-04       20s FEMALE         Information pending Not Resolved
## 4      4 2020-05-02       50s FEMALE         Information pending Not Resolved
## 5      5 2020-04-28       50s   MALE                     Neither Not Resolved
## 6      6 2020-04-10       30s   MALE                     Neither     Resolved
##                        Reporting_PHU                 Address        City
## 1                 Peel Public Health  7120 Hurontario Street Mississauga
## 2               Ottawa Public Health 100 Constellation Drive      Ottawa
## 3   Windsor-Essex County Health Unit   1005 Ouellette Avenue     Windsor
## 4  Region of Waterloo, Public Health  99 Regina Street South    Waterloo
## 5 York Region Public Health Services      17250 Yonge Street   Newmarket
## 6               Ottawa Public Health 100 Constellation Drive      Ottawa
##   Postal_Code                   Reporting_PHU_Website    Latitude    Longitude
## 1     L5W 1N4               www.peelregion.ca/health/  43.6474713  -79.7088933
## 2     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122
## 3     N9A 4J8                           www.wechu.org  42.3087965  -83.0336705
## 4     N2J 4V3           www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5     L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/   44.048023   -79.480239
## 6     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122

ggplot(data = covid) + 
  stat_count(mapping = aes(y=Age_Group))

This is a bar plot, but stat_count() doesn’t give the flexibility we want. Instead we can use bar plot with stat() function:

ggplot(data = covid) + 
  geom_bar(aes(y = Age_Group, x = stat(count), color=Age_Group), alpha=.6)

stat(count) is giving what we want. It transforms the data falling into the category using count function. We can also use it to do the math we need. Instead of counts on the x axis, we can display proportions in the dataset using number of rows in covid dataset:

ggplot(data = covid) + 
  geom_bar(aes(y = Age_Group, x = stat(count/nrow(covid)), color=Age_Group), alpha=.6)

Now let’s add more information to the plot. What is the proportion of men in each group?

ggplot(data = covid) + 
  geom_bar(aes(y = Age_Group, x = stat(count/nrow(covid)), fill=Gender), alpha=.8)

Bar Chart Types (Position adjustments)

Lets begin asking what is the recovery rate?

ggplot(data = covid) + 
  geom_bar(aes(y = Age_Group, x = stat(count/nrow(covid)), fill=Outcome1), alpha=.7)

Although we can see that old age groups have more fatal outcomes, we may find it useful to use a Stacked Bar Chart. This can be done by adding position="fill". Also let’s filter out the unknown groups (using subset()):

ggplot(data = subset(covid, Age_Group != 'Unknown')) + 
  geom_bar(aes(y = Age_Group, x = stat(count/nrow(covid)), fill=Outcome1), alpha=.7, position="fill")

Another way of comparing exact numbers among groups is plotting bars one next to another (position = "dodge"):

ggplot(data = subset(covid, Age_Group != 'Unknown')) + 
  geom_bar(aes(y = Age_Group, x = stat(count/nrow(covid)), fill=Outcome1), alpha=.7, position="dodge")

Playing with Coordinates

Let’s go back to bar chart, but change the axis (x with y)

ggplot(data = subset(covid, Age_Group != 'Unknown')) + 
  geom_bar(aes(x = Age_Group, y = stat(count/nrow(covid)), fill=Outcome1), alpha=.8)

It is very easy to do magic with the above one:

ggplot(data = subset(covid, Age_Group != 'Unknown')) + 
  geom_bar(aes(x = Age_Group, y = stat(count/nrow(covid)), fill=Outcome1), alpha=.8) +
  coord_polar() +
  theme_bw()

Let’s do more magic:

cnd <- subset(map_data("world"), region=='Canada')
ggplot(data = cnd, aes(x=long, y=lat, group=group)) + 
   geom_polygon(fill = "white", colour = "black") +
  coord_quickmap() + 
  annotate('point',x=-80.5393899, y=43.4704571, size=5, alpha=.5, color='firebrick') +
  annotate('text',x=-80.5393899, y=43.4704571, size=5, alpha=.5, label='You are here', vjust=1,hjust=-0.1) +
  theme_void()

at least I am there.

You will be proficient in magic later in this course

library(lubridate)
cnd               <- subset(map_data("world"), region=='Canada')
covid.canada$date <- dmy(covid.canada$date)

last_info <- subset(covid.canada, date == (Sys.Date() -1)) # Yesterday
last_info <- last_info[,c(2,6)]
 
locs <- read.csv('data/provinces.csv', header=T)
colnames(locs) <- c('prname','long','lat')

last_info <- merge(last_info, locs, by='prname')

ggplot(data = cnd, aes(x=long, y=lat, group=group)) + 
   geom_polygon(fill = "white", colour = "black") +
   coord_quickmap() + 
  geom_point(data = last_info, mapping = aes(x = lat, y=long, size=numconf, group=1, color=prname),  alpha=.7) +
  theme_void()

and some more

# install.packages('leaflet')
library('leaflet')
leaflet(last_info) %>%
  addTiles() %>%  # use the default base map which is OpenStreetMap tiles
  addCircleMarkers(lng=~lat, lat=~long,  radius = ~numconf/10000,
             popup=paste0(last_info$prname, '<br>' , last_info$numconf))

Exercises (not deliverables)

This part is just an exercise. You will not be marked.

Produce the following:

Assign colour = 'black', shape = 23, stroke=2, alpha=.3, size=3
Use aesthetic mapping to change the fill only for the waterfront houses with more than 2 bathrooms: fill = waterfront == 1 & bathrooms >2
Add classical theme: ggplot(...) + geom_point(...) + theme_classic()

Using house data, plot the histogram of different locations:

Plot the histogram with changing in colours w.r.t. zipcode (aes) and assign alpha = 0.7
add a facet, facet_wrap(...)

Draw scatter plot with sqft_basement on x and sqft_above on y:

Assign colour = ‘firebrick’ and alpha = 0.3
Add smoother and assign smoother’s colour = ‘black’
Do you think there is a relation between the two variable? If so is the relation linear?

Draw a barplot of people who were affected from COVID-19 by Age_Group and Outcome1 in Waterloo

Instead of using data=covid, use data=subset(covid, City == 'Waterloo')
Your plot should be horizontal and the bars are grouped by Age_Group
The colours of the bars will show Fatal, Not Resolved and Resolved

Compare Male and Female affected by COVID-19 in Canada by Age_Group using radial coordinate:

First draw a column chart and filter the age group ‘Unknown’ out.
Put Age_Group on x axis and stat(count) on y.
Use aesthetic mapping to fill = Outcome1
Add coord_polar() and theme_bw()