Lab 3: Visualizing Categorical Variables

Introduction

In this tutorial you will learn how to be derive basic insights from categorical data and visualize joint probabilities. In particular we will be visualizing discrete distributions, both univariate and multivariate. By the end of the tutorial you will be able to:

draw bar, column charts and treemaps,
visualize distribution of more than one categorical variable,
draw eikosograms to visualize independence and bayes rule,
interpret these graphs.

Getting Started

We will visualize two COVID-19 confirmed cases datasets. To be able to follow you need:

to download covid.db data uploaded on LEARN
some packages including

tidyverse
eikosogram
RSQLite
treemapify
waffle
ggthemes
hrbrthemes
leaflet

Now, let’s load the tidyverse package;

library(tidyverse)

and read the covid dataset:

library('RSQLite')

con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT * FROM confirmed")
dbDisconnect(con)

head(covid)

##   Row_ID       Date Age_Group Gender                 Acquisition     Outcome1
## 1      1 2020-04-29       50s FEMALE         Information pending Not Resolved
## 2      2 2020-03-04       40s FEMALE Contact of a confirmed case     Resolved
## 3      3 2020-05-04       20s FEMALE         Information pending Not Resolved
## 4      4 2020-05-02       50s FEMALE         Information pending Not Resolved
## 5      5 2020-04-28       50s   MALE                     Neither Not Resolved
## 6      6 2020-04-10       30s   MALE                     Neither     Resolved
##                        Reporting_PHU                 Address        City
## 1                 Peel Public Health  7120 Hurontario Street Mississauga
## 2               Ottawa Public Health 100 Constellation Drive      Ottawa
## 3   Windsor-Essex County Health Unit   1005 Ouellette Avenue     Windsor
## 4  Region of Waterloo, Public Health  99 Regina Street South    Waterloo
## 5 York Region Public Health Services      17250 Yonge Street   Newmarket
## 6               Ottawa Public Health 100 Constellation Drive      Ottawa
##   Postal_Code                   Reporting_PHU_Website    Latitude    Longitude
## 1     L5W 1N4               www.peelregion.ca/health/  43.6474713  -79.7088933
## 2     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122
## 3     N9A 4J8                           www.wechu.org  42.3087965  -83.0336705
## 4     N2J 4V3           www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5     L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/   44.048023   -79.480239
## 6     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122

Aggregating Data

Before diving into visualizing discrete distributions, we need to learn how to pre-process the data to be able to plot. In particular we will show,

Counting number of people who were infected by COVID-19
Grouping based in the outcome, age, gender and/or location

There are various ways to pre-process the data in R. The most common way is to load all the data to RAM and use tidyverse to process it. But in the industry where the datasets are too big to load on RAM, it is wise to use R together with SQL.

SQL

We briefly visited SQL to show how to import the data. Now, we will use it to preprocess the data into the shape that we need.

Basic SQL Functions

SQL is a very intuitive language; it is very much like talking to the computer. Let’s visit the vocabulary.

SELECT

SELECT command is to use the variable columns of the dataset. If we use *, this means we select all:

library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT * FROM confirmed")
dbDisconnect(con)

head(covid)

##   Row_ID       Date Age_Group Gender                 Acquisition     Outcome1
## 1      1 2020-04-29       50s FEMALE         Information pending Not Resolved
## 2      2 2020-03-04       40s FEMALE Contact of a confirmed case     Resolved
## 3      3 2020-05-04       20s FEMALE         Information pending Not Resolved
## 4      4 2020-05-02       50s FEMALE         Information pending Not Resolved
## 5      5 2020-04-28       50s   MALE                     Neither Not Resolved
## 6      6 2020-04-10       30s   MALE                     Neither     Resolved
##                        Reporting_PHU                 Address        City
## 1                 Peel Public Health  7120 Hurontario Street Mississauga
## 2               Ottawa Public Health 100 Constellation Drive      Ottawa
## 3   Windsor-Essex County Health Unit   1005 Ouellette Avenue     Windsor
## 4  Region of Waterloo, Public Health  99 Regina Street South    Waterloo
## 5 York Region Public Health Services      17250 Yonge Street   Newmarket
## 6               Ottawa Public Health 100 Constellation Drive      Ottawa
##   Postal_Code                   Reporting_PHU_Website    Latitude    Longitude
## 1     L5W 1N4               www.peelregion.ca/health/  43.6474713  -79.7088933
## 2     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122
## 3     N9A 4J8                           www.wechu.org  42.3087965  -83.0336705
## 4     N2J 4V3           www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5     L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/   44.048023   -79.480239
## 6     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122

Instead, we can specify the columns as below:

library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT Date, Age_Group, Gender, Outcome1, City 
                          FROM confirmed")
dbDisconnect(con)

head(covid)

##         Date Age_Group Gender     Outcome1        City
## 1 2020-04-29       50s FEMALE Not Resolved Mississauga
## 2 2020-03-04       40s FEMALE     Resolved      Ottawa
## 3 2020-05-04       20s FEMALE Not Resolved     Windsor
## 4 2020-05-02       50s FEMALE Not Resolved    Waterloo
## 5 2020-04-28       50s   MALE Not Resolved   Newmarket
## 6 2020-04-10       30s   MALE     Resolved      Ottawa

AS

Sometimes the original column name is not very intuitive or just too detailed. We may want to rename it:

library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT Date, Age_Group AS Age, Gender, Outcome1 AS Outcome, City 
                          FROM confirmed")
dbDisconnect(con)

head(covid)

##         Date Age Gender      Outcome        City
## 1 2020-04-29 50s FEMALE Not Resolved Mississauga
## 2 2020-03-04 40s FEMALE     Resolved      Ottawa
## 3 2020-05-04 20s FEMALE Not Resolved     Windsor
## 4 2020-05-02 50s FEMALE Not Resolved    Waterloo
## 5 2020-04-28 50s   MALE Not Resolved   Newmarket
## 6 2020-04-10 30s   MALE     Resolved      Ottawa

ORDER BY

To order the data, we use the command ORDER BY:

library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")

covid <- dbGetQuery(con, "SELECT Date, Age_Group AS Age, Gender, Outcome1 as Outcome, City 
                          FROM confirmed
                          ORDER BY Date")
dbDisconnect(con)

head(covid)

##         Date Age Gender  Outcome        City
## 1 2020-01-01 80s   MALE Resolved      Simcoe
## 2 2020-01-10 40s FEMALE Resolved     Toronto
## 3 2020-01-15 60s   MALE Resolved     Toronto
## 4 2020-01-19 20s   MALE Resolved Mississauga
## 5 2020-01-21 50s   MALE Resolved     Toronto
## 6 2020-01-22 50s FEMALE Resolved     Toronto

or to sort in DESCending order;

library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")

covid <- dbGetQuery(con, "SELECT Date, Age_Group AS Age, Gender, Outcome1 as Outcome, City 
                          FROM confirmed
                          ORDER BY Date DESC")
dbDisconnect(con)

head(covid)

##         Date Age Gender      Outcome        City
## 1 2020-05-12 <20   MALE Not Resolved      London
## 2 2020-05-12 60s   MALE Not Resolved      Barrie
## 3 2020-05-12 20s   MALE Not Resolved     Windsor
## 4 2020-05-12 60s   MALE Not Resolved Mississauga
## 5 2020-05-12 <20 FEMALE Not Resolved      London
## 6 2020-05-12 40s FEMALE Not Resolved     Windsor

COUNT and GROUP BY

Instead of just calling the data, we can extract the needed information on the flow. For example, we can COUNT the number of rows, SUM or AVERAGE numbers in a column.

The below code counts the number of rows without any constraint, so it will return the number of rows in the dataset:

library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")

covid <- dbGetQuery(con, "SELECT COUNT(*) AS Cases
                          FROM confirmed")
dbDisconnect(con)

head(covid)

##   Cases
## 1 21236

We may ask how many people were tested positive in different cities among different age group. The below code seems it can do it, but there is one function missing that can do the job:

library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")

covid <- dbGetQuery(con, "SELECT Age_Group, City, COUNT(*) AS Cases 
                          FROM confirmed")
dbDisconnect(con)

head(covid)

##   Age_Group        City Cases
## 1       50s Mississauga 21236

If we want to count the rows corresponding to categories in a categorical variable, we use GROUP BY:

library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")

covid <- dbGetQuery(con, "SELECT Age_Group AS Age, City, COUNT(*) AS Cases 
                          FROM confirmed
                          GROUP BY Age")
dbDisconnect(con)

head(covid)

##   Age        City Cases
## 1 20s     Windsor  2454
## 2 30s      Ottawa  2591
## 3 40s      Ottawa  2936
## 4 50s Mississauga  3545
## 5 60s      Ottawa  2644
## 6 70s    Waterloo  1900

Combining All

library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")

covid <- dbGetQuery(con, "SELECT Age_Group as Age, City, COUNT(*) AS Cases 
                          FROM confirmed
                          GROUP BY Age, City
                          ORDER BY City DESC")
dbDisconnect(con)

head(covid)

##   Age    City Cases
## 1 20s Windsor   121
## 2 30s Windsor   105
## 3 40s Windsor    99
## 4 50s Windsor   112
## 5 60s Windsor    81
## 6 70s Windsor    54

Tidyverse

We don’t always need to use SQL to aggregate things. Usually SQL helps us to bring the most fruitful data aggregated at a certain level. Then we continue processing before plotting.

Tidyverse contains powerful functions to process the data in a similar fashion to SQL. Let’s see how they work.

library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT * FROM confirmed")
dbDisconnect(con)

head(covid)

##   Row_ID       Date Age_Group Gender                 Acquisition     Outcome1
## 1      1 2020-04-29       50s FEMALE         Information pending Not Resolved
## 2      2 2020-03-04       40s FEMALE Contact of a confirmed case     Resolved
## 3      3 2020-05-04       20s FEMALE         Information pending Not Resolved
## 4      4 2020-05-02       50s FEMALE         Information pending Not Resolved
## 5      5 2020-04-28       50s   MALE                     Neither Not Resolved
## 6      6 2020-04-10       30s   MALE                     Neither     Resolved
##                        Reporting_PHU                 Address        City
## 1                 Peel Public Health  7120 Hurontario Street Mississauga
## 2               Ottawa Public Health 100 Constellation Drive      Ottawa
## 3   Windsor-Essex County Health Unit   1005 Ouellette Avenue     Windsor
## 4  Region of Waterloo, Public Health  99 Regina Street South    Waterloo
## 5 York Region Public Health Services      17250 Yonge Street   Newmarket
## 6               Ottawa Public Health 100 Constellation Drive      Ottawa
##   Postal_Code                   Reporting_PHU_Website    Latitude    Longitude
## 1     L5W 1N4               www.peelregion.ca/health/  43.6474713  -79.7088933
## 2     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122
## 3     N9A 4J8                           www.wechu.org  42.3087965  -83.0336705
## 4     N2J 4V3           www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5     L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/   44.048023   -79.480239
## 6     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122

Select

To select the columns that we are interested in, we can use select as below. Note that I use %>% head() just to display the first 6 rows to save space. It has nothing to do with select functionality:

select(covid, Age_Group,Outcome1,City) %>% head() # head is just for returning the first 6 lines

##   Age_Group     Outcome1        City
## 1       50s Not Resolved Mississauga
## 2       40s     Resolved      Ottawa
## 3       20s Not Resolved     Windsor
## 4       50s Not Resolved    Waterloo
## 5       50s Not Resolved   Newmarket
## 6       30s     Resolved      Ottawa

You may have noticed the power of %>%. That is called a pipe to inject the output of the left function to the right one. We can do the same thing above using pipes:

covid %>% select(Age_Group,Outcome1,City) %>% head() # head is just for returning the first 6 lines

##   Age_Group     Outcome1        City
## 1       50s Not Resolved Mississauga
## 2       40s     Resolved      Ottawa
## 3       20s Not Resolved     Windsor
## 4       50s Not Resolved    Waterloo
## 5       50s Not Resolved   Newmarket
## 6       30s     Resolved      Ottawa

Rename

Similar to AS function in SQL, we can rename the columns with the NewName=OldName order as below:

rename(covid, Age=Age_Group) %>% head() # head is just for returning the first 6 lines

##   Row_ID       Date Age Gender                 Acquisition     Outcome1
## 1      1 2020-04-29 50s FEMALE         Information pending Not Resolved
## 2      2 2020-03-04 40s FEMALE Contact of a confirmed case     Resolved
## 3      3 2020-05-04 20s FEMALE         Information pending Not Resolved
## 4      4 2020-05-02 50s FEMALE         Information pending Not Resolved
## 5      5 2020-04-28 50s   MALE                     Neither Not Resolved
## 6      6 2020-04-10 30s   MALE                     Neither     Resolved
##                        Reporting_PHU                 Address        City
## 1                 Peel Public Health  7120 Hurontario Street Mississauga
## 2               Ottawa Public Health 100 Constellation Drive      Ottawa
## 3   Windsor-Essex County Health Unit   1005 Ouellette Avenue     Windsor
## 4  Region of Waterloo, Public Health  99 Regina Street South    Waterloo
## 5 York Region Public Health Services      17250 Yonge Street   Newmarket
## 6               Ottawa Public Health 100 Constellation Drive      Ottawa
##   Postal_Code                   Reporting_PHU_Website    Latitude    Longitude
## 1     L5W 1N4               www.peelregion.ca/health/  43.6474713  -79.7088933
## 2     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122
## 3     N9A 4J8                           www.wechu.org  42.3087965  -83.0336705
## 4     N2J 4V3           www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5     L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/   44.048023   -79.480239
## 6     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122

or you can use pipes

covid %>% rename(Age=Age_Group) %>% head()

##   Row_ID       Date Age Gender                 Acquisition     Outcome1
## 1      1 2020-04-29 50s FEMALE         Information pending Not Resolved
## 2      2 2020-03-04 40s FEMALE Contact of a confirmed case     Resolved
## 3      3 2020-05-04 20s FEMALE         Information pending Not Resolved
## 4      4 2020-05-02 50s FEMALE         Information pending Not Resolved
## 5      5 2020-04-28 50s   MALE                     Neither Not Resolved
## 6      6 2020-04-10 30s   MALE                     Neither     Resolved
##                        Reporting_PHU                 Address        City
## 1                 Peel Public Health  7120 Hurontario Street Mississauga
## 2               Ottawa Public Health 100 Constellation Drive      Ottawa
## 3   Windsor-Essex County Health Unit   1005 Ouellette Avenue     Windsor
## 4  Region of Waterloo, Public Health  99 Regina Street South    Waterloo
## 5 York Region Public Health Services      17250 Yonge Street   Newmarket
## 6               Ottawa Public Health 100 Constellation Drive      Ottawa
##   Postal_Code                   Reporting_PHU_Website    Latitude    Longitude
## 1     L5W 1N4               www.peelregion.ca/health/  43.6474713  -79.7088933
## 2     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122
## 3     N9A 4J8                           www.wechu.org  42.3087965  -83.0336705
## 4     N2J 4V3           www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5     L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/   44.048023   -79.480239
## 6     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122

Arrange

arrange is the function that does the same job as ORDER BY, which sorts the rows:

arrange(covid,Date) %>% head()

##   Row_ID       Date Age_Group Gender                 Acquisition Outcome1
## 1   4842 2020-01-01       80s   MALE                     Neither Resolved
## 2   3121 2020-01-10       40s FEMALE Contact of a confirmed case Resolved
## 3   3468 2020-01-15       60s   MALE         Information pending Resolved
## 4  15828 2020-01-19       20s   MALE                     Neither Resolved
## 5  11282 2020-01-21       50s   MALE              Travel-Related Resolved
## 6  10437 2020-01-22       50s FEMALE              Travel-Related Resolved
##                   Reporting_PHU                        Address        City
## 1 Haldimand-Norfolk Health Unit            12 Gilbertson Drive      Simcoe
## 2         Toronto Public Health 277 Victoria Street, 5th Floor     Toronto
## 3         Toronto Public Health 277 Victoria Street, 5th Floor     Toronto
## 4            Peel Public Health         7120 Hurontario Street Mississauga
## 5         Toronto Public Health 277 Victoria Street, 5th Floor     Toronto
## 6         Toronto Public Health 277 Victoria Street, 5th Floor     Toronto
##   Postal_Code                                 Reporting_PHU_Website    Latitude
## 1     N3Y 4N5                                          www.hnhu.org 42.84782526
## 2     M5B 1W2 www.toronto.ca/community-people/health-wellness-care/ 43.65659125
## 3     M5B 1W2 www.toronto.ca/community-people/health-wellness-care/ 43.65659125
## 4     L5W 1N4                             www.peelregion.ca/health/  43.6474713
## 5     M5B 1W2 www.toronto.ca/community-people/health-wellness-care/ 43.65659125
## 6     M5B 1W2 www.toronto.ca/community-people/health-wellness-care/ 43.65659125
##      Longitude
## 1 -80.30381491
## 2 -79.37935801
## 3 -79.37935801
## 4  -79.7088933
## 5 -79.37935801
## 6 -79.37935801

We can also sort in descending order as

arrange(covid, desc(Date)) %>% head()

##   Row_ID       Date Age_Group Gender         Acquisition     Outcome1
## 1     50 2020-05-12       <20   MALE Information pending Not Resolved
## 2    314 2020-05-12       60s   MALE Information pending Not Resolved
## 3    320 2020-05-12       20s   MALE Information pending Not Resolved
## 4    530 2020-05-12       60s   MALE Information pending Not Resolved
## 5    641 2020-05-12       <20 FEMALE Information pending Not Resolved
## 6    645 2020-05-12       40s FEMALE Information pending Not Resolved
##                         Reporting_PHU                Address        City
## 1        Middlesex-London Health Unit         50 King Street      London
## 2 Simcoe Muskoka District Health Unit      15 Sperling Drive      Barrie
## 3    Windsor-Essex County Health Unit  1005 Ouellette Avenue     Windsor
## 4                  Peel Public Health 7120 Hurontario Street Mississauga
## 5        Middlesex-London Health Unit         50 King Street      London
## 6    Windsor-Essex County Health Unit  1005 Ouellette Avenue     Windsor
##   Postal_Code       Reporting_PHU_Website    Latitude    Longitude
## 1     N6A 5L7          www.healthunit.com 42.98146842 -81.25401572
## 2     L4M 6K9 www.simcoemuskokahealth.org 44.41071258 -79.68630597
## 3     N9A 4J8               www.wechu.org  42.3087965  -83.0336705
## 4     L5W 1N4   www.peelregion.ca/health/  43.6474713  -79.7088933
## 5     N6A 5L7          www.healthunit.com 42.98146842 -81.25401572
## 6     N9A 4J8               www.wechu.org  42.3087965  -83.0336705

group_by and summarize

We can use group_by together with summarize to aggregate the data. There are many functions that we could use, today we will only cover n() that counts the number of rows:

group_by(covid, Age_Group) %>% summarize(Cases = n())

## # A tibble: 10 x 2
##    Age_Group Cases
##    <chr>     <int>
##  1 <20         571
##  2 20s        2454
##  3 30s        2591
##  4 40s        2936
##  5 50s        3545
##  6 60s        2644
##  7 70s        1900
##  8 80s        2661
##  9 90s        1920
## 10 Unknown      14

Similar to SQL, group_by allows grouping with many variables:

covid %>% group_by(Age_Group, Gender) %>% summarize(Cases = n())

## # A tibble: 36 x 3
## # Groups:   Age_Group [10]
##    Age_Group Gender      Cases
##    <chr>     <chr>       <int>
##  1 <20       FEMALE        284
##  2 <20       MALE          285
##  3 <20       TRANSGENDER     1
##  4 <20       UNKNOWN         1
##  5 20s       FEMALE       1371
##  6 20s       MALE         1079
##  7 20s       TRANSGENDER     1
##  8 20s       UNKNOWN         3
##  9 30s       FEMALE       1426
## 10 30s       MALE         1156
## # … with 26 more rows

Using Pipes

Now, we can combine all we have learned. The most straightforward way is to give a data name after using select function, then continue processing the new data and give another name for the output and so on. This becomes too time and space consuming. Instead, we can use pipes to inject the output of the former to the latter as;

covid %>% 
  rename(Age=Age_Group, Outcome=Outcome1) %>% 
  group_by(Age, Outcome) %>% 
  summarize(Cases = n()) %>% 
  arrange(Outcome)

## # A tibble: 28 x 3
## # Groups:   Age [10]
##    Age   Outcome      Cases
##    <chr> <chr>        <int>
##  1 20s   Fatal            2
##  2 30s   Fatal            5
##  3 40s   Fatal           16
##  4 50s   Fatal           58
##  5 60s   Fatal          139
##  6 70s   Fatal          295
##  7 80s   Fatal          651
##  8 90s   Fatal          599
##  9 <20   Not Resolved   140
## 10 20s   Not Resolved   439
## # … with 18 more rows

Visualizing Categorical Variables

Bar plots (and column charts) are very similar to histograms. In histograms we group the continuous variables into bins where the bins are determined externally (e.g. bins=20 will return 20 bins). Here, bar plots have natural bins naturally corresponding to each category.

If you want to use bar charts or treemaps you have to somehow aggregate the data, either doing before the plot or during plotting. Below we will cover the both approaches.

Aggregated

Assume you want to plot the distribution of confirmed cases w.r.t. Age Group. You may aggregate your data before plotting as:

totalCases <- covid %>% 
  rename(Age=Age_Group) %>% 
  group_by(Age) %>% 
  summarise(ncases = n())
totalCases

## # A tibble: 10 x 2
##    Age     ncases
##    <chr>    <int>
##  1 <20        571
##  2 20s       2454
##  3 30s       2591
##  4 40s       2936
##  5 50s       3545
##  6 60s       2644
##  7 70s       1900
##  8 80s       2661
##  9 90s       1920
## 10 Unknown     14

Now you can easily plot it. You need to tell ggplot to use numbers as they are in the data. You can do it with stat='identity:

ggplot(totalCases, aes(y = Age, x = ncases, fill = Age)) + 
  geom_bar(stat = 'identity')

Or you can add another dimension, Outcome1, to your aggregated data and plot it:

totalCases <- covid %>% 
  rename(Age=Age_Group, Outcome=Outcome1) %>% 
  group_by(Age,Outcome) %>% 
  summarise(ncases = n())

ggplot(totalCases, aes(y = Age, x = ncases,fill=Outcome)) + 
  geom_bar(stat = 'identity')

Or you can plot each outcomes as separate bars as

totalCases <- covid %>% 
  rename(Age=Age_Group, Outcome=Outcome1) %>% 
  group_by(Age,Outcome) %>% 
  summarise(ncases = n())

ggplot(totalCases, aes(y = Age, x = ncases, fill=Outcome)) + 
  geom_bar(stat = 'identity', position='dodge')

You can also plot proportions:

totalCases <- covid %>% 
  rename(Age=Age_Group, Outcome=Outcome1) %>% 
  group_by(Age,Outcome) %>% 
  summarise(ncases = n())

ggplot(totalCases, aes(y = Age, x = ncases, fill=Outcome)) + 
  geom_bar(stat = 'identity', position='fill')

Non-aggregated

Sometimes your data is not that big and you can just use ggplot to do all. If your data is raw, ggplot has enough functionality to aggregate it on the fly using ..count.. or stat(count):

ggplot(covid, aes(y = Age_Group, x = stat(count))) + 
  geom_bar()

It is easy to fill with another variable:

ggplot(covid, aes(y = Age_Group, x = stat(count), fill=Outcome1)) + 
  geom_bar()

If you want the proportions, not count of cases falling to each Age_Group, you can divide to number of rows as:

ggplot(covid, aes(y = Age_Group, x = stat(count)/nrow(covid), fill=Outcome1)) + 
  geom_bar(position = 'fill')

Back To Back Comparison

Similar to previous example, we will generate the back2back plot manually. But first let’s remove the Unknown group and compare female and male only:

covid <- subset(covid, Age_Group != 'Unknown')


ggplot() + 
  geom_bar(data = subset(covid, Gender == 'MALE'),   aes(x= stat(count),y = Age_Group), fill = 'firebrick', alpha=.75) + 
  geom_bar(data = subset(covid, Gender == 'FEMALE'), aes(x=-stat(count),y = Age_Group), fill = 'steelblue', alpha=.75) +
  theme_minimal() + 
  ggtitle('Comparison of Confirmed Cases by Gender') + 
  annotate('text', x = -2000, y = '90s', label='Female') +
  annotate('text', x = 1500, y = '90s', label='Male')

It looks that females are more likely to acquire covid than male, at least in Ontario.

What about the outcomes?

covid <- subset(covid, Age_Group != 'Unknown')

ggplot() + 
  geom_bar(data = subset(covid, Gender == 'MALE'), aes(x=stat(count),y = Age_Group, fill = Outcome1), colour = 'white', alpha=.75) + 
  geom_bar(data = subset(covid, Gender == 'FEMALE'), aes(x=-stat(count),y = Age_Group, fill = Outcome1), colour = 'white', alpha=.75) +
  theme_minimal() + 
  ggtitle('Comparison of Confirmed Cases by Gender') +
  annotate('text', x = -2000, y = '90s', label='Female') +
  annotate('text', x = 1500, y = '90s', label='Male')

There are more fatal cases in female than male. But is it due to the fact that there are more positive female than male?

The above plot is good to visualize numbers but not the proportions. We may think falsely that female death rates higher than male. If we add position = 'fill' we can see that the death rates among males are higher than females:

covid <- subset(covid, Age_Group != 'Unknown')

ggplot() + 
  geom_bar(data = subset(covid, Gender == 'MALE'), aes(x=stat(count),y = Age_Group, fill = Outcome1), colour = 'white', alpha=.6, position = 'fill') + 
  geom_bar(data = subset(covid, Gender == 'FEMALE'), aes(x=-stat(count),y = Age_Group, fill = Outcome1), colour = 'white', alpha=.6, position = 'fill') +
  theme_minimal() + 
  ggtitle('Comparison of Female (left) and Male (right) Confirmed Cases')

It looks the effect is reverted. Males are more exposed to fatality risk than females.

Making BBC Quality Plots

Let’s try to plot something similar to this:

sums <- subset(covid, Age_Group != 'Unknown' & Gender %in% c('MALE','FEMALE')) %>% group_by(Age_Group,Gender) %>% summarise(ncases=n(), pos=n() + 150)
sums$pos[sums$Gender=='FEMALE'] <- -sums$pos[sums$Gender=='FEMALE']

ggplot() +
  geom_bar(data = subset(covid, Gender == 'MALE'), aes(x=stat(count), y=Age_Group,fill = Outcome1), colour = 'white', alpha=.8) +
  geom_bar(data = subset(covid, Gender == 'FEMALE'), aes(x=-stat(count), y=Age_Group,fill = Outcome1), colour = 'white', alpha=.8)+
  theme_minimal() +
  scale_fill_manual(name="", values = c("#ffa100", "#ffe25a","#007ea1")) +
  geom_text(data= sums, mapping = aes(x=pos, y=Age_Group, label=abs(ncases)),size=5) + 
  labs(title = 'Comparison of Confirmed Cases',subtitle = 'Ontario') +
  annotate('text', x = -1000, y = 'Gender', label='Female',size = 6) +
  annotate('text', x = 750, y = 'Gender', label='Male',size = 6) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_text(size=16),
        title = element_text(size=16))

Using Maps

totalcases <- covid %>% 
  group_by(City,Latitude,Longitude) %>% 
  summarize(ncases=n())
totalcases <- mutate(totalcases, 
                     Latitude = as.numeric(Latitude),
                     Longitude = as.numeric(Longitude))

library('leaflet')
leaflet(totalcases) %>%
  addTiles() %>%  # use the default base map which is OpenStreetMap tiles
  addCircleMarkers(lng=~Longitude, lat=~Latitude,  radius = ~ncases/500,
               popup=paste0(totalcases$City, '<br>' , totalcases$ncases))

Conditional probabilities and Independence

Recall the conditional probabilities: \[P(A | B) = \frac{P(A,B)}{P(B)} \quad \Rightarrow \quad P(A | B) P(B) = P(A,B) \quad \Rightarrow P(B | A) = \frac{P(A,B)}{P(A)} \]

You can see that there is a very direct connection between different information, A given B can be obtained from B given A.

A package, eikosograms, written by a University of Waterloo professor is motivated by this fact and can effectively show the dependencies very nicely.

To plot \(P(Outcome = y \ | \ Gender)\);

library('eikosograms')
eikos(y= 'Outcome1', x='Gender', data = subset(covid, Gender %in% c('MALE', 'FEMALE')))

In the above plot the size of the area correspond to the joint probabilities. By using the information above we point several conditional probabilities written in the plot:

\(P( G=Female ) = 0.58\) (on the top edge of the plot)
\(P( O=Fatal | G = Female) = 0.08\)
\(P( O= Resolved | G = Male) = 1- 0.27 = 0.73\)
\(P( o= Fatal | G = Male) = P( Fatal | G = Female) = P( Fatal ) \sim 0.08\)

Since the probabilities are very close to each other we can say probability of death is independent from gender.

Bayes’ Theorem

Recall the theorem:

\[P(A | B) = \frac{P(A,B)}{P(B)} \Rightarrow P(B | A) = \frac{P(A|B)P(B)}{P(A)}\]

In eikosograms this transformation corresponds to flipping the plot on x-y axis.

To plot \(P(Gender = y \ | \ Outcome)\);

eikos(x= 'Outcome1', y='Gender', data = subset(covid, Gender %in% c('MALE','FEMALE')))

The above is the same plot, just transposed. Now we can see the other probabilities such as

\(P( O=Fatal ) = 0.08\)
\(P( G=Female | O=Fatal ) = 0.54\)

The conditional proabilities of Fatal and Resolved are very close to each other. Gender and Outcome1 are may be independent. A bigger dataset will make it more clear.

library('eikosograms')
eikos(y= 'Outcome1', x='Age_Group', data = covid)

The outcome is independent among age groups \(\leq 50\)s but dependent when the age group is greater than 50s.

Visualizing Proportions

Pie Charts

Pie charts are widely used, but they are not hailed by all data visualization scholars. In fact, mighty statistician and pioneer in data visualization Edward Tufte famously said:

The only worse design than a pie chart is several of them.

Apparently R and ggplot developers agree, but not everyone agrees. If you want to plot them, here is how it is. First start drawing the below stacked bar chart where the areas will be equal to the areas in the pie chart:

aggDat <- group_by(covid, Outcome1) %>% summarise(ncases = n())
ggplot(aggDat) +
      geom_col(aes(x = 1, y = ncases, fill = Outcome1), position = "fill") +
      scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1"))

then revolve it around y axis:

aggDat <- group_by(covid, Outcome1) %>% summarise(ncases = n())
ggplot(aggDat) +
      geom_col(aes(x = 1, y = ncases, fill = Outcome1), position = "fill") +
      coord_polar(theta = "y") +
      scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1"))

You can also add facets:

aggDat <- group_by(covid, Gender, Outcome1) %>% summarise(ncases = n())
ggplot(aggDat) +
      geom_col(aes(x = 1, y = ncases, fill = Outcome1), position = "fill") +
      facet_wrap( ~ Gender) + 
      coord_polar(theta = "y") +
      scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1"))

or put the names inside the plot so that it will look like your breakfast which is also known as donut plot:

aggDat <- group_by(covid, Gender, Outcome1) %>% summarise(ncases = n())
ggplot(aggDat) +
      geom_col(aes(x = 1, y = ncases, fill = Outcome1), position = "fill") +
      facet_wrap( ~ Gender) + 
      geom_text(aes(x = 0, y = 0, label = Gender)) + 
      coord_polar(theta = "y") +
      scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1")) + 
      theme_void() + 
      theme(strip.background=element_blank(),
          strip.text=element_blank())

Polar Coordinates

aggDat <- group_by(covid, Outcome1) %>% summarise(ncases = n())
ggplot(aggDat) +
      geom_bar(aes(x = Outcome1, y=ncases, fill = Outcome1), stat = 'identity') +
      coord_polar() +
      scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1"))

aggDat <- group_by(covid, Outcome1) %>% summarise(ncases = n())
ggplot(aggDat) +
      geom_bar(aes(x = Outcome1, y=ncases, fill = Outcome1), stat = 'identity') +
      coord_polar() +
      # theme(aspect.ratio = 1) +
      scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1"))

aggDat <- group_by(covid, Outcome1, Age_Group) %>% summarise(ncases = n())
ggplot(subset(aggDat, Age_Group !='Unknown')) +
      geom_bar(aes(x = Age_Group, y=ncases, fill = Outcome1), stat = 'identity') +
      coord_polar() +
      scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1"))

aggDat <- group_by(covid, Outcome1, Age_Group) %>% summarise(ncases = n())
ggplot(aggDat) +
      geom_bar(aes(x = Outcome1, y=ncases, fill = Outcome1), stat = 'identity') +
      coord_polar() +
      scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1"))+
      facet_wrap( ~ Age_Group) + 
      theme_void()

Treemaps

library(treemapify)

covid <- covid[ !is.na(covid$City), ]
totalCases <- group_by(covid, City) %>% summarise(ncases = n())


ggplot(totalCases, aes(area = ncases, fill = ncases, label=City)) +
  geom_treemap() + 
  geom_treemap_text(fontface = "italic", colour = "white", place = "topleft",
                    grow = T)

library('treemapify')

fatal <- subset(covid, Outcome1 == 'Fatal') %>% group_by(City) %>% summarise(ncases = n())

ggplot(fatal, aes(area = ncases, fill = ncases, label=City)) +
  geom_treemap() + 
  geom_treemap_text(fontface = "italic", colour = "white", place = "topleft",
                    grow = T) + 
  scale_fill_gradient(low='steelblue', high='orange')

Waffle Charts

Another powerful way to plot categorical variable is using Waffle chart. This type can give you the sense that you can observe each individual.

But first you need to install it using a special repository:

# install.packages('extrafont')
# install.packages("waffle", repos = "https://cinc.rud.is")
library('waffle')

library('ggthemes')
library('hrbrthemes')
subset(covid, City == 'Waterloo') %>% group_by(Gender) %>% summarise(ncases = n()) %>% 
ggplot(aes(fill = Gender, values = ncases)) +
  geom_waffle(n_rows = 20, size = .5, colour = "white", flip = F) + 
  coord_equal() + 
  theme_void() +
  theme_enhance_waffle() +
  labs(
    title = "Total Number of Cases by Gender",
    subtitle = "Waterloo",
    x = "Year",
    y = "Count"
  )

subset(covid, City %in% c('Newmarket','Ottawa','Waterloo')) %>% 
  group_by(City, Outcome1) %>% summarise(ncases = n()) %>% 
ggplot(aes(fill = Outcome1, values = ncases)) +
  geom_waffle(n_rows = 30, size = 0.2, colour = "white", flip = T) + 
  facet_wrap(~City, nrow=1) +
  coord_equal() + 
  scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1")) +
  theme_void() +
  theme(panel.grid = element_blank(), axis.ticks.y = element_line(), 
        text = element_text(size=16))

Lab 3: Visualizing Categorical Variables