Lab 3: Visualizing Categorical Variables
Introduction
In this tutorial you will learn how to be derive basic insights from categorical data and visualize joint probabilities. In particular we will be visualizing discrete distributions, both univariate and multivariate. By the end of the tutorial you will be able to:
- draw bar, column charts and treemaps,
- visualize distribution of more than one categorical variable,
- draw eikosograms to visualize independence and bayes rule,
- interpret these graphs.
Getting Started
We will visualize two COVID-19 confirmed cases datasets. To be able to follow you need:
- to download covid.db data uploaded on LEARN
- some packages including
tidyverse
eikosogram
RSQLite
treemapify
waffle
ggthemes
hrbrthemes
leaflet
Now, let’s load the tidyverse package;
and read the covid dataset:
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT * FROM confirmed")
dbDisconnect(con)
head(covid)
## Row_ID Date Age_Group Gender Acquisition Outcome1
## 1 1 2020-04-29 50s FEMALE Information pending Not Resolved
## 2 2 2020-03-04 40s FEMALE Contact of a confirmed case Resolved
## 3 3 2020-05-04 20s FEMALE Information pending Not Resolved
## 4 4 2020-05-02 50s FEMALE Information pending Not Resolved
## 5 5 2020-04-28 50s MALE Neither Not Resolved
## 6 6 2020-04-10 30s MALE Neither Resolved
## Reporting_PHU Address City
## 1 Peel Public Health 7120 Hurontario Street Mississauga
## 2 Ottawa Public Health 100 Constellation Drive Ottawa
## 3 Windsor-Essex County Health Unit 1005 Ouellette Avenue Windsor
## 4 Region of Waterloo, Public Health 99 Regina Street South Waterloo
## 5 York Region Public Health Services 17250 Yonge Street Newmarket
## 6 Ottawa Public Health 100 Constellation Drive Ottawa
## Postal_Code Reporting_PHU_Website Latitude Longitude
## 1 L5W 1N4 www.peelregion.ca/health/ 43.6474713 -79.7088933
## 2 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
## 3 N9A 4J8 www.wechu.org 42.3087965 -83.0336705
## 4 N2J 4V3 www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5 L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/ 44.048023 -79.480239
## 6 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
Aggregating Data
Before diving into visualizing discrete distributions, we need to learn how to pre-process the data to be able to plot. In particular we will show,
- Counting number of people who were infected by COVID-19
- Grouping based in the outcome, age, gender and/or location
There are various ways to pre-process the data in R. The most common way is to load all the data to RAM and use tidyverse
to process it. But in the industry where the datasets are too big to load on RAM, it is wise to use R together with SQL.
SQL
We briefly visited SQL to show how to import the data. Now, we will use it to preprocess the data into the shape that we need.
Basic SQL Functions
SQL is a very intuitive language; it is very much like talking to the computer. Let’s visit the vocabulary.
SELECT
SELECT
command is to use the variable columns of the dataset. If we use *
, this means we select all:
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT * FROM confirmed")
dbDisconnect(con)
head(covid)
## Row_ID Date Age_Group Gender Acquisition Outcome1
## 1 1 2020-04-29 50s FEMALE Information pending Not Resolved
## 2 2 2020-03-04 40s FEMALE Contact of a confirmed case Resolved
## 3 3 2020-05-04 20s FEMALE Information pending Not Resolved
## 4 4 2020-05-02 50s FEMALE Information pending Not Resolved
## 5 5 2020-04-28 50s MALE Neither Not Resolved
## 6 6 2020-04-10 30s MALE Neither Resolved
## Reporting_PHU Address City
## 1 Peel Public Health 7120 Hurontario Street Mississauga
## 2 Ottawa Public Health 100 Constellation Drive Ottawa
## 3 Windsor-Essex County Health Unit 1005 Ouellette Avenue Windsor
## 4 Region of Waterloo, Public Health 99 Regina Street South Waterloo
## 5 York Region Public Health Services 17250 Yonge Street Newmarket
## 6 Ottawa Public Health 100 Constellation Drive Ottawa
## Postal_Code Reporting_PHU_Website Latitude Longitude
## 1 L5W 1N4 www.peelregion.ca/health/ 43.6474713 -79.7088933
## 2 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
## 3 N9A 4J8 www.wechu.org 42.3087965 -83.0336705
## 4 N2J 4V3 www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5 L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/ 44.048023 -79.480239
## 6 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
Instead, we can specify the columns as below:
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT Date, Age_Group, Gender, Outcome1, City
FROM confirmed")
dbDisconnect(con)
head(covid)
## Date Age_Group Gender Outcome1 City
## 1 2020-04-29 50s FEMALE Not Resolved Mississauga
## 2 2020-03-04 40s FEMALE Resolved Ottawa
## 3 2020-05-04 20s FEMALE Not Resolved Windsor
## 4 2020-05-02 50s FEMALE Not Resolved Waterloo
## 5 2020-04-28 50s MALE Not Resolved Newmarket
## 6 2020-04-10 30s MALE Resolved Ottawa
AS
Sometimes the original column name is not very intuitive or just too detailed. We may want to rename it:
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT Date, Age_Group AS Age, Gender, Outcome1 AS Outcome, City
FROM confirmed")
dbDisconnect(con)
head(covid)
## Date Age Gender Outcome City
## 1 2020-04-29 50s FEMALE Not Resolved Mississauga
## 2 2020-03-04 40s FEMALE Resolved Ottawa
## 3 2020-05-04 20s FEMALE Not Resolved Windsor
## 4 2020-05-02 50s FEMALE Not Resolved Waterloo
## 5 2020-04-28 50s MALE Not Resolved Newmarket
## 6 2020-04-10 30s MALE Resolved Ottawa
ORDER BY
To order the data, we use the command ORDER BY
:
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT Date, Age_Group AS Age, Gender, Outcome1 as Outcome, City
FROM confirmed
ORDER BY Date")
dbDisconnect(con)
head(covid)
## Date Age Gender Outcome City
## 1 2020-01-01 80s MALE Resolved Simcoe
## 2 2020-01-10 40s FEMALE Resolved Toronto
## 3 2020-01-15 60s MALE Resolved Toronto
## 4 2020-01-19 20s MALE Resolved Mississauga
## 5 2020-01-21 50s MALE Resolved Toronto
## 6 2020-01-22 50s FEMALE Resolved Toronto
or to sort in DESC
ending order;
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT Date, Age_Group AS Age, Gender, Outcome1 as Outcome, City
FROM confirmed
ORDER BY Date DESC")
dbDisconnect(con)
head(covid)
## Date Age Gender Outcome City
## 1 2020-05-12 <20 MALE Not Resolved London
## 2 2020-05-12 60s MALE Not Resolved Barrie
## 3 2020-05-12 20s MALE Not Resolved Windsor
## 4 2020-05-12 60s MALE Not Resolved Mississauga
## 5 2020-05-12 <20 FEMALE Not Resolved London
## 6 2020-05-12 40s FEMALE Not Resolved Windsor
COUNT and GROUP BY
Instead of just calling the data, we can extract the needed information on the flow. For example, we can COUNT
the number of rows, SUM
or AVERAGE
numbers in a column.
The below code counts the number of rows without any constraint, so it will return the number of rows in the dataset:
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT COUNT(*) AS Cases
FROM confirmed")
dbDisconnect(con)
head(covid)
## Cases
## 1 21236
We may ask how many people were tested positive in different cities among different age group. The below code seems it can do it, but there is one function missing that can do the job:
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT Age_Group, City, COUNT(*) AS Cases
FROM confirmed")
dbDisconnect(con)
head(covid)
## Age_Group City Cases
## 1 50s Mississauga 21236
If we want to count the rows corresponding to categories in a categorical variable, we use GROUP BY
:
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT Age_Group AS Age, City, COUNT(*) AS Cases
FROM confirmed
GROUP BY Age")
dbDisconnect(con)
head(covid)
## Age City Cases
## 1 20s Windsor 2454
## 2 30s Ottawa 2591
## 3 40s Ottawa 2936
## 4 50s Mississauga 3545
## 5 60s Ottawa 2644
## 6 70s Waterloo 1900
Combining All
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT Age_Group as Age, City, COUNT(*) AS Cases
FROM confirmed
GROUP BY Age, City
ORDER BY City DESC")
dbDisconnect(con)
head(covid)
## Age City Cases
## 1 20s Windsor 121
## 2 30s Windsor 105
## 3 40s Windsor 99
## 4 50s Windsor 112
## 5 60s Windsor 81
## 6 70s Windsor 54
Tidyverse
We don’t always need to use SQL to aggregate things. Usually SQL helps us to bring the most fruitful data aggregated at a certain level. Then we continue processing before plotting.
Tidyverse contains powerful functions to process the data in a similar fashion to SQL. Let’s see how they work.
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "covid.db")
covid <- dbGetQuery(con, "SELECT * FROM confirmed")
dbDisconnect(con)
head(covid)
## Row_ID Date Age_Group Gender Acquisition Outcome1
## 1 1 2020-04-29 50s FEMALE Information pending Not Resolved
## 2 2 2020-03-04 40s FEMALE Contact of a confirmed case Resolved
## 3 3 2020-05-04 20s FEMALE Information pending Not Resolved
## 4 4 2020-05-02 50s FEMALE Information pending Not Resolved
## 5 5 2020-04-28 50s MALE Neither Not Resolved
## 6 6 2020-04-10 30s MALE Neither Resolved
## Reporting_PHU Address City
## 1 Peel Public Health 7120 Hurontario Street Mississauga
## 2 Ottawa Public Health 100 Constellation Drive Ottawa
## 3 Windsor-Essex County Health Unit 1005 Ouellette Avenue Windsor
## 4 Region of Waterloo, Public Health 99 Regina Street South Waterloo
## 5 York Region Public Health Services 17250 Yonge Street Newmarket
## 6 Ottawa Public Health 100 Constellation Drive Ottawa
## Postal_Code Reporting_PHU_Website Latitude Longitude
## 1 L5W 1N4 www.peelregion.ca/health/ 43.6474713 -79.7088933
## 2 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
## 3 N9A 4J8 www.wechu.org 42.3087965 -83.0336705
## 4 N2J 4V3 www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5 L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/ 44.048023 -79.480239
## 6 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
Select
To select the columns that we are interested in, we can use select
as below. Note that I use %>% head()
just to display the first 6 rows to save space. It has nothing to do with select functionality:
## Age_Group Outcome1 City
## 1 50s Not Resolved Mississauga
## 2 40s Resolved Ottawa
## 3 20s Not Resolved Windsor
## 4 50s Not Resolved Waterloo
## 5 50s Not Resolved Newmarket
## 6 30s Resolved Ottawa
You may have noticed the power of %>%
. That is called a pipe to inject the output of the left function to the right one. We can do the same thing above using pipes:
## Age_Group Outcome1 City
## 1 50s Not Resolved Mississauga
## 2 40s Resolved Ottawa
## 3 20s Not Resolved Windsor
## 4 50s Not Resolved Waterloo
## 5 50s Not Resolved Newmarket
## 6 30s Resolved Ottawa
Rename
Similar to AS
function in SQL, we can rename
the columns with the NewName=OldName
order as below:
## Row_ID Date Age Gender Acquisition Outcome1
## 1 1 2020-04-29 50s FEMALE Information pending Not Resolved
## 2 2 2020-03-04 40s FEMALE Contact of a confirmed case Resolved
## 3 3 2020-05-04 20s FEMALE Information pending Not Resolved
## 4 4 2020-05-02 50s FEMALE Information pending Not Resolved
## 5 5 2020-04-28 50s MALE Neither Not Resolved
## 6 6 2020-04-10 30s MALE Neither Resolved
## Reporting_PHU Address City
## 1 Peel Public Health 7120 Hurontario Street Mississauga
## 2 Ottawa Public Health 100 Constellation Drive Ottawa
## 3 Windsor-Essex County Health Unit 1005 Ouellette Avenue Windsor
## 4 Region of Waterloo, Public Health 99 Regina Street South Waterloo
## 5 York Region Public Health Services 17250 Yonge Street Newmarket
## 6 Ottawa Public Health 100 Constellation Drive Ottawa
## Postal_Code Reporting_PHU_Website Latitude Longitude
## 1 L5W 1N4 www.peelregion.ca/health/ 43.6474713 -79.7088933
## 2 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
## 3 N9A 4J8 www.wechu.org 42.3087965 -83.0336705
## 4 N2J 4V3 www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5 L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/ 44.048023 -79.480239
## 6 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
or you can use pipes
## Row_ID Date Age Gender Acquisition Outcome1
## 1 1 2020-04-29 50s FEMALE Information pending Not Resolved
## 2 2 2020-03-04 40s FEMALE Contact of a confirmed case Resolved
## 3 3 2020-05-04 20s FEMALE Information pending Not Resolved
## 4 4 2020-05-02 50s FEMALE Information pending Not Resolved
## 5 5 2020-04-28 50s MALE Neither Not Resolved
## 6 6 2020-04-10 30s MALE Neither Resolved
## Reporting_PHU Address City
## 1 Peel Public Health 7120 Hurontario Street Mississauga
## 2 Ottawa Public Health 100 Constellation Drive Ottawa
## 3 Windsor-Essex County Health Unit 1005 Ouellette Avenue Windsor
## 4 Region of Waterloo, Public Health 99 Regina Street South Waterloo
## 5 York Region Public Health Services 17250 Yonge Street Newmarket
## 6 Ottawa Public Health 100 Constellation Drive Ottawa
## Postal_Code Reporting_PHU_Website Latitude Longitude
## 1 L5W 1N4 www.peelregion.ca/health/ 43.6474713 -79.7088933
## 2 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
## 3 N9A 4J8 www.wechu.org 42.3087965 -83.0336705
## 4 N2J 4V3 www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5 L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/ 44.048023 -79.480239
## 6 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
Arrange
arrange
is the function that does the same job as ORDER BY
, which sorts the rows:
## Row_ID Date Age_Group Gender Acquisition Outcome1
## 1 4842 2020-01-01 80s MALE Neither Resolved
## 2 3121 2020-01-10 40s FEMALE Contact of a confirmed case Resolved
## 3 3468 2020-01-15 60s MALE Information pending Resolved
## 4 15828 2020-01-19 20s MALE Neither Resolved
## 5 11282 2020-01-21 50s MALE Travel-Related Resolved
## 6 10437 2020-01-22 50s FEMALE Travel-Related Resolved
## Reporting_PHU Address City
## 1 Haldimand-Norfolk Health Unit 12 Gilbertson Drive Simcoe
## 2 Toronto Public Health 277 Victoria Street, 5th Floor Toronto
## 3 Toronto Public Health 277 Victoria Street, 5th Floor Toronto
## 4 Peel Public Health 7120 Hurontario Street Mississauga
## 5 Toronto Public Health 277 Victoria Street, 5th Floor Toronto
## 6 Toronto Public Health 277 Victoria Street, 5th Floor Toronto
## Postal_Code Reporting_PHU_Website Latitude
## 1 N3Y 4N5 www.hnhu.org 42.84782526
## 2 M5B 1W2 www.toronto.ca/community-people/health-wellness-care/ 43.65659125
## 3 M5B 1W2 www.toronto.ca/community-people/health-wellness-care/ 43.65659125
## 4 L5W 1N4 www.peelregion.ca/health/ 43.6474713
## 5 M5B 1W2 www.toronto.ca/community-people/health-wellness-care/ 43.65659125
## 6 M5B 1W2 www.toronto.ca/community-people/health-wellness-care/ 43.65659125
## Longitude
## 1 -80.30381491
## 2 -79.37935801
## 3 -79.37935801
## 4 -79.7088933
## 5 -79.37935801
## 6 -79.37935801
We can also sort in descending order as
## Row_ID Date Age_Group Gender Acquisition Outcome1
## 1 50 2020-05-12 <20 MALE Information pending Not Resolved
## 2 314 2020-05-12 60s MALE Information pending Not Resolved
## 3 320 2020-05-12 20s MALE Information pending Not Resolved
## 4 530 2020-05-12 60s MALE Information pending Not Resolved
## 5 641 2020-05-12 <20 FEMALE Information pending Not Resolved
## 6 645 2020-05-12 40s FEMALE Information pending Not Resolved
## Reporting_PHU Address City
## 1 Middlesex-London Health Unit 50 King Street London
## 2 Simcoe Muskoka District Health Unit 15 Sperling Drive Barrie
## 3 Windsor-Essex County Health Unit 1005 Ouellette Avenue Windsor
## 4 Peel Public Health 7120 Hurontario Street Mississauga
## 5 Middlesex-London Health Unit 50 King Street London
## 6 Windsor-Essex County Health Unit 1005 Ouellette Avenue Windsor
## Postal_Code Reporting_PHU_Website Latitude Longitude
## 1 N6A 5L7 www.healthunit.com 42.98146842 -81.25401572
## 2 L4M 6K9 www.simcoemuskokahealth.org 44.41071258 -79.68630597
## 3 N9A 4J8 www.wechu.org 42.3087965 -83.0336705
## 4 L5W 1N4 www.peelregion.ca/health/ 43.6474713 -79.7088933
## 5 N6A 5L7 www.healthunit.com 42.98146842 -81.25401572
## 6 N9A 4J8 www.wechu.org 42.3087965 -83.0336705
group_by and summarize
We can use group_by
together with summarize to aggregate the data. There are many functions that we could use, today we will only cover n()
that counts the number of rows:
## # A tibble: 10 x 2
## Age_Group Cases
## <chr> <int>
## 1 <20 571
## 2 20s 2454
## 3 30s 2591
## 4 40s 2936
## 5 50s 3545
## 6 60s 2644
## 7 70s 1900
## 8 80s 2661
## 9 90s 1920
## 10 Unknown 14
Similar to SQL, group_by
allows grouping with many variables:
## # A tibble: 36 x 3
## # Groups: Age_Group [10]
## Age_Group Gender Cases
## <chr> <chr> <int>
## 1 <20 FEMALE 284
## 2 <20 MALE 285
## 3 <20 TRANSGENDER 1
## 4 <20 UNKNOWN 1
## 5 20s FEMALE 1371
## 6 20s MALE 1079
## 7 20s TRANSGENDER 1
## 8 20s UNKNOWN 3
## 9 30s FEMALE 1426
## 10 30s MALE 1156
## # … with 26 more rows
Using Pipes
Now, we can combine all we have learned. The most straightforward way is to give a data name after using select
function, then continue processing the new data and give another name for the output and so on. This becomes too time and space consuming. Instead, we can use pipes to inject the output of the former to the latter as;
covid %>%
rename(Age=Age_Group, Outcome=Outcome1) %>%
group_by(Age, Outcome) %>%
summarize(Cases = n()) %>%
arrange(Outcome)
## # A tibble: 28 x 3
## # Groups: Age [10]
## Age Outcome Cases
## <chr> <chr> <int>
## 1 20s Fatal 2
## 2 30s Fatal 5
## 3 40s Fatal 16
## 4 50s Fatal 58
## 5 60s Fatal 139
## 6 70s Fatal 295
## 7 80s Fatal 651
## 8 90s Fatal 599
## 9 <20 Not Resolved 140
## 10 20s Not Resolved 439
## # … with 18 more rows
Visualizing Categorical Variables
Bar plots (and column charts) are very similar to histograms. In histograms we group the continuous variables into bins where the bins are determined externally (e.g. bins=20
will return 20 bins). Here, bar plots have natural bins naturally corresponding to each category.
If you want to use bar charts or treemaps you have to somehow aggregate the data, either doing before the plot or during plotting. Below we will cover the both approaches.
Aggregated
Assume you want to plot the distribution of confirmed cases w.r.t. Age Group. You may aggregate your data before plotting as:
totalCases <- covid %>%
rename(Age=Age_Group) %>%
group_by(Age) %>%
summarise(ncases = n())
totalCases
## # A tibble: 10 x 2
## Age ncases
## <chr> <int>
## 1 <20 571
## 2 20s 2454
## 3 30s 2591
## 4 40s 2936
## 5 50s 3545
## 6 60s 2644
## 7 70s 1900
## 8 80s 2661
## 9 90s 1920
## 10 Unknown 14
Now you can easily plot it. You need to tell ggplot to use numbers as they are in the data. You can do it with stat='identity
:
Or you can add another dimension, Outcome1
, to your aggregated data and plot it:
totalCases <- covid %>%
rename(Age=Age_Group, Outcome=Outcome1) %>%
group_by(Age,Outcome) %>%
summarise(ncases = n())
ggplot(totalCases, aes(y = Age, x = ncases,fill=Outcome)) +
geom_bar(stat = 'identity')
Or you can plot each outcomes as separate bars as
totalCases <- covid %>%
rename(Age=Age_Group, Outcome=Outcome1) %>%
group_by(Age,Outcome) %>%
summarise(ncases = n())
ggplot(totalCases, aes(y = Age, x = ncases, fill=Outcome)) +
geom_bar(stat = 'identity', position='dodge')
You can also plot proportions:
Non-aggregated
Sometimes your data is not that big and you can just use ggplot to do all. If your data is raw, ggplot has enough functionality to aggregate it on the fly using ..count..
or stat(count)
:
It is easy to fill with another variable:
If you want the proportions, not count of cases falling to each Age_Group, you can divide to number of rows as:
Back To Back Comparison
Similar to previous example, we will generate the back2back plot manually. But first let’s remove the Unknown
group and compare female and male only:
covid <- subset(covid, Age_Group != 'Unknown')
ggplot() +
geom_bar(data = subset(covid, Gender == 'MALE'), aes(x= stat(count),y = Age_Group), fill = 'firebrick', alpha=.75) +
geom_bar(data = subset(covid, Gender == 'FEMALE'), aes(x=-stat(count),y = Age_Group), fill = 'steelblue', alpha=.75) +
theme_minimal() +
ggtitle('Comparison of Confirmed Cases by Gender') +
annotate('text', x = -2000, y = '90s', label='Female') +
annotate('text', x = 1500, y = '90s', label='Male')
It looks that females are more likely to acquire covid than male, at least in Ontario.
What about the outcomes?
covid <- subset(covid, Age_Group != 'Unknown')
ggplot() +
geom_bar(data = subset(covid, Gender == 'MALE'), aes(x=stat(count),y = Age_Group, fill = Outcome1), colour = 'white', alpha=.75) +
geom_bar(data = subset(covid, Gender == 'FEMALE'), aes(x=-stat(count),y = Age_Group, fill = Outcome1), colour = 'white', alpha=.75) +
theme_minimal() +
ggtitle('Comparison of Confirmed Cases by Gender') +
annotate('text', x = -2000, y = '90s', label='Female') +
annotate('text', x = 1500, y = '90s', label='Male')
There are more fatal cases in female than male. But is it due to the fact that there are more positive female than male?
The above plot is good to visualize numbers but not the proportions. We may think falsely that female death rates higher than male. If we add position = 'fill'
we can see that the death rates among males are higher than females:
covid <- subset(covid, Age_Group != 'Unknown')
ggplot() +
geom_bar(data = subset(covid, Gender == 'MALE'), aes(x=stat(count),y = Age_Group, fill = Outcome1), colour = 'white', alpha=.6, position = 'fill') +
geom_bar(data = subset(covid, Gender == 'FEMALE'), aes(x=-stat(count),y = Age_Group, fill = Outcome1), colour = 'white', alpha=.6, position = 'fill') +
theme_minimal() +
ggtitle('Comparison of Female (left) and Male (right) Confirmed Cases')
It looks the effect is reverted. Males are more exposed to fatality risk than females.
Making BBC Quality Plots
Let’s try to plot something similar to this:
sums <- subset(covid, Age_Group != 'Unknown' & Gender %in% c('MALE','FEMALE')) %>% group_by(Age_Group,Gender) %>% summarise(ncases=n(), pos=n() + 150)
sums$pos[sums$Gender=='FEMALE'] <- -sums$pos[sums$Gender=='FEMALE']
ggplot() +
geom_bar(data = subset(covid, Gender == 'MALE'), aes(x=stat(count), y=Age_Group,fill = Outcome1), colour = 'white', alpha=.8) +
geom_bar(data = subset(covid, Gender == 'FEMALE'), aes(x=-stat(count), y=Age_Group,fill = Outcome1), colour = 'white', alpha=.8)+
theme_minimal() +
scale_fill_manual(name="", values = c("#ffa100", "#ffe25a","#007ea1")) +
geom_text(data= sums, mapping = aes(x=pos, y=Age_Group, label=abs(ncases)),size=5) +
labs(title = 'Comparison of Confirmed Cases',subtitle = 'Ontario') +
annotate('text', x = -1000, y = 'Gender', label='Female',size = 6) +
annotate('text', x = 750, y = 'Gender', label='Male',size = 6) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_text(size=16),
title = element_text(size=16))
Conditional probabilities and Independence
Recall the conditional probabilities: \[P(A | B) = \frac{P(A,B)}{P(B)} \quad \Rightarrow \quad P(A | B) P(B) = P(A,B) \quad \Rightarrow P(B | A) = \frac{P(A,B)}{P(A)} \]
You can see that there is a very direct connection between different information, A given B can be obtained from B given A.
A package, eikosograms
, written by a University of Waterloo professor is motivated by this fact and can effectively show the dependencies very nicely.
To plot \(P(Outcome = y \ | \ Gender)\);
library('eikosograms')
eikos(y= 'Outcome1', x='Gender', data = subset(covid, Gender %in% c('MALE', 'FEMALE')))
In the above plot the size of the area correspond to the joint probabilities. By using the information above we point several conditional probabilities written in the plot:
- \(P( G=Female ) = 0.58\) (on the top edge of the plot)
- \(P( O=Fatal | G = Female) = 0.08\)
- \(P( O= Resolved | G = Male) = 1- 0.27 = 0.73\)
- \(P( o= Fatal | G = Male) = P( Fatal | G = Female) = P( Fatal ) \sim 0.08\)
Since the probabilities are very close to each other we can say probability of death is independent from gender.
Bayes’ Theorem
Recall the theorem:
\[P(A | B) = \frac{P(A,B)}{P(B)} \Rightarrow P(B | A) = \frac{P(A|B)P(B)}{P(A)}\]
In eikosograms this transformation corresponds to flipping the plot on x-y axis.
To plot \(P(Gender = y \ | \ Outcome)\);
The above is the same plot, just transposed. Now we can see the other probabilities such as
- \(P( O=Fatal ) = 0.08\)
- \(P( G=Female | O=Fatal ) = 0.54\)
The conditional proabilities of Fatal and Resolved are very close to each other. Gender and Outcome1 are may be independent. A bigger dataset will make it more clear.
The outcome is independent among age groups \(\leq 50\)s but dependent when the age group is greater than 50s.
Visualizing Proportions
Pie Charts
Pie charts are widely used, but they are not hailed by all data visualization scholars. In fact, mighty statistician and pioneer in data visualization Edward Tufte famously said:
The only worse design than a pie chart is several of them.
Apparently R
and ggplot
developers agree, but not everyone agrees. If you want to plot them, here is how it is. First start drawing the below stacked bar chart where the areas will be equal to the areas in the pie chart:
aggDat <- group_by(covid, Outcome1) %>% summarise(ncases = n())
ggplot(aggDat) +
geom_col(aes(x = 1, y = ncases, fill = Outcome1), position = "fill") +
scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1"))
then revolve it around y axis:
aggDat <- group_by(covid, Outcome1) %>% summarise(ncases = n())
ggplot(aggDat) +
geom_col(aes(x = 1, y = ncases, fill = Outcome1), position = "fill") +
coord_polar(theta = "y") +
scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1"))
You can also add facets:
aggDat <- group_by(covid, Gender, Outcome1) %>% summarise(ncases = n())
ggplot(aggDat) +
geom_col(aes(x = 1, y = ncases, fill = Outcome1), position = "fill") +
facet_wrap( ~ Gender) +
coord_polar(theta = "y") +
scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1"))
or put the names inside the plot so that it will look like your breakfast which is also known as donut plot:
aggDat <- group_by(covid, Gender, Outcome1) %>% summarise(ncases = n())
ggplot(aggDat) +
geom_col(aes(x = 1, y = ncases, fill = Outcome1), position = "fill") +
facet_wrap( ~ Gender) +
geom_text(aes(x = 0, y = 0, label = Gender)) +
coord_polar(theta = "y") +
scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1")) +
theme_void() +
theme(strip.background=element_blank(),
strip.text=element_blank())
Polar Coordinates
aggDat <- group_by(covid, Outcome1) %>% summarise(ncases = n())
ggplot(aggDat) +
geom_bar(aes(x = Outcome1, y=ncases, fill = Outcome1), stat = 'identity') +
coord_polar() +
scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1"))
aggDat <- group_by(covid, Outcome1) %>% summarise(ncases = n())
ggplot(aggDat) +
geom_bar(aes(x = Outcome1, y=ncases, fill = Outcome1), stat = 'identity') +
coord_polar() +
# theme(aspect.ratio = 1) +
scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1"))
Treemaps
library(treemapify)
covid <- covid[ !is.na(covid$City), ]
totalCases <- group_by(covid, City) %>% summarise(ncases = n())
ggplot(totalCases, aes(area = ncases, fill = ncases, label=City)) +
geom_treemap() +
geom_treemap_text(fontface = "italic", colour = "white", place = "topleft",
grow = T)
library('treemapify')
fatal <- subset(covid, Outcome1 == 'Fatal') %>% group_by(City) %>% summarise(ncases = n())
ggplot(fatal, aes(area = ncases, fill = ncases, label=City)) +
geom_treemap() +
geom_treemap_text(fontface = "italic", colour = "white", place = "topleft",
grow = T) +
scale_fill_gradient(low='steelblue', high='orange')
Waffle Charts
Another powerful way to plot categorical variable is using Waffle chart. This type can give you the sense that you can observe each individual.
But first you need to install it using a special repository:
# install.packages('extrafont')
# install.packages("waffle", repos = "https://cinc.rud.is")
library('waffle')
library('ggthemes')
library('hrbrthemes')
subset(covid, City == 'Waterloo') %>% group_by(Gender) %>% summarise(ncases = n()) %>%
ggplot(aes(fill = Gender, values = ncases)) +
geom_waffle(n_rows = 20, size = .5, colour = "white", flip = F) +
coord_equal() +
theme_void() +
theme_enhance_waffle() +
labs(
title = "Total Number of Cases by Gender",
subtitle = "Waterloo",
x = "Year",
y = "Count"
)
subset(covid, City %in% c('Newmarket','Ottawa','Waterloo')) %>%
group_by(City, Outcome1) %>% summarise(ncases = n()) %>%
ggplot(aes(fill = Outcome1, values = ncases)) +
geom_waffle(n_rows = 30, size = 0.2, colour = "white", flip = T) +
facet_wrap(~City, nrow=1) +
coord_equal() +
scale_fill_manual(name="", values = c("firebrick", "#ffe25a","#007ea1")) +
theme_void() +
theme(panel.grid = element_blank(), axis.ticks.y = element_line(),
text = element_text(size=16))
References
- https://www.datanovia.com/en/blog/ggplot-colors-best-tricks-you-will-love/
- Data Visualization with R
- https://www.datanovia.com/en/blog/gganimate-how-to-create-plots-with-beautiful-animation-in-r/
- Treemapify Documentation2
- Visualizing a Categorical Variable
- Waffle Chart
- GGPlot Colors Best Tricks You Will Love