Lab 1: Data Visualization
Introduction
In this tutorial you will learn how to import data from external sources, preprocess it and plot a simple graph. By the end of the tutorial you will be able to:
- read from xlsx, csv / txt / tsv, rda and SQL database,
- identify dimensions, overwrite column and row names,
- draw different plots with ggplot2,
- visualize continuous (and other non-categorical numeric data) and categorical data.
Getting Started
We will visualize,
- House Prices data used once in Kaggle Challenge, which involves house prices and properties for various different houses in Washington
- Ontario Covid Confirmed Cases data
You will need, - tidyverse
(install if not yet installed) - house_subset, provinces and covid datasets
You will also need some additional packages
# install.packages('tidyverse')
# install.packages('RSQLite')
# install.packages('readxl')
# install.packages('leaflet')
Now call tidyverse library:
Reading Data
If you want to read data, it must be in the same folder where your working directory currently is.
- You can check it in RStudio on the right hand side, under
Files
tab. You can browse to the folder where your data is, pressMore
and click onSet as working directory
option:
- You can check it by typing
getwd()
to the console and change it usingsetwd('path/to/folder')
, for examplesetwd('Documents/MSCI253/Lab/')
if the data is in the folder.
Read from CSV / TXT / TSV
These files are all in same format, and even their extension doesn’t matter. Whatever the name is the columns are separated either
- by a comma (,)
- a tab (\t)
- or anything else
By default read.csv
expects comma, and if something else appears it doesn’t split anything. Besides
- The file may not have a header,
- It may start in the nth row and so on.
Comma Separated
If your file is comma separated then it must look like this:
Then you can read it easily as:
provinces <- read.csv('data/provinces.csv') # by default it expects , to separate columns
head(provinces)
## Province Latitude Longitude
## 1 Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000 -63.0000
## 3 Ontario 50.00000 -85.0000
## 4 Nova Scotia 45.00000 -63.0000
## 5 Alberta 55.00000 -115.0000
## 6 British Columbia 53.72667 -127.6476
head
function returns only first 6 rows to give you an idea about what kind of data you have (similarly tail
returns last 6). You can see that data is read properly. Also the dimensions of the data is:
## [1] 10 3
There is information for 9 Canadian provinces and the data is 3 columns.
Tab or Anything Else Separated
However when the data is tab or something else separated (e.g. semicolon) as above, then the default option will read the data incorrectly as:
## Province.Latitude.Longitude
## 1 Saskatchewan\t55\t-106
## 2 Prince Edward Island\t46.25\t-63
## 3 Ontario\t50\t-85
## 4 Nova Scotia\t45\t-63
## 5 Alberta\t55\t-115
## 6 British Columbia\t53.726669\t-127.647621
Notice that between each column there is a \t sign and all is in 1 column:
## [1] 10 1
which means the data is not properly read. To read it properly we have to specify the separator:
## Province Latitude Longitude
## 1 Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000 -63.0000
## 3 Ontario 50.00000 -85.0000
## 4 Nova Scotia 45.00000 -63.0000
## 5 Alberta 55.00000 -115.0000
## 6 British Columbia 53.72667 -127.6476
If it is separated by semicolons (‘;’) then
## Province Latitude Longitude
## 1 Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000 -63.0000
## 3 Ontario 50.00000 -85.0000
## 4 Nova Scotia 45.00000 -63.0000
## 5 Alberta 55.00000 -115.0000
## 6 British Columbia 53.72667 -127.6476
Without Header
Data do not always have header, and if you don’t specify it, the first row will be read as header:
## Saskatchewan X55 X.106
## 1 Prince Edward Island 46.25000 -63.00000
## 2 Ontario 50.00000 -85.00000
## 3 Nova Scotia 45.00000 -63.00000
## 4 Alberta 55.00000 -115.00000
## 5 British Columbia 53.72667 -127.64762
## 6 Manitoba 53.76086 -98.81387
Instead, you must add header = F
:
## V1 V2 V3
## 1 Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000 -63.0000
## 3 Ontario 50.00000 -85.0000
## 4 Nova Scotia 45.00000 -63.0000
## 5 Alberta 55.00000 -115.0000
## 6 British Columbia 53.72667 -127.6476
Now you can write the headers manually:
## Province Latitude Longitude
## 1 Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000 -63.0000
## 3 Ontario 50.00000 -85.0000
## 4 Nova Scotia 45.00000 -63.0000
## 5 Alberta 55.00000 -115.0000
## 6 British Columbia 53.72667 -127.6476
Starts from Nth row
If there is a line that you must skip to reach the data you can specify it as
## Province Latitude Longitude
## 1 Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000 -63.0000
## 3 Ontario 50.00000 -85.0000
## 4 Nova Scotia 45.00000 -63.0000
## 5 Alberta 55.00000 -115.0000
## 6 British Columbia 53.72667 -127.6476
From Excel
R is an open source program and is not by default compatible with Excel. You need an additional package, readxl
, to read xlsx files. When you have readxl
installed you can call the library with library('readxl')
then use the function read_excel(...)
. However we don’t want unnecessary functions to be loaded to the RAM. So instead of using library('readxl')
to call all library, we can
## # A tibble: 6 x 21
## id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6.41e9 2014… 5.38e5 3 2.25 2570 7242 2 0
## 2 7.24e9 2014… 1.23e6 4 4.5 5420 101930 1 0
## 3 1.32e9 2014… 2.58e5 3 2.25 1715 6819 2 0
## 4 9.30e9 2015… 6.50e5 4 3 2950 5000 2 0
## 5 1.88e9 2014… 3.95e5 3 2 1890 14040 2 0
## 6 7.98e9 2015… 2.30e5 3 1 1250 9774 1 0
## # … with 12 more variables: view <dbl>, condition <dbl>, grade <dbl>,
## # sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## # zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>
By default it reads the first sheet. If you want the second page you can type readxl::read_excel('house_subset.xlsx',2)
instead. Or you can write the sheet name too:
From rda (R native format)
R has its own format, sometimes it is convenient to write and read from it because it is capable of writing and reading any data type:
## # A tibble: 6 x 21
## id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.52e9 2014… 2.00e6 3 2.75 3050 44867 1 0
## 2 4.22e9 2015… 9.20e5 5 2.25 2730 6000 1.5 0
## 3 9.82e9 2014… 8.85e5 4 2.5 2830 5000 2 0
## 4 2.39e9 2015… 4.80e5 3 1 1040 5060 1 0
## 5 1.48e9 2014… 9.05e5 4 2.5 3300 10250 1 0
## 6 2.29e9 2014… 7.99e5 3 2.5 2140 9897 1 0
## # … with 12 more variables: view <dbl>, condition <dbl>, grade <dbl>,
## # sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## # zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>
From SQL
SQL is the language for any data scientist. You will learn how to write SQL in Database course.
We downloaded Ontario Covid Confirmed Cases data and imported to SQLite. The dataset includes only one table, confirmed
, which contains details about people who were confirmed having Covid in Ontario.
You can run SQL codes prior to importing, and assign the result into a data.frame:
## install.packages('RSQLite')
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "data/covid.db")
covid <- dbGetQuery(con,
"SELECT Date, City, COUNT(*) as Confirmed
FROM confirmed
GROUP BY Date, City
ORDER BY Date")
dbDisconnect(con)
tail(covid)
## Date City Confirmed
## 1645 2020-05-12 London 2
## 1646 2020-05-12 Mississauga 1
## 1647 2020-05-12 Newmarket 1
## 1648 2020-05-12 Toronto 1
## 1649 2020-05-12 Whitby 3
## 1650 2020-05-12 Windsor 5
In this lab, we need the data in full detail so that we can play with it:
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "data/covid.db")
covid <- dbGetQuery(con, "SELECT * FROM confirmed")
dbDisconnect(con)
head(covid)
## Row_ID Date Age_Group Gender Acquisition Outcome1
## 1 1 2020-04-29 50s FEMALE Information pending Not Resolved
## 2 2 2020-03-04 40s FEMALE Contact of a confirmed case Resolved
## 3 3 2020-05-04 20s FEMALE Information pending Not Resolved
## 4 4 2020-05-02 50s FEMALE Information pending Not Resolved
## 5 5 2020-04-28 50s MALE Neither Not Resolved
## 6 6 2020-04-10 30s MALE Neither Resolved
## Reporting_PHU Address City
## 1 Peel Public Health 7120 Hurontario Street Mississauga
## 2 Ottawa Public Health 100 Constellation Drive Ottawa
## 3 Windsor-Essex County Health Unit 1005 Ouellette Avenue Windsor
## 4 Region of Waterloo, Public Health 99 Regina Street South Waterloo
## 5 York Region Public Health Services 17250 Yonge Street Newmarket
## 6 Ottawa Public Health 100 Constellation Drive Ottawa
## Postal_Code Reporting_PHU_Website Latitude Longitude
## 1 L5W 1N4 www.peelregion.ca/health/ 43.6474713 -79.7088933
## 2 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
## 3 N9A 4J8 www.wechu.org 42.3087965 -83.0336705
## 4 N2J 4V3 www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5 L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/ 44.048023 -79.480239
## 6 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
Directly From Online Source
You can also read data into R directly from its source. The below data can be downloaded from Government of Canada Website:
covid.canada <- read.csv(url('https://health-infobase.canada.ca/src/data/covidLive/covid19.csv'))
head(covid.canada)
## pruid prname prnameFR date update numconf numprob
## 1 35 Ontario Ontario 31-01-2020 NA 3 0
## 2 59 British Columbia Colombie-Britannique 31-01-2020 NA 1 0
## 3 1 Canada Canada 31-01-2020 NA 4 0
## 4 35 Ontario Ontario 08-02-2020 NA 3 0
## 5 59 British Columbia Colombie-Britannique 08-02-2020 NA 4 0
## 6 1 Canada Canada 08-02-2020 NA 7 0
## numdeaths numtotal numtested numtests numrecover percentrecover ratetested
## 1 0 3 NA 0 NA NA
## 2 0 1 NA 0 NA NA
## 3 0 4 NA 0 NA NA
## 4 0 3 NA 0 NA NA
## 5 0 4 NA 63 NA NA
## 6 0 7 NA 63 NA NA
## ratetests numtoday percentoday ratetotal ratedeaths numdeathstoday
## 1 NA 3 300 0.02 0 0
## 2 NA 1 100 0.02 0 0
## 3 NA 4 400 0.01 0 0
## 4 NA 0 0 0.02 0 0
## 5 12 3 300 0.08 0 0
## 6 2 3 75 0.02 0 0
## percentdeath numtestedtoday numteststoday numrecoveredtoday percentactive
## 1 0 NA NA NA 100
## 2 0 NA NA NA 100
## 3 0 NA NA NA 100
## 4 0 NA NA NA 100
## 5 0 NA NA NA 100
## 6 0 NA NA NA 100
## numactive rateactive numtotal_last14 ratetotal_last14 numdeaths_last14
## 1 3 0.02 NA NA NA
## 2 1 0.02 NA NA NA
## 3 4 0.01 NA NA NA
## 4 3 0.02 NA NA NA
## 5 4 0.08 NA NA NA
## 6 7 0.02 NA NA NA
## ratedeaths_last14 numtotal_last7 ratetotal_last7 numdeaths_last7
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## ratedeaths_last7 avgtotal_last7 avgincidence_last7 avgdeaths_last7
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## avgratedeaths_last7
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
Visualization
Knowing your data
Before playing with the data it is best to know its properties:
head(data)
andtail(data)
returns top 6 and bottom 6 rowsstr(data)
returns the structure of the datadim(data)
returns the dimensions, # or rows and # of columnsnrow(data)
returns only # of rowsncol(data)
returns only # of columns
colnames(data)
andrownames(data)
returns the column and row namesunique(series)
returns nonduplicating data points in the series
## [1] 841 21
There are 841 data in the dataset and 21 columns of information.
## # A tibble: 6 x 21
## id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.49e9 2014… 4.20e5 3 2.5 1470 1571 2 0
## 2 7.90e9 2014… 3.90e5 3 3.25 1370 913 2 0
## 3 8.89e8 2014… 5.99e5 3 1.75 1650 1180 3 0
## 4 9.52e8 2014… 3.80e5 3 2.5 1260 900 2 0
## 5 1.91e8 2015… 1.58e6 4 3.25 3410 10125 2 0
## 6 3.00e9 2015… 4.75e5 3 2.5 1310 1294 2 0
## # … with 12 more variables: view <dbl>, condition <dbl>, grade <dbl>,
## # sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## # zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>
Apart from the price of the house, there are information about number of bathrooms, square feet of living spaces, number of floors, whether or not it is water front, location and so on. A problem in our data is even though the zipcode
is a categorical variable, since it is numbers it is imported as num
.
Zipcode is a column of the dataset. Below we show the unique zipcodes which gives information about how many neighbourhoods are included in the dataset:
## [1] 98040 98105 98116
Visualizing Numeric Data
In this subsection we will try to understand the relation mainly between two continuous variables, sqft_living
space of the house and its price
.
Before starting let’s give some overview about plotting. Plots in ggplot2
are created by overlapping layers. The layers here refer to background, points in the plot, title and many more. To plot a graph:
- You first create the space and give its data,
- Put the points, lines, etc.
- Add titles
- Tweak label and title sizes (if necessary)
- Change the layout (theme)
- And so on
As you can see it is an empty plot. We must add the points layer:
Other Visuals
Boxplot
Similarly if you want a boxplot of prices you can:
Notice for the scatter plot we used two data, x
and y
, and for the histogram we only used only x
.
Histogram & Density
There are countably finite graphs you can plot with ggplot, including but not limited to:
- line: geom_line()
- point: geom_point()
- barplot: geom_bar()
- boxplot: geom_boxplot()
- density: geom_density()
- historgram: guess what
- and more
Aesthetic Mapping vs. Assignment: Colours, sizes, shapes and else
There is a very important distinction for you to keep in mind:
- Aesthetic mapping:
aes(...)
function is a mapping function. It maps the variables into shapes, sizes and colours - Assigning: Instead of mapping (dynamic), you assign single colour, size etc. to the points.
Let’s begin with Aesthetic mapping. The colour can be given with colour
parameter inside the aes()
function:
On the other hand you can assign a specific colour, say firebrick
, to the points by writing outside of the aes() function:
Or a combination of both
alpha, the little touch to makes things beautiful
alpha is the transitivity of the object. For crowded data like the below it makes a lot of difference:
ggplot(data = house) +
geom_point(aes(x = sqft_living, y = price, colour=zipcode), size=2, alpha=0.6)
Or you can also use with a couple of different options so that it gives more information about the data
ggplot(data = house) +
geom_point(aes(x = sqft_living, y = price), alpha=0.2, size =4, colour='firebrick') +
theme_classic()
Converting to categorical
As we talked above, the zipcode
variable is not continuous but categorical but since it is a number it is read as numeric and therefore treated as a continuous variable. We must know this in advance and convert during preprocessing:
ggplot(data = house) +
geom_point(aes(x = sqft_living, y = price, colour=zipcode, size = floors), alpha=0.5)
Try out yourself
Produce the following:
- Assign
colour = 'black', shape = 23, stroke=2, alpha=.3, size=3
- Use aesthetic mapping to change the fill only for the waterfront houses with more than 2 bathrooms:
fill = waterfront == 1 & bathrooms >2
- Add classical theme:
ggplot(...) + geom_point(...) + theme_classic()
Smoothers
To see the trend in the data we may choose to add smoothers on the plot:
ggplot(data = house) +
geom_point(aes(x = sqft_living, y = price), size=3, alpha=.5) +
geom_smooth(aes(x = sqft_living, y = price), color='firebrick')
If we add colour changing w.r.t. the zipcode by adding inside the aes() then:
ggplot(data = house) +
geom_point(aes(x = sqft_living, y = price), size=3, alpha=.5) +
geom_smooth(aes(x = sqft_living, y = price, color=zipcode))
Adding aes into top level
You can see that it is becoming verbose. The aes function has almost the same parameters but we have to write it twice. Instead we can add aes() into top level and give the parameters which is common to all data. Also let’s remove the error bands:
ggplot(data = house, aes(x = sqft_living, y = price)) +
geom_point(size=3, alpha=.5) +
geom_smooth(aes(color=zipcode), se=F)
As you can see above aes(color=zipcode)
inside the smooth parameter adds up to the top level aes(x=sqft_living, y=price)
. We can also overwrite it if necessary. For example we may think it is unnecessary to add a smoother to zipcode == 98040
. So we can overwrite the data that smoother will be applied:
Facets
You may have noticed that the data is too dense and overlapped. We could have some insights such as in the neighbourhood where zipcode = 98116 the houses are cheaper. But still it is hard to see the trends.
Facets can help us to see the details:
ggplot(data = house, aes(x = sqft_living, y = price, color=zipcode)) +
geom_point(size=3,alpha=.5) +
facet_wrap( ~ zipcode)
We can also change the orientation of facets, also add the smoothers back:
ggplot(data = house, aes(x = sqft_living, y = price, color=zipcode)) +
geom_point(size=3,alpha=.5) +
geom_smooth(color='black') +
facet_wrap( ~ zipcode, dir='v')
Overwriting the facet captions:
zipcode.labs <- c('Mercer Island','University District', 'West Seattle')
names(zipcode.labs) <- c('98040','98105','98116')
ggplot(data = house, aes(x = sqft_living, y = price, color=zipcode)) +
geom_point(size=3,alpha=.5) +
geom_smooth(color='black') +
facet_wrap( ~ zipcode, dir='v', labeller = labeller(zipcode = zipcode.labs))
Facet grids
Facet grids are also a good option to see the scatter plot but broke down to two categorical variables.
Themes
There are some default themes available to make your plot look beautiful as it deserves. Here are some of them:
theme_bw
theme_classic
theme_dark
theme_light
theme_linedraw
theme_minimal
Visualizing Categorical Data
We cannot use the same methods we used for categorical data. For example the covid
data doesn’t have a continuous variable (except longitude and latiude). The questions such as Which gender does COVID-19 affect the most, what is the severity of the pandemic in different cities city cannot be answered with scatter plot or boxplots. We need to visualize the counts of people falling into the categories:
## Row_ID Date Age_Group Gender Acquisition Outcome1
## 1 1 2020-04-29 50s FEMALE Information pending Not Resolved
## 2 2 2020-03-04 40s FEMALE Contact of a confirmed case Resolved
## 3 3 2020-05-04 20s FEMALE Information pending Not Resolved
## 4 4 2020-05-02 50s FEMALE Information pending Not Resolved
## 5 5 2020-04-28 50s MALE Neither Not Resolved
## 6 6 2020-04-10 30s MALE Neither Resolved
## Reporting_PHU Address City
## 1 Peel Public Health 7120 Hurontario Street Mississauga
## 2 Ottawa Public Health 100 Constellation Drive Ottawa
## 3 Windsor-Essex County Health Unit 1005 Ouellette Avenue Windsor
## 4 Region of Waterloo, Public Health 99 Regina Street South Waterloo
## 5 York Region Public Health Services 17250 Yonge Street Newmarket
## 6 Ottawa Public Health 100 Constellation Drive Ottawa
## Postal_Code Reporting_PHU_Website Latitude Longitude
## 1 L5W 1N4 www.peelregion.ca/health/ 43.6474713 -79.7088933
## 2 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
## 3 N9A 4J8 www.wechu.org 42.3087965 -83.0336705
## 4 N2J 4V3 www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5 L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/ 44.048023 -79.480239
## 6 K2G 6J8 www.ottawapublichealth.ca 45.3456651 -75.7639122
This is a bar plot, but stat_count()
doesn’t give the flexibility we want. Instead we can use bar plot with stat()
function:
stat(count)
is giving what we want. It transforms the data falling into the category using count
function. We can also use it to do the math we need. Instead of counts on the x axis, we can display proportions in the dataset using number of rows in covid dataset:
ggplot(data = covid) +
geom_bar(aes(y = Age_Group, x = stat(count/nrow(covid)), color=Age_Group), alpha=.6)
Now let’s add more information to the plot. What is the proportion of men in each group?
ggplot(data = covid) +
geom_bar(aes(y = Age_Group, x = stat(count/nrow(covid)), fill=Gender), alpha=.8)
Bar Chart Types (Position adjustments)
Lets begin asking what is the recovery rate?
ggplot(data = covid) +
geom_bar(aes(y = Age_Group, x = stat(count/nrow(covid)), fill=Outcome1), alpha=.7)
Although we can see that old age groups have more fatal outcomes, we may find it useful to use a Stacked Bar Chart. This can be done by adding position="fill"
. Also let’s filter out the unknown groups (using subset()):
ggplot(data = subset(covid, Age_Group != 'Unknown')) +
geom_bar(aes(y = Age_Group, x = stat(count/nrow(covid)), fill=Outcome1), alpha=.7, position="fill")
Another way of comparing exact numbers among groups is plotting bars one next to another (position = "dodge")
:
Playing with Coordinates
Let’s go back to bar chart, but change the axis (x with y)
ggplot(data = subset(covid, Age_Group != 'Unknown')) +
geom_bar(aes(x = Age_Group, y = stat(count/nrow(covid)), fill=Outcome1), alpha=.8)
It is very easy to do magic with the above one:
ggplot(data = subset(covid, Age_Group != 'Unknown')) +
geom_bar(aes(x = Age_Group, y = stat(count/nrow(covid)), fill=Outcome1), alpha=.8) +
coord_polar() +
theme_bw()
Let’s do more magic:
cnd <- subset(map_data("world"), region=='Canada')
ggplot(data = cnd, aes(x=long, y=lat, group=group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap() +
annotate('point',x=-80.5393899, y=43.4704571, size=5, alpha=.5, color='firebrick') +
annotate('text',x=-80.5393899, y=43.4704571, size=5, alpha=.5, label='You are here', vjust=1,hjust=-0.1) +
theme_void()
at least I am there.
You will be proficient in magic later in this course
library(lubridate)
cnd <- subset(map_data("world"), region=='Canada')
covid.canada$date <- dmy(covid.canada$date)
last_info <- subset(covid.canada, date == (Sys.Date() -1)) # Yesterday
last_info <- last_info[,c(2,6)]
locs <- read.csv('data/provinces.csv', header=T)
colnames(locs) <- c('prname','long','lat')
last_info <- merge(last_info, locs, by='prname')
ggplot(data = cnd, aes(x=long, y=lat, group=group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap() +
geom_point(data = last_info, mapping = aes(x = lat, y=long, size=numconf, group=1, color=prname), alpha=.7) +
theme_void()
and some more
Exercises (not deliverables)
This part is just an exercise. You will not be marked.
- Produce the following:
- Assign
colour = 'black', shape = 23, stroke=2, alpha=.3, size=3
- Use aesthetic mapping to change the fill only for the waterfront houses with more than 2 bathrooms:
fill = waterfront == 1 & bathrooms >2
- Add classical theme:
ggplot(...) + geom_point(...) + theme_classic()
- Using house data, plot the histogram of different locations:
- Plot the histogram with changing in colours w.r.t. zipcode (aes) and assign
alpha = 0.7
- add a facet,
facet_wrap(...)
- Draw scatter plot with
sqft_basement
on x andsqft_above
on y:
- Assign colour = ‘firebrick’ and alpha = 0.3
- Add smoother and assign smoother’s colour = ‘black’
- Do you think there is a relation between the two variable? If so is the relation linear?
- Draw a barplot of people who were affected from COVID-19 by
Age_Group
andOutcome1
in Waterloo
- Instead of using
data=covid
, usedata=subset(covid, City == 'Waterloo')
- Your plot should be horizontal and the bars are grouped by
Age_Group
- The colours of the bars will show Fatal, Not Resolved and Resolved
- Compare Male and Female affected by COVID-19 in Canada by
Age_Group
using radial coordinate:
- First draw a column chart and filter the age group ‘Unknown’ out.
- Put
Age_Group
on x axis andstat(count)
on y. - Use aesthetic mapping to
fill = Outcome1
- Add
coord_polar()
andtheme_bw()