Introduction

In this tutorial, you will learn how to import data from external sources, preprocess it and plot a simple graph. By the end of the tutorial you will be able to:

read from xlsx, csv / txt / tsv, rda and SQL database,
get familiar with your data by summarizing meta information,
draw different plots with ggplot2,
visualize continuous data.

Getting Started

We will play with

House Prices data used once in Kaggle Challenge, which involves house prices and properties for various different houses in Washington
Ontario Covid Confirmed Cases data

You will need

tidyverse (install if not yet installed)
house_subset, provinces and covid datasets

You will also need some additional packages

# install.packages('tidyverse')
# install.packages('RSQLite')
# install.packages('readxl')

Now call tidyverse library:

library('tidyverse')

Reading Data

Read from CSV / TXT / TSV

These files are all in same format, and even their extension doesn’t matter. Whatever the name is the columns are separated either

by a comma (,)
a tab (\t)
or anything else

By default read.csv expects comma, and if something else appears it doesn’t split anything. Besides

The file may not have a header,
It may start in the nth row and so on.

Comma Separated

If your file is comma separated then it must look like this:

Then you can read it easily as:

provinces <- read.csv('data/provinces.csv') # by default it expects , to separate columns
provinces

##                     Province Latitude  Longitude
## 1               Saskatchewan 55.00000 -106.00000
## 2       Prince Edward Island 46.25000  -63.00000
## 3                    Ontario 50.00000  -85.00000
## 4                Nova Scotia 45.00000  -63.00000
## 5                    Alberta 55.00000 -115.00000
## 6           British Columbia 53.72667 -127.64762
## 7                   Manitoba 53.76086  -98.81387
## 8  Newfoundland and Labrador 53.13551  -57.66043
## 9              New Brunswick 46.49839  -66.15967
## 10                    Quebec 53.00000  -70.00000

dim(provinces)

## [1] 10  3

There is information for 10 Canadian provinces and the data is 3 columns.

Tab or Anything Else Separated

tab vs semicolon separated

However when the data is tab or something else separated (e.g. semicolon) as above, then the default option will read the data incorrectly as:

provinces <- read.csv('data/provinces.tsv')
head(provinces)

##                Province.Latitude.Longitude
## 1                   Saskatchewan\t55\t-106
## 2         Prince Edward Island\t46.25\t-63
## 3                         Ontario\t50\t-85
## 4                     Nova Scotia\t45\t-63
## 5                        Alberta\t55\t-115
## 6 British Columbia\t53.726669\t-127.647621

Notice that between each column there is a \t sign and all is in 1 column:

dim(provinces)

## [1] 10  1

which means the data is not properly read. To read it properly we have to specify the separater:

provinces <- read.csv('data/provinces.tsv', sep='\t')
head(provinces)

##               Province Latitude Longitude
## 1         Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000  -63.0000
## 3              Ontario 50.00000  -85.0000
## 4          Nova Scotia 45.00000  -63.0000
## 5              Alberta 55.00000 -115.0000
## 6     British Columbia 53.72667 -127.6476

If it is separated by semicolons (‘;’) then

provinces <- read.csv('data/provinces_semicol.txt', sep=';')
head(provinces)

##               Province Latitude Longitude
## 1         Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000  -63.0000
## 3              Ontario 50.00000  -85.0000
## 4          Nova Scotia 45.00000  -63.0000
## 5              Alberta 55.00000 -115.0000
## 6     British Columbia 53.72667 -127.6476

Without Header

Data do not always have header, and if you don’t specify it, the first row will be read as header:

provinces <- read.csv('data/provinces_headless.csv')
head(provinces)

##           Saskatchewan      X55      X.106
## 1 Prince Edward Island 46.25000  -63.00000
## 2              Ontario 50.00000  -85.00000
## 3          Nova Scotia 45.00000  -63.00000
## 4              Alberta 55.00000 -115.00000
## 5     British Columbia 53.72667 -127.64762
## 6             Manitoba 53.76086  -98.81387

Instead, you must add header = F:

provinces <- read.csv('data/provinces_headless.csv', header = F)
head(provinces)

##                     V1       V2        V3
## 1         Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000  -63.0000
## 3              Ontario 50.00000  -85.0000
## 4          Nova Scotia 45.00000  -63.0000
## 5              Alberta 55.00000 -115.0000
## 6     British Columbia 53.72667 -127.6476

Now you can write the headers manually:

colnames(provinces) <- c('Province','Latitude','Longitude')
head(provinces)

##               Province Latitude Longitude
## 1         Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000  -63.0000
## 3              Ontario 50.00000  -85.0000
## 4          Nova Scotia 45.00000  -63.0000
## 5              Alberta 55.00000 -115.0000
## 6     British Columbia 53.72667 -127.6476

Starts from Nth row

If there is a line that you must skip to reach the data you can specify it as

provinces <- read.csv('data/provinces_nth_row.csv', skip = 2)
head(provinces)

##               Province Latitude Longitude
## 1         Saskatchewan 55.00000 -106.0000
## 2 Prince Edward Island 46.25000  -63.0000
## 3              Ontario 50.00000  -85.0000
## 4          Nova Scotia 45.00000  -63.0000
## 5              Alberta 55.00000 -115.0000
## 6     British Columbia 53.72667 -127.6476

From Excel

R is an open source program and is not by default compatible with Excel. You need an additional package, readxl, to read xlsx files. When you have readxl installed you can call the library with library('readxl') then use the function read_excel(...). However we don’t want unnecessary functions to be loaded to the RAM. So instead of using library('readxl') to call all library, we can

# install.packages('readxl')
house <- readxl::read_excel('data/house_subset.xlsx')
head(house)

## # A tibble: 6 x 21
##       id date   price bedrooms bathrooms sqft_living sqft_lot floors waterfront
##    <dbl> <chr>  <dbl>    <dbl>     <dbl>       <dbl>    <dbl>  <dbl>      <dbl>
## 1 6.41e9 2014… 5.38e5        3      2.25        2570     7242      2          0
## 2 7.24e9 2014… 1.23e6        4      4.5         5420   101930      1          0
## 3 1.32e9 2014… 2.58e5        3      2.25        1715     6819      2          0
## 4 9.30e9 2015… 6.50e5        4      3           2950     5000      2          0
## 5 1.88e9 2014… 3.95e5        3      2           1890    14040      2          0
## 6 7.98e9 2015… 2.30e5        3      1           1250     9774      1          0
## # … with 12 more variables: view <dbl>, condition <dbl>, grade <dbl>,
## #   sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## #   zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>

By default it reads the first sheet. If you want the second page you can type readxl::read_excel('house_subset.xlsx',2) instead. Or you can write the sheet name too:

# install.packages('readxl')
house <- readxl::read_excel('data/house_subset.xlsx', 'prices')

From rda (R native format)

R has its own format, sometimes it is convenient to write and read from it because it is capable of importing ny data type:

house <- get(load('data/house_small_subset.rda'))
head(house)

## # A tibble: 6 x 21
##       id date   price bedrooms bathrooms sqft_living sqft_lot floors waterfront
##    <dbl> <chr>  <dbl>    <dbl>     <dbl>       <dbl>    <dbl>  <dbl>      <dbl>
## 1 2.52e9 2014… 2.00e6        3      2.75        3050    44867    1            0
## 2 4.22e9 2015… 9.20e5        5      2.25        2730     6000    1.5          0
## 3 9.82e9 2014… 8.85e5        4      2.5         2830     5000    2            0
## 4 2.39e9 2015… 4.80e5        3      1           1040     5060    1            0
## 5 1.48e9 2014… 9.05e5        4      2.5         3300    10250    1            0
## 6 2.29e9 2014… 7.99e5        3      2.5         2140     9897    1            0
## # … with 12 more variables: view <dbl>, condition <dbl>, grade <dbl>,
## #   sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## #   zipcode <dbl>, lat <dbl>, long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>

From SQL

SQL is the language for any data scientist. You will learn how to write SQL in Database course.

We downloaded Ontario Covid Confirmed Cases data and imported to SQLite. The dataset includes only one table, confirmed, which contains details about people who were confirmed having Covid in Ontario.

You can run SQL codes prior to importing, and assign the result into a data.frame:

## install.packages('RSQLite')
library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "data/covid.db")
covid <- dbGetQuery(con, 
          "SELECT Date, City, COUNT(*) as Confirmed
          FROM confirmed
          GROUP BY Date, City
          ORDER BY Date")
dbDisconnect(con)

tail(covid)

##            Date        City Confirmed
## 1645 2020-05-12      London         2
## 1646 2020-05-12 Mississauga         1
## 1647 2020-05-12   Newmarket         1
## 1648 2020-05-12     Toronto         1
## 1649 2020-05-12      Whitby         3
## 1650 2020-05-12     Windsor         5

In this lab, we need the data in full detail so that we can play with it:

library('RSQLite')
con <- dbConnect(RSQLite::SQLite(), dbname = "data/covid.db")
covid <- dbGetQuery(con, "SELECT * FROM confirmed")
dbDisconnect(con)

head(covid)

##   Row_ID       Date Age_Group Gender                 Acquisition     Outcome1
## 1      1 2020-04-29       50s FEMALE         Information pending Not Resolved
## 2      2 2020-03-04       40s FEMALE Contact of a confirmed case     Resolved
## 3      3 2020-05-04       20s FEMALE         Information pending Not Resolved
## 4      4 2020-05-02       50s FEMALE         Information pending Not Resolved
## 5      5 2020-04-28       50s   MALE                     Neither Not Resolved
## 6      6 2020-04-10       30s   MALE                     Neither     Resolved
##                        Reporting_PHU                 Address        City
## 1                 Peel Public Health  7120 Hurontario Street Mississauga
## 2               Ottawa Public Health 100 Constellation Drive      Ottawa
## 3   Windsor-Essex County Health Unit   1005 Ouellette Avenue     Windsor
## 4  Region of Waterloo, Public Health  99 Regina Street South    Waterloo
## 5 York Region Public Health Services      17250 Yonge Street   Newmarket
## 6               Ottawa Public Health 100 Constellation Drive      Ottawa
##   Postal_Code                   Reporting_PHU_Website    Latitude    Longitude
## 1     L5W 1N4               www.peelregion.ca/health/  43.6474713  -79.7088933
## 2     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122
## 3     N9A 4J8                           www.wechu.org  42.3087965  -83.0336705
## 4     N2J 4V3           www.chd.region.waterloo.on.ca 43.46287573 -80.52091315
## 5     L3Y 6Z1 www.york.ca/wps/portal/yorkhome/health/   44.048023   -79.480239
## 6     K2G 6J8               www.ottawapublichealth.ca  45.3456651  -75.7639122

Directly From Online Source

You can also read data into R directly from its source. The below data can be downloaded from Government of Canada Website:

covid.canada <- read.csv(url('https://health-infobase.canada.ca/src/data/covidLive/covid19.csv')) 
head(covid.canada)

##   pruid           prname             prnameFR       date update numconf numprob
## 1    35          Ontario              Ontario 31-01-2020     NA       3       0
## 2    59 British Columbia Colombie-Britannique 31-01-2020     NA       1       0
## 3     1           Canada               Canada 31-01-2020     NA       4       0
## 4    35          Ontario              Ontario 08-02-2020     NA       3       0
## 5    59 British Columbia Colombie-Britannique 08-02-2020     NA       4       0
## 6     1           Canada               Canada 08-02-2020     NA       7       0
##   numdeaths numtotal numtested numrecover percentrecover ratetested numtoday
## 1         0        3        NA         NA                        NA        3
## 2         0        1        NA         NA                        NA        1
## 3         0        4        NA         NA                        NA        4
## 4         0        3        NA         NA                        NA        0
## 5         0        4        NA         NA                        NA        3
## 6         0        7        NA         NA                        NA        3
##   percentoday ratetotal ratedeaths numdeathstoday percentdeath numtestedtoday
## 1         300      0.02          0              0            0             NA
## 2         100      0.02          0              0            0             NA
## 3         400      0.01          0              0            0             NA
## 4           0      0.02          0              0            0             NA
## 5         300      0.08          0              0            0             NA
## 6          75      0.02          0              0            0             NA
##   numrecoveredtoday percentactive numactive rateactive numtotal_last14
## 1                NA           100         3       0.02              NA
## 2                NA           100         1       0.02              NA
## 3                NA           100         4       0.01              NA
## 4                NA           100         3       0.02              NA
## 5                NA           100         4       0.08              NA
## 6                NA           100         7       0.02              NA
##   ratetotal_last14 numdeaths_last14 ratedeaths_last14 numtotal_last7
## 1               NA               NA                NA             NA
## 2               NA               NA                NA             NA
## 3               NA               NA                NA             NA
## 4               NA               NA                NA             NA
## 5               NA               NA                NA             NA
## 6               NA               NA                NA             NA
##   ratetotal_last7 numdeaths_last7 ratedeaths_last7 avgtotal_last7
## 1              NA              NA               NA             NA
## 2              NA              NA               NA             NA
## 3              NA              NA               NA             NA
## 4              NA              NA               NA             NA
## 5              NA              NA               NA             NA
## 6              NA              NA               NA             NA
##   avgincidence_last7 avgdeaths_last7 avgratedeaths_last7
## 1                 NA              NA                  NA
## 2                 NA              NA                  NA
## 3                 NA              NA                  NA
## 4                 NA              NA                  NA
## 5                 NA              NA                  NA
## 6                 NA              NA                  NA

Preprocessing Data

Getting familiar with your data

Before playing with the data it is best to know its properties:

head(data) and tail(data) returns top 6 and bottom 6 rows
str(data) returns the structure of the data
dim(data) returns the dimensions, # or rows and # of columns
- nrow(data) returns only # of rows
- ncol(data) returns only # of columns
colnames(data) and rownames(data) returns the column and row names
unique(series) returns nonduplicating data points in the series

covid <- read.csv('data/covid_confirmed.csv')
dim(covid)

## [1] 237786     10

There are 237786 confirmed cases in Ontario since the beginning of the disease. The dataset contains 10 columns of information.

str(covid)

## 'data.frame':    237786 obs. of  10 variables:
##  $ Row_ID              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Case_Reported_Date  : Factor w/ 329 levels "2020-01-23","2020-01-24",..: 42 42 42 42 39 42 42 39 40 39 ...
##  $ Age_Group           : Factor w/ 11 levels "","<20","20s",..: 5 7 5 8 9 6 4 7 9 9 ...
##  $ Gender              : Factor w/ 4 levels "FEMALE","GENDER DIVERSE",..: 3 3 3 3 3 1 1 3 3 1 ...
##  $ Case_AcquisitionInfo: Factor w/ 6 levels "CC","MISSING INFORMATION",..: 1 2 3 1 4 4 3 1 1 4 ...
##  $ Outcome1            : Factor w/ 3 levels "Fatal","Not Resolved",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ City                : Factor w/ 34 levels "Barrie","Belleville",..: 31 31 31 31 31 31 31 31 31 31 ...
##  $ Postal_Code         : Factor w/ 34 levels "K2G 6J8","K6J 5T1",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ Latitude            : num  43.7 43.7 43.7 43.7 43.7 ...
##  $ Longitude           : num  -79.4 -79.4 -79.4 -79.4 -79.4 ...

The dataset contains detailed information about the date of report, age and gender of the person who was tested positive, the outcome, City etc.

City is the city that the person resides. Below we show the unique cities included in the data:

unique(covid$City)

##  [1] Toronto          Barrie           Chatham          Thorold         
##  [5] Hamilton         Windsor          New Liskeard     Peterborough    
##  [9] Sault Ste. Marie Brantford        Sudbury          St. Thomas      
## [13] Oakville         Mississauga      Whitby           Newmarket       
## [17] Ottawa           Waterloo         Point Edward     London          
## [21] Stratford        Cornwall         Thunder Bay      Guelph          
## [25] Kingston         Brockville       Simcoe           Port Hope       
## [29] Owen Sound       Belleville       Kenora           Timmins         
## [33] North Bay        Pembroke        
## 34 Levels: Barrie Belleville Brantford Brockville Chatham Cornwall ... Windsor

Basic operations

In R, there are some operations that you must now by heart. These are selecting row and columns, slicing the data and so on. The “by heart” here means you must know the way they work even though you may not need them.

Take the COVID data:

head(covid,10)

##    Row_ID Case_Reported_Date Age_Group Gender Case_AcquisitionInfo Outcome1
## 1       1         2020-04-04       40s   MALE                   CC Resolved
## 2       2         2020-04-04       60s   MALE  MISSING INFORMATION Resolved
## 3       3         2020-04-04       40s   MALE    NO KNOWN EPI LINK Resolved
## 4       4         2020-04-04       70s   MALE                   CC Resolved
## 5       5         2020-04-01       80s   MALE                   OB Resolved
## 6       6         2020-04-04       50s FEMALE                   OB Resolved
## 7       7         2020-04-04       30s FEMALE    NO KNOWN EPI LINK Resolved
## 8       8         2020-04-01       60s   MALE                   CC Resolved
## 9       9         2020-04-02       80s   MALE                   CC Resolved
## 10     10         2020-04-01       80s FEMALE                   OB    Fatal
##       City Postal_Code Latitude Longitude
## 1  Toronto     M5B 1W2 43.65659 -79.37936
## 2  Toronto     M5B 1W2 43.65659 -79.37936
## 3  Toronto     M5B 1W2 43.65659 -79.37936
## 4  Toronto     M5B 1W2 43.65659 -79.37936
## 5  Toronto     M5B 1W2 43.65659 -79.37936
## 6  Toronto     M5B 1W2 43.65659 -79.37936
## 7  Toronto     M5B 1W2 43.65659 -79.37936
## 8  Toronto     M5B 1W2 43.65659 -79.37936
## 9  Toronto     M5B 1W2 43.65659 -79.37936
## 10 Toronto     M5B 1W2 43.65659 -79.37936

As seen above, the dataset is not as much detailed as it promisses. It masks the zipcode by reporting the same for the City.

The dataset is very big, so for this section we will work with a small proportion of it:

covid.small <- tail(covid, 10)

Selecting Columns

Since this is a dataframe, you can use musicians$birthday to choose the third column. But for matrices it wont’ work. There is another notation that you need to know: You can select certain columns and rows using [ , ] notation:

covid.small[ ,2]

##  [1] 2020-04-09 2020-04-03 2020-04-03 2020-04-01 2020-04-03 2020-04-12
##  [7] 2020-04-12 2020-04-12 2020-04-08 2020-04-12
## 329 Levels: 2020-01-23 2020-01-24 2020-02-21 2020-02-25 ... 2021-01-16

covid.small[ ,2:4]

##        Case_Reported_Date Age_Group Gender
## 237777         2020-04-09       80s   MALE
## 237778         2020-04-03       80s FEMALE
## 237779         2020-04-03       90s FEMALE
## 237780         2020-04-01       90s FEMALE
## 237781         2020-04-03       80s   MALE
## 237782         2020-04-12       20s   MALE
## 237783         2020-04-12       50s FEMALE
## 237784         2020-04-12       40s   MALE
## 237785         2020-04-08       90s FEMALE
## 237786         2020-04-12       30s FEMALE

covid.small[ , c(2,3,4,6,7)]

##        Case_Reported_Date Age_Group Gender Outcome1    City
## 237777         2020-04-09       80s   MALE Resolved Toronto
## 237778         2020-04-03       80s FEMALE Resolved Toronto
## 237779         2020-04-03       90s FEMALE    Fatal Toronto
## 237780         2020-04-01       90s FEMALE    Fatal Toronto
## 237781         2020-04-03       80s   MALE Resolved Toronto
## 237782         2020-04-12       20s   MALE Resolved Toronto
## 237783         2020-04-12       50s FEMALE Resolved Toronto
## 237784         2020-04-12       40s   MALE Resolved Toronto
## 237785         2020-04-08       90s FEMALE Resolved Toronto
## 237786         2020-04-12       30s FEMALE Resolved Toronto

or we can drop unwanted columns by using ‘-’ sign:

covid.small[ , -c(1,5,8:10)]

##        Case_Reported_Date Age_Group Gender Outcome1    City
## 237777         2020-04-09       80s   MALE Resolved Toronto
## 237778         2020-04-03       80s FEMALE Resolved Toronto
## 237779         2020-04-03       90s FEMALE    Fatal Toronto
## 237780         2020-04-01       90s FEMALE    Fatal Toronto
## 237781         2020-04-03       80s   MALE Resolved Toronto
## 237782         2020-04-12       20s   MALE Resolved Toronto
## 237783         2020-04-12       50s FEMALE Resolved Toronto
## 237784         2020-04-12       40s   MALE Resolved Toronto
## 237785         2020-04-08       90s FEMALE Resolved Toronto
## 237786         2020-04-12       30s FEMALE Resolved Toronto

or call them by name:

covid.small[ ,c('Age_Group','Gender','Outcome1')]

##        Age_Group Gender Outcome1
## 237777       80s   MALE Resolved
## 237778       80s FEMALE Resolved
## 237779       90s FEMALE    Fatal
## 237780       90s FEMALE    Fatal
## 237781       80s   MALE Resolved
## 237782       20s   MALE Resolved
## 237783       50s FEMALE Resolved
## 237784       40s   MALE Resolved
## 237785       90s FEMALE Resolved
## 237786       30s FEMALE Resolved

Selecting Rows

covid.small[1:2, ]

##        Row_ID Case_Reported_Date Age_Group Gender Case_AcquisitionInfo Outcome1
## 237777 237777         2020-04-09       80s   MALE                   OB Resolved
## 237778 237778         2020-04-03       80s FEMALE                   OB Resolved
##           City Postal_Code Latitude Longitude
## 237777 Toronto     M5B 1W2 43.65659 -79.37936
## 237778 Toronto     M5B 1W2 43.65659 -79.37936

rowIDs  <- c(1,2,4) 
covid.small[rowIDs, ]

##        Row_ID Case_Reported_Date Age_Group Gender Case_AcquisitionInfo Outcome1
## 237777 237777         2020-04-09       80s   MALE                   OB Resolved
## 237778 237778         2020-04-03       80s FEMALE                   OB Resolved
## 237780 237780         2020-04-01       90s FEMALE                   OB    Fatal
##           City Postal_Code Latitude Longitude
## 237777 Toronto     M5B 1W2 43.65659 -79.37936
## 237778 Toronto     M5B 1W2 43.65659 -79.37936
## 237780 Toronto     M5B 1W2 43.65659 -79.37936

or we can drop the rows 3 and 4 by using ‘-’ sign:

covid.small[-c(3:4),]

##        Row_ID Case_Reported_Date Age_Group Gender Case_AcquisitionInfo Outcome1
## 237777 237777         2020-04-09       80s   MALE                   OB Resolved
## 237778 237778         2020-04-03       80s FEMALE                   OB Resolved
## 237781 237781         2020-04-03       80s   MALE                   OB Resolved
## 237782 237782         2020-04-12       20s   MALE    NO KNOWN EPI LINK Resolved
## 237783 237783         2020-04-12       50s FEMALE    NO KNOWN EPI LINK Resolved
## 237784 237784         2020-04-12       40s   MALE    NO KNOWN EPI LINK Resolved
## 237785 237785         2020-04-08       90s FEMALE                   OB Resolved
## 237786 237786         2020-04-12       30s FEMALE  MISSING INFORMATION Resolved
##           City Postal_Code Latitude Longitude
## 237777 Toronto     M5B 1W2 43.65659 -79.37936
## 237778 Toronto     M5B 1W2 43.65659 -79.37936
## 237781 Toronto     M5B 1W2 43.65659 -79.37936
## 237782 Toronto     M5B 1W2 43.65659 -79.37936
## 237783 Toronto     M5B 1W2 43.65659 -79.37936
## 237784 Toronto     M5B 1W2 43.65659 -79.37936
## 237785 Toronto     M5B 1W2 43.65659 -79.37936
## 237786 Toronto     M5B 1W2 43.65659 -79.37936

We can also use logical arrays as filters:

inds <- c(T,T,F,T,T,T,F,F,T,T)
covid.small[ inds ,]

##        Row_ID Case_Reported_Date Age_Group Gender Case_AcquisitionInfo Outcome1
## 237777 237777         2020-04-09       80s   MALE                   OB Resolved
## 237778 237778         2020-04-03       80s FEMALE                   OB Resolved
## 237780 237780         2020-04-01       90s FEMALE                   OB    Fatal
## 237781 237781         2020-04-03       80s   MALE                   OB Resolved
## 237782 237782         2020-04-12       20s   MALE    NO KNOWN EPI LINK Resolved
## 237785 237785         2020-04-08       90s FEMALE                   OB Resolved
## 237786 237786         2020-04-12       30s FEMALE  MISSING INFORMATION Resolved
##           City Postal_Code Latitude Longitude
## 237777 Toronto     M5B 1W2 43.65659 -79.37936
## 237778 Toronto     M5B 1W2 43.65659 -79.37936
## 237780 Toronto     M5B 1W2 43.65659 -79.37936
## 237781 Toronto     M5B 1W2 43.65659 -79.37936
## 237782 Toronto     M5B 1W2 43.65659 -79.37936
## 237785 Toronto     M5B 1W2 43.65659 -79.37936
## 237786 Toronto     M5B 1W2 43.65659 -79.37936

Cases in Waterloo:

# covid$City == 'Waterloo'

temp <- covid[covid$City == 'Waterloo' &  covid$Outcome1 == 'Fatal',  ]
# head(temp)
nrow(temp)

## [1] 182

Cases in Toronto and Mississauga

temp <- covid[covid$City %in% c('Toronto','Mississauga'),]
# head(temp)
nrow(temp)

## [1] 122909

Cities whose name contain the phrase New:

is.New <- grepl("New",covid$City)
# head(is.New)
temp <- covid[is.New, c("Age_Group", "Gender", "Outcome1", "City")]
rmarkdown::paged_table(temp)

Tidyverse way

Tidyverse can help to process your data very fast in a few line of codes. It has lot’s of features that we will cover gradually, but for now we can start with two of them, select and filter

Let’s start with select for selecting columns:

select(covid.small, c('Age_Group','Gender','Outcome1'))

##        Age_Group Gender Outcome1
## 237777       80s   MALE Resolved
## 237778       80s FEMALE Resolved
## 237779       90s FEMALE    Fatal
## 237780       90s FEMALE    Fatal
## 237781       80s   MALE Resolved
## 237782       20s   MALE Resolved
## 237783       50s FEMALE Resolved
## 237784       40s   MALE Resolved
## 237785       90s FEMALE Resolved
## 237786       30s FEMALE Resolved

It also allowing choosing a range of columns using their names:

select(covid.small, Case_Reported_Date:Outcome1)

##        Case_Reported_Date Age_Group Gender Case_AcquisitionInfo Outcome1
## 237777         2020-04-09       80s   MALE                   OB Resolved
## 237778         2020-04-03       80s FEMALE                   OB Resolved
## 237779         2020-04-03       90s FEMALE                   OB    Fatal
## 237780         2020-04-01       90s FEMALE                   OB    Fatal
## 237781         2020-04-03       80s   MALE                   OB Resolved
## 237782         2020-04-12       20s   MALE    NO KNOWN EPI LINK Resolved
## 237783         2020-04-12       50s FEMALE    NO KNOWN EPI LINK Resolved
## 237784         2020-04-12       40s   MALE    NO KNOWN EPI LINK Resolved
## 237785         2020-04-08       90s FEMALE                   OB Resolved
## 237786         2020-04-12       30s FEMALE  MISSING INFORMATION Resolved

Using Pipes

A very handy operator comes with tidyverse package is a pipe. We use %>% sign to inject output of one operation to another. For example, we can select a couple of columns and then using the resulting table, filter for Waterloo’s cases. So, instead of

wat.cases <- select(covid, c('Age_Group','Gender','Outcome1','City'))
wat.cases <- subset(wat.cases, City == 'Waterloo')
head(wat.cases)

we can use pipes as:

wat.cases <- select(covid, c('Age_Group','Gender','Outcome1','City')) %>% 
  subset(City == 'Waterloo')
head(wat.cases)

##      Age_Group Gender Outcome1     City
## 4186       60s FEMALE Resolved Waterloo
## 4188       20s   MALE Resolved Waterloo
## 4190       40s   MALE Resolved Waterloo
## 4191       30s FEMALE Resolved Waterloo
## 4196       20s FEMALE Resolved Waterloo
## 4197       40s FEMALE Resolved Waterloo

Notice that in the second line (subset), we didn’t write the data name as the first parameter, it was injected by the pipe operation.

Creating / Manipulating Variables

To create a new variable or manipulate the existing one, you simply assign them. But there are other ways that can be more handy. We will work with the House dataset to show this:

house.smaller <- select(house,price:sqft_living,floors,zipcode:long)
head(house.smaller)

## # A tibble: 6 x 8
##     price bedrooms bathrooms sqft_living floors zipcode   lat  long
##     <dbl>    <dbl>     <dbl>       <dbl>  <dbl>   <dbl> <dbl> <dbl>
## 1 2000000        3      2.75        3050    1     98040  47.5 -122.
## 2  920000        5      2.25        2730    1.5   98105  47.7 -122.
## 3  885000        4      2.5         2830    2     98105  47.7 -122.
## 4  480000        3      1           1040    1     98116  47.6 -122.
## 5  905000        4      2.5         3300    1     98040  47.6 -122.
## 6  799000        3      2.5         2140    1     98040  47.6 -122.

Let’s assume we will need a new binary variable, Multistorey, which is 1 if the house is two or more floors and 0 otherwise:

house.smaller$multistorey <- ifelse(house.smaller$floors >=2, 1, 0)
head(house.smaller)

## # A tibble: 6 x 9
##     price bedrooms bathrooms sqft_living floors zipcode   lat  long multistorey
##     <dbl>    <dbl>     <dbl>       <dbl>  <dbl>   <dbl> <dbl> <dbl>       <dbl>
## 1 2000000        3      2.75        3050    1     98040  47.5 -122.           0
## 2  920000        5      2.25        2730    1.5   98105  47.7 -122.           0
## 3  885000        4      2.5         2830    2     98105  47.7 -122.           1
## 4  480000        3      1           1040    1     98116  47.6 -122.           0
## 5  905000        4      2.5         3300    1     98040  47.6 -122.           0
## 6  799000        3      2.5         2140    1     98040  47.6 -122.           0

We could can use mutate from the tidyverse package too:

house.smaller <- mutate(house.smaller, 
               multistorey = ifelse(floors >=2, 1, 0))
head(house.smaller)

## # A tibble: 6 x 9
##     price bedrooms bathrooms sqft_living floors zipcode   lat  long multistorey
##     <dbl>    <dbl>     <dbl>       <dbl>  <dbl>   <dbl> <dbl> <dbl>       <dbl>
## 1 2000000        3      2.75        3050    1     98040  47.5 -122.           0
## 2  920000        5      2.25        2730    1.5   98105  47.7 -122.           0
## 3  885000        4      2.5         2830    2     98105  47.7 -122.           1
## 4  480000        3      1           1040    1     98116  47.6 -122.           0
## 5  905000        4      2.5         3300    1     98040  47.6 -122.           0
## 6  799000        3      2.5         2140    1     98040  47.6 -122.           0

Grouping and Summarizing

Let’s assume that we are interested in

average of house prices in different neighbourhoods (zipcode) and
number of houses in different neighbourhoods (zipcode).

We can calculate the average house prices as:

group_by(house.smaller, zipcode) %>% 
  summarise(avgPrice=mean(price))

## # A tibble: 3 x 2
##   zipcode avgPrice
##     <dbl>    <dbl>
## 1   98040 1194230.
## 2   98105  862825.
## 3   98116  618634.

and number of houses as:

group_by(house.smaller, zipcode) %>% 
  summarise(nhouses=n())

## # A tibble: 3 x 2
##   zipcode nhouses
##     <dbl>   <int>
## 1   98040     282
## 2   98105     229
## 3   98116     330

Also we can group by two variables as below:

group_by(house.smaller, zipcode, multistorey) %>% 
  summarise(nhouses=n())

## # A tibble: 6 x 3
## # Groups:   zipcode [3]
##   zipcode multistorey nhouses
##     <dbl>       <dbl>   <int>
## 1   98040           0     152
## 2   98040           1     130
## 3   98105           0     141
## 4   98105           1      88
## 5   98116           0     210
## 6   98116           1     120

Visualization

Visualizing Continuous and Non-Categorical Numeric Data

In this subsection we will try to understand the relation mainly between two continuous variables, sqft_living space of the house and its price.

Before starting let’s give some overview about plotting. Plots in ggplot2 are created by overlapping layers. The layers here refer to background, points in the plot, title and many more. To plot a graph:

You first create the space and give its data,
Put the points, lines, etc.
Add titles
Tweak label and title sizes (if necessary)
Change the layout (theme)
And so on

# 1. The space
ggplot(data = house)

As you can see it is an empty plot. We must add the points layer:

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price))

Other Visuals

Boxplot

Similarly if you want a boxplot of prices you can:

ggplot(data = house) + 
  geom_boxplot(aes(x = price))

Notice for the scatter plot we used two data, x and y, and for the histogram we only used only x.

Histogram

ggplot(data = house) + 
  geom_histogram(aes(x = price))

Density

ggplot(data = house) + 
  geom_density(aes(x = price))

Histogram & Density

ggplot(data = house, aes(x = price)) + 
  geom_histogram(aes(y=..density..), fill='steelblue', alpha=.6, color='grey75') + 
  geom_density()

There are countably finite types you can plot with ggplot, including but not limited to:

line: geom_line()
point: geom_point()
barplot: geom_bar()
boxplot: geom_boxplot()
density: geom_density()
historgram: guess what
and more

Aesthetic Mapping vs. Assignment: Colours, sizes, shapes and else

There is a very important distinction for you to keep in mind:

Aesthetic mapping: aes(...) function is a mapping function. It maps the variables into shapes, sizes and colours
Assigning: Instead of mapping (dynamic), you assign single colour, size etc. to the points.

Let’s begin with Aesthetic mapping. The colour can be given with colour parameter inside the aes() function:

ggplot(data = house) + 
  geom_point( aes(x = sqft_living, y = price, colour=zipcode))

On the other hand you can assign a specific colour, say firebrick, to the points by writing outside of the aes() function:

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price) , colour = 'firebrick', size = 2)

Or a combination of both

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price, colour=zipcode) , size = 3)

alpha, the little touch that makes things beautiful

alpha is the transitivity of the object. For crowded data like the below it makes a lot of difference:

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price, colour=zipcode, size = floors), alpha=0.6)

Or you can also use with a couple of different options so that it gives more information about the data

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price), alpha=0.2, size =4, colour='firebrick') + 
  theme_classic()

Converting to categorical

As we talked above, the zipcode variable is not continuous but categorical but since it is a number it is read as numeric and therefore treated as a continuous variable. We must know this in advance and convert during preprocessing:

house$zipcode  <- as.factor(house$zipcode)

ggplot(data = house) + 
  geom_point(aes(x = sqft_living, y = price, colour=zipcode, size = floors), alpha=0.5)

Themes

There are some default themes available to make your plot look beautiful as it deserves. Here are some of them:

theme_bw

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_bw()

theme_classic

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_classic()

theme_dark

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_dark()

theme_light

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_light()

theme_linedraw

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_linedraw()

theme_minimal

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_minimal()

theme_void

ggplot(data = house, aes(x = sqft_living, y = price)) + 
  geom_point(size=3,alpha=.5, colour='orange') +
  theme_void()

References

Data Visualization with R

Week 2: Data Processing and Visualizing

A not very gentle introduction