Lab 2: Distributional Properties

and how to compare them



Introduction

In this tutorial you will learn how to be derive basic insights from data and visualize joint probabilities. In particular we will be visualizing distributions, both discrete and continuous, both univariate and bivariate. By the end of the tutorial you will be able to:

  • draw and compare distributions two or more distributions
  • draw histograms, densities,
  • draw box, violin and jitter plots,
  • interpret these visuals.

Getting Started

In this lab we will again visualize House Prices dataset used once in Kaggle Challenge, which consists of prices and some properties of houses in Washington. To be able to follow you need:

  • to download house_subset.csv data uploaded on LEARN and
  • tidyverse,
  • GGally,
  • ggridges and
  • gridExtra packages (install them if not yet installed)
library('tidyverse') 
house <- read.csv('house_subset.csv')

If tidyverse installation were unsuccessful, you can instead use ggplot for this lab:

library('ggplot2') 
house <- read.csv('house_subset.csv')

Recall - House Data

dim(house)
## [1] 7633   22
str(house)
## 'data.frame':    7633 obs. of  22 variables:
##  $ X            : int  2 6 7 16 17 20 22 23 27 31 ...
##  $ id           : num  6.41e+09 7.24e+09 1.32e+09 9.30e+09 1.88e+09 ...
##  $ date         : chr  "20141209T000000" "20140512T000000" "20140627T000000" "20150124T000000" ...
##  $ price        : num  538000 1225000 257500 650000 395000 ...
##  $ bedrooms     : int  3 4 3 4 3 3 3 5 3 3 ...
##  $ bathrooms    : num  2.25 4.5 2.25 3 2 1 2.75 2.5 1.75 2.5 ...
##  $ sqft_living  : int  2570 5420 1715 2950 1890 1250 3050 2270 2450 2320 ...
##  $ sqft_lot     : int  7242 101930 6819 5000 14040 9774 44867 6300 2691 3980 ...
##  $ floors       : num  2 1 2 2 2 1 1 2 2 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 3 0 0 4 0 0 0 ...
##  $ condition    : int  3 3 3 3 3 4 3 3 3 3 ...
##  $ grade        : int  7 11 7 9 7 7 9 8 8 8 ...
##  $ sqft_above   : int  2170 3890 1715 1980 1890 1250 2330 2270 1750 2320 ...
##  $ sqft_basement: int  400 1530 0 970 0 0 720 0 700 0 ...
##  $ yr_built     : int  1951 2001 1995 1979 1994 1969 1968 1995 1915 2003 ...
##  $ yr_renovated : int  1991 0 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98125 98053 98003 98126 98019 98003 98040 98092 98119 98027 ...
##  $ lat          : num  47.7 47.7 47.3 47.6 47.7 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1690 4760 2238 2140 1890 1280 4110 2240 1760 2580 ...
##  $ sqft_lot15   : int  7639 101930 6819 4000 14018 8850 20336 7005 3573 3980 ...

Above summarizes the data. There are 7633 houses in the dataset and 22 columns of information. Apart from the price of the house, there is information about number of bathrooms, square feet of living spaces, number of floors, whether or not it is water front, location and so on.

Below we show the unique zipcodes which gives information about how many neighbours are included in the dataset:

unique(house$zipcode)
##  [1] 98125 98053 98003 98126 98019 98040 98092 98119 98027 98001 98070 98105
## [13] 98042 98059 98122 98075 98116 98032 98065 98109 98155 98011 98106 98072
## [25] 98055

We know that some values are not continuous but treated as if so. So we must tell R to read as categorical instead:

house$waterfront <- as.factor(house$waterfront)
house$view       <- as.factor(house$view)
house$zipcode    <- as.factor(house$zipcode)
house$floors     <- as.factor(house$floors)
house$bedrooms   <- as.factor(house$bedrooms)
house$bathrooms  <- as.factor(house$bathrooms)
class(house$zipcode)
## [1] "factor"

Univariate Distributions

There are two types of univariate distributions, numeric and discrete. We will start with univariate numeric one.

There are a number of visuals that you can play with:

  • Histograms and densities
  • Boxplots, violinplots and jitter plots
  • A combination of these

Histograms, Densities and Boxplots

Histograms can tell us about the distribution of the data at hand. You can see below that normal distribution has a signature bell shape. So when we plot the data, if we can see a similar shape we can say, it seems normally distributed. Let’s remember the main statistics of normal distribution andhow it looks like from the window of Histogram and Boxplot:

set.seed(156)
normal <- data.frame(x = rnorm(10000))
summary(normal)
##        x           
##  Min.   :-3.87037  
##  1st Qu.:-0.66184  
##  Median : 0.02372  
##  Mean   : 0.01571  
##  3rd Qu.: 0.69615  
##  Max.   : 3.80205
# sort(normal$x)

Standard normal has its mean and median at 0. It is a symmetric distribution so the median is at the centre. Besides, the first quartile (the 25%th data point) and third quartile (the 75%th data point) are

library('gridExtra') ## Needed for side by side plots

p1 <- ggplot(normal, aes(x = x)) + 
  geom_histogram(fill="firebrick", color='white', alpha=.7) + 
  theme_classic() + 
  theme(axis.title = element_blank(), axis.text.y = element_blank())  + 
  xlim(c(-4,4)) +
  scale_x_continuous(n.breaks = 13) +
  ggtitle('Normal Distribution')

p2 <- ggplot(normal, aes(x = x)) + 
  geom_boxplot(outlier.colour = 'firebrick') + 
  theme_classic() + 
  theme(axis.title = element_blank(), axis.text.y = element_blank()) +
  xlim(c(-4,4)) 

p3 <- ggplot(house, aes(x = price)) + 
  geom_histogram(fill=adjustcolor('firebrick',.7), color='white') + 
  theme_classic() + xlim(c(0, 6000000)) +
  theme(axis.title = element_blank(), axis.text.y = element_blank()) +
  scale_x_continuous(n.breaks = 12) +
  ggtitle('Distribution of Price')

p4 <- ggplot(house, aes(x = price)) + 
  geom_boxplot(outlier.colour = 'firebrick') + 
  theme_classic() + xlim(c(0, 6000000)) +
  theme(axis.title = element_blank(), axis.text.y = element_blank()) +
  scale_x_continuous(n.breaks = 12)
  


grid.arrange(p1,p2,p3,p4, ncol=1)

Price is obviously not normally distributed. It is rightskewed. Also it has too many outliers according to the boxplot (which overinterprets outliers).

The price data has its peak around 450k. This also indicate that median price is around 450k. Also Boxplot shows the 1st quartile is around 350k and 3rd quartile is around 600k.

Transforming Data

In many cases you will have a non-normal distributed data but you will need normal. This is very common problem and we will visit later. One way to make your data look more normal is to transform it with log or sqrt functions.

p1 <- ggplot(house, aes(x=price)) + 
  geom_histogram(fill = adjustcolor('darkolivegreen2',0.7), color='white', bins=30) + 
  ggtitle('Boxplot of Price') + theme_light()

p2 <- ggplot(house, aes(x=sqrt(price))) + 
  geom_histogram(fill = adjustcolor('steelblue',0.7), color='white', bins=30) + 
  ggtitle('Boxplot of SQRT Price') + theme_light()

p3 <- ggplot(house, aes(x=log(price))) + 
  geom_histogram(fill = adjustcolor('firebrick',0.7), color='white', bins=30) + 
  ggtitle('Boxplot of LOG Price') + theme_light()

grid.arrange(p1,p2, p3, ncol=3)

Well, log transformation seems more like the one that we want to see.

Comparing Univariate Distributions

The price data is a continuous variable and from above we can say that it is not normally distributed. But maybe it is not common in whole data, maybe there is a factor that will change the picture.

Let’s explore price more using histograms, but this time visualize the counts but the density (proportions) on the y axis:

ggplot(house, aes(x=price, y=stat(density))) + 
  geom_histogram(fill = adjustcolor('firebrick',0.7), color='white', bins=30) + 
  # geom_density(fill="firebrick", alpha=.2) + 
  ggtitle('Histogram of Price') + theme_light()

Overlaying Histograms

Houses are not expensive everywhere. For example if we want to compare particular regions, e.g. 98040, 98105 and 98116 we can subset these regions overlay histograms:

ggplot(subset(house, zipcode %in% c('98040', '98105', '98116')), aes(x=price, fill=zipcode)) + 
  geom_histogram(alpha=.5, colour='white', position='identity', bins = 20) + 
  theme_bw()

There are roughly 3.5 bins placed in the range of 0 - 1,000,000. The histograms of 98116 and 98105 piles up 300 - 600K and there are more houses in this range in 98116. Also the red bins are above the others for higher prices, which translates into 98040 is overally expensive.

We may wonder whether or not waterfront matters:

ggplot(house, aes(x=price, y=..density.., fill=waterfront)) + 
  geom_histogram(alpha=.5, colour='white', bins=20, position = 'identity')

Similarly waterfront houses are way more expensive than others.

Overlaying Densities

Another way to visualize is to use density plots. Here density is empirical estimate to probability density function:

ggplot(subset(house, zipcode %in% c('98040', '98105', '98116')), aes(x=price, fill=zipcode)) + 
  geom_density(alpha=.6) + 
  theme_bw()

Similarly we can ask if waterfront houses are more expensive:

ggplot(house, aes(x=price, fill=waterfront)) + 
  geom_density(alpha=.6)

Facet Wrap

Facet wraps are powerfull in comparing few distributions. Let’s compare 98040, 98105 and 98116 using facet wraps:

ggplot(subset(house, zipcode %in% c('98040', '98105', '98116')), aes(x=price, fill=zipcode)) + 
  geom_histogram(alpha=.5, colour='white') + 
  facet_wrap( ~ zipcode, ncol = 1) + 
  theme_bw()

98040 spans a wider range of price whereas 98116 is relatively cheaper. Both 98116 and 98105 have peaks in range of 400K - 600K, so most of the houses are in this range. Distribution of 98040 piles up around 800K - 1 million, whereas the other two region’s peaks are around 400-600K.

Back to Back Plot

Overlapping gives good information when comparing two distributions but a better way can be back2back plot. But I must admit that it is not easy to plot it. We will do it by

  • ploting the two group separately by subsetting the data
  • flipping direction of one of the data by multiplying with -1 as: -..count.. or -..density..

Let’s compare houses in zipcode == 98040 and zipcode == 98105

ggplot() + 
  geom_histogram(data= subset(house, zipcode == 98040), aes(x = log(price), y = ..count..), fill="firebrick", colour='white') + 
  geom_histogram(data= subset(house, zipcode == 98105), aes(x = log(price), y = -..count..), fill= "steelblue", colour='white')

We can see both of them are expensive regions but the houses in 98105 is overally cheaper as the distribution is more right-skewed. The red data piles up (and therefore its median is) around 13.60 compared to the blue one that piles up around 13.20.

Human’s eye catches the asymmetry better than anything else. Guess why? So a better practice is to flip coordinates:

ggplot(house, aes(x=price)) + 
  geom_histogram(data= subset(house, zipcode == 98040), aes(y = log(price), x = ..count..), fill="firebrick", colour='white') + 
  geom_histogram(data= subset(house, zipcode == 98105), aes(y = log(price), x = -..count..), fill= "steelblue", colour='white') +
  annotate('text', x = -20, y = 15.25, label='Mercer Island') +
  annotate('text', x = 20, y = 15.25, label='University District') +
  ggtitle('Comparison of LOG pries') + 
  theme_minimal()

Comparing many distributions

Density

ggplot(house, aes(x=log(price), fill=bedrooms)) + 
  geom_density(aes(y=..density..),alpha=.3) + 
  theme_minimal()
## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf

Histogram

ggplot(house, aes(x=log(price), fill=bedrooms)) + 
  geom_histogram(aes(y=..density..),alpha=.6) + 
  theme_minimal()

Reminds me a song a song.

It is interesting to see how many houses have 9 bedrooms! The density plot is a better choice than histogram to compare more than 3 series.

Faceting

As you can see above it is becoming harder to compare precisely the distributions. One way to deal with this problem is using facet_wrap. Let’s see if the distribution is different among different locations:

Let’s check neighbourhood-wise:

ggplot(house, aes(x=log(price), colour = zipcode)) + 
  geom_histogram() + 
  ggtitle('Boxplot of LOG Price') +
  facet_wrap( ~zipcode, ncol=5) + # try adding scales = 'free' 
  theme_minimal()  + 
  theme(
        legend.position = 'none')

Log values can transform the data into more normal shape. There are many neighbourhoods who have house log prices normally distributed.

Boxplots

Boxplot is a solution to above problem. It summarizes important features and visualizes in a condensed way so that we can compare many series. Take for example boxplot of LOG prices:

ggplot(house, aes(x=log(price))) + 
  geom_boxplot(outlier.colour="firebrick",outlier.size = 3,outlier.alpha = .5) + 
  ggtitle('Boxplot of LOG Price') + 
  theme_light()

Can you see 5 values and outliers above?

summary(log(house$price))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.35   12.67   12.99   13.03   13.35   15.48

The left end red point is the minimum (11.35) and right end red point is the maximum (15.48). The median is slightly left to 13, the 1st quartile (12.67) and 3rd quartile (13.35) are left and right edges of the box.

It is very powerful for summarizing important information in one single frame:

ggplot(house, aes(y = zipcode, x=price, color = zipcode)) + 
  geom_boxplot() + 
  ggtitle('Boxplot of Price') +
  theme_light() +
  theme(legend.position = 'none')

Which neighbourhood seems more expensive? In the neighbour, zipcode = 98040, median price of a house is greater than any other neighbourhood, also the most expensive house in our dataset is located there.

Do you want to see where is 98040?

A better practice is to sort the boxplots based on their median values. This can be done by giving order to the zipcode data. Zipcodes are numbers, so by default R reads them as numeric, however we want them to be treated as level data. So we will overwrite zipcode as below:

medians  <- group_by(house, zipcode) %>% summarize(Medians = median(price))
ord <- order(medians$Medians)  # the order
house$zipcode <- factor(house$zipcode , levels=medians$zipcode[ord])

ggplot(house, aes(y = zipcode, x=price, color = zipcode)) + 
  geom_boxplot() + 
  ggtitle('Boxplot of Price') +
  theme_light() + 
  theme(legend.position = 'none')

Jitter Plots, Violin Plots and more

Boxplots are used a lot in practice but as we discussed above they lose some information. Another problem with boxplot is it shows too much outliers than actually there is. Below there are three options to overcome the problem:

Jitter

ggplot(house, aes(y=zipcode, x=log(price), color=zipcode)) +
  geom_jitter(alpha=.6) +
  theme_light() +
  theme(legend.position = 'none')

Jitter & Boxplot

ggplot(house, aes(y=zipcode, x=log(price), color=zipcode)) +
  geom_jitter(alpha=.6) +
  geom_boxplot(outlier.alpha=0, colour="black",fill="grey", alpha=.3) + 
  theme_light() +
  theme(legend.position = 'none')

Density Ridges

library('ggridges')
ggplot(house, aes(y=zipcode, x=log(price), fill=zipcode)) +
  geom_density_ridges() +
  theme_minimal() +
  theme(legend.position = 'none')

Violin

ggplot(house, aes(y=zipcode, x=log(price), color=zipcode)) +
  geom_violin(fill=adjustcolor('grey75',0.7)) +
  theme_light() +
  theme(legend.position = 'none')

Cleveland Dot

No worries, we will learn

medians <- group_by(house, zipcode) %>% summarise(price=median(price))
ggplot(medians, aes(y=zipcode)) +
  geom_point(aes(x=price), colour='firebrick') + 
  geom_segment(aes(x=200000, yend = zipcode, xend=price), colour = 'grey', alpha=.5) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  theme(legend.position = 'none') + 
  ggtitle('Median prices in different neighbourhoods')

The jitter plot can add the information about data intensity in certain locations, for example number of houses in locations with zipcode 98032, 98070 and 98109 are not as much as 98059. This information lacks in boxplot. But jitter doesn’t have the important information such as median and quartiles. The overlap adds this.

Also violin plots can give it for us. It is adding by changing the boxes to symmetric density plots.

–>