Unsupervised Learning

kmeans & Hierarchical Clustering

Introduction

Last week we introduced you to an unsupervised method, PCA, that takes only input variables (predictors) to come up with useful insights. But as we mentioned, it is not a learning algorithm as it doesn’t conclude on anything. On the other hand, this week’s topic, K-means and hierarchical clustering are unsupervised learning algorithms that are also know as clustering algorithms, that returns some classes associated with the data points. This week we will,

give insight into how they work,
show how they work in R,
interpret using different examples.

Getting Started

To follow up, you will need

tidyverse
dendextend
circlize

Now, call tidyverse and other packages:

library('tidyverse')
library('dendextend')
library('circlize')

theme_set(theme_minimal())

K-means Clustering

Thinking of a method that doesn’t require labels, one of the most intuitive approach could be assuming the numbers that belong to the same cluster must not fall too far from each other, which is the motivation behind k-means. Here, we only need the value, k, that can be known in some cases.

The algorithm

k-means is an iterative algorithm that runs certain times or until it converges. Here, we first randomly assign the points to one of k clusters, then calculate each clusters’ means and then re-assign the points based on their proximity to one of each cluster and repeat these steps. So, we can outline it as:

Choose number of clusters, k,
Assign each observation to a one of the k clusters randomly,
Compute cluster centroid, that is the mean of the points that are assigned to that group,
- There will be k centroids (means),
Re-assign the observations to one of the k clusters again based on their proximity (euclidean distance) to the means calculated in the previous step
Repeat 3 and 4 until
- The change in the calculated means are not changing much (less than epsilon) or,
- maximum iteration has been reached.

The algorithm looks like as below:

Iris Example

Let’s explain it on iris data.

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Here, we know that the dataset contans 3 types of iris flowers; Setosa, Versicolor or Virginica. We can plot, using the actual labels as:

g1 <- ggplot(iris, aes(x=Sepal.Length, y=Petal.Width, colour=Species)) + 
  geom_point(alpha=.5, size =2)

g1

Now, let’s see what we can do with k-means clustering. We know that there are 3 _k_lusters. The rest is straightforward:

fit <- kmeans(iris[,1:4],3)
fit

## K-means clustering with 3 clusters of sizes 38, 62, 50
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.850000    3.073684     5.742105    2.071053
## 2     5.901613    2.748387     4.393548    1.433871
## 3     5.006000    3.428000     1.462000    0.246000
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
## [112] 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1
## [149] 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 23.87947 39.82097 15.15100
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

We can see that each point is has been assigned to one group (clustering vector). The report also shows the means of the clusters (centroids) and the success of the fit (between_SS / total_SS = 88.4 %).

Now, let’s compare the results with the original classes:

clus <- as.factor(fit$cluster)
g2 <- ggplot(iris, aes(x=Sepal.Length, y=Petal.Width, colour=clus)) + 
  geom_point(alpha=.5, size =2)

gridExtra::grid.arrange(g1,g2, ncol = 2)

Not bad, not bad.

table(iris$Species, clus)

##             clus
##               1  2  3
##   setosa      0  0 50
##   versicolor  2 48  0
##   virginica  36 14  0

Setosa is perfectly predicted by the model (as cluster 2), it can predict versicolor well but did some error in virginica.

Hierarchical Clustering

Hierarchical Clustering, by itself, is not a learning algorithm. Similar to PCA, it helps to have insight about the dataset. However, with a small add-on function, we can use it for developing a model as well.

The motivation is similar to k-means, in which we find some group means and cluster the points based on their proximity to the means. Here, hierarchical model is a pairwise method that clusters bottom to top, which (roughly) finds pairs that are closest to each other, then finds the pairs of pairs closest to each other and so on. The proximity is accessed by using the notion of dissimilarity or distance.

There are a couple of options that you won’t ever bother about the backend of clustering. Once we find the distance between points and hence found the subgroups, we need to go further one more step and match the pairs. But here the question arises, what is the group’s vector that we will use to compare with the other group? We can average the points, we can find the maximum or minimum etc. Based on these, we have four choices

Complete
Single
Average
Centroid

The algorithm doesn’t work on raw data, we need to calculate the distance between the points first. We will change the dataset for sake of presentation:

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

dists <- dist(mtcars[,1:7],method = "euclidean")

The above is an object that contains pairwise distance. Now we can build a hierarchical tree using the above distances:

hc.complete <- hclust(dists, method="complete")
hc.single   <- hclust(dists, method="single")
hc.average  <- hclust(dists, method="average")
hc.centriod <- hclust(dists, method="centroid")

plot(hc.complete)

The height gives the distance betweemn clusters. So, from the top, we can say there are two main clusters that are far from each other. Then we can see that they split into subgroups and so on.

plot(hc.single)

plot(hc.average)

plot(hc.centriod)

Now, what if we want the machine to cluster them into two 2 groups, fancy and less fancy cars. We can use cutree:

clus <- cutree(hc.average,  k=2)
clus

##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##                   1                   1                   1                   2 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##                   2                   2                   2                   1 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##                   1                   1                   1                   2 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##                   2                   2                   2                   2 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##                   2                   1                   1                   1 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##                   1                   2                   2                   2 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##                   2                   1                   1                   1 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##                   2                   1                   2                   1

plot(hc.average, labels = clus)

dend  <- as.dendrogram(hc.average) %>% 
  color_branches(k=2)
plot(dend)

dend  <- as.dendrogram(hc.average) %>% 
  color_branches(k=3)
plot(dend)

library("circlize")

dend <- as.dendrogram(hc.average) %>% 
  color_branches(k=2) %>% 
  color_labels(k=2)

par(mar = rep(0,4))
circlize_dendrogram(dend, labels_track_height = NA, dend_track_height = .2)

Virus Classification

Last week we visualized virus data, that was extracted using a neural network technique called Word Embeddings. The data contained meaning vectors, each being 300 dimensions.

This week, we will use the two methods that we learned to cluster 30 viruses into proper groups using the unsupervised learning algorithms.

Let’s read the data first

dat <- read.csv('plotvect.tsv', sep='\t', header = F)
rownames(dat) <- data.frame(read.csv('plotmeta.tsv', sep='\t', header=F))[,1]
rmarkdown::paged_table(head(dat))

k-means

fit <- kmeans(dat, centers = 5)
cbind(sort(fit$cluster))

##                            [,1]
## variola                       1
## smallpox                      1
## monkeypox                     1
## smallpox-like                 1
## mers                          2
## rhinovirus-14                 2
## mers-cov-encoded              2
## corona-virus                  2
## mers-6hb                      2
## mers-and                      2
## sars                          2
## rhinoviruses/enteroviruses    2
## sars-cov2                     2
## corona                        2
## rhinovirus                    2
## sars-                         2
## sarscoronavirus               2
## coronavirus                   2
## mers-coronavirus              2
## 'ebola                        3
## ebola                         3
## mev                           4
## herpete                       4
## nonpoliovirus                 4
## herpes simplex                4
## herpes                        4
## zosteriform                   4
## zoster                        4
## chickenpox                    4
## tlr9-expressing               4
## measles                       4
## herpeticum                    4
## measles-induced               4
## measles-associated            4
## polio                         4
## common cold                   4
## mumps                         4
## rash                          4
## anti-varicella                4
## varicella                     4
## nonpolio                      4
## h5n6                          5
## h5n1                          5
## flu                           5
## h1n1                          5
## h1n1influenza                 5
## influenza                     5
## h5n1/04                       5

dend <- as.dendrogram(hclust(dist(dat), method = 'ward.D'))
dend <- dend %>% 
   color_branches(k=5) %>% 
    color_labels(k=5)

par(mar = rep(0,4))
circlize_dendrogram(dend, labels_track_height = NA, dend_track_height = .2)