---
title: "Tutorial 12: Unsupervised Clustering Analysis"
author: "[Simon & Zebin]"
date: "4/24/2018 & 4/27/2018"
header-includes:
- \usepackage{amsmath}
output:
html_document:
fig_width: 10
fig_height: 6
highlight: tango
number_sections: yes
theme: paper
toc: yes
toc_depth: 3
html_notebook: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r echo=FALSE, result='hide'}
setwd("E:\\HKU data\\Courses\\Stat_3612\\Tutorial2018\\T12")
```
# Clustering Methods
## Distance-based Methods: K-Means Clustering
The goal of the K-means algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.
Given a set of observations $(x1, x2, ¡, xn)$, where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into $k (¡Ü n)$ sets $S = {S1, S2, ¡, Sk}$ so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance). Formally, the objective is to find:
$$
\underset{S}{arg min}\sum_{i=1}^{n}\sum_{x \in S_{i}} \Vert {x - \mu_{i}}\Vert ^{2}
$$
The standard algorithm for K-Means:
* Initialization: randomly chooses k observations from the data set and uses these as the initial means;
* Assignment step: Assign each observation to the cluster whose mean has the least squared Euclidean distance, this is intuitively the "nearest" mean;
* Update step: Calculate the new means to be the centroids of the observations in the new clusters.
```{r echo=FALSE, fig.align="center", out.width = 300}
knitr::include_graphics(c('K-means_convergence.gif'))
```
The results of the K-means clustering algorithm are:
1. The centroids of the K clusters.
2. Labels for the training data (each data point is assigned to a single cluster).
(Reference:
1. https://images0.cnblogs.com/blog/573996/201311/15181807-bdd3782c84f249c5bef85378d372d3cd.png
2. https://en.wikipedia.org/wiki/K-means_clustering)
## Hierarchy-based Methods: Hierarchical Clustering
Hierarchical Clustering (including agglomerative and divisive) is an iterative algorithm to find group of the data. The agglomerative hierarchical clustering works as follows:
1. Put each data point itself as one cluster.
2. Identify the closest two clusters and combine them into one cluster.
3. Repeat the above step till all the data points are in a single cluster.
Once this is done, it is usually represented by a dendrogram like structure.
There are a few ways to determine how close two clusters are:
1. Complete linkage clustering: Find the maximum possible distance between points belonging to two different clusters.
2. Single linkage clustering: Find the minimum possible distance between points belonging to two different clusters.
3. Mean linkage clustering: Find all possible pairwise distances for points belonging to two different clusters and then calculate the average.
4. Centroid linkage clustering: Find the centroid of each cluster and calculate the distance between centroids of two clusters.
The default method for the hclust function is the complete linkage method.
```{r echo=FALSE, fig.align="center", out.width = "600px", eval=T}
knitr::include_graphics(c('HierarchicalClustering.png'))
```
(Referebce:
1. https://en.wikipedia.org/wiki/Hierarchical_clustering
2. http://quantdare.com/wp-content/uploads/2016/06/AggloDivHierarClustering-800x389.png)
## Density-based Methods: DBSCAN
In density-based spatial clustering of applications with noise (DBSCAN), clusters are dense regions in the data space, separated by regions of lower density of points. The DBSCAN algorithm is based on this intuitive notion of clusters and noise. The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.
* eps: the minimum distance between two points. It means that if the distance between two points is lower or equal to this value (eps), these points are considered neighbors. The value for ???? can then be chosen by using a k-distance graph, plotting the distance to the k = minPts nearest neighbor
* minPoints: the minimum number of points to form a dense region. For example, if we set the minPoints parameter as 5, then we need at least 5 points to form a dense region. As a rule of thumb, minPts = 2¡¤dim can be used, but it may be necessary to choose larger values for very large data.
```{r echo=FALSE, fig.align="center", out.width = "600px", eval=T}
knitr::include_graphics(c('DBSCAN.jpg'))
```
The DBSCAN algorithm can be abstracted into the following steps:
* 1. Find the ¦Å (eps) neighbors of every point, and identify the core points with more than minPts neighbors.
* 2. Find the connected components of core points on the neighbor graph, ignoring all non-core points.
* 3. Assign each non-core point to a nearby cluster if the cluster is an ¦Å (eps) neighbor, otherwise assign it to noise.
(Reference:
1. https://en.wikipedia.org/wiki/DBSCAN#Parameter_estimation
2. http://www.sthda.com/english/articles/30-advanced-clustering/105-dbscan-density-based-clustering-essentials/
3. https://www.quora.com/How-does-DBSCAN-algorithm-work)
# Example 1: Use K-means to Find the Main Colors of a Image
```{r echo=FALSE, results='hide'}
suppressMessages(library(dplyr))
suppressMessages(library(OpenImageR))
suppressMessages(library(ClusterR))
options(warn=-1)
```
Read and Show the image (A 526*800 image, with 3 RGB channels)
```{r}
img <- readImage("ColorfulBird.jpg") # Read the image
dim(img)
```
Show the image in R
```{r}
imageShow(img)
```
Convert the image to a n*3 matrix (3 represents the RGB colors)
```{r}
im2 = apply(img, 3, as.vector)
dim(im2)
```
Find the optimal number of clusters
Some popular used criteria:
* variance_explained : the sum of the within-cluster-sum-of-squares-of-all-clusters divided by the total sum of squares
* WCSSE : the sum of the within-cluster-sum-of-squares-of-all-clusters
* dissimilarity : the average intra-cluster-dissimilarity of all clusters (the distance metric defaults to euclidean)
* silhouette : the average silhouette width of all clusters (the distance metric defaults to euclidean)
* distortion_fK : this criterion is based on the following paper, 'Selection of K in K-means clustering' (https://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf)
* AIC : the Akaike information criterion
* BIC : the Bayesian information criterion
* Adjusted_Rsquared : the adjusted R^2 statistic
```{r}
opt = Optimal_Clusters_KMeans(im2, max_clusters = 10, plot_clusters = T, verbose = F, criterion = 'distortion_fK', fK_threshold = 0.9)
```
Values above the fixed threshold (here fK_threshold = 0.90) could be recommended for clustering. So we can either choose 3, 6 or 8.
Perform K-means clustering with 6 clusters
```{r}
km_rc = KMeans_rcpp(im2, clusters = 6, num_init = 2, max_iters = 100,
initializer = 'optimal_init', verbose = T)
```
The ratio of between SS over total SS. The closer to 1, the better the classification.
```{r}
km_rc$between.SS_DIV_total.SS
```
The center of each cluster:
```{r}
km_rc$centroids
```
Regenerate the image using the centroids.
```{r}
getcent = km_rc$centroids
getclust = km_rc$clusters
new_im = getcent[getclust, ]
# each observation is associated with the nearby centroid
dim(new_im) = c(nrow(img), ncol(img), 3)
# back-convertion to a 3-dimensional image
imageShow(new_im)
```
The traditional K-means function (optional).
```{r}
km <- kmeans(im2,centers=3)
getcent2 = km$centers
getclust2 = km$cluster
new_im2 = getcent2[getclust2, ]
# each observation is associated with the nearby centroid
dim(new_im2) = c(nrow(img), ncol(img), 3)
# back-convertion to a 3-dimensional image
imageShow(new_im2)
```
# Example 2: Use Clustering Algorithms to Draw Accident Map
Kaggle provides a set of UK accidents data between 2005 - 2015 (https://www.kaggle.com/silicon99/dft-accident-data).
```{r echo=FALSE, results='hide'}
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(geosphere))
suppressMessages(library(raster))
suppressMessages(library(dbscan))
```
Read the data
```{r echo=FALSE, results='hide'}
accident_features <- readRDS("accident_features.Rds")
vehicle_accidents_spots <- readRDS("vehicle_accidents_spots.Rds")
dim(accident_features)
head(accident_features)
head(vehicle_accidents_spots)
```
Scale the data
```{r}
scaled_features <- data.frame(apply(accident_features,2,scale))
```
## Use K-Means
First, we select a optimal value of K.
```{r}
opt = Optimal_Clusters_KMeans(as.matrix(scaled_features), max_clusters = 10, plot_clusters = T, verbose = F, criterion = 'distortion_fK')
```
The results show that K = 2, 6 is OK. For model simplicity, we select 2 here.
```{r}
df_matrix <- apply(accident_features,2,scale)
km2 <- kmeans(df_matrix,centers=2)
accident_features$cluster <- km2$cluster
accident_features$long <- vehicle_accidents_spots$Longitude
accident_features$lat <- vehicle_accidents_spots$Latitude
nc <- aggregate(x = accident_features[,3],by = list(class = accident_features$cluster), FUN = mean)
nc <- nc[,"x"]
accident_features$cluster <- factor(accident_features$cluster)
levels(accident_features$cluster) <- round(nc,0)
```
Load UK countries' shapefile into R
```{r}
shp <- shapefile("infuse_ctry_2011")
map.df <- fortify(shp)
ggplot(map.df)+
geom_path(aes(x=long, y=lat, group=group))+
geom_point(data=accident_features, aes(x=long, y=lat, color=cluster), size=2)+
scale_color_discrete("Cluster")+
coord_fixed()
```
## Use Hierarchical Clustering
```{r}
hc <- hclust(dist(scaled_features), method='complete')
# hierarchical clustering, dendrogram
plot(hc)
```
Classify into eight groups
```{r}
scaled_features$clust <- cutree(hc, k=2)
scaled_features$long <- vehicle_accidents_spots$Longitude
scaled_features$lat <- vehicle_accidents_spots$Latitude
```
Assign the number of Casualties as class labels
```{r}
nc <- aggregate(x = accident_features[,3],by = list(class = scaled_features$clust), FUN = mean)
nc <- nc[,"x"]
scaled_features$clust <- factor(scaled_features$clust)
levels(scaled_features$clust) <- round(nc,0)
```
Plot the accident points on a UK map colored by the assigned class
```{r}
ggplot(map.df)+
geom_path(aes(x=long, y=lat, group=group))+
geom_point(data=scaled_features, aes(x=long, y=lat, color=clust), size=2)+
scale_color_discrete("Cluster")+
coord_fixed()
```
## Use DBSCAN
As you see, the clustering of the previous two methods are not very well in terms of the geometric location. This time, we use DBSCAN to do location clustering. First, find the optimal parameter eps. A knee corresponds to a threshold where a sharp change occurs along the k-distance curve.
```{r}
df <- vehicle_accidents_spots[,1:2]
kNNdistplot(df, k = 5)
abline(h = 1, col = "red")
```
Use DBSCAN to do clustering
```{r}
db_clusters <- dbscan(df, eps=1, minPts=5)
print(db_clusters)
df$cluster <- db_clusters$cluster
```
Plot the classified accident points on a UK map
```{r}
ggplot(map.df)+
geom_path(aes(x=long, y=lat, group=group))+
geom_point(data=df, aes(x=Longitude, y=Latitude, color=factor(cluster)), size=2)+
scale_color_discrete("Cluster")+
coord_fixed()
```