1 Clustering Methods

1.1 Distance-based Methods: K-Means Clustering

The goal of the K-means algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

Given a set of observations \((x1, x2, …, xn)\), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into \(k (≤ n)\) sets \(S = {S1, S2, …, Sk}\) so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance). Formally, the objective is to find:

\[ \underset{S}{arg min}\sum_{i=1}^{n}\sum_{x \in S_{i}} \Vert {x - \mu_{i}}\Vert ^{2} \]

The standard algorithm for K-Means:

  • Initialization: randomly chooses k observations from the data set and uses these as the initial means;

  • Assignment step: Assign each observation to the cluster whose mean has the least squared Euclidean distance, this is intuitively the “nearest” mean;

  • Update step: Calculate the new means to be the centroids of the observations in the new clusters.