🌑

Stephen's Blog

Automatic Clustering with Silhouette Analysis on Agglomerative Hierarchical Clustering

 

Stephen Cheng

Intro

Automatic clustering algorithms are algorithms that can perform clustering without prior knowledge of data sets, and determine the optimal number of clusters even in the presence of noise and outlier points.

Silhouette Analysis

Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters. The silhouette score is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. Thus silhouette analysis can be used to choose an optimal number for clusters automatically.

Agglomerative Hierarchical Clustering

Hierarchical clustering algorithms fall into 2 branches: top-down or bottom-up. Bottom-up algorithms treat each data point as a single cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all data points. Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering or HAC. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.

The Steps of Hierarchical clustering:

  1. We start by treating each data point as a single cluster i.e. if there are N data points in our dataset then we have N clusters. We then select a distance metric that measures the distance between two clusters. As an example, we will use average linkage which defines the distance between two clusters to be the average distance between data points in the first cluster and data points in the second cluster.

  2. On each iteration, two clusters are combined into one. The two clusters to be combined are selected as those with the smallest average linkage. I.e. according to our selected distance metric, these two clusters have the smallest distance between each other and therefore are the most similar and should be combined.

  3. Step 2 is repeated until we reach the root of the tree that one cluster contains all data points. In this way we can select how many clusters we want in the end simply by choosing when to stop combining the clusters i.e. when we stop building the tree!

Automatic Clustering

Since hierarchical clustering does not require us to specify the number of clusters and we can even select which number of clusters looks best since we are building a tree. Here we use the silhouette score to help us determine choosing the optimal number of clusters for clustering task automatically. An example of auto-clustering with silhouette analysis on agglomerative hierarchical clustering is shown as follows.

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.metrics import silhouette_score
import scipy.cluster.hierarchy as shc

Load and clean the data:

1
2
3
4
5
6
7
X = pd.read_csv('customer_info.csv')

# Dropping the CUST_ID column from the data
X = X.drop('CUST_ID', axis = 1)

# Handling the missing values
X.fillna(method ='ffill', inplace = True)

Preprocess the data:

1
2
3
4
5
6
7
8
9
10
# Scaling the data so that all the features become comparable
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Normalizing the data so that the data approximately
# follows a Gaussian distribution
X_normalized = normalize(X_scaled)

# Converting the numpy array into a pandas DataFrame
X_normalized = pd.DataFrame(X_normalized)

Reduce the dimensionality:

1
2
3
4
pca = PCA(n_components = 2)
X_new = pca.fit_transform(X_normalized)
df_new = pd.DataFrame(X_new)
df_new.columns = ['P1', 'P2']

Visualize the dendograms:

1
2
3
plt.figure(figsize =(8, 8))
plt.title('Visualising Clustering')
Dendrogram = shc.dendrogram((shc.linkage(df_new, method ='ward')))

Evaluate the clustering models:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
k = range(2,10)

ac_list = [AgglomerativeClustering(n_clusters = i) for i in k]

# Appending the silhouette scores
silhouette_scores = {}
silhouette_scores.fromkeys(k)

for i,j in enumerate(k):
silhouette_scores[j] = silhouette_score(df_new,
ac_list[i].fit_predict(df_new))

# Plotting
y = list(silhouette_scores.values())
plt.bar(k, y)
plt.xlabel('Number of clusters', fontsize = 20)
plt.ylabel('S(i)', fontsize = 20)
plt.show()

From the result above, we can conclude that 3 clusters obtain the highest silhouette score so the optimal number of clusters is 3 in this case. The complete code and dataset can be found here

Reference

[1] Wikipedia-Automatic clustering algorithms
[2] https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
[3] The 5 Clustering Algorithms Data Scientists Need to Know

, — Mar 22, 2020

Search

    Made with ❤️ and ☀️ on Earth.