In hierarchical clustering, a dendrogram is a tree-like diagram that shows the process of merging clusters, with each cluster being represented by a node. It can help you visualize the relationships between clusters and decide the appropriate number of clusters for your data. In this tutorial, we will learn how to cut a dendrogram in Python using the SciPy
library.
Step 1: Install the required libraries
First, you need to install numpy
, scipy
, and matplotlib
if you haven’t already. You can do this using pip
:
1 |
pip install numpy scipy matplotlib |
Step 2: Import the necessary libraries
Next, you need to import the required libraries:
1 2 3 |
import numpy as np from scipy.cluster.hierarchy import dendrogram, linkage, fcluster import matplotlib.pyplot as plt |
Step 3: Generate some sample data
For this tutorial, we will generate some random data points:
1 2 |
np.random.seed(42) data = np.random.rand(10, 3) |
Step 4: Perform Hierarchical Clustering
Now, we need to perform hierarchical clustering on the data using the linkage
function from scipy
:
1 |
Z = linkage(data, 'ward') |
Here, we have used the Ward’s method as the linkage method for clustering.
Step 5: Plot the dendrogram
Before cutting the dendrogram, let’s plot it first to visualize the hierarchical clustering:
1 2 3 4 5 6 |
fig = plt.figure(figsize=(10, 5)) dn = dendrogram(Z) plt.title("Dendrogram") plt.xlabel("Data points") plt.ylabel("Euclidean distances") plt.show() |
This code will generate the dendrogram for the given data.
Step 6: Cutting the dendrogram
You can cut the dendrogram at a specific distance or at a specific number of clusters. In this example, we will cut the dendrogram at a maximum distance of 1.5.
1 2 |
max_distance = 1.5 clusters = fcluster(Z, max_distance, criterion='distance') |
Here, fcluster
function takes the linkage matrix Z
, the maximum distance, and the cutting criterion as input arguments.
Step 7: Visualize the clustered data
After cutting the dendrogram, we can visualize the clustered data:
1 2 3 4 5 |
plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis') plt.title("Clustered Data") plt.xlabel("X-axis") plt.ylabel("Y-axis") plt.show() |
This will show the data points with different colors representing different clusters.
Full code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
import numpy as np from scipy.cluster.hierarchy import dendrogram, linkage, fcluster import matplotlib.pyplot as plt np.random.seed(42) data = np.random.rand(10, 3) Z = linkage(data, 'ward') fig = plt.figure(figsize=(10, 5)) dn = dendrogram(Z) plt.title("Dendrogram") plt.xlabel("Data points") plt.ylabel("Euclidean distances") plt.show() max_distance = 1.5 clusters = fcluster(Z, max_distance, criterion='distance') plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis') plt.title("Clustered Data") plt.xlabel("X-axis") plt.ylabel("Y-axis") plt.show() |
Output
Conclusion
In this tutorial, we have learned how to cut a dendrogram in Python using the SciPy
library. Cutting a dendrogram helps in deciding the number of clusters in hierarchical clustering. By following these steps, you can visualize and analyze the hierarchical clustering of your data more effectively.