K-means clustering is a popular machine-learning algorithm used for exploratory data analysis to find hidden patterns or groupings in data. Though it is straightforward to implement, it can be difficult to assess the quality of its performance. This tutorial will guide you on how to evaluate K-means clustering in Python.
Step 1: Install Necessary Libraries
To begin with, we need the necessary Python libraries – Pandas, NumPy, SciKit-Learn, and Matplotlib. If these libraries aren’t installed yet, you can use pip to install them as follows:
1 |
!pip install pandas numpy scikit-learn matplotlib |
Step 2: Loading the Data
The next step involves importing these libraries and loading our dataset. In this tutorial, we’ll use the Iris dataset which is a multivariate data set introduced by Sir Ronald Fisher.
1 2 3 4 5 |
import pandas as pd from sklearn import datasets data = datasets.load_iris() df = pd.DataFrame(data.data, columns=data.feature_names) |
Step 3: Apply the K-Means Algorithm
After loading the data, we’ll run the K-Means clustering algorithm on it. For this, we’ll use the KMeans class from the sklearn.cluster module.
1 2 3 4 |
from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=3, random_state=0).fit(df) kmeans.labels_ |
Step 4: Evaluating the Model
To assess the performance of the model, we’ll use two metrics: Inertia and Silhouette Score. Inertia is a measure of how internally coherent clusters are, while the Silhouette Score measures how close each data point in one cluster is to the data points in the neighboring clusters.
1 2 3 4 |
from sklearn import metrics print("Inertia: ", kmeans.inertia_) print("Silhouette Coefficients: ", metrics.silhouette_score(df, kmeans.labels_, metric='euclidean')) |
Output:
Inertia: 78.851441426146 Silhouette Coefficients: 0.5528190123564091
Full Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import pandas as pd from sklearn import datasets from sklearn.cluster import KMeans from sklearn import metrics data = datasets.load_iris() df = pd.DataFrame(data.data, columns=data.feature_names) kmeans = KMeans(n_clusters=3, random_state=0).fit(df) print(kmeans.labels_) print("Inertia: ", kmeans.inertia_) print("Silhouette Coefficients: ", metrics.silhouette_score(df, kmeans.labels_, metric='euclidean')) |
Conclusion
Evaluating K-mean clustering isn’t always straightforward but with the right metrics like inertia and silhouette score, you can assess the performance of K-means clustering in Python. This can be instrumental in revealing hidden patterns or trends in the data thereby offering insights and improving decision-making.