Learn K-means Clustering

Abdul Baasith
3 min readAug 12, 2021

--

Introduction

One of the most popular Machine Learning algorithms is K-means clustering. It is an unsupervised learning algorithm, meaning that it is used for unlabeled datasets.

K-means clustering algorithm is an unsupervised technique to group data in the order of their similarities. We then find patterns within this data which are present as k-clusters.

What is K-means Clustering

formal definition of K-means clusteringK-means clustering is an iterative algorithm that partitions a group of data containing n values into k subgroups. Each of the n value belongs to the k cluster with the nearest mean.

The objective of the K-means clustering is to minimize the Euclidean distance that each point has from the centroid of the cluster.

Advantages of k-means

  • Relatively simple to implement.
  • Scales to large data sets.
  • Guarantees convergence.
  • Can warm-start the positions of centroids.
  • Easily adapts to new examples.
  • Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

😎😎😎😎Now lets try a code in k-means😎😎😎😎

Using Python

#Create artificial data set

from sklearn.datasets import make_blobs

raw_data = make_blobs(n_samples = 200, n_features = 2, centers = 4, cluster_std = 1.8)

#Data imports

import pandas as pd

import numpy as np

#Visualization imports

import seaborn

import matplotlib.pyplot as plt

%matplotlib inline

#Visualize the data

plt.scatter(raw_data[0][:,0], raw_data[0][:,1])

plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1])

#Build and train the model

from sklearn.cluster import KMeans

model = KMeans(n_clusters=4)

model.fit(raw_data[0])

#See the predictions

model.labels_

model.cluster_centers_

Code Explanation

First

Now let’s use the make blobs function to create some artificial data!

More specifically, here is how you could create a data set with 200 samples that has 2 features and 4 cluster centers. The standard deviation within each cluster will be set to 1.8

Next

The best way to verify that this has been handled correctly is by creating some quick data visualizations.

plt.scatter(raw_data[0][:,0], raw_data[0][:,1])

If you want to have a color representation type this code

plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1])

Next

lets create an instance of this KMeans class with a parameter of n_clusters=4 and assign it to the variable model

from sklearn.cluster import KMeans

model = KMeans(n_clusters=4)

Next

Now let’s train our model by invoking the fit method on it and passing in the first element of our raw_data tuple

model.fit(raw_data[0])

Next

First, let’s predict which cluster each data point belongs to. To do this, access the labels_ attribute from our model using .

model.labels_

where the center of each cluster lies, access the cluster_centers_ attribute using the dot operator .

Next

predict the cluster for a given data point located at position

test=np.array([4.0,5.0]) 
second_test = test.reshape(1, -1)
Kmean.predict(second_test)

😋😋😋😋finished😋😋😋😋

conclusion

The goal of unsupervised learning using clustering is to discover significant correlations in data that you would not have seen otherwise.
It’s up to you to decide whether or not those connections provide a solid foundation for actionable knowledge.

Hope the tutorial was helpful. If there is anything we missed out, do let us know through comments.😇

--

--

Abdul Baasith
Abdul Baasith

Written by Abdul Baasith

Hi there i'm Abdul Baasith, A passionate Software Engineer . If your tool is a hammer then every problem look like a nail

No responses yet