Learn K-means Clustering
Introduction
One of the most popular Machine Learning algorithms is K-means clustering. It is an unsupervised learning algorithm, meaning that it is used for unlabeled datasets.
K-means clustering algorithm is an unsupervised technique to group data in the order of their similarities. We then find patterns within this data which are present as k-clusters.
What is K-means Clustering
formal definition of K-means clustering — K-means clustering is an iterative algorithm that partitions a group of data containing n values into k subgroups. Each of the n value belongs to the k cluster with the nearest mean.
The objective of the K-means clustering is to minimize the Euclidean distance that each point has from the centroid of the cluster.
Advantages of k-means
- Relatively simple to implement.
- Scales to large data sets.
- Guarantees convergence.
- Can warm-start the positions of centroids.
- Easily adapts to new examples.
- Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.
😎😎😎😎Now lets try a code in k-means😎😎😎😎
Using Python
#Create artificial data set
from sklearn.datasets import make_blobs
raw_data = make_blobs(n_samples = 200, n_features = 2, centers = 4, cluster_std = 1.8)
#Data imports
import pandas as pd
import numpy as np
#Visualization imports
import seaborn
import matplotlib.pyplot as plt
%matplotlib inline
#Visualize the data
plt.scatter(raw_data[0][:,0], raw_data[0][:,1])
plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1])
#Build and train the model
from sklearn.cluster import KMeans
model = KMeans(n_clusters=4)
model.fit(raw_data[0])
#See the predictions
model.labels_
model.cluster_centers_
Code Explanation
First
Now let’s use the make blobs function to create some artificial data!
More specifically, here is how you could create a data set with 200
samples that has 2
features and 4
cluster centers. The standard deviation within each cluster will be set to 1.8
Next
The best way to verify that this has been handled correctly is by creating some quick data visualizations.
plt.scatter(raw_data[0][:,0], raw_data[0][:,1])
If you want to have a color representation type this code
plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1])
Next
lets create an instance of this KMeans
class with a parameter of n_clusters=4
and assign it to the variable model
from sklearn.cluster import KMeans
model = KMeans(n_clusters=4)
Next
Now let’s train our model by invoking the fit
method on it and passing in the first element of our raw_data
tuple
model.fit(raw_data[0])
Next
First, let’s predict which cluster each data point belongs to. To do this, access the labels_
attribute from our model using .
model.labels_
where the center of each cluster lies, access the cluster_centers_
attribute using the dot operator .
Next
predict the cluster for a given data point located at position
test=np.array([4.0,5.0])
second_test = test.reshape(1, -1)
Kmean.predict(second_test)
😋😋😋😋finished😋😋😋😋
conclusion
The goal of unsupervised learning using clustering is to discover significant correlations in data that you would not have seen otherwise.
It’s up to you to decide whether or not those connections provide a solid foundation for actionable knowledge.
Hope the tutorial was helpful. If there is anything we missed out, do let us know through comments.😇