Retrieving Wikipedia articles using k means clustering Machine Learning

4 min readSep 7, 2021

Nearest Neighbors is a simple algorithm widely used in predictive analysis to cluster data by assigning an item to a cluster by determining what other items are most similar to it. k-Means Clustering is an unsupervised learning algorithm that is used for clustering

Derive a similarity matrix from the items in the dataset.
This matrix, referred to as the distance matrix, will hold the similarity values for each and every item in the data set.
With the matrix in place, compare each item in the dataset to every other item and compute the similarity value.
Using the distance matrix, examine every item to see whether the distance to its neighbors is less than a value that you have defined. This value is called the threshold.
The algorithm puts each element in a separate cluster, analyzes the items, and decides which items are similar, and adds similar items to the same cluster.

K-Means Algorithm

The k-means algorithm is an unsupervised clustering algorithm. It takes a bunch of unlabeled points and tries to group them into “k” number of clusters.

It is unsupervised because the points have no external classification.

The “k” in k-means denotes the number of clusters you want to have in the end. If k = 3, you will have 3clusters on the data set.

TF IDF

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF-IDF (term frequency-inverse document frequency) was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular. so words like the/and/is/to is under rated

Once you’ve transformed words into numbers, in a way that’s machine learning algorithms can understand, the TF-IDF score can be fed to algorithms such as Naive Bayes and Support Vector Machines, nearest neighbor greatly improving the results of more basic methods like word counts.

we have used nearest neighbor search for this model