Retrieving Wikipedia articles using k means clustering Machine Learning

Abdul Baasith
4 min readSep 7, 2021

Nearest Neighbors is a simple algorithm widely used in predictive analysis to cluster data by assigning an item to a cluster by determining what other items are most similar to it. k-Means Clustering is an unsupervised learning algorithm that is used for clustering

  1. Derive a similarity matrix from the items in the dataset.
    This matrix, referred to as the distance matrix, will hold the similarity values for each and every item in the data set.
  2. With the matrix in place, compare each item in the dataset to every other item and compute the similarity value.
  3. Using the distance matrix, examine every item to see whether the distance to its neighbors is less than a value that you have defined. This value is called the threshold.
  4. The algorithm puts each element in a separate cluster, analyzes the items, and decides which items are similar, and adds similar items to the same cluster.

K-Means Algorithm

The k-means algorithm is an unsupervised clustering algorithm. It takes a bunch of unlabeled points and tries to group them into “k” number of clusters.

It is unsupervised because the points have no external classification.

The “k” in k-means denotes the number of clusters you want to have in the end. If k = 3, you will have 3clusters on the data set.

TF IDF

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF-IDF (term frequency-inverse document frequency) was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular. so words like the/and/is/to is under rated

Once you’ve transformed words into numbers, in a way that’s machine learning algorithms can understand, the TF-IDF score can be fed to algorithms such as Naive Bayes and Support Vector Machines, nearest neighbor greatly improving the results of more basic methods like word counts.

we have used nearest neighbor search for this model

Lest learn it by coding

lets do it Step by Step

Get the sframe from the link below and start using in your ipynb notebook

https://drive.google.com/drive/folders/1UJmDH2MdO539DeDF1oe978YtzZIrvN7A?usp=sharing

Next

explore the dataset

Next

see the word count

Next

calculate the word count append to the frame

Next

Calculate the TF IDF

Next

now explore the data set again and see the tfidf

Next

lets calculate the cosaine distance the less distance is more likely to obama article

Next

now apply nearest neighbor model

Next

try some one else from first compare word count and tfidf

Next

create 2 models for word count and tfidf and compare

github code Link:

https://gist.github.com/baasithshiyam/02ddd4c302956720d16c4d1d51e9c687

Done

Hope the tutorial was helpful. If there is anything we missed out, do let us know through comments.😇

❤️❤️❤️❤️❤️❤️❤️Thanks for reading❤️❤️❤️❤️❤️❤️❤️❤️

--

--

Abdul Baasith

Hi there iam Abdul Baasith Software Engineer .I'am Typically a person who thinks out of the box . If your tool is a hammer then every problem look like a nail