Decision Tree to predict Attrition of Machine Learning

6 min readSep 21, 2021

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too sounds interesting right

The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from training data.

We start at the root of the tree when using Decision Trees to forecast a class label for a record. The values of the root attribute and the record’s attribute are compared. We follow the branch that corresponds to that value and go to the next node based on the comparison. a illustration is given below

Types of Decision Trees

Decision Trees are classified into two types, based on the target variables.

Categorical Variable Decision Trees: This is where the algorithm has a categorical target variable. For example, consider you are asked to predict the company success of a company as one of three categories: low, medium, or high. Features could include assets, liabilities, debts, and profit, ect … The decision tree will learn from these features and after passing each data point through each node, it will end up at a leaf node of one of the three categorical targets low, medium, or high.
Continuous Variable Decision Trees: In this case the features input to the decision tree (e.g. qualities of a house) will be used to predict a continuous output (e.g. the price of that house).

Root Node: It represents the entire population
Splitting: It is a process of dividing
Decision Node: When a sub-node splits into further sub-nodes
Leaf (Terminal Node): Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
Branch / Sub-Tree: A subsection of the entire tree

How do Decision Trees work?

To decide whether to divide a node into two or more sub-nodes, decision trees employ a variety of methods. The homogeneity of the generated sub-nodes improves with the formation of sub-nodes. To put it another way, the purity of the node improves as the target variable rises.

So the Millon dollar question how to find what the root node value to be

Attribute Selection Measures

If the dataset consists of n attributes then deciding which attribute to place at the root or at different levels of the tree as internal nodes is a complicated By just randomly selecting any node to be the root can’t solve the issue. If we follow a random approach, it may give us bad results with low accuracy.

For solving this attribute selection problem, researchers worked and devised some solutions.

They suggested :

Entropy,
Information gain,
Gini index,
Gain Ratio,
Reduction in Variance
Chi-Square

Entropy

Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an arbitrary collection of examples. The higher the entropy more the information content.

So what if we have multiple attributes

Gini Index

Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified.
It means an attribute with lower Gini index should be preferred.

so how its calculated

Not only this there are some others which ill be uploading a new bolg with more details. so you can learn

Now lets see how its done in coding

first of get the data set

https://drive.google.com/file/d/1jRXlhfnuThD6QM5_GTheXmcVODHqI0Gs/view?usp=sharing

Now import the google drive for the python notebook

import all the important libraries

Read all the data from the csv using pandas

explore the dataset more to get a clear understanding

now we should remove the unwanted columns from the data frame so in my case i can remove ‘EmployeeCount’, ‘EmployeeNumber’, ‘Over18’, ‘StandardHours’

. This will differ from the datasets that we are using and the scenario

axis is 1 beacuse we are targeting the column and inplace will modifies the data frame

Take only the object type column because it should be values to run machine learning model

so now we will check the type and append all the colum in a array

Now we can see the Attrition is also not a int so now we to change it to 1 and o

now lets do the next data processing by changing catogorical data into dummies for fitting the model

now we can see all the values are int and transformed

encode the labels in the object_col

Split train and test data

Create a function to evaluate the model with accuracy and confusion matrix ill be making a blog on confusion matrix as well

now lets create the decision tree classifier using Scikit-learn

Now lets see how the model performs

lets try to predict a result

wow we got it right

Lets try to visualize the decision tree

if we zoom we can see we used the gini index

Hurry we have completed the Decision tree classifier with a relevant huge dataset

Github link below:

YouTube video

Advantages and Disadvantages

advantage

Decision trees take very little time in processing the data when compared to other algorithms. Few preprocessing steps like normalization, transformation, and scaling the data can be skipped.

missing values in the dataset, the performance of the model won’t be affected.

A Decision Tree model is intuitive and easy to explain to the technical teams and stakeholders, and can be implemented across several organizations.

Disadvantage

In decision trees, small changes in the data can cause a large change in the structure of the decision tree that in turn leads to instability.

The training time drastically increases, proportional to the size of the dataset. In some cases, the calculations can go complex compared to the other traditional algorithms.

https://tenor.com/view/stay-happy-milk-and-mocha-bear-good-vibes-confetti-cheering-gif-17687639

Hope the tutorial was helpful. If there is anything we missed out, do let us know through comments.😇

❤️❤️❤️❤️❤️❤️❤️Thanks for reading❤️❤️❤️❤️❤️❤️❤️❤️

Decision Tree to predict Attrition of Machine Learning

Types of Decision Trees

How do Decision Trees work?

Attribute Selection Measures

Entropy

Gini Index

Next

Next

Next

Next

Next

Next

Next

Next

Next

Next

Next

Next

Advantages and Disadvantages

advantage

Disadvantage

Written by Abdul Baasith