Decision Tree to predict Attrition of Machine Learning
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too sounds interesting right
The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from training data.
We start at the root of the tree when using Decision Trees to forecast a class label for a record. The values of the root attribute and the record’s attribute are compared. We follow the branch that corresponds to that value and go to the next node based on the comparison. a illustration is given below
Types of Decision Trees
Decision Trees are classified into two types, based on the target variables.
- Categorical Variable Decision Trees: This is where the algorithm has a categorical target variable. For example, consider you are asked to predict the company success of a company as one of three categories: low, medium, or high. Features could include assets, liabilities, debts, and profit, ect … The decision tree will learn from these features and after passing each data point through each node, it will end up at a leaf node of one of the three categorical targets low, medium, or high.
- Continuous Variable Decision Trees: In this case the features input to the decision tree (e.g. qualities of a house) will be used to predict a continuous output (e.g. the price of that house).
- Root Node: It represents the entire population
- Splitting: It is a process of dividing
- Decision Node: When a sub-node splits into further sub-nodes
- Leaf (Terminal Node): Nodes do not split is called Leaf or Terminal node.
- Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
- Branch / Sub-Tree: A subsection of the entire tree
How do Decision Trees work?
To decide whether to divide a node into two or more sub-nodes, decision trees employ a variety of methods. The homogeneity of the generated sub-nodes improves with the formation of sub-nodes. To put it another way, the purity of the node improves as the target variable rises.
So the Millon dollar question how to find what the root node value to be
Attribute Selection Measures
If the dataset consists of n attributes then deciding which attribute to place at the root or at different levels of the tree as internal nodes is a complicated By just randomly selecting any node to be the root can’t solve the issue. If we follow a random approach, it may give us bad results with low accuracy.
For solving this attribute selection problem, researchers worked and devised some solutions.
They suggested :
Entropy,
Information gain,
Gini index,
Gain Ratio,
Reduction in Variance
Chi-Square
Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an arbitrary collection of examples. The higher the entropy more the information content.
So what if we have multiple attributes
Gini Index
- Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified.
- It means an attribute with lower Gini index should be preferred.
so how its calculated
Not only this there are some others which ill be uploading a new bolg with more details. so you can learn
Now lets see how its done in coding
first of get the data set
https://drive.google.com/file/d/1jRXlhfnuThD6QM5_GTheXmcVODHqI0Gs/view?usp=sharing
Now import the google drive for the python notebook
Next
import all the important libraries
Next
Read all the data from the csv using pandas
explore the dataset more to get a clear understanding
Next
now we should remove the unwanted columns from the data frame so in my case i can remove ‘EmployeeCount’, ‘EmployeeNumber’, ‘Over18’, ‘StandardHours’
. This will differ from the datasets that we are using and the scenario
axis is 1 beacuse we are targeting the column and inplace will modifies the data frame
Next
Take only the object type column because it should be values to run machine learning model
so now we will check the type and append all the colum in a array
Now we can see the Attrition is also not a int so now we to change it to 1 and o
Next
now lets do the next data processing by changing catogorical data into dummies for fitting the model
now we can see all the values are int and transformed
Next
encode the labels in the object_col
Next
Split train and test data
Next
Create a function to evaluate the model with accuracy and confusion matrix ill be making a blog on confusion matrix as well
Next
now lets create the decision tree classifier using Scikit-learn
Next
Now lets see how the model performs
Next
lets try to predict a result
wow we got it right
Next
Lets try to visualize the decision tree
if we zoom we can see we used the gini index
Hurry we have completed the Decision tree classifier with a relevant huge dataset
Github link below:
YouTube video
Advantages and Disadvantages
advantage
Decision trees take very little time in processing the data when compared to other algorithms. Few preprocessing steps like normalization, transformation, and scaling the data can be skipped.
missing values in the dataset, the performance of the model won’t be affected.
A Decision Tree model is intuitive and easy to explain to the technical teams and stakeholders, and can be implemented across several organizations.
Disadvantage
In decision trees, small changes in the data can cause a large change in the structure of the decision tree that in turn leads to instability.
The training time drastically increases, proportional to the size of the dataset. In some cases, the calculations can go complex compared to the other traditional algorithms.
Hope the tutorial was helpful. If there is anything we missed out, do let us know through comments.😇
❤️❤️❤️❤️❤️❤️❤️Thanks for reading❤️❤️❤️❤️❤️❤️❤️❤️