What are Decision Trees in Machine learning?
Table of contents:
1. What are decision trees?
2. What is Entropy and Gini Index?
3. What is Information Gain?
4. How to avoid overfitting?
1. What are decision trees?
Decision Tree is a supervised machine learning algorithm that can be used to solve both classification and regression problems. It is a tree-structured algorithm, where the internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes: the Decision Node and Leaf Node. Decision nodes are used to make decisions and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches.
- Root Node: The root node is the starting point of the algorithm. It represents the entire dataset, which further gets divided into two or more homogeneous sets.
- Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf node.
- Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given conditions.
- Branch/Sub Tree: A tree formed by splitting the tree.
- Pruning: Pruning is the process of removing unwanted branches from the tree.
- Parent/Child node: The tree's root node is called the parent node, and other nodes are called the child nodes.
2. What is Entropy and Gini Index?
The purity of a split is checked using either the Entropy or the Gini Index.
i) Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. The Entropy is calculated as follows,
In the above equation, p+ is the probability of the positive/Yes class, and p- is the probability of the negative/No class.
Consider the above example1 where after the split node B has 3 Yes and 2 No and node C has 2 Yes and 0 No
The entropy of B ->
H(S) = -(3/5).log(3/5) — (2/5).log(2/5)
The entropy of C ->
H(S) = -(2/2).log(2/2) — 0
ii) The Gini index is calculated as,
For the example1,
Gini Index of B ->
H(S) = -(3/5) - (2/5)
Gini Index of C ->
H(S) = -(2/2) - 0
It is recommended to use Gini Index in situations where there is a large number of features as the computation of Gini is faster than that of the Entropy.
3. What is Information Gain?
The node to split upon is decided using the Information Gain. It is the measurement of change in entropy after the segmentation of a dataset based on an attribute. A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having the highest information gain is split first. Information Gain is calculated as follows,
Feature having the greatest Information gain is selected to be the root node for splitting.
For the above example, the information gain for node A can be calculated as follows,
* H(S) = -(5/7).log(5/7) — (2/7).log(2/7)
* H(node B) = -(3/5).log(3/5) -(2/5).log(2/5)
* H(node C) = -(2/2).log(2/2) -0
* Σ |Sv|/|S| . H(Sv) = (5/7).H(node B) + (2/7).H(node C)
4. How to avoid overfitting?
A decision tree model can easily be overfitted in the training stage as a decision tree keeps on splitting the nodes based on the features. The overfitting problem can be avoided by using the post-pruning or the pre-pruning method.
In the post-pruning method, the decision tree is first constructed on the given dataset, and then the level until which the tree needs to be expanded is decided.
The pre-pruning method involves tuning the hyperparameters to get a balanced result.