A decision tree is a supervised machine learning algorithm. Just like its name, a decision tree is a tree structure, and we can make a decision based on the tree structure we built. When we build a decision tree model, it will break down the data into smaller and smaller classes, leaves represent class labels and branches represent features that lead to those class labels. The idea of creating a decision tree seems fairly easy. However, how should we build a best-fitted decision tree model, and what algorithms we should use accordingly? Will the result change if we choose different metrics to build the decision tree? We will get the answer to those questions step by step.
First, we take a look at a simple example of a decision tree model. Supposed there are five people in a family, and then we want to find the person who likes playing basketball. There are some features we can use like age or gender. Therefore, for the root node, we first decide if this person is younger than 15 years old or not. If this person is younger than 15 years old, then we put this person into the left non-leaf node, otherwise the right non-leaf node. One thing we want to make sure of is that for each tree node, we just need a clear division index, for example, the person who is younger than 15 must love playing basketball, and the person older than 15 must dislike playing basketball. However, this did not help us to reach the end of the decision. Therefore, we use another feature, gender, to divide the node into smaller subsets. Assuming boys like to play basketball, girls do not like it. Finally, we would see a clear result. To sum up the whole process, we pass in five data points, each time we go down a little bit, and finally get the result, then we call this the process of building a decision tree.
The decision tree makes different decision attributes on each node. Make decisions from the first node, and then go down to make decisions from different nodes. The idea is fairly intuitive just input data, traverse the data to each node, and finally, the output will decide which category each person belongs to. However, let us back to the beginning. The very first question raised is how could we choose the root node? How about the next node? We want our root node to be the best feature that can divide the data into different groups. To evaluate the performance of different features on classifying the data points, we introduce some metrics to build the decision tree.
The first one is entropy; it is the measurement of the impurity or randomness in the data points. Entropy is the degree of randomness or uncertainty. If the sample is completely homogeneous then entropy is zero and if the sample is equally divided it has an entropy of one.
Entropy can be used for the core algorithm for building a decision tree is called ID3, and ID3 uses Entropy and Information Gain to construct a decision tree. Suppose P(X) is the probability that X happened, and H(X) measures the uncertainty of X happened. For example, X represents the earthquake happens in Rhode Island, we say that in this case, P(X) is small, and accordingly, H(X) would be large. The Entropy would be like H(X).
Now suppose we have two different lists. List1 = [1, 2, 1, 1, 4, 2, 3] and List2 = [1, 1, 1, 1, 1, 1, 1]. Then for List1 and List2, the impurity of List1 is relatively high, we said List1 is more random, the entropy of List1 is relatively high. Compared to List1, the purity of the List2 set is higher, and the randomness is lower, so entropy is lower. To prove these mathematics, we take a look at the formula of entropy. Then in List1, we could see that the absolute value of log-based data points would be large when we project them onto the formula of Entropy. However, for list 2, since we only have 1 in the list, the final entropy of List2 would be ln(0) which is 1. For List1 and List2, if the randomness inside the list is large, the entropy is large. If a set is very pure with only one category or only two categories inside, then the purity is high and the value of entropy is small.
Another similar metric I want to introduce is the Gini index. It is similar to entropy, but with a different calculation method. The larger the Gini index value, the list will be more random and the less pure. The larger the Gini index or entropy, the worse the current effect. However, we will have a smaller Gini index for the purer List.
Two widely used algorithms use entropy or Gini index to build a decision tree model. One is ID3, Iterative Dichotomies 3, which uses the Entropy function and Information gain as metrics. The other is CART, Classification, and Regression Trees, which uses the Gini Index as a metric. We can take a look at a real data set in which we will predict if a person have ‘Debt’ based on if he has ‘House’ and the ‘Marriage’ status:
ID3 -> Entropy and Information Gain:
If we use ID3 for a classification problem, the basic idea of building an efficient decision tree with comparable high accuracy is we want the entropy decrease rapidly as the increase if the depth of the decision tree. For example, with a faster entropy value, we only need 3 steps to reach the same result as we use 5 steps originally when building a decision tree. The faster the entropy decrease, the less depth we need for a decision tree model. First, we compute the entropy of the original target variables, Debt. We found that it is 0.6109.
Then we compute the entropy for each category we could use for the root node.
We pick the smallest one in these entropies we compute and compared it with the original entropy. The difference between those two entropies we would be able to compute is called the information gain. We could find that the entropy decreased most by using ‘Marriage’, and so ‘Marriage’ will have the largest information gain, so we use it as the root node.
This process is not only limited to computing the root node, we could also use it for computing another leaf node. Therefore, we keep repeating this process and finally reach the result we desired.
CART -> Gini Index:
CART is one of the most powerful algorithms to construct the decision tree model, and we use the Gini Index as the evaluation metric in CART. A CART tree is built by splitting a node into two leaf nodes repeatedly and finally constructed a binary tree. Although one feature may have many different values, CART still divides the data points into two-part. Since we already know how to compute the Gini index, the actual recursive division process will be like this:
We calculate the Gini gain by taking average information entropy for current features. Similarly, we would record the Gini gain for each feature and pick the best one. We could find that the Gini index is smallest by using ‘Marriage’, and we still use ‘Marriage’ as the root node like we conclude before using entropy and information gain.
There will not be so much difference no matter we use Entropy or Gini index as the evaluation metrics when building a decision tree model. But entropy performs well in some specific cases when we have extremely imbalanced data. According to the formula of entropy, it multiplies probabilities of the event with a log of probabilities. Therefore, lower probabilities will be scaled up under this circumstance. For example, if you have 2 events, A and B. The probability that A happened is 0.01, the probability that B happened is 0.99.
By using Gini index, we could see that the result is governed by majority probability, and lower probability have seldom effect on the result. However, by using entropy, lower probabilities weighted more. According to Laura Elena, some more minor differences will be that Entropy might be a little slower to compute because it uses the logarithm during computation. Besides, it only matters in 2% of the cases when you use different evaluation metrics. Also, the Gini index may be more appropriate to deal with the regression problem, and Entropy would be better to deal with the classification problem. Therefore, during the parameter tuning process, rather than making assumptions about which one would be better, trying both would be a better choice if you have enough time. Moreover, when we use uses Entropy as the base calculation, we have a wider range of results whereas the Gini Index will be scaled between 0 and 1. Therefore, the Gini Index may be easier to interpret.
Gini index measures the probability of a random sample being classified incorrectly if we randomly pick a class label, whereas Entropy is a measurement of information. They are all good evaluation metrics to be used in the decision tree. By using those algorithms, we build a decision tree without much data cleaning and the result is easy to read and interpret without even statistical knowledge. Some applications can be the use of demographic data to find prospective clients. Without a specific demographic in mind, the business may make mistakes in their marketing strategy and thus affect its total revenues. However, using entropy or Gini index can help them to reach a better result by performing a decision tree model. They can help the company to determine wise strategies and help the company to achieve its marketing goals. Besides using in business or finance fields, other fields like engineering, education, law, business, and healthcare can implement decision trees with entropy or Gini index as well. Both of them are powerful techniques to split the node in decision tree-based models.
CFI, “Decision Tree, a support tool with a tree-like structure that models probable outcomes, cost of resources, utilities, and possible consequences.” https://corporatefinanceinstitute.com/resources/knowledge/other/decision-tree/. Accessed Nov 20, 2020.
“Decision Tree.” Wikipedia, Wikimedia Foundation, Dec 21, 2011 https://en.wikipedia.org/wiki/Decision_tree. Accessed Nov 20, 2020.
L.E. Raileanu, K. Stoffel. “Theoretical comparison between the Gini Index and Information Gain criteria.” 2004 https://www.unine.ch/files/live/sites/imi/files/shared/documents/papers/Gini_index_fulltext.pdf. Accessed Nov 20, 2020.
Sam, T. “Entropy: How Decision Trees Make Decisions” Jan 10, 2019 https://towardsdatascience.com/entropy-how-decision-trees-make-decisions-2946b9c18c8. Accessed 20 Nov. 2020.
Sanjeevi, Madhu. “Chapter 4: Decision Trees Algorithms.” Oct 6, 2017 https://medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-b93975f7a1f1. Accessed 20 Nov. 2020.