Blogs/Decision Trees

Decision Trees

peterwashington Nov 01 2021 6 min read 0 views

Keeping it 100


Another popular type of machine learning model is called a “decision tree”. This method is quite intuitive. If we wanted to build a classifier for deciding whether a self-driving car should pull over or keep driving, we might construct the following decision tree:

Using a decision tree for classification is quite simple. The slightly harder part is constructing the decision tree from training data. Training works as follows: There are different metrics that are used to decide the best way to split the data, and we place the variables that split the data better higher up in the tree.

A common metric used for this is called entropy. Entropy has the following formula:

Conceptually, entropy measures the amount of information or surprise that the variable brings. For example, a variable that takes value A 100% of the time and value B 0% of the time will have entropy 0, which makes sense because there is no uncertainty about the outcome of the variable. By contrast, a variable that takes value A 50% of the time and value B 50% of the time will have entropy 0.301, which is higher. If the variable could take 4 possible values, each with 25% probability, the entropy would be even higher: 0.602.

Let’s look at how entropy is used to construct a decision tree. We train the decision tree classifier with the following training data (we assume of decisions that other drivers have made while driving):

To select the root decision node, we choose the attribute with the lowest gain. Gain is the amount of information gained after splitting a dataset by a certain attribute, and is defined as:

Let’s calculate the gain for each of our 3 potential decision variables:

We first need to calculate the entropy of pulling over:

Now, we calculate the entropy of pulling over and sirens:

The final gain is then for pulling over and sirens is then:

We calculate the gain of the other two variables in the same way:

We see that the variable with the largest gain is whether we hear sirens, so we select this variable as the top (root) node of the decision tree.

To construct the rest of the decision tree, we divide the dataset based on whether we hear sirens or not, and we repeat the process from there.