🌑

Stephen Cheng

What Are The Key Parameters of Tree Modelling and How to Avoid Over-fitting in Decision Trees?

 

Stephen Cheng

Intro

Overfitting is one of the key challenges faced while modeling decision trees. If there is no limit set of a decision tree, it will give you 100% accuracy on training set because in the worse case it will end up making 1 leaf for each observation. Thus, preventing overfitting is pivotal while modeling a decision tree and it can be done in 2 ways:

1) Setting constraints on tree size.
2) Tree pruning.

Let’s discuss both of these briefly.

Setting Constraints on Tree Size

This can be done by using various parameters which are used to define a tree. First, let‘s look at the general structure of a decision tree.

The parameters used for defining a tree are further explained below. The parameters described below are irrespective of tool. It is important to understand the role of parameters used in tree modeling. These parameters are available in R & Python.

Minimum samples for a node split

1) Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
2) Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
3) Too high values can lead to under-fitting hence, it should be tuned using CV.

Minimum samples for a terminal node (leaf)

1) Defines the minimum samples (or observations) required in a terminal node or leaf.
2) Used to control over-fitting similar to min_samples_split.
3) Generally lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in majority will be very small.

Maximum depth of tree (vertical depth)

1) The maximum depth of a tree.
2) Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
3) Should be tuned using CV.

Maximum number of terminal nodes

1) The maximum number of terminal nodes or leaves in a tree.
2) Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.

Maximum features to consider for split

1) The number of features to consider while searching for a best split. These will be randomly selected.
2) As a thumb-rule, square root of the total number of features works great but we should check up to 30-40% of the total number of features.
3) Higher values can lead to over-fitting but depends on case to case.

Tree Pruning

As discussed earlier, the technique of setting constraint is a greedy-approach. In other words, it will check for the best split instantaneously and move forward until one of the specified stopping condition is reached. Let’s consider the following case when you’re driving.

There are 2 lanes:

1) A lane with cars moving at 80km/h.
2) A lane with trucks moving at 30km/h.

At this instant, you are the yellow car and you have 2 choices:

1) Take a left and overtake the other 2 cars quickly.
2) Keep moving in the present lane.

Let’s analyze these choice. In the former choice, you’ll immediately overtake the car ahead and reach behind the truck and start moving at 30 km/h, looking for an opportunity to move back right. All cars originally behind you move ahead in the meanwhile. This would be the optimum choice if your objective is to maximize the distance covered in next say 10 seconds. In the later choice, you sale through at same speed, cross trucks and then overtake maybe depending on situation ahead. Greedy you!

This is exactly the difference between normal decision tree & pruning. A decision tree with constraints won’t see the truck ahead and adopt a greedy approach by taking a left. On the other hand if we use pruning, we in effect look at a few steps ahead and make a choice.

How to prune?

So we know pruning is better. But how to implement it in decision tree? The idea is simple.

1) We first make the decision tree to a large depth.
2) Then we start at the bottom and start removing leaves which are giving us negative returns when compared from the top.
3) Suppose a split is giving us a gain of say -10 (loss of 10) and then the next split on that gives us a gain of 20. A simple decision tree will stop at step 1 but in pruning, we will see that the overall gain is +10 and keep both leaves.

Note

Note that sklearn’s decision tree classifier does not currently support pruning. Advanced packages like xgboost have adopted tree pruning in their implementation. But the library rpart in R, provides a function to prune. Good for R users!

, — Jun 6, 2018

Search

    Made with ❤️ and ☀️ on Earth.