Home » machine learning » decision trees » Decision Trees Part 3: Pruning your Tree

# Decision Trees Part 3: Pruning your Tree

Ok last time we learned how to automatically grow a tree, using a greedy algorithm to choose splits that maximise a given ‘metric’.

The problem is that the trees become huge and undoubtedly overfit to our data, meaning that it will generalize to unseen data poorly. This is the age old battle between signal and noise, where we have to build in a level of fuzziness to the model – pixelating the trees to see the forest.

So first we come up with a way of measuring what we want to achieve, then we find an algorithm to maximise that quantity.

Let’s introduce some terminology and notation. For some model built on training data, the resubstitution error is the error (for some notion of error) on the training data. For a tree $T$ we may write
$R(T)$ for this quantity.

In our case we will use the misclassification rate for our error, since in the Kaggle competition on the Titanic data  this is how the predictions are scored.

Now we quantify tree complexity, write $|T|$ for the number of leaves. There is some balance to be struck between this quantities, which we will call the ‘$\alpha$-cost’ of the tree, given by
$R_{\alpha}(T) = R(T) = \alpha |T|$.

So $\alpha$ is the price we pay for adding an extra leaf, which must be paid for in reduced training error.

So given our tree $T$ and $\alpha$, we seek a subtree $T^*$ that minimises $R_{\alpha}$, or at least one that gives us a small value (because there are a lot of subtrees).

One thing we can try is look at each node that is not a leaf, fuse it, and calculate the cost, choose the node(if any) which gives the best improvement in tree cost, if none of them give an improvement, we’re done, else repeat the process.

Ok let’s get coding. First we want to be able to measure the error rate at a node, I call this the conditional error rate, because it assumes that you are already in the node. Since we will be calculating this quite a lot, I have added it to the node so it won’t recalculate it when called again.

def conditional_error_rate(self, node, new_data=False):
if node.error==None:
if new_data:
node.error= node.local_data[self.response].value_counts()
if node.predicted_class in node.error.keys():
node.error = node.error[node.predicted_class]/float(node.size)
node.error= 1-node.error
else:
node.error=1
else:
node.error= 1-node.predicted_prob
return node.error

In the training or resubstitution case, where new_data=False, the missclassification rate is just one minus the predicted class. In the case we want to look at later, where after we have built our tree we look at new data, we need to know how often the class appears in the new data. If the predicted class is not in the new data at all, the missclassification rate is total, 1.

That was the conditional error rate, but what contribution does a node make to the overall error rate? To answer that we need to know the probability of being at that node, then multiply the conditional rate by that probability.

def node_prob(self, node):
return node.size/float(self.data_size)

def local_error_rate(self, node, new_data=False):
return self.conditional_error_rate(node, new_data)*self.node_prob(node)

So local_error_rate tells us the error a node contributes if we have fused at that node. To compare with not fusing, we look at the subtree rooted at that node, and calculate the sum of its leaves local errors.

def node_error(self,node):
return sum(self.local_error_rate(leaf) for leaf in self.get_node_leaves(node))

def get_node_leaves(self, node): #get leaves of subtree at a node
leaves=set([])
for descendant in self.node_iter_down(node):
if descendant in self.leaves:
return leaves

def node_iter_down(self,base, first=True): #iterates top to bottom over the subtree at a node
if first:
yield base
if base.children==None:
return
if base.children==None:
yield base
else:
for child in base.children:
yield child
for node in self.node_iter_down(child, first=False):
yield node

Great! Now if we want to know $R(T)$ we just sum all of the local errors at the leaves.

def error(self, new_data=False):
return sum(self.local_error_rate(leaf,new_data) for leaf in self.leaves)

This is looking promising, now what we need to do is find a way of iterating through each possible prune, and calculate the change in tree cost. The good thing about the change in cost is that we only need to know about what’s going on in the subtree rooted at a node: we lose $\alpha$ times the number of leaves of the subtree minus one to the cost by fusing, but gain the difference between the node error and its local error rate.

The change in cost by fusing a node is given by

def node_cost(self, node, alpha):
fused_error = self.local_error_rate(node)
unfused_error = self.node_error(node)
number_leaves_lost = len(self.get_node_leaves(node))-1
error_diff=fused_error-unfused_error
return error_diff - alpha*number_leaves_lost

And now we iterate over possible prunes:

def get_best_prune(self, alpha):
best_node_cost, best_node = min(self.iter_prune_cost(alpha),
key=lambda x: x[0])
if best_node_cost             return best_node
else: #If no good prunes then we return zip.
return None

def iter_prune_cost(self, alpha):
check=True
for node in self.vertices:
if not node in self.leaves:
check=False
yield [self.node_cost(node, alpha),node]
if check:  #This is a base case, if the only node left is the root, then no more can be done.
yield[0,self.get_root()]

Having laid the groundwork, the final method is simplicity itself.

def prune_tree(self, alpha):
best_prune=self.get_best_prune(alpha)
if best_prune==None:
return
else:
self.fuse_vertex(best_prune)
self.prune_tree(alpha)

And that is that! Let’s give it a whirl. I’ve added a method called train to the class which you can see in the code I have put up, this allows us to pass a dictionary of parameters to build the tree. You can also see that I have seperated out the regression and classificiation trees into their own subclasses. First we try $\alpha=0$, meaning do no pruning.

import cleantitanic as ct
df=ct.cleaneddf()[0]
df=df[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch',
'fare', 'embarked']]
data_type_dict={'survived':'nominal', 'pclass':'ordinal', 'sex':'nominal',
'age':'ordinal', 'sibsp':'ordinal', 'parch':'ordinal',
'fare':'ordinal', 'embarked':'nominal'}
g=ClassificationTree()
parameters={'min_node_size':5, 'max_node_depth':20,
'data_type_dict':data_type_dict,'threshold':0,
'response':'survived'}
parameters['alpha']=0
parameters['metric_kind']='Gini'
g.train(data=df,data_type_dict=data_type_dict, parameters=parameters)
g.plot()

Which produces this monstrosity.

Yikes! Let’s see it with $\alpha=0.001$.

And $\alpha=0.00135$.

And $\alpha=0.00135$.

And finally  $\alpha=0.002$.

So you can see that the tree you get depends entirely on $\alpha$, so which one should you use? We will attempt to answer that question next time where we will discuss cross-validation and OOB(out of bag resampling).

Here is the code I have been promising, it also includes the means to test unseen data, which we will discuss next time.