Ok last time we learned how to automatically grow a tree, using a greedy algorithm to choose splits that maximise a given ‘metric’.

The problem is that the trees become huge and undoubtedly overfit to our data, meaning that it will generalize to unseen data poorly. This is the age old battle between signal and noise, where we have to build in a level of fuzziness to the model – pixelating the trees to see the forest.

So first we come up with a way of measuring what we want to achieve, then we find an algorithm to maximise that quantity.

Let’s introduce some terminology and notation. For some model built on training data, the resubstitution error is the error (for some notion of error) on the training data. For a tree we may write

for this quantity.

In our case we will use the misclassification rate for our error, since in the Kaggle competition on the Titanic data this is how the predictions are scored.

Now we quantify tree complexity, write for the number of leaves. There is some balance to be struck between this quantities, which we will call the ‘-cost’ of the tree, given by

.

So is the price we pay for adding an extra leaf, which must be paid for in reduced training error.

So given our tree and $\alpha$, we seek a subtree $T^*$ that minimises , or at least one that gives us a small value (because there are a lot of subtrees).

One thing we can try is look at each node that is not a leaf, fuse it, and calculate the cost, choose the node(if any) which gives the best improvement in tree cost, if none of them give an improvement, we’re done, else repeat the process.

Ok let’s get coding. First we want to be able to measure the error rate at a node, I call this the conditional error rate, because it assumes that you are already in the node. Since we will be calculating this quite a lot, I have added it to the node so it won’t recalculate it when called again.

def conditional_error_rate(self, node, new_data=False): if node.error==None: if new_data: node.error= node.local_data[self.response].value_counts() if node.predicted_class in node.error.keys(): node.error = node.error[node.predicted_class]/float(node.size) node.error= 1-node.error else: node.error=1 else: node.error= 1-node.predicted_prob return node.error

In the training or resubstitution case, where new_data=False, the missclassification rate is just one minus the predicted class. In the case we want to look at later, where after we have built our tree we look at new data, we need to know how often the class appears in the new data. If the predicted class is not in the new data at all, the missclassification rate is total, 1.

That was the conditional error rate, but what contribution does a node make to the overall error rate? To answer that we need to know the probability of being at that node, then multiply the conditional rate by that probability.

def node_prob(self, node): return node.size/float(self.data_size) def local_error_rate(self, node, new_data=False): return self.conditional_error_rate(node, new_data)*self.node_prob(node)

So local_error_rate tells us the error a node contributes if we have fused at that node. To compare with not fusing, we look at the subtree rooted at that node, and calculate the sum of its leaves local errors.

def node_error(self,node): return sum(self.local_error_rate(leaf) for leaf in self.get_node_leaves(node)) def get_node_leaves(self, node): #get leaves of subtree at a node leaves=set([]) for descendant in self.node_iter_down(node): if descendant in self.leaves: leaves.add(descendant) return leaves def node_iter_down(self,base, first=True): #iterates top to bottom over the subtree at a node if first: yield base if base.children==None: return if base.children==None: yield base else: for child in base.children: yield child for node in self.node_iter_down(child, first=False): yield node

Great! Now if we want to know we just sum all of the local errors at the leaves.

def error(self, new_data=False): return sum(self.local_error_rate(leaf,new_data) for leaf in self.leaves)

This is looking promising, now what we need to do is find a way of iterating through each possible prune, and calculate the change in tree cost. The good thing about the change in cost is that we only need to know about what’s going on in the subtree rooted at a node: we lose times the number of leaves of the subtree minus one to the cost by fusing, but gain the difference between the node error and its local error rate.

The change in cost by fusing a node is given by

def node_cost(self, node, alpha): fused_error = self.local_error_rate(node) unfused_error = self.node_error(node) number_leaves_lost = len(self.get_node_leaves(node))-1 error_diff=fused_error-unfused_error return error_diff - alpha*number_leaves_lost

And now we iterate over possible prunes:

def get_best_prune(self, alpha): best_node_cost, best_node = min(self.iter_prune_cost(alpha), key=lambda x: x[0]) if best_node_cost return best_node else: #If no good prunes then we return zip. return None def iter_prune_cost(self, alpha): check=True for node in self.vertices: if not node in self.leaves: check=False yield [self.node_cost(node, alpha),node] if check: #This is a base case, if the only node left is the root, then no more can be done. yield[0,self.get_root()]

Having laid the groundwork, the final method is simplicity itself.

def prune_tree(self, alpha): best_prune=self.get_best_prune(alpha) if best_prune==None: return else: self.fuse_vertex(best_prune) self.prune_tree(alpha)

And that is that! Let’s give it a whirl. I’ve added a method called train to the class which you can see in the code I have put up, this allows us to pass a dictionary of parameters to build the tree. You can also see that I have seperated out the regression and classificiation trees into their own subclasses. First we try , meaning do no pruning.

import cleantitanic as ct df=ct.cleaneddf()[0] df=df[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']] data_type_dict={'survived':'nominal', 'pclass':'ordinal', 'sex':'nominal', 'age':'ordinal', 'sibsp':'ordinal', 'parch':'ordinal', 'fare':'ordinal', 'embarked':'nominal'} g=ClassificationTree() parameters={'min_node_size':5, 'max_node_depth':20, 'data_type_dict':data_type_dict,'threshold':0, 'response':'survived'} parameters['alpha']=0 parameters['metric_kind']='Gini' g.train(data=df,data_type_dict=data_type_dict, parameters=parameters) g.plot()

Which produces this monstrosity.

Yikes! Let’s see it with .

So you can see that the tree you get depends entirely on , so which one should you use? We will attempt to answer that question next time where we will discuss cross-validation and OOB(out of bag resampling).

Here is the code I have been promising, it also includes the means to test unseen data, which we will discuss next time.

###### Related articles

- Decision Trees Part 1: Representation (triangleinequality.wordpress.com)
- Decision Trees Part 2: Growing Your Tree (triangleinequality.wordpress.com)
- Special Graphs and Dynamic Inheritance (triangleinequality.wordpress.com)

[…] Last time we developed a method for pruning decision trees based on a compromise between training error and tree complexity, the latter of which is weighted with a parameter . Choose a different and get a different pruned tree. So how can we choose such a parameter? […]

[…] about how they were created you can start here with representation, then here for growing the tree, here for pruning the […]

Wonderful article! We will be linking to this great content on our website.

Keep up the good writing.

Good post! We are linking to this great content on our

site. Keep up the great writing.