Home » machine learning » decision trees » Decision Trees Part 2: Growing Your Tree

Decision Trees Part 2: Growing Your Tree

Ok so last time we looked at what a Decision Tree was, and how to represent one in Python using our DecisionNode, DecisionTree, PivotDecisionNode and PivotDecisionTree classes. We also built a very basic one for the Titanic Kaggle data with a single split on ‘sex’.

As might have been apparent even in that single example, building decision trees by hand is tedious, and what’s worse is that you have to guess what attributes and pivots to use. What we need is some automation.

Thinking aloud (so to speak) we want a new method for PivotDecisionTree that grows a particular node. Let’s just try and write such a method and fill it out as we go along.

    def grow_node(self, node):
        if node.parent==None:
        if self.stopping_condition(node):

            except StopIteration:
            self.split_vertex(node, split_attribute=best_split[1],
            for child in node.children:

The idea is simple, start with a node, if some stopping condition is achieved, do nothing, else find the best attribute to split on, and the best pivot for that attribute, and split on it. This is a greedy algorithm idea.  Then repeat the process with each of the children.

Now how to find the best split.

    def get_best_split(self, node):
        gen = self.iter_split_eval(node)
        for split in gen:
            if split[0] < best_split[0]:
                best_split = split
        return best_split

So this method iterates through the evaluations of the possible splits, and returns the lowest one, assuming lowest is what we want. Now we fill in the next function we haven’t defined.

    def iter_split_eval(self, node):
        for split in self.iter_split(node):
            if node.children==None:
                for child in node.children:
                ret = [self.node_purity(node),
                node.split_attribute, node.pivot]
                yield ret

    def iter_split(self, node):
        for attribute in self.data.columns:
            if attribute != self.response:
                for pivot in self.get_pivots(node.local_data, attribute):
                    self.split_vertex(vertex=node, pivot=pivot,

This is a generator object since we use the keyword yield, this allows to iterate over it, without creating the whole list all at once.

We try each attribute, and each pivot, split on it, and report our evaluation of that split using something called ‘node_purity’ which takes a metric as an argument. The metric will be what controls how we evaluate a split.

First we need a way of iterating over all possible splits. Observe that there are two cases: nominal data where there are 2^{k-1}-1 possible binary splits if there are k labels; ordinal data where there are k-1 splits. In the first case the pivot is a set, in the second its a number.

So let’s try something like:

    def get_pivots(self, data, attribute):
        if self.data_type_dict[attribute]=='ordinal':
            max_pivot = max(data[attribute].unique())
            for pivot in data[attribute].unique():
                if pivot < max_pivot:
                    yield pivot
        elif self.data_type_dict[attribute]=='nominal':
            values = data[attribute].unique()
            if n<=1:
            for r in range(1,n+1):
                for pivot in combinations(values, r):
                        yield set(pivot) #we find all subsets of a certain size, but we only need to go
                                         #up to about half way, otherwise we duplicate.

This is another generator object which separates the two cases we discussed above. In the ‘nominal’ case it makes use of a handy method in the itertools module to produce all subsets of a given size.

Also notice that we need to tag our data columns with ‘ordinal’ or ‘nominal’ at some point.

Ok so now we’ve sorted out the outline of the logistics, we need to say how we want to evaluate splits. What we want in a split is for the data to be less mixed up with respect to the response.

In the classification case there are two main metrics we can use: Gini and Entropy.
Suppose our response can take k values: v_1, \dots , v_k , then we can associate to our data a probability vector p=(p_1, \dots ,p_k) that is the proportions of each response in the dataset.

The Gini metric tells us the probability of a random datum being mislabelled if we assign labels according to the probability vector : G(p) = \sum \limits _{i=1}^k p_i(1-p_i).

Entropy tells us, given that we know a data point is in the node, what is our expected information gain (in bits) of finding out what its response is:
E(p) = - \sum \limits _{i=1}^k p_i \log_2 p_i .

By convention we write 0 \log _2(0) = 0.

Notice that both of these metrics are maximised when the data is perfectly mixed up (or impure, that is p_i=p_j \ \forall i,j, and minimised when the node not mixed at all (pure), that is \exists i: p_i=1.

So we are seeking the former case, we want the data to be mostly in a single class.

These are very easy to implement:

    def Gini(self,prob_vector):
        return sum(p*(1-p) for p in prob_vector)

    def Entropy(self,prob_vector):
        def entropy_summand(p):
            if p==0:
                return 0
                return -p*log(p,2)
        return sum(entropy_summand(p) for p in prob_vector)

Now in the regression case we can measure mixed-upness using the residual some of squares (RSS), this is the sum of the squared errors at a leaf based on our prediction, the mean.

If m_l is the average response at a leaf, and y_i the response of the i^{th} data point at the node, then RSS(l)=\sum \limits _{i}(y_i-m_l)^2.

    def RSS(self, filtered_data):
        prediction = np.mean(filtered_data[self.response])
        dev = sum((y-prediction)**2 for y in filtered_data[self.response])
        return dev

We also make some helper functions for the metrics.

    def metric(self, filtered_data, kind='Entropy'):
        if kind=='RSS':
            return self.RSS(filtered_data=filtered_data)
            if kind=='Entropy':
                return self.Entropy(prob_vector)
            elif kind=='Gini':
                return self.Entropy(prob_vector)

    def get_prob_vector(self, data):
        size = float(len(data))
        prob_vector = [value_count[key]/size for key in value_count.keys()]
        return prob_vector

Now we have the metrics sorted, we can make a function to handle calculating node purity. The idea is that if a node is a leaf, it simply returns the metric evaluated on the subset of the data corresponding to its filter, if it is a parent, it returns the weighted sum of the node purities of its children.

    def node_purity(self, node):
        if node.children==None:
            return self.metric(node.local_data, kind=self.metric_kind)
            left_size= float(node.left.size)
            right_size = float(node.right.size)
            left_purity= (left_size/node.size)*left_raw_purity
            right_purity = (right_size/node.size)*right_raw_purity
            return left_purity+right_purity

And since it makes to reference to pivots and the like, it can go in DecisionTree.

Now we are pretty much there, we just need to say something about stopping conditions. Some natural candidates are node depth (how many levels down from the root it is), node size, total number of leaves,or a threshold for the metric.
The last one won’t work because we can’t know in advance how well the data can be partitioned by the attributes, but this is quite a general problem in machine learning, parameter-tuning.

Node depth sounds reasonable, so we could add that into the DecisionNode and have it be updated when splitting a node.

Node size is also something we already keeping track of, but it has the disadvantage that we don’t know in advance how small the nodes can become.

    def stopping_condition(self, node):
        if self.max_node_depth <= node.depth:
            return True
        elif node.size <= self.min_node_size:
            return True
            return False

Now we put it together.

    def grow_tree(self):

One other thing I want to do is interface to igraphs a bit better on the plotting front, but I will hide the details from you guys for now as it’s not so interesting, but once I put the code up you can examine it at leisure.

Now we are ready to grow our tree.

import cleantitanic as ct
df=df[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch',
       'fare', 'embarked']]
data_type_dict={'survived':'nominal', 'pclass':'ordinal', 'sex':'nominal',
                'age':'ordinal', 'sibsp':'ordinal', 'parch':'ordinal',
                'fare':'ordinal', 'embarked':'nominal'}

                    metric_kind='Gini', max_node_depth=3, data=df)

And it looks like:
Where left corresponds to the vertex test being True. So this is starting to look good, but next we need to
learn how to prune it because we don’t yet have very informed criteria for stopping, but it is better to
ask forgiveness than permission, as the Pythonistas say!

The code will be up soon when it is all put together.



  1. […] last time we learned how to automatically grow a tree, using a greedy algorithm to choose splits that maximise […]

  2. […] learn about how they were created you can start here with representation, then here for growing the tree, here for pruning the […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: