Home » machine learning » decision trees » Decision Trees Part 2: Growing Your Tree

Decision Trees Part 2: Growing Your Tree

Ok so last time we looked at what a Decision Tree was, and how to represent one in Python using our DecisionNode, DecisionTree, PivotDecisionNode and PivotDecisionTree classes. We also built a very basic one for the Titanic Kaggle data with a single split on ‘sex’.

As might have been apparent even in that single example, building decision trees by hand is tedious, and what’s worse is that you have to guess what attributes and pivots to use. What we need is some automation.

Thinking aloud (so to speak) we want a new method for PivotDecisionTree that grows a particular node. Let’s just try and write such a method and fill it out as we go along.

    def grow_node(self, node):
if node.parent==None:
node.local_data=node.local_filter(data=self.data)
if self.stopping_condition(node):
return
else:
try:
best_split=self.get_best_split(node)

except StopIteration:
return
self.split_vertex(node, split_attribute=best_split[1],
pivot=best_split[2])
for child in node.children:
child.local_data=child.local_filter(data=node.local_data)
self.grow_node(node=child)


The idea is simple, start with a node, if some stopping condition is achieved, do nothing, else find the best attribute to split on, and the best pivot for that attribute, and split on it. This is a greedy algorithm idea.  Then repeat the process with each of the children.

Now how to find the best split.

    def get_best_split(self, node):
gen = self.iter_split_eval(node)
first=next(gen)
best_split=first
for split in gen:
if split[0] < best_split[0]:
best_split = split
return best_split


So this method iterates through the evaluations of the possible splits, and returns the lowest one, assuming lowest is what we want. Now we fill in the next function we haven’t defined.

    def iter_split_eval(self, node):
for split in self.iter_split(node):
if node.children==None:
pass
else:
for child in node.children:
child.local_data=child.local_filter(node.local_data)
ret = [self.node_purity(node),
node.split_attribute, node.pivot]
yield ret

def iter_split(self, node):
for attribute in self.data.columns:
if attribute != self.response:
for pivot in self.get_pivots(node.local_data, attribute):
self.fuse_vertex(node)
self.split_vertex(vertex=node, pivot=pivot,
split_attribute=attribute)
yield


This is a generator object since we use the keyword yield, this allows to iterate over it, without creating the whole list all at once.

We try each attribute, and each pivot, split on it, and report our evaluation of that split using something called ‘node_purity’ which takes a metric as an argument. The metric will be what controls how we evaluate a split.

First we need a way of iterating over all possible splits. Observe that there are two cases: nominal data where there are $2^{k-1}-1$ possible binary splits if there are k labels; ordinal data where there are k-1 splits. In the first case the pivot is a set, in the second its a number.

So let’s try something like:

    def get_pivots(self, data, attribute):
if self.data_type_dict[attribute]=='ordinal':
max_pivot = max(data[attribute].unique())
for pivot in data[attribute].unique():
if pivot < max_pivot:
yield pivot
elif self.data_type_dict[attribute]=='nominal':
values = data[attribute].unique()
n=len(values)
if n<=1:
return
n=floor(float(n)/2)
n=int(n)
for r in range(1,n+1):
for pivot in combinations(values, r):
yield set(pivot) #we find all subsets of a certain size, but we only need to go
#up to about half way, otherwise we duplicate.



This is another generator object which separates the two cases we discussed above. In the ‘nominal’ case it makes use of a handy method in the itertools module to produce all subsets of a given size.

Also notice that we need to tag our data columns with ‘ordinal’ or ‘nominal’ at some point.

Ok so now we’ve sorted out the outline of the logistics, we need to say how we want to evaluate splits. What we want in a split is for the data to be less mixed up with respect to the response.

In the classification case there are two main metrics we can use: Gini and Entropy.
Suppose our response can take k values: $v_1, \dots , v_k$, then we can associate to our data a probability vector $p=(p_1, \dots ,p_k)$ that is the proportions of each response in the dataset.

The Gini metric tells us the probability of a random datum being mislabelled if we assign labels according to the probability vector : $G(p) = \sum \limits _{i=1}^k p_i(1-p_i)$.

Entropy tells us, given that we know a data point is in the node, what is our expected information gain (in bits) of finding out what its response is:
$E(p) = - \sum \limits _{i=1}^k p_i \log_2 p_i$.

By convention we write $0 \log _2(0) = 0$.

Notice that both of these metrics are maximised when the data is perfectly mixed up (or impure, that is $p_i=p_j \ \forall i,j$, and minimised when the node not mixed at all (pure), that is $\exists i: p_i=1$.

So we are seeking the former case, we want the data to be mostly in a single class.

These are very easy to implement:

    def Gini(self,prob_vector):
return sum(p*(1-p) for p in prob_vector)

def Entropy(self,prob_vector):
def entropy_summand(p):
if p==0:
return 0
else:
return -p*log(p,2)
return sum(entropy_summand(p) for p in prob_vector)


Now in the regression case we can measure mixed-upness using the residual some of squares (RSS), this is the sum of the squared errors at a leaf based on our prediction, the mean.

If $m_l$ is the average response at a leaf, and $y_i$ the response of the $i^{th}$ data point at the node, then $RSS(l)=\sum \limits _{i}(y_i-m_l)^2$.

    def RSS(self, filtered_data):
prediction = np.mean(filtered_data[self.response])
dev = sum((y-prediction)**2 for y in filtered_data[self.response])
return dev


We also make some helper functions for the metrics.

    def metric(self, filtered_data, kind='Entropy'):
else:
prob_vector=self.get_prob_vector(filtered_data)
if kind=='Entropy':
return self.Entropy(prob_vector)
elif kind=='Gini':
return self.Entropy(prob_vector)

def get_prob_vector(self, data):
size = float(len(data))
value_count=data[self.response].value_counts()
prob_vector = [value_count[key]/size for key in value_count.keys()]
return prob_vector


Now we have the metrics sorted, we can make a function to handle calculating node purity. The idea is that if a node is a leaf, it simply returns the metric evaluated on the subset of the data corresponding to its filter, if it is a parent, it returns the weighted sum of the node purities of its children.

    def node_purity(self, node):
if node.children==None:
return self.metric(node.local_data, kind=self.metric_kind)
else:
left_raw_purity=self.node_purity(node=node.left)
right_raw_purity=self.node_purity(node=node.right)
left_size= float(node.left.size)
right_size = float(node.right.size)
left_purity= (left_size/node.size)*left_raw_purity
right_purity = (right_size/node.size)*right_raw_purity
return left_purity+right_purity


And since it makes to reference to pivots and the like, it can go in DecisionTree.

Now we are pretty much there, we just need to say something about stopping conditions. Some natural candidates are node depth (how many levels down from the root it is), node size, total number of leaves,or a threshold for the metric.
The last one won’t work because we can’t know in advance how well the data can be partitioned by the attributes, but this is quite a general problem in machine learning, parameter-tuning.

Node depth sounds reasonable, so we could add that into the DecisionNode and have it be updated when splitting a node.

Node size is also something we already keeping track of, but it has the disadvantage that we don’t know in advance how small the nodes can become.

    def stopping_condition(self, node):
if self.max_node_depth <= node.depth:
return True
elif node.size <= self.min_node_size:
return True
else:
return False


Now we put it together.

    def grow_tree(self):
self.create_vertex()
self.set_root(self.vertices[0])
self.grow_node(node=self.get_root())
self.set_predictions


One other thing I want to do is interface to igraphs a bit better on the plotting front, but I will hide the details from you guys for now as it’s not so interesting, but once I put the code up you can examine it at leisure.

Now we are ready to grow our tree.

import cleantitanic as ct
df=ct.cleaneddf()[0]
df=df[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch',
'fare', 'embarked']]
data_type_dict={'survived':'nominal', 'pclass':'ordinal', 'sex':'nominal',
'age':'ordinal', 'sibsp':'ordinal', 'parch':'ordinal',
'fare':'ordinal', 'embarked':'nominal'}

t=PivotDecisionTree(data_type_dict=data_type_dict,
metric_kind='Gini', max_node_depth=3, data=df)
t.response='survived'
t.grow_tree()
g=t.plot()


And it looks like:

Where left corresponds to the vertex test being True. So this is starting to look good, but next we need to
learn how to prune it because we don’t yet have very informed criteria for stopping, but it is better to
ask forgiveness than permission, as the Pythonistas say!

The code will be up soon when it is all put together.