Ok so last time we looked at what a Decision Tree was, and how to represent one in Python using our DecisionNode, DecisionTree, PivotDecisionNode and PivotDecisionTree classes. We also built a very basic one for the Titanic Kaggle data with a single split on ‘sex’.

As might have been apparent even in that single example, building decision trees by hand is tedious, and what’s worse is that you have to guess what attributes and pivots to use. What we need is some automation.

Thinking aloud (so to speak) we want a new method for PivotDecisionTree that grows a particular node. Let’s just try and write such a method and fill it out as we go along.

def grow_node(self, node): if node.parent==None: node.local_data=node.local_filter(data=self.data) if self.stopping_condition(node): return else: try: best_split=self.get_best_split(node) except StopIteration: return self.split_vertex(node, split_attribute=best_split[1], pivot=best_split[2]) for child in node.children: child.local_data=child.local_filter(data=node.local_data) self.grow_node(node=child)

The idea is simple, start with a node, if some stopping condition is achieved, do nothing, else find the best attribute to split on, and the best pivot for that attribute, and split on it. This is a greedy algorithm idea. Then repeat the process with each of the children.

Now how to find the best split.

def get_best_split(self, node): gen = self.iter_split_eval(node) first=next(gen) best_split=first for split in gen: if split[0] < best_split[0]: best_split = split return best_split

So this method iterates through the evaluations of the possible splits, and returns the lowest one, assuming lowest is what we want. Now we fill in the next function we haven’t defined.

def iter_split_eval(self, node): for split in self.iter_split(node): if node.children==None: pass else: for child in node.children: child.local_data=child.local_filter(node.local_data) ret = [self.node_purity(node), node.split_attribute, node.pivot] yield ret def iter_split(self, node): for attribute in self.data.columns: if attribute != self.response: for pivot in self.get_pivots(node.local_data, attribute): self.fuse_vertex(node) self.split_vertex(vertex=node, pivot=pivot, split_attribute=attribute) yield

This is a generator object since we use the keyword yield, this allows to iterate over it, without creating the whole list all at once.

We try each attribute, and each pivot, split on it, and report our evaluation of that split using something called ‘node_purity’ which takes a metric as an argument. The metric will be what controls how we evaluate a split.

First we need a way of iterating over all possible splits. Observe that there are two cases: nominal data where there are possible binary splits if there are k labels; ordinal data where there are k-1 splits. In the first case the pivot is a set, in the second its a number.

So let’s try something like:

def get_pivots(self, data, attribute): if self.data_type_dict[attribute]=='ordinal': max_pivot = max(data[attribute].unique()) for pivot in data[attribute].unique(): if pivot < max_pivot: yield pivot elif self.data_type_dict[attribute]=='nominal': values = data[attribute].unique() n=len(values) if n<=1: return n=floor(float(n)/2) n=int(n) for r in range(1,n+1): for pivot in combinations(values, r): yield set(pivot) #we find all subsets of a certain size, but we only need to go #up to about half way, otherwise we duplicate.

This is another generator object which separates the two cases we discussed above. In the ‘nominal’ case it makes use of a handy method in the itertools module to produce all subsets of a given size.

Also notice that we need to tag our data columns with ‘ordinal’ or ‘nominal’ at some point.

Ok so now we’ve sorted out the outline of the logistics, we need to say how we want to evaluate splits. What we want in a split is for the data to be less mixed up with respect to the response.

In the classification case there are two main metrics we can use: Gini and Entropy.

Suppose our response can take k values: , then we can associate to our data a probability vector that is the proportions of each response in the dataset.

The Gini metric tells us the probability of a random datum being mislabelled if we assign labels according to the probability vector : .

Entropy tells us, given that we know a data point is in the node, what is our expected information gain (in bits) of finding out what its response is:

.

By convention we write .

Notice that both of these metrics are maximised when the data is perfectly mixed up (or *impure*, that is , and minimised when the node not mixed at all (*pure*), that is .

So we are seeking the former case, we want the data to be mostly in a single class.

These are very easy to implement:

def Gini(self,prob_vector): return sum(p*(1-p) for p in prob_vector) def Entropy(self,prob_vector): def entropy_summand(p): if p==0: return 0 else: return -p*log(p,2) return sum(entropy_summand(p) for p in prob_vector)

Now in the regression case we can measure mixed-upness using the residual some of squares (RSS), this is the sum of the squared errors at a leaf based on our prediction, the mean.

If is the average response at a leaf, and the response of the data point at the node, then .

def RSS(self, filtered_data): prediction = np.mean(filtered_data[self.response]) dev = sum((y-prediction)**2 for y in filtered_data[self.response]) return dev

We also make some helper functions for the metrics.

def metric(self, filtered_data, kind='Entropy'): if kind=='RSS': return self.RSS(filtered_data=filtered_data) else: prob_vector=self.get_prob_vector(filtered_data) if kind=='Entropy': return self.Entropy(prob_vector) elif kind=='Gini': return self.Entropy(prob_vector) def get_prob_vector(self, data): size = float(len(data)) value_count=data[self.response].value_counts() prob_vector = [value_count[key]/size for key in value_count.keys()] return prob_vector

Now we have the metrics sorted, we can make a function to handle calculating node purity. The idea is that if a node is a leaf, it simply returns the metric evaluated on the subset of the data corresponding to its filter, if it is a parent, it returns the weighted sum of the node purities of its children.

def node_purity(self, node): if node.children==None: return self.metric(node.local_data, kind=self.metric_kind) else: left_raw_purity=self.node_purity(node=node.left) right_raw_purity=self.node_purity(node=node.right) left_size= float(node.left.size) right_size = float(node.right.size) left_purity= (left_size/node.size)*left_raw_purity right_purity = (right_size/node.size)*right_raw_purity return left_purity+right_purity

And since it makes to reference to pivots and the like, it can go in DecisionTree.

Now we are pretty much there, we just need to say something about stopping conditions. Some natural candidates are node depth (how many levels down from the root it is), node size, total number of leaves,or a threshold for the metric.

The last one won’t work because we can’t know in advance how well the data can be partitioned by the attributes, but this is quite a general problem in machine learning, parameter-tuning.

Node depth sounds reasonable, so we could add that into the DecisionNode and have it be updated when splitting a node.

Node size is also something we already keeping track of, but it has the disadvantage that we don’t know in advance how small the nodes can become.

def stopping_condition(self, node): if self.max_node_depth <= node.depth: return True elif node.size <= self.min_node_size: return True else: return False

Now we put it together.

def grow_tree(self): self.create_vertex() self.set_root(self.vertices[0]) self.leaves.add(self.vertices[0]) self.grow_node(node=self.get_root()) self.set_predictions

One other thing I want to do is interface to igraphs a bit better on the plotting front, but I will hide the details from you guys for now as it’s not so interesting, but once I put the code up you can examine it at leisure.

Now we are ready to grow our tree.

import cleantitanic as ct df=ct.cleaneddf()[0] df=df[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']] data_type_dict={'survived':'nominal', 'pclass':'ordinal', 'sex':'nominal', 'age':'ordinal', 'sibsp':'ordinal', 'parch':'ordinal', 'fare':'ordinal', 'embarked':'nominal'} t=PivotDecisionTree(data_type_dict=data_type_dict, metric_kind='Gini', max_node_depth=3, data=df) t.response='survived' t.grow_tree() g=t.plot()

And it looks like:

Where left corresponds to the vertex test being True. So this is starting to look good, but next we need to

learn how to prune it because we don’t yet have very informed criteria for stopping, but it is better to

ask forgiveness than permission, as the Pythonistas say!

The code will be up soon when it is all put together.

[…] last time we learned how to automatically grow a tree, using a greedy algorithm to choose splits that maximise […]

[…] learn about how they were created you can start here with representation, then here for growing the tree, here for pruning the […]