Home » Kaggle

# Category Archives: Kaggle

## Creating Dummy Variables in Pandas

Last time we implemented logistic regression, where the data is in the form of a numpy array. But what if your data is not of that form? What if it is a pandas dataframe like the Kaggle Titanic data?

Specifically the problem is variables like ‘Title’ where we have four strings ‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’ as values. The solution is to create a column for each value, for example ‘Title_Mr’, with values 0,1 depending on whether the data point has that value. Reading Wes McKinney’s book again, he explains what to do, and so I have created a handy little function using his example.

It will accept as input a dataframe, and a dictionary telling which variables are nominal. It will then replace each nominal variable in your dataframe with a set of dummy columns, and also update your data type dictionary. Simple but effective.

def dummy_variables(data, data_type_dict):
#Loop over nominal variables.
for variable in filter(lambda x: data_type_dict[x]=='nominal',
data_type_dict.keys()):

#First we create the columns with dummy variables.
#Note that the argument 'prefix' means the column names will be
#prefix_value for each unique value in the original column, so
#we set the prefix to be the name of the original variable.
dummy_df=pd.get_dummies(data[variable], prefix=variable)

#Remove old variable from dictionary.
data_type_dict.pop(variable)

#Add new dummy variables to dictionary.
for dummy_variable in dummy_df.columns:
data_type_dict[dummy_variable] = 'nominal'

#Add dummy variables to main df.
data=data.drop(variable, axis=1)
data=data.join(dummy_df)

return [data, data_type_dict]


Easy. Now once everything is numeric, we can as use np.asarray to cast our dataframe to a numpy array.

## Random Forest

Today we will learn about the random forest algorithm, and how to implement in Python, using the decision trees we have made before.

History

The random forest algorithm was first introduced by Leo Breiman, and he (and Adam Cutler) have a great guide to them here.

The Algorithm

The idea is to train many decision trees on bootstrapped samples of your data, and when we grow a node we only consider a small random selection of variables. We don’t bother pruning the trees, we just trust that the overfit of each tree will be averaged out. This has the advantage of being fast, since few variables need to be searched, and stable since average of all the trees over the resampled training sets reduces the variability you get from any particular sample. This also gives a rich set of information about the model, since each tree has unseen data that was missed in the resampling, and this can be used to test each individual tree. This can be used to estimate the importance of a variable, and estimate the generalisation error of the random forest.

The best way to explain it is to just do it.

Random Forest in Python
First we’ll create random tree classes, one for the regression case and one for the classification case.

class ClassRandomTree(tic.ClassificationTree, object):
def iter_split(self, node): #We overwrite the classification trees' exhaustive search
#with a random sample.
random.shuffle(self.columns) #We shuffle the columns, then select the first few.
for i in range(self.number_random_vars): #This attribute will be set by another object.
attribute=self.columns[i]
for pivot in self.get_pivots(node.local_data, attribute):
self.fuse_vertex(node)
self.split_vertex(vertex=node, pivot=pivot,
split_attribute=attribute)
yield    def train(self,data, data_type_dict, min_node_size, max_depth,
number_random_vars,response):
self.number_random_vars=number_random_vars #The number of variables we select for each tree.
self.vertices=[]
self.edges=set([])
self.leaves=set([])
self.data=data
self.data_type_dict=data_type_dict #Tells the algorithm what kinds of variables we have.
self.columns=data_type_dict.keys()
if self.response in self.columns:
self.columns.remove(self.response)
self.min_node_size=min_node_size
self.max_node_depth=max_depth
self.response=response
self.grow_tree() #Grows the tree without pruning.


Now we do exactly the same, but inherit from the regression tree class.

class RegressRandomTree(tic.RegressionTree, object):
def iter_split(self, node):
random.shuffle(self.columns)
for i in range(self.number_random_vars):
attribute=self.columns[i]
for pivot in self.get_pivots(node.local_data, attribute):
self.fuse_vertex(node)
self.split_vertex(vertex=node, pivot=pivot,
split_attribute=attribute)
yield

def train(self,data, data_type_dict, min_node_size, max_depth,
number_random_vars,response):
self.number_random_vars=number_random_vars
self.vertices=[]
self.edges=set([])
self.leaves=set([])
self.data=data
self.data_type_dict=data_type_dict
self.columns=data_type_dict.keys()
if self.response in self.columns:
self.columns.remove(self.response)
self.min_node_size=min_node_size
self.max_node_depth=max_depth
self.response=response
self.grow_tree()


Now that we can make a tree, we grow a forest. The basic idea is simple, and we add a couple of features to keep track of variable importance and our OOB error.

class RandomForest(object):
def __init__(self, m, train, test, data_type_dict, min_node_size,
max_depth, response, no_trees, missing=False, classification=True):
self.m=m #Number of variables to select for each tree.
self.train=train
self.test=test
self.data_type_dict=data_type_dict
self.min_node_size=min_node_size
self.max_depth=max_depth
self.response=response
self.no_trees=no_trees
self.missing=missing
self.classification=classification
def predict(self, predict=False, OOBerror=False,
variable_importance=False):
N=len(self.train)
predictions=[]
oob_error=[] #Keeps track of the OOB error of each tree.
mean_oob_error=[] #Keeps track of the mean OOB error of the trees.
importance_dict ={key:0 for key in self.data_type_dict.keys()} #Will give an importance score
#for each variable.
for tree in range(self.no_trees):
data,leftovers = OOB_sample(self.train,N, True) #We resample the data.
if self.classification:
R=ClassRandomTree(missing=self.missing)
else:
R=RegressRandomTree(missing=self.missing)
R.train(data=data, data_type_dict=self.data_type_dict,
min_node_size=self.min_node_size,
max_depth=self.max_depth,
response=self.response,
number_random_vars=self.m)
if OOBerror or variable_importance:
if self.classification:
oob_error.append(R.test(leftovers)) #Tests tree on unsampled data.
mean_oob_error.append(1-np.mean(oob_error))
else:
oob_error.append(R.test(leftovers)/float(len(leftovers)))
mean_oob_error.append(np.mean(oob_error))
if variable_importance:
for attribute in self.data_type_dict.keys():
#To assess a variables importance, we see how the error of the tree is affected
#by randomly permuting the labels for each variable.
shuffled_leftovers=shuffle_attribute_labels(leftovers, attribute)
local_class_rate= 1-oob_error[tree] #Unshuffled error.
shuffled_rate = 1-R.test(shuffled_leftovers) #Shuffled error.
importance_dict[attribute]+=local_class_rate-shuffled_rate #Difference betweeen
#shuffled and unshuffled
if predict:
#Now we make our predictions from the tree, and record them so that
#we can either take the mean or mode of the trees' predictions at the end.
predictions.append(self.test.apply(R.predict, axis=1))
if predict:
predictions=tic.combine_predictions(predictions, not self.classification)

if OOBerror or variable_importance:
if self.classification:
print 'Classification rate: ',mean_oob_error[-1]
else:
print 'Mean error: ', mean_oob_error[-1]

if variable_importance:
#Creates a graph of variable importance, we just do a little fussing so that
#they are ranked, which is a little awkward in a dictionary.
importance_dict = {attribute: importance_dict[attribute]/float(self.no_trees)
for attribute in importance_dict.keys()}
D=sorted(importance_dict.items(), key=lambda x: x[1])
values= [x[1] for x in D]
keys = [x[0] for x in D]
import matplotlib.pyplot as plt
plt.barh(range(len(importance_dict)), values, align='center')
plt.yticks(range(len(importance_dict)), keys)
plt.show()

return [predictions, mean_oob_error, importance_dict, oob_error]


Now I just need to show you the code I used for shuffling the variable labels:

def shuffle_attribute_labels(df, attribute):
df = df.copy()
col = list(df[attribute])
np.random.shuffle(col)
df[attribute]=col
return df


To illustrate this let’s run it on the Titanic data set. I created a new variable ‘Spouse’ that records whether or not the spouse of the person is known to have survived or not, and I’d like to know if this is a useful attribute.

import cleantitanic2 as ct
train,test, data_type_dict=ct.clean()
R=RandomForest(4, train, test, data_type_dict, 1, 100,'Survived',100, missing=False)
D=R.predict(predict=True,OOBerror=True, variable_importance=True)
prediction_path = 'C:/Documents and Settings/DIGIT/My Documents/Google Drive/Blogs/triangleinequality/Titanic/prediction.csv'
prediction_csv['Survived']=D[0]
prediction_csv['Survived']=prediction_csv['Survived'].map(lambda x: int(x))
index =False)


Here I used 100 trees, minimum node size 1 and maximum depth 100. I also chose to look at only 4 variables at each step, a rule of thumb is use about the square root of the number of variables, although you will want to refine this using cross-validation or the OOB error given by the random forest.
This produced the following graph, and gave an estimate of 0.78942559327 for the OOB error.
As you can see the variable ‘Spouse’ has negative importance! That means that using it is harmful to the model, since randomly shuffling the labels actually improved things! So I should forget about that one. We also see what we already knew, that ‘Title’ and ‘Sex’ are very important, although of course ‘Title’ is really a refinement of the variable ‘Sex’, and so we see one of the problems with this measure: correlated variables!I have updated most of the modules I have written now, so if you want to run this code you should download all of these files, if you have any problems just let me know!

## Basic Feature Engineering with the Titanic Data

Today we are going to add a couple of features to the Titanic data set that I have discussed extensively, this will involve changing my data cleaning script. Following this I will test the new features using cross-validation to see if they made a difference. These new features come from reading the Kaggle forums and also this helpful blog post.

Titles

First up the Name column is currently not being used, but we can at least extract the title from the name. There are quite a few titles going around, but I want to reduce them all to Mrs, Miss, Mr and Master.  To do this we’ll need a function that searches for substrings. Thankfully the library ‘strings’ has just what we need.

import strings
def substrings_in_string(big_string, substrings):
for substring in substrings:
if string.find(big_string, substring) != -1:
return substring
print big_string
return np.nan


Using this iteratively I was able to get a full list of titles.

title_list=['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev',
'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess',
'Don', 'Jonkheer']


Now that I have them, I recombine them to the four categories.

df['Title']=df['Name'].map(lambda x: substrings_in_string(x, title_list))

#replacing all titles with mr, mrs, miss, master
def replace_titles(x):
title=x['Title']
if title in ['Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
return 'Mr'
elif title in ['Countess', 'Mme']:
return 'Mrs'
elif title in ['Mlle', 'Ms']:
return 'Miss'
elif title =='Dr':
if x['Sex']=='Male':
return 'Mr'
else:
return 'Mrs'
else:
return title
df['Title']=df.apply(replace_titles, axis=1)


You may be interested to know that ‘Jonkheer’ is a male honorific for Dutch nobility. Also interesting is that I was tempted to just send ‘Dr’ -> ‘Mr’, but decided to check first, and there was indeed a female doctor aboard! It seems 1912 was further ahead of its time than Doctor Who!

Curious, I looked her up: her name was Dr. Alice Leader, and she and her husband were physicians in New York city.

But I digress. On to the decks.
Deck

This is going be very similar, we have a ‘Cabin’ column not doing much, only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck, and so we’re going to extract these just like the titles.

#Turning cabin number into Deck
cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'Unknown']
df['Deck']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))


Family Size
One thing you can do to create new features is linear combinations of features. In a model like linear regression this should be unnecessary, but for a decision tree may find it hard to model such relationships. Reading on the forums at Kaggle, some people have considered the size of a person’s family, the sum of their ‘SibSp’ and ‘Parch’ attributes. Perhaps people traveling alone did better? Or on the other hand perhaps if you had a family, you might have risked your life looking for them, or even giving up a space up to them in a lifeboat. Let’s throw that into the mix.

#Creating new family_size column
df['Family_Size']=df['SibSp']+df['Parch']


Age*Class
This is an interaction term, since age and class are both numbers we can just multiply them.

df['Age*Class']=df['Age']*df['Pclass']


Fare per Person

Here we divide the fare by the number of family members traveling together, I’m not exactly sure what this represents, but it’s easy enough to add in.

df['Fare_Per_Person']=df['Fare']/(df['Family_Size']+1)


Get The Code
I have incorporated these new features in a new data cleaning cleantitanic2.py script, which you can download here.

## A complete guide to getting 0.79903 in Kaggle’s Titanic Competition with Python

Ok so this is going to be a quick recap of all the work we have done so far in this blog, but it should be accessible to first time readers also.

Then simply run the last one, making sure that the other files are in your ‘pythonpath’ so that they can be imported. Otherwise read on.

Getting Started

Kaggle have a competition where you must predict the survivors of the titanic. The first step is to download the data,
you’ll need to grab the training data, and also the test data. This guide is going to be using Python, so you’ll also need that. I recommend Python (x,y) with Spyder, which you can download here.

Munging and Plotting

Once you’ve done that, you’ll need to clean the data, that involves getting rid of silly values, filling in missing values, discretising numeric attributes and also creating a template for a submission to Kaggle. The code that does all of that can be found here.

Ok great so now you’ve got your data, it is time to get to know it a little. The best way to get to know data is through visualisation, so here is some code to produce some plots of the data, which you can tinker with. To read a guide to creating these plots, click here.

Graphs and Trees in Python

So now you know something about Python, and also something about the Titanic data set. Our next goal is to produce a decision tree model for the data. Prior to that I created some Python code to represent graphs, to learn the very basics of graph theory you can click here.

Otherwise you will want to download the tigraph module I created from here. To learn about how I created this, and some related topics in python programming, you can read about object orientated programming here, decorators in python here,
and how to represent graphs in python here and here. Note that to produce plots of the graphs, you will need the igraph module for python.

Now that we can represent graphs in python we can make decision trees. The decision tree module can be downloaded here.

To learn about how they were created you can start here with representation, then here for growing the tree, here for pruning the tree.

Putting it all together

The way I have used decision trees is explained here: basically using training and test data to prune a decision tree, and then swapping the roles of training and test data and repeating several times, finally I combine the predictions by taking the mode.  The final piece of code to produce the model is here. And if you run it you should get a score of 0.79903, which the leaderboard tells me is 230th.

Here is a sample tree from those I took the mode of:

There is a great deal of room for improvement, I have not created any new features, such as extracting titles from the names,  and have done the crudest possible missing data imputation, as well as using decision trees, which are basically the simplest model you can make.

Please let me know what you think or if you have any problems.

## To prune or not to prune: cross-validation is the answer

Last time we developed a method for pruning decision trees based on a compromise between training error and tree complexity, the latter of which is weighted with a parameter $\alpha$. Choose a different $\alpha$ and get a different pruned tree. So how can we choose such a parameter?

If data is plentiful, we can split our training data into training and holdout data, and use the latter to test which $\alpha$ gives the best result. Our data is not so plentiful. The main problem is that the final model did not get to see the whole training set as it was built.

A similar idea is k-fold cross-validation. Partition the data into k equal sized subsets randomly. For each subset, use it as test data, and its complement as training data to build a model, test the model, record the average error over the k subsets and we have an estimate for the model’s generalisation error.

Now repeat (on the same k subsets) for each set of model parameters you wish to test. Choose the set of parameters minmising the generalisation error. Use these parameters to build a model on the whole data set. Done.

Ok so lets build a generator function to return the folded data.

import random
import pandas as pd
def cross_validate(no_folds, data, resample=False):
rows = list(data.index)
random.shuffle(rows)
N=len(data)
len_fold = int(N/no_folds)
start=0
for i in range(no_folds):
if i==no_folds-1:
stop =N
else:
stop = start +len_fold
test = data.ix[rows[start:stop]]
train = data.ix[rows[:start]+rows[stop:]]
if resample:
train_len=start+N-stop
no_resamples = N-train_len
train_rows = list(train.index)
random_extra_rows =[random.choice(train_rows) for row in range(no_resamples)]
train_rows = train_rows+random_extra_rows
train=train.ix[train_rows]
yield {'test':test, 'train':train}
start=stop


The idea is pretty simple, we take the list of row indexes and randomly shuffle them, then we just split the data set into k parts by choosing the rows from start*i to start+len(data)/k for the ith part, using the new mixed up row ordering.

Also observe that I have added an option to pump up the training sets to be the same size as the original data set using sampling with replacement.

I tried using this to select $\alpha$, but I found that the $\alpha$ you get over-prunes the tree trained on the full data set. I believe this is because cross-validation has fewer unique values, and so consequently the leaves make up a bigger slice of the error pie, and so they will be more resistant to pruning and hence need more ‘encouragement’ to fuse through larger $\alpha$. As a result cross-validation tends to overestimate the optimum $\alpha$.

The new approach I settled upon was using something like cross-validation: I split the data into training and
testing portions, grow the tree on the training data, and prune it using the test data instead of the training data.
In this case $\alpha$ is irrelevant, we set it to be 0.

The problem of course is that while our model has good information on how much to prune, it has only been trained
on half (say) the data. To alleviate this problem I also grew a tree on the test data, and pruned it using the
test data. This gives 2 models, or k-models if using k-fold cross-validation.

Then from each I take their predictions and combine them by taking the modal prediction. Of course the mode is
not very informative for only 2 predictions, so I repeated this process several times. This helps to reduce
variability from our random partitioning of the data. Here’s how to achieve this.

df=ct.cleaneddf(no_bins=10)[0]
df2=ct.cleaneddf(no_bins=10)[1]

df=df[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
'Fare', 'Embarked']]
data_type_dict={'Survived':'nominal', 'Pclass':'ordinal', 'Sex':'nominal',
'Age':'ordinal', 'SibSp':'ordinal', 'Parch':'ordinal',
'Fare':'ordinal', 'Embarked':'nominal'}


First I’ve just loaded the training and test data, and created the dictionary of data types that the tree expects.

def tree_train(data_type_dict, train_data,test_data, response, no_folds,
min_node_size, max_depth, no_iter):
parameters={'min_node_size':min_node_size, 'max_node_depth':max_depth,
'threshold':0, 'metric_kind':'Gini', 'alpha':0,
'response':response}
model=ticart.ClassificationTree()
predictions=[]
for i in range(no_iter):
for fold in cross_validate(no_folds, train_data):
model=ticart.ClassificationTree()
model.train(fold['train'], data_type_dict, parameters, prune=False)
model.prune_tree(alpha=0, new_data=True)
predictions.append(test_data.apply(model.predict, axis=1))
return predictions


This is pretty straightforward, for each iteration we chop up the data, train on all but one portion and prune with the remaining portion, repeat this for each fold and take predictions, returning a list of predictions at the end.

Next we need a way to combine predictions. We combine the list of predictions into a dataframe, then apply the mode to each row.

def combine_predictions(predictions):
data_dict ={i:predictions[i] for i in range(len(predictions))}
d=pd.DataFrame(data_dict)
def mode(x):
key,value = max(x.value_counts().iteritems(), key=lambda x:x[1])
return key
pred=d.apply(mode, axis=1)
return predpredictions=tree_train(data_type_dict=data_type_dict, train_data=df,
test_data=df2, response='Survived', no_folds=10,
max_depth=50, min_node_size=5, no_iter=50)


And that’s it, next time I am going to walkthrough the whole process of cleaning the data through to building the model and making a submission to Kaggle.

## Decision Trees Part 3: Pruning your Tree

Ok last time we learned how to automatically grow a tree, using a greedy algorithm to choose splits that maximise a given ‘metric’.

The problem is that the trees become huge and undoubtedly overfit to our data, meaning that it will generalize to unseen data poorly. This is the age old battle between signal and noise, where we have to build in a level of fuzziness to the model – pixelating the trees to see the forest.

So first we come up with a way of measuring what we want to achieve, then we find an algorithm to maximise that quantity.

Let’s introduce some terminology and notation. For some model built on training data, the resubstitution error is the error (for some notion of error) on the training data. For a tree $T$ we may write
$R(T)$ for this quantity.

In our case we will use the misclassification rate for our error, since in the Kaggle competition on the Titanic data  this is how the predictions are scored.

Now we quantify tree complexity, write $|T|$ for the number of leaves. There is some balance to be struck between this quantities, which we will call the ‘$\alpha$-cost’ of the tree, given by
$R_{\alpha}(T) = R(T) = \alpha |T|$.

So $\alpha$ is the price we pay for adding an extra leaf, which must be paid for in reduced training error.

So given our tree $T$ and $\alpha$, we seek a subtree $T^*$ that minimises $R_{\alpha}$, or at least one that gives us a small value (because there are a lot of subtrees).

One thing we can try is look at each node that is not a leaf, fuse it, and calculate the cost, choose the node(if any) which gives the best improvement in tree cost, if none of them give an improvement, we’re done, else repeat the process.

Ok let’s get coding. First we want to be able to measure the error rate at a node, I call this the conditional error rate, because it assumes that you are already in the node. Since we will be calculating this quite a lot, I have added it to the node so it won’t recalculate it when called again.

    def conditional_error_rate(self, node, new_data=False):
if node.error==None:
if new_data:
node.error= node.local_data[self.response].value_counts()
if node.predicted_class in node.error.keys():
node.error = node.error[node.predicted_class]/float(node.size)
node.error= 1-node.error
else:
node.error=1
else:
node.error= 1-node.predicted_prob
return node.error


In the training or resubstitution case, where new_data=False, the missclassification rate is just one minus the predicted class. In the case we want to look at later, where after we have built our tree we look at new data, we need to know how often the class appears in the new data. If the predicted class is not in the new data at all, the missclassification rate is total, 1.

That was the conditional error rate, but what contribution does a node make to the overall error rate? To answer that we need to know the probability of being at that node, then multiply the conditional rate by that probability.

    def node_prob(self, node):
return node.size/float(self.data_size)

def local_error_rate(self, node, new_data=False):
return self.conditional_error_rate(node, new_data)*self.node_prob(node)


So local_error_rate tells us the error a node contributes if we have fused at that node. To compare with not fusing, we look at the subtree rooted at that node, and calculate the sum of its leaves local errors.

    def node_error(self,node):
return sum(self.local_error_rate(leaf) for leaf in self.get_node_leaves(node))

def get_node_leaves(self, node): #get leaves of subtree at a node
leaves=set([])
for descendant in self.node_iter_down(node):
if descendant in self.leaves:
return leaves

def node_iter_down(self,base, first=True): #iterates top to bottom over the subtree at a node
if first:
yield base
if base.children==None:
return
if base.children==None:
yield base
else:
for child in base.children:
yield child
for node in self.node_iter_down(child, first=False):
yield node


Great! Now if we want to know $R(T)$ we just sum all of the local errors at the leaves.

    def error(self, new_data=False):
return sum(self.local_error_rate(leaf,new_data) for leaf in self.leaves)


This is looking promising, now what we need to do is find a way of iterating through each possible prune, and calculate the change in tree cost. The good thing about the change in cost is that we only need to know about what’s going on in the subtree rooted at a node: we lose $\alpha$ times the number of leaves of the subtree minus one to the cost by fusing, but gain the difference between the node error and its local error rate.

The change in cost by fusing a node is given by

    def node_cost(self, node, alpha):
fused_error = self.local_error_rate(node)
unfused_error = self.node_error(node)
number_leaves_lost = len(self.get_node_leaves(node))-1
error_diff=fused_error-unfused_error
return error_diff - alpha*number_leaves_lost


And now we iterate over possible prunes:

    def get_best_prune(self, alpha):
best_node_cost, best_node = min(self.iter_prune_cost(alpha),
key=lambda x: x[0])
if best_node_cost             return best_node
else: #If no good prunes then we return zip.
return None

def iter_prune_cost(self, alpha):
check=True
for node in self.vertices:
if not node in self.leaves:
check=False
yield [self.node_cost(node, alpha),node]
if check:  #This is a base case, if the only node left is the root, then no more can be done.
yield[0,self.get_root()]


Having laid the groundwork, the final method is simplicity itself.

    def prune_tree(self, alpha):
best_prune=self.get_best_prune(alpha)
if best_prune==None:
return
else:
self.fuse_vertex(best_prune)
self.prune_tree(alpha)


And that is that! Let’s give it a whirl. I’ve added a method called train to the class which you can see in the code I have put up, this allows us to pass a dictionary of parameters to build the tree. You can also see that I have seperated out the regression and classificiation trees into their own subclasses. First we try $\alpha=0$, meaning do no pruning.

import cleantitanic as ct
df=ct.cleaneddf()[0]
df=df[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch',
'fare', 'embarked']]
data_type_dict={'survived':'nominal', 'pclass':'ordinal', 'sex':'nominal',
'age':'ordinal', 'sibsp':'ordinal', 'parch':'ordinal',
'fare':'ordinal', 'embarked':'nominal'}
g=ClassificationTree()
parameters={'min_node_size':5, 'max_node_depth':20,
'data_type_dict':data_type_dict,'threshold':0,
'response':'survived'}
parameters['alpha']=0
parameters['metric_kind']='Gini'
g.train(data=df,data_type_dict=data_type_dict, parameters=parameters)
g.plot()



Which produces this monstrosity.

Yikes! Let’s see it with $\alpha=0.001$.

And $\alpha=0.00135$.

And $\alpha=0.00135$.

And finally  $\alpha=0.002$.

So you can see that the tree you get depends entirely on $\alpha$, so which one should you use? We will attempt to answer that question next time where we will discuss cross-validation and OOB(out of bag resampling).

Here is the code I have been promising, it also includes the means to test unseen data, which we will discuss next time.

## Decision Trees Part 2: Growing Your Tree

Ok so last time we looked at what a Decision Tree was, and how to represent one in Python using our DecisionNode, DecisionTree, PivotDecisionNode and PivotDecisionTree classes. We also built a very basic one for the Titanic Kaggle data with a single split on ‘sex’.

As might have been apparent even in that single example, building decision trees by hand is tedious, and what’s worse is that you have to guess what attributes and pivots to use. What we need is some automation.

Thinking aloud (so to speak) we want a new method for PivotDecisionTree that grows a particular node. Let’s just try and write such a method and fill it out as we go along.

    def grow_node(self, node):
if node.parent==None:
node.local_data=node.local_filter(data=self.data)
if self.stopping_condition(node):
return
else:
try:
best_split=self.get_best_split(node)

except StopIteration:
return
self.split_vertex(node, split_attribute=best_split[1],
pivot=best_split[2])
for child in node.children:
child.local_data=child.local_filter(data=node.local_data)
self.grow_node(node=child)


The idea is simple, start with a node, if some stopping condition is achieved, do nothing, else find the best attribute to split on, and the best pivot for that attribute, and split on it. This is a greedy algorithm idea.  Then repeat the process with each of the children.

Now how to find the best split.

    def get_best_split(self, node):
gen = self.iter_split_eval(node)
first=next(gen)
best_split=first
for split in gen:
if split[0] < best_split[0]:
best_split = split
return best_split


So this method iterates through the evaluations of the possible splits, and returns the lowest one, assuming lowest is what we want. Now we fill in the next function we haven’t defined.

    def iter_split_eval(self, node):
for split in self.iter_split(node):
if node.children==None:
pass
else:
for child in node.children:
child.local_data=child.local_filter(node.local_data)
ret = [self.node_purity(node),
node.split_attribute, node.pivot]
yield ret

def iter_split(self, node):
for attribute in self.data.columns:
if attribute != self.response:
for pivot in self.get_pivots(node.local_data, attribute):
self.fuse_vertex(node)
self.split_vertex(vertex=node, pivot=pivot,
split_attribute=attribute)
yield


This is a generator object since we use the keyword yield, this allows to iterate over it, without creating the whole list all at once.

We try each attribute, and each pivot, split on it, and report our evaluation of that split using something called ‘node_purity’ which takes a metric as an argument. The metric will be what controls how we evaluate a split.

First we need a way of iterating over all possible splits. Observe that there are two cases: nominal data where there are $2^{k-1}-1$ possible binary splits if there are k labels; ordinal data where there are k-1 splits. In the first case the pivot is a set, in the second its a number.

So let’s try something like:

    def get_pivots(self, data, attribute):
if self.data_type_dict[attribute]=='ordinal':
max_pivot = max(data[attribute].unique())
for pivot in data[attribute].unique():
if pivot < max_pivot:
yield pivot
elif self.data_type_dict[attribute]=='nominal':
values = data[attribute].unique()
n=len(values)
if n<=1:
return
n=floor(float(n)/2)
n=int(n)
for r in range(1,n+1):
for pivot in combinations(values, r):
yield set(pivot) #we find all subsets of a certain size, but we only need to go
#up to about half way, otherwise we duplicate.



This is another generator object which separates the two cases we discussed above. In the ‘nominal’ case it makes use of a handy method in the itertools module to produce all subsets of a given size.

Also notice that we need to tag our data columns with ‘ordinal’ or ‘nominal’ at some point.

Ok so now we’ve sorted out the outline of the logistics, we need to say how we want to evaluate splits. What we want in a split is for the data to be less mixed up with respect to the response.

In the classification case there are two main metrics we can use: Gini and Entropy.
Suppose our response can take k values: $v_1, \dots , v_k$, then we can associate to our data a probability vector $p=(p_1, \dots ,p_k)$ that is the proportions of each response in the dataset.

The Gini metric tells us the probability of a random datum being mislabelled if we assign labels according to the probability vector : $G(p) = \sum \limits _{i=1}^k p_i(1-p_i)$.

Entropy tells us, given that we know a data point is in the node, what is our expected information gain (in bits) of finding out what its response is:
$E(p) = - \sum \limits _{i=1}^k p_i \log_2 p_i$.

By convention we write $0 \log _2(0) = 0$.

Notice that both of these metrics are maximised when the data is perfectly mixed up (or impure, that is $p_i=p_j \ \forall i,j$, and minimised when the node not mixed at all (pure), that is $\exists i: p_i=1$.

So we are seeking the former case, we want the data to be mostly in a single class.

These are very easy to implement:

    def Gini(self,prob_vector):
return sum(p*(1-p) for p in prob_vector)

def Entropy(self,prob_vector):
def entropy_summand(p):
if p==0:
return 0
else:
return -p*log(p,2)
return sum(entropy_summand(p) for p in prob_vector)


Now in the regression case we can measure mixed-upness using the residual some of squares (RSS), this is the sum of the squared errors at a leaf based on our prediction, the mean.

If $m_l$ is the average response at a leaf, and $y_i$ the response of the $i^{th}$ data point at the node, then $RSS(l)=\sum \limits _{i}(y_i-m_l)^2$.

    def RSS(self, filtered_data):
prediction = np.mean(filtered_data[self.response])
dev = sum((y-prediction)**2 for y in filtered_data[self.response])
return dev


We also make some helper functions for the metrics.

    def metric(self, filtered_data, kind='Entropy'):
else:
prob_vector=self.get_prob_vector(filtered_data)
if kind=='Entropy':
return self.Entropy(prob_vector)
elif kind=='Gini':
return self.Entropy(prob_vector)

def get_prob_vector(self, data):
size = float(len(data))
value_count=data[self.response].value_counts()
prob_vector = [value_count[key]/size for key in value_count.keys()]
return prob_vector


Now we have the metrics sorted, we can make a function to handle calculating node purity. The idea is that if a node is a leaf, it simply returns the metric evaluated on the subset of the data corresponding to its filter, if it is a parent, it returns the weighted sum of the node purities of its children.

    def node_purity(self, node):
if node.children==None:
return self.metric(node.local_data, kind=self.metric_kind)
else:
left_raw_purity=self.node_purity(node=node.left)
right_raw_purity=self.node_purity(node=node.right)
left_size= float(node.left.size)
right_size = float(node.right.size)
left_purity= (left_size/node.size)*left_raw_purity
right_purity = (right_size/node.size)*right_raw_purity
return left_purity+right_purity


And since it makes to reference to pivots and the like, it can go in DecisionTree.

Now we are pretty much there, we just need to say something about stopping conditions. Some natural candidates are node depth (how many levels down from the root it is), node size, total number of leaves,or a threshold for the metric.
The last one won’t work because we can’t know in advance how well the data can be partitioned by the attributes, but this is quite a general problem in machine learning, parameter-tuning.

Node depth sounds reasonable, so we could add that into the DecisionNode and have it be updated when splitting a node.

Node size is also something we already keeping track of, but it has the disadvantage that we don’t know in advance how small the nodes can become.

    def stopping_condition(self, node):
if self.max_node_depth <= node.depth:
return True
elif node.size <= self.min_node_size:
return True
else:
return False


Now we put it together.

    def grow_tree(self):
self.create_vertex()
self.set_root(self.vertices[0])
self.grow_node(node=self.get_root())
self.set_predictions


One other thing I want to do is interface to igraphs a bit better on the plotting front, but I will hide the details from you guys for now as it’s not so interesting, but once I put the code up you can examine it at leisure.

Now we are ready to grow our tree.

import cleantitanic as ct
df=ct.cleaneddf()[0]
df=df[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch',
'fare', 'embarked']]
data_type_dict={'survived':'nominal', 'pclass':'ordinal', 'sex':'nominal',
'age':'ordinal', 'sibsp':'ordinal', 'parch':'ordinal',
'fare':'ordinal', 'embarked':'nominal'}

t=PivotDecisionTree(data_type_dict=data_type_dict,
metric_kind='Gini', max_node_depth=3, data=df)
t.response='survived'
t.grow_tree()
g=t.plot()


And it looks like:

Where left corresponds to the vertex test being True. So this is starting to look good, but next we need to
learn how to prune it because we don’t yet have very informed criteria for stopping, but it is better to
ask forgiveness than permission, as the Pythonistas say!

The code will be up soon when it is all put together.