Home » Posts tagged 'Kaggle'

Tag Archives: Kaggle

Basic Feature Engineering with the Titanic Data

Today we are going to add a couple of features to the Titanic data set that I have discussed extensively, this will involve changing my data cleaning script. Following this I will test the new features using cross-validation to see if they made a difference. These new features come from reading the Kaggle forums and also this helpful blog post.

Titles

First up the Name column is currently not being used, but we can at least extract the title from the name. There are quite a few titles going around, but I want to reduce them all to Mrs, Miss, Mr and Master.  To do this we’ll need a function that searches for substrings. Thankfully the library ‘strings’ has just what we need.

import strings
def substrings_in_string(big_string, substrings):
    for substring in substrings:
        if string.find(big_string, substring) != -1:
            return substring
    print big_string
    return np.nan

Using this iteratively I was able to get a full list of titles.

title_list=['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev',
                    'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess',
                    'Don', 'Jonkheer']

Now that I have them, I recombine them to the four categories.

df['Title']=df['Name'].map(lambda x: substrings_in_string(x, title_list))

#replacing all titles with mr, mrs, miss, master
def replace_titles(x):
    title=x['Title']
    if title in ['Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
        return 'Mr'
    elif title in ['Countess', 'Mme']:
        return 'Mrs'
    elif title in ['Mlle', 'Ms']:
        return 'Miss'
    elif title =='Dr':
        if x['Sex']=='Male':
            return 'Mr'
        else:
            return 'Mrs'
    else:
        return title
df['Title']=df.apply(replace_titles, axis=1)

You may be interested to know that ‘Jonkheer’ is a male honorific for Dutch nobility. Also interesting is that I was tempted to just send ‘Dr’ -> ‘Mr’, but decided to check first, and there was indeed a female doctor aboard! It seems 1912 was further ahead of its time than Doctor Who!

Curious, I looked her up: her name was Dr. Alice Leader, and she and her husband were physicians in New York city.

But I digress. On to the decks.
Deck

This is going be very similar, we have a ‘Cabin’ column not doing much, only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck, and so we’re going to extract these just like the titles.

#Turning cabin number into Deck
cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'Unknown']
df['Deck']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))

Family Size
One thing you can do to create new features is linear combinations of features. In a model like linear regression this should be unnecessary, but for a decision tree may find it hard to model such relationships. Reading on the forums at Kaggle, some people have considered the size of a person’s family, the sum of their ‘SibSp’ and ‘Parch’ attributes. Perhaps people traveling alone did better? Or on the other hand perhaps if you had a family, you might have risked your life looking for them, or even giving up a space up to them in a lifeboat. Let’s throw that into the mix.

#Creating new family_size column
df['Family_Size']=df['SibSp']+df['Parch']

Age*Class
This is an interaction term, since age and class are both numbers we can just multiply them.

df['Age*Class']=df['Age']*df['Pclass']

Fare per Person

Here we divide the fare by the number of family members traveling together, I’m not exactly sure what this represents, but it’s easy enough to add in.

df['Fare_Per_Person']=df['Fare']/(df['Family_Size']+1)

Get The Code
I have incorporated these new features in a new data cleaning cleantitanic2.py script, which you can download here.

A complete guide to getting 0.79903 in Kaggle’s Titanic Competition with Python

Ok so this is going to be a quick recap of all the work we have done so far in this blog, but it should be accessible to first time readers also.

Quickstart: to get underway immediately just download the following python scripts:

Data Cleaning

Graph Theory

Decision Trees

Final Model

Then simply run the last one, making sure that the other files are in your ‘pythonpath’ so that they can be imported. Otherwise read on.

Getting Started

Kaggle have a competition where you must predict the survivors of the titanic. The first step is to download the data,
you’ll need to grab the training data, and also the test data. This guide is going to be using Python, so you’ll also need that. I recommend Python (x,y) with Spyder, which you can download here.

Munging and Plotting

Once you’ve done that, you’ll need to clean the data, that involves getting rid of silly values, filling in missing values, discretising numeric attributes and also creating a template for a submission to Kaggle. The code that does all of that can be found here.

To read a guide to cleaning the data and producing that code, click here.

Ok great so now you’ve got your data, it is time to get to know it a little. The best way to get to know data is through visualisation, so here is some code to produce some plots of the data, which you can tinker with. To read a guide to creating these plots, click here.

Graphs and Trees in Python

So now you know something about Python, and also something about the Titanic data set. Our next goal is to produce a decision tree model for the data. Prior to that I created some Python code to represent graphs, to learn the very basics of graph theory you can click here.

Otherwise you will want to download the tigraph module I created from here. To learn about how I created this, and some related topics in python programming, you can read about object orientated programming here, decorators in python here,
and how to represent graphs in python here and here. Note that to produce plots of the graphs, you will need the igraph module for python.

Now that we can represent graphs in python we can make decision trees. The decision tree module can be downloaded here.

To learn about how they were created you can start here with representation, then here for growing the tree, here for pruning the tree.

Putting it all together

The way I have used decision trees is explained here: basically using training and test data to prune a decision tree, and then swapping the roles of training and test data and repeating several times, finally I combine the predictions by taking the mode.  The final piece of code to produce the model is here. And if you run it you should get a score of 0.79903, which the leaderboard tells me is 230th.

Here is a sample tree from those I took the mode of:

sample_treeThere is a great deal of room for improvement, I have not created any new features, such as extracting titles from the names,  and have done the crudest possible missing data imputation, as well as using decision trees, which are basically the simplest model you can make.

Please let me know what you think or if you have any problems.

To prune or not to prune: cross-validation is the answer

Last time we developed a method for pruning decision trees based on a compromise between training error and tree complexity, the latter of which is weighted with a parameter \alpha. Choose a different \alpha and get a different pruned tree.pruning.jpg So how can we choose such a parameter?

If data is plentiful, we can split our training data into training and holdout data, and use the latter to test which \alpha gives the best result. Our data is not so plentiful. The main problem is that the final model did not get to see the whole training set as it was built.

A similar idea is k-fold cross-validation. Partition the data into k equal sized subsets randomly. For each subset, use it as test data, and its complement as training data to build a model, test the model, record the average error over the k subsets and we have an estimate for the model’s generalisation error.

Now repeat (on the same k subsets) for each set of model parameters you wish to test. Choose the set of parameters minmising the generalisation error. Use these parameters to build a model on the whole data set. Done.

Ok so lets build a generator function to return the folded data.

import random
import pandas as pd
def cross_validate(no_folds, data, resample=False):
    rows = list(data.index)
    random.shuffle(rows)
    N=len(data)
    len_fold = int(N/no_folds)
    start=0
    for i in range(no_folds):
        if i==no_folds-1:
            stop =N
        else:
            stop = start +len_fold
        test = data.ix[rows[start:stop]]
        train = data.ix[rows[:start]+rows[stop:]]
        if resample:
            train_len=start+N-stop
            no_resamples = N-train_len
            train_rows = list(train.index)
            random_extra_rows =[random.choice(train_rows) for row in range(no_resamples)]
            train_rows = train_rows+random_extra_rows
            train=train.ix[train_rows]
        yield {'test':test, 'train':train}
        start=stop

The idea is pretty simple, we take the list of row indexes and randomly shuffle them, then we just split the data set into k parts by choosing the rows from start*i to start+len(data)/k for the ith part, using the new mixed up row ordering.

Also observe that I have added an option to pump up the training sets to be the same size as the original data set using sampling with replacement.

I tried using this to select \alpha , but I found that the \alpha you get over-prunes the tree trained on the full data set. I believe this is because cross-validation has fewer unique values, and so consequently the leaves make up a bigger slice of the error pie, and so they will be more resistant to pruning and hence need more ‘encouragement’ to fuse through larger \alpha. As a result cross-validation tends to overestimate the optimum \alpha.

The new approach I settled upon was using something like cross-validation: I split the data into training and
testing portions, grow the tree on the training data, and prune it using the test data instead of the training data.
In this case \alpha is irrelevant, we set it to be 0.

The problem of course is that while our model has good information on how much to prune, it has only been trained
on half (say) the data. To alleviate this problem I also grew a tree on the test data, and pruned it using the
test data. This gives 2 models, or k-models if using k-fold cross-validation.

Then from each I take their predictions and combine them by taking the modal prediction. Of course the mode is
not very informative for only 2 predictions, so I repeated this process several times. This helps to reduce
variability from our random partitioning of the data. Here’s how to achieve this.

df=ct.cleaneddf(no_bins=10)[0]
df2=ct.cleaneddf(no_bins=10)[1]

df=df[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 
       'Fare', 'Embarked']]
data_type_dict={'Survived':'nominal', 'Pclass':'ordinal', 'Sex':'nominal', 
                'Age':'ordinal', 'SibSp':'ordinal', 'Parch':'ordinal', 
                'Fare':'ordinal', 'Embarked':'nominal'}

First I’ve just loaded the training and test data, and created the dictionary of data types that the tree expects.

def tree_train(data_type_dict, train_data,test_data, response, no_folds,
                   min_node_size, max_depth, no_iter):
        parameters={'min_node_size':min_node_size, 'max_node_depth':max_depth, 
                    'threshold':0, 'metric_kind':'Gini', 'alpha':0,
                    'response':response}
        model=ticart.ClassificationTree()
        predictions=[]
        for i in range(no_iter):
            for fold in cross_validate(no_folds, train_data):
                model=ticart.ClassificationTree()
                model.train(fold['train'], data_type_dict, parameters, prune=False)
                model.load_new_data(fold['test'])
                model.prune_tree(alpha=0, new_data=True)
                predictions.append(test_data.apply(model.predict, axis=1))
        return predictions    

This is pretty straightforward, for each iteration we chop up the data, train on all but one portion and prune with the remaining portion, repeat this for each fold and take predictions, returning a list of predictions at the end.

Next we need a way to combine predictions. We combine the list of predictions into a dataframe, then apply the mode to each row.

def combine_predictions(predictions):
    data_dict ={i:predictions[i] for i in range(len(predictions))}
    d=pd.DataFrame(data_dict)
    def mode(x):
        key,value = max(x.value_counts().iteritems(), key=lambda x:x[1])
        return key
    pred=d.apply(mode, axis=1)
    return predpredictions=tree_train(data_type_dict=data_type_dict, train_data=df,
                           test_data=df2, response='Survived', no_folds=10,
                           max_depth=50, min_node_size=5, no_iter=50)

And that’s it, next time I am going to walkthrough the whole process of cleaning the data through to building the model and making a submission to Kaggle.

Decision Trees Part 3: Pruning your Tree

Ok last time we learned how to automatically grow a tree, using a greedy algorithm to choose splits that maximise a given ‘metric’.

The problem is that the trees become huge and undoubtedly overfit to our data, meaning that it will generalize to unseen data poorly. This is the age old battle between signal and noise, where we have to build in a level of fuzziness to the model – pixelating the trees to see the forest.

So first we come up with a way of measuring what we want to achieve, then we find an algorithm to maximise that quantity.

Let’s introduce some terminology and notation. For some model built on training data, the resubstitution error is the error (for some notion of error) on the training data. For a tree T we may write
R(T) for this quantity.

In our case we will use the misclassification rate for our error, since in the Kaggle competition on the Titanic data  this is how the predictions are scored.

Now we quantify tree complexity, write |T| for the number of leaves. There is some balance to be struck between this quantities, which we will call the ‘\alpha-cost’ of the tree, given by
R_{\alpha}(T) = R(T) = \alpha |T|.

So \alpha is the price we pay for adding an extra leaf, which must be paid for in reduced training error.

So given our tree T and $\alpha$, we seek a subtree $T^*$ that minimises R_{\alpha}, or at least one that gives us a small value (because there are a lot of subtrees).

One thing we can try is look at each node that is not a leaf, fuse it, and calculate the cost, choose the node(if any) which gives the best improvement in tree cost, if none of them give an improvement, we’re done, else repeat the process.

Ok let’s get coding. First we want to be able to measure the error rate at a node, I call this the conditional error rate, because it assumes that you are already in the node. Since we will be calculating this quite a lot, I have added it to the node so it won’t recalculate it when called again.

    def conditional_error_rate(self, node, new_data=False):
        if node.error==None:
            if new_data:
                node.error= node.local_data[self.response].value_counts()
                if node.predicted_class in node.error.keys():
                    node.error = node.error[node.predicted_class]/float(node.size)
                    node.error= 1-node.error
                else:
                    node.error=1
            else:
                node.error= 1-node.predicted_prob
        return node.error

In the training or resubstitution case, where new_data=False, the missclassification rate is just one minus the predicted class. In the case we want to look at later, where after we have built our tree we look at new data, we need to know how often the class appears in the new data. If the predicted class is not in the new data at all, the missclassification rate is total, 1.

That was the conditional error rate, but what contribution does a node make to the overall error rate? To answer that we need to know the probability of being at that node, then multiply the conditional rate by that probability.

    def node_prob(self, node):
        return node.size/float(self.data_size)

    def local_error_rate(self, node, new_data=False):
        return self.conditional_error_rate(node, new_data)*self.node_prob(node)

So local_error_rate tells us the error a node contributes if we have fused at that node. To compare with not fusing, we look at the subtree rooted at that node, and calculate the sum of its leaves local errors.

    def node_error(self,node):
        return sum(self.local_error_rate(leaf) for leaf in self.get_node_leaves(node))

    def get_node_leaves(self, node): #get leaves of subtree at a node
        leaves=set([])
        for descendant in self.node_iter_down(node):
            if descendant in self.leaves:
                leaves.add(descendant)
        return leaves

    def node_iter_down(self,base, first=True): #iterates top to bottom over the subtree at a node
        if first:
            yield base
            if base.children==None:
                return
        if base.children==None:
            yield base
        else:
            for child in base.children:
                yield child
                for node in self.node_iter_down(child, first=False):
                    yield node

Great! Now if we want to know R(T) we just sum all of the local errors at the leaves.

    def error(self, new_data=False):
        return sum(self.local_error_rate(leaf,new_data) for leaf in self.leaves)

This is looking promising, now what we need to do is find a way of iterating through each possible prune, and calculate the change in tree cost. The good thing about the change in cost is that we only need to know about what’s going on in the subtree rooted at a node: we lose \alpha times the number of leaves of the subtree minus one to the cost by fusing, but gain the difference between the node error and its local error rate.

The change in cost by fusing a node is given by

    def node_cost(self, node, alpha):
        fused_error = self.local_error_rate(node)
        unfused_error = self.node_error(node)
        number_leaves_lost = len(self.get_node_leaves(node))-1
        error_diff=fused_error-unfused_error
        return error_diff - alpha*number_leaves_lost

And now we iterate over possible prunes:

    def get_best_prune(self, alpha):
        best_node_cost, best_node = min(self.iter_prune_cost(alpha),
                                          key=lambda x: x[0])
        if best_node_cost             return best_node
        else: #If no good prunes then we return zip.
            return None

    def iter_prune_cost(self, alpha):
        check=True
        for node in self.vertices:
            if not node in self.leaves:
                check=False
                yield [self.node_cost(node, alpha),node]
        if check:  #This is a base case, if the only node left is the root, then no more can be done.
            yield[0,self.get_root()]

Having laid the groundwork, the final method is simplicity itself.

    def prune_tree(self, alpha):
        best_prune=self.get_best_prune(alpha)
        if best_prune==None:
            return
        else:
            self.fuse_vertex(best_prune)
            self.prune_tree(alpha)

And that is that! Let’s give it a whirl. I’ve added a method called train to the class which you can see in the code I have put up, this allows us to pass a dictionary of parameters to build the tree. You can also see that I have seperated out the regression and classificiation trees into their own subclasses. First we try \alpha=0, meaning do no pruning.

import cleantitanic as ct
df=ct.cleaneddf()[0]
df=df[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch',
       'fare', 'embarked']]
data_type_dict={'survived':'nominal', 'pclass':'ordinal', 'sex':'nominal',
                'age':'ordinal', 'sibsp':'ordinal', 'parch':'ordinal',
                'fare':'ordinal', 'embarked':'nominal'}
g=ClassificationTree()
parameters={'min_node_size':5, 'max_node_depth':20,
                'data_type_dict':data_type_dict,'threshold':0,
                'response':'survived'}
parameters['alpha']=0
parameters['metric_kind']='Gini'
g.train(data=df,data_type_dict=data_type_dict, parameters=parameters)
g.plot()

Which produces this monstrosity.

unprunedtree

Yikes! Let’s see it with \alpha=0.001.

alpha001treeAnd \alpha=0.00135.

alpha00135treeAnd \alpha=0.00135.

alpha0015treeAnd finally  \alpha=0.002.

alpha002treeSo you can see that the tree you get depends entirely on \alpha, so which one should you use? We will attempt to answer that question next time where we will discuss cross-validation and OOB(out of bag resampling).

Here is the code I have been promising, it also includes the means to test unseen data, which we will discuss next time.

Decision Trees Part 2: Growing Your Tree

Ok so last time we looked at what a Decision Tree was, and how to represent one in Python using our DecisionNode, DecisionTree, PivotDecisionNode and PivotDecisionTree classes. We also built a very basic one for the Titanic Kaggle data with a single split on ‘sex’.

As might have been apparent even in that single example, building decision trees by hand is tedious, and what’s worse is that you have to guess what attributes and pivots to use. What we need is some automation.

Thinking aloud (so to speak) we want a new method for PivotDecisionTree that grows a particular node. Let’s just try and write such a method and fill it out as we go along.

    def grow_node(self, node):
        if node.parent==None:
            node.local_data=node.local_filter(data=self.data)
        if self.stopping_condition(node):
            return
        else:
            try:
                best_split=self.get_best_split(node)

            except StopIteration:
                return
            self.split_vertex(node, split_attribute=best_split[1],
                              pivot=best_split[2])
            for child in node.children:
                child.local_data=child.local_filter(data=node.local_data)
                self.grow_node(node=child)

The idea is simple, start with a node, if some stopping condition is achieved, do nothing, else find the best attribute to split on, and the best pivot for that attribute, and split on it. This is a greedy algorithm idea.  Then repeat the process with each of the children.

Now how to find the best split.

    def get_best_split(self, node):
        gen = self.iter_split_eval(node)
        first=next(gen)
        best_split=first
        for split in gen:
            if split[0] < best_split[0]:
                best_split = split
        return best_split

So this method iterates through the evaluations of the possible splits, and returns the lowest one, assuming lowest is what we want. Now we fill in the next function we haven’t defined.

    def iter_split_eval(self, node):
        for split in self.iter_split(node):
            if node.children==None:
                pass
            else:
                for child in node.children:
                    child.local_data=child.local_filter(node.local_data)
                ret = [self.node_purity(node),
                node.split_attribute, node.pivot]
                yield ret

    def iter_split(self, node):
        for attribute in self.data.columns:
            if attribute != self.response:
                for pivot in self.get_pivots(node.local_data, attribute):
                    self.fuse_vertex(node)
                    self.split_vertex(vertex=node, pivot=pivot,
                                      split_attribute=attribute)
                    yield

This is a generator object since we use the keyword yield, this allows to iterate over it, without creating the whole list all at once.

We try each attribute, and each pivot, split on it, and report our evaluation of that split using something called ‘node_purity’ which takes a metric as an argument. The metric will be what controls how we evaluate a split.

First we need a way of iterating over all possible splits. Observe that there are two cases: nominal data where there are 2^{k-1}-1 possible binary splits if there are k labels; ordinal data where there are k-1 splits. In the first case the pivot is a set, in the second its a number.

So let’s try something like:

    def get_pivots(self, data, attribute):
        if self.data_type_dict[attribute]=='ordinal':
            max_pivot = max(data[attribute].unique())
            for pivot in data[attribute].unique():
                if pivot < max_pivot:
                    yield pivot
        elif self.data_type_dict[attribute]=='nominal':
            values = data[attribute].unique()
            n=len(values)
            if n<=1:
                return
            n=floor(float(n)/2)
            n=int(n)
            for r in range(1,n+1):
                for pivot in combinations(values, r):
                        yield set(pivot) #we find all subsets of a certain size, but we only need to go
                                         #up to about half way, otherwise we duplicate.

This is another generator object which separates the two cases we discussed above. In the ‘nominal’ case it makes use of a handy method in the itertools module to produce all subsets of a given size.

Also notice that we need to tag our data columns with ‘ordinal’ or ‘nominal’ at some point.

Ok so now we’ve sorted out the outline of the logistics, we need to say how we want to evaluate splits. What we want in a split is for the data to be less mixed up with respect to the response.

In the classification case there are two main metrics we can use: Gini and Entropy.
Suppose our response can take k values: v_1, \dots , v_k , then we can associate to our data a probability vector p=(p_1, \dots ,p_k) that is the proportions of each response in the dataset.

The Gini metric tells us the probability of a random datum being mislabelled if we assign labels according to the probability vector : G(p) = \sum \limits _{i=1}^k p_i(1-p_i).

Entropy tells us, given that we know a data point is in the node, what is our expected information gain (in bits) of finding out what its response is:
E(p) = - \sum \limits _{i=1}^k p_i \log_2 p_i .

By convention we write 0 \log _2(0) = 0.

Notice that both of these metrics are maximised when the data is perfectly mixed up (or impure, that is p_i=p_j \ \forall i,j, and minimised when the node not mixed at all (pure), that is \exists i: p_i=1.

So we are seeking the former case, we want the data to be mostly in a single class.

These are very easy to implement:

    def Gini(self,prob_vector):
        return sum(p*(1-p) for p in prob_vector)

    def Entropy(self,prob_vector):
        def entropy_summand(p):
            if p==0:
                return 0
            else:
                return -p*log(p,2)
        return sum(entropy_summand(p) for p in prob_vector)

Now in the regression case we can measure mixed-upness using the residual some of squares (RSS), this is the sum of the squared errors at a leaf based on our prediction, the mean.

If m_l is the average response at a leaf, and y_i the response of the i^{th} data point at the node, then RSS(l)=\sum \limits _{i}(y_i-m_l)^2.

    def RSS(self, filtered_data):
        prediction = np.mean(filtered_data[self.response])
        dev = sum((y-prediction)**2 for y in filtered_data[self.response])
        return dev

We also make some helper functions for the metrics.

    def metric(self, filtered_data, kind='Entropy'):
        if kind=='RSS':
            return self.RSS(filtered_data=filtered_data)
        else:
            prob_vector=self.get_prob_vector(filtered_data)
            if kind=='Entropy':
                return self.Entropy(prob_vector)
            elif kind=='Gini':
                return self.Entropy(prob_vector)

    def get_prob_vector(self, data):
        size = float(len(data))
        value_count=data[self.response].value_counts()
        prob_vector = [value_count[key]/size for key in value_count.keys()]
        return prob_vector

Now we have the metrics sorted, we can make a function to handle calculating node purity. The idea is that if a node is a leaf, it simply returns the metric evaluated on the subset of the data corresponding to its filter, if it is a parent, it returns the weighted sum of the node purities of its children.

    def node_purity(self, node):
        if node.children==None:
            return self.metric(node.local_data, kind=self.metric_kind)
        else:
            left_raw_purity=self.node_purity(node=node.left)
            right_raw_purity=self.node_purity(node=node.right)
            left_size= float(node.left.size)
            right_size = float(node.right.size)
            left_purity= (left_size/node.size)*left_raw_purity
            right_purity = (right_size/node.size)*right_raw_purity
            return left_purity+right_purity

And since it makes to reference to pivots and the like, it can go in DecisionTree.

Now we are pretty much there, we just need to say something about stopping conditions. Some natural candidates are node depth (how many levels down from the root it is), node size, total number of leaves,or a threshold for the metric.
The last one won’t work because we can’t know in advance how well the data can be partitioned by the attributes, but this is quite a general problem in machine learning, parameter-tuning.

Node depth sounds reasonable, so we could add that into the DecisionNode and have it be updated when splitting a node.

Node size is also something we already keeping track of, but it has the disadvantage that we don’t know in advance how small the nodes can become.

    def stopping_condition(self, node):
        if self.max_node_depth <= node.depth:
            return True
        elif node.size <= self.min_node_size:
            return True
        else:
            return False

Now we put it together.

    def grow_tree(self):
        self.create_vertex()
        self.set_root(self.vertices[0])
        self.leaves.add(self.vertices[0])
        self.grow_node(node=self.get_root())
        self.set_predictions

One other thing I want to do is interface to igraphs a bit better on the plotting front, but I will hide the details from you guys for now as it’s not so interesting, but once I put the code up you can examine it at leisure.

Now we are ready to grow our tree.

import cleantitanic as ct
df=ct.cleaneddf()[0]
df=df[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch',
       'fare', 'embarked']]
data_type_dict={'survived':'nominal', 'pclass':'ordinal', 'sex':'nominal',
                'age':'ordinal', 'sibsp':'ordinal', 'parch':'ordinal',
                'fare':'ordinal', 'embarked':'nominal'}

t=PivotDecisionTree(data_type_dict=data_type_dict,
                    metric_kind='Gini', max_node_depth=3, data=df)
t.response='survived'
t.grow_tree()
g=t.plot()

And it looks like:
titanic_decision_tree1
Where left corresponds to the vertex test being True. So this is starting to look good, but next we need to
learn how to prune it because we don’t yet have very informed criteria for stopping, but it is better to
ask forgiveness than permission, as the Pythonistas say!

The code will be up soon when it is all put together.

Decision Trees Part 1: Representation

Now that we have data structures describing graphs, we can put them to work and do some machine learning! We’re going to be looking at decision trees. The wikipedia page on Decison Tree learning has an example from our favourite data set, the Titanic survivors, so let’s steal their picture to illustrate.

The idea is you start at the root, each node describes an attribute, here the root attribute is ‘sex’. Then the downward edges represent a partition of the attribute, in a binary tree this just means a yes or no question, which leads you to another node, and so on, until you arrive at a leaf.

Each leaf represents a subset of the data, and in the picture the survival rates for this subset are given. So if you have a bunch of training data (where the response is known) you can use this to let each leaf give a prediction by taking the mode (or mean for numeric response).

A regression tree is where the response is numeric, and a classification tree is where the response is categorical.

The real challenge is coming up with a tree to use as your model, but the first order of business is to be able to represent a decision tree. So let’s get to work.

To simplify things we’ll insist our trees be binary, and we don’t lose any generality by doing so.

The basic idea is that the nodes will have a simple boolean test on an attribute, for numeric it returns true if datapoint.attribute <= pivot and false otherwise, and for a categorical it returns true if datapoint.attribute is in pivot and false otherwise.

Thus each node has a local ‘filter’ on the dataset, which determines its local data. Similarly we can ask a node: to which of your children does this datapoint belong?

I want the nodes themselves to keep track of a lot of this as it feels more natural, so I begin by creating a subclass DecisionNode.

This will be designed to accept a pandas dataframe so we can help ourself to all of its goodies.

import tigraphs as tig
import numpy as np
import pandas as pd
class DecisionNode(tig.BasicNode,object):
    def __init__(self, **kwargs):
        super(DecisionNode, self).__init__(**kwargs)
        self.left=None
        self.right=None
        self.children=None
        self.parent=None
        self.prediction=None
        self.tally={}
        self.total=0.0
        self.size=0
        self.depth =0
        self.local_data=None
    def local_filter(self, data): #filters data
        pass first=False)
    def get_next_node_or_predict(self, datapoint):
        pass

As you can see I have kept it non-specific on how it filters so we can reuse it later. Since it will be in a binary tree, I’ve given it some convenient class attributes to keep track of the left child (corresponding to passing the test) and right child (corresponding to failing the test).

Now we’ll create a binary tree, inheriting from the nary-tree class we made last time and also using our new DecisionNode class.

class DecisionTree(tig.return_nary_tree_class(directed=True), object):
    def __init__(self, data=None,response='',Vertex=DecisionNode,**kwargs):
        super(DecisionTree, self).__init__(N=2, Vertex=Vertex, **kwargs)
        self.data =data
        self.response=response #data attribute we're trying to predict
    def split_vertex(self, vertex):
        super(DecisionTree, self).split_vertex(vertex)
        vertex.left = vertex.children[0]
        vertex.right = vertex.children[1]
    def fuse_vertex(self, vertex):
        super(DecisionTree, self).fuse_vertex(vertex)
        vertex.left, vertex.right = None, None

That was pretty straightforward, now we’ll be more specific and introduce the pivots to chop-up the data with.

class PivotDecisionNode(DecisionNode,object):
    def __init__(self, **kwargs):
        super(PivotDecisionNode, self).__init__(**kwargs)
        self.pivot=None
        self.split_attribute = None
    def local_filter(self, data): #filters the data based on parent's pivot
        if self.parent==None:
            self.size = len(data)
            return data
        attribute = self.parent.split_attribute
        pivot = self.parent.pivot
        if type(pivot)==set:
            ret= data[attribute].isin(pivot)
        else:
            ret = data[attribute] &lt;= pivot
        if self == self.parent.left:
            ret=data[ret]
        else:
            ret=data[~ret]
        self.size=len(ret)
        return ret
    def get_next_node_or_predict(self, datapoint): #tells us where to find a prediction, or returns one
        if self.children == None:
            return self.prediction
        else:
            if type(self.pivot) ==set:
                if datapoint[self.split_attribute] in self.pivot:
                    return self.left
                else:
                    return self.right
            else:
                if datapoint[self.split_attribute] &lt;=self.pivot:
                    return self.left
                else:
                    return self.right

So the node can now filter the data based on an attribute and pivot, notice that left and right children from the same parent have opposite filters.

Now we’ll use this to create a PivotDecisionTree class.

class PivotDecisionTree(DecisionTree, object):
    def __init__(self, data_type_dict={},metric_kind='Gini',
                 tree_kind='classification',
                 Vertex=PivotDecisionNode, min_node_size=5,
                 max_node_depth=4, **kwargs):
        super(PivotDecisionTree, self).__init__(Vertex=Vertex, **kwargs)
        self.data_type_dict=data_type_dict
        self.metric_kind=metric_kind
        self.tree_kind=tree_kind
    def split_vertex(self, vertex, split_attribute, pivot):
        super(PivotDecisionTree, self).split_vertex(vertex)
        vertex.pivot, vertex.split_attribute = pivot, split_attribute
    def fuse_vertex(self, vertex):
        super(PivotDecisionTree, self).fuse_vertex(vertex)
        self.pivot, self.split_attribute = None, None
    def create_full_n_level(self, *args, **kwargs):
        raise AttributeError('This method is not appropriate as pivots are not specified')
    def set_node_prediction(self,node):
        if self.tree_kind == 'classification': #returns a probability for each class
            node.prediction=node.local_data[self.response].value_counts()
            node.size = sum(node.prediction[key] for key in node.prediction.keys())
            node.size=float(node.size)
            node.prediction={ key : node.prediction[key]/node.size
                              for key in node.prediction.keys() }
        elif self.tree_kind == 'regression': #returns mean of the responses
            node.prediction = node.local_data[self.response].mean()
            node.prediction=node.prediction[self.response].mean()
    def set_predictions(self):
        for node in self.vertices:
            self.set_node_prediction(node)

And now we can represent Decision Trees! To test it, let’s create a very simple one for the Titanic dataset from Kaggle. We’ll just have a single split on ‘sex’, which should be reasonable recalling this plot we produced awhile back.

So we’ll produce this decision tree ‘by hand’, and next time we’ll look at how to automate the process, ‘growing’ trees.

t=PivotDecisionTree()
t.create_vertex()
t.set_root(t.vertices[0])
root = t.get_root()
t.leaves.add(root)
t.split_vertex(vertex=t.get_root(), split_attribute='sex', pivot=set(['female']))
import cleantitanic as ct
data = ct.cleaneddf()[0]
t.response='survived'
root = t.get_root()
root.local_data=root.local_filter(data)
for child in root.children:
    child.local_data=child.local_filter(root.local_data)
t.set_predictions()
for leaf in t.leaves:
    print leaf.prediction
#producing the following output
{0: 0.25796178343949044, 1: 0.7420382165605095}
{0: 0.81109185441941078, 1: 0.18890814558058924}

From looking at the plot above I think you can guess which way around they are. As you can see creating a tree by hand is tedious and is best left to computers.

If you want to learn more I highly recommend having a read of this.

Plotting With Python

So we are continuing from last time, with the Titanic data set from Kaggle. Since our ‘response’, the survived column, is categorical or discrete, the easiest kind of plot to read will also be discrete, like bar charts.

First let’s load in the cleaned data we produced, and any libraries we might need.

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: import cleantitanic as ct #make sure that python knows
                                  #to look where the script is by changing PYTHONPATH
In [4]: import matplotlib.pylab as plt
In [5]: data = ct.cleaneddf()
In [6]: traindf, testdf =data[0], data[1]

Some natural columns to start with are embarked, pclass and sex, so let’s make a bar chart for each displaying the proportions of survived and not survived.

In [7]: def proportionSurvived(discreteVar):
   ...:     by_var = traindf.groupby([discreteVar,'survived']) #groups the data act on groups
                                                               #seperately
   ...:     table = by_var.size() #gets group size counts, hashed by the two variables
   ...:     table = table.unstack() #splits the data into 2 columns, 0, 1, each indexed by the
                                    #other variable
   ...:     normedtable = table.div(table.sum(1), axis=0) #divides the counts by the totals
   ...:     return normedtable
   ...:

In [8]: discreteVarList = ['sex', 'pclass', 'embarked']
In [9]: fig1, axes1 = plt.subplots(3,1) #creates a 3x1 blank plot
In [10]: for i in range(3): #now we fill in the subplots
   ....:     var = discreteVarList[i]
   ....:     table = proportionSurvived(var)
   ....:     table.plot(kind='barh', stacked=True, ax=axes1[i])
   ....:
In [11]: fig1.show() #displays the plot, might not need this if running in 'interactive' mode

You should get something that looks like:
proportionsurvivedgenderclassembarked

As you can see it was a dangerous time to be a male! You were also more likely to live if you were in a higher class, and even where you embarked seemed to have an effect. What we would like now is to understand the variation not explained by sex, i.e. how can we predict which females will die and males will live?

To do this, we can try and split the data into male and female, then see what effect pclass and embarked had on survival in each group. I’ll also show how to tinker with of the display
options of the plots.

In [12]: fig2, axes2 = plt.subplots(2,3)
In [13]: genders=traindf.sex.unique()
In [14]: classes=traindf.pclass.unique()
In [15]: def normrgb(rgb):   #this converts rgb codes into the format matplotlib wants
   ....:     rgb = [float(x)/255 for x in rgb]
   ....:     return rgb
   ....:
In [16]: darkpink, lightpink =normrgb([255,20,147]), normrgb([255,182,193])
In [16]: darkblue, lightblue = normrgb([0,0,128]),normrgb([135,206, 250])
In [17]: for gender in genders:
   ....:     for pclass in classes:
   ....:         if gender=='male':
   ....:             colorscheme = [lightblue, darkblue] #blue for boys
   ....:             row=0
   ....:         else:
   ....:             colorscheme = [lightpink, darkpink] #pink for a girl
   ....:             row=1
   ....:         group = traindf[(traindf.sex==gender)&(traindf.pclass==pclass)]
   ....:         group = group.groupby(['embarked', 'survived']).size().unstack()
   ....:         group = group.div(group.sum(1), axis=0)
   ....:         group.plot(kind='barh', ax=axes2[row, (int(pclass)-1)], color=colorscheme, stacked=True, legend=False).set_title('Class '+pclass).axes.get_xaxis().set_ticks([])
   ....:
In [18]: plt.subplots_adjust(wspace=0.4, hspace=1.3)
In [19]: fhandles, flabels = axes2[1,2].get_legend_handles_labels()
In [20]: mhandles, mlabels = axes2[0,2].get_legend_handles_labels()
In [21]: plt.figlegend(fhandles, ('die', 'live'), title='Female', loc='center', bbox_to_anchor=(0.06, 0.45, 1.1, .102))
Out[21]:
In [22]: plt.figlegend(mhandles, ('die', 'live'), 'center', title='Male',bbox_to_anchor=(-0.15, 0.45, 1.1, .102))
Out[22]:
In [23]: fig2.show()

And now you will have a plot that looks like:
genderembarkedpclassbarh

I’m sure that you’re sick of bar charts by now, but one last thing I want to demonstrate is binning or discretisation of numeric values. Let’s try that for age and fare.

This time I will show a basic plot to point out the problems. First we bin ages and fare:

bins = [0,5,14, 25, 40, 60, 100]
binNames =['Young Child', 'Child', 'Young Adult', 'Adult', 'Middle Aged', 'Older']
binAge = pd.cut(traindf.age, bins, labels=binNames)
#cut uses the given bins, or if passed an integer, divides the range evenly
binFare = pd.qcut(traindf.fare, 3, labels=['Cheap', 'Middle', 'Expensive'])
#qcut does quantiles

Let’s just try some plots and move it towards something presentable.

fig3, axes3 = plt.subplots(1,2)
binVars = [binAge, binFare]
for i in range(2):
    group = traindf.groupby([binVars[i], 'sex', 'survived'])
    group = group.size().unstack()
    group = group.div(group.sum(1), axis=0)
    ax = group.plot(kind='barh', stacked=True, ax=axes3[i])

Producing something like:
agefare1

Not very pretty. I would like to remove sex from the yticks, and have this be implied by the colouring of the bars, maybe also remove the xticks since they don’t add much. We may also have to sort out the spacing and legends.

fig3, axes3 = plt.subplots(1,2)
binVars = [binAge, binFare]
varNames = ['Age', 'Fare']
badStringList=['(', ')', 'female', 'male', ',']
def removeBadStringFromString(string, badStringList):
    for badString in ['(', ')', 'female', 'male']: #notice that you want female before male
        string = string.replace(badString, '')
    return string

def removeBadStringFromLabels(ax, badStringList):
    labels = [item.get_text() for item in ax.get_yticklabels()]
    labels = [removeBadStringFromString(label) for label in labels]
    return labels
for i in range(2):
    group = traindf.groupby([binVars[i], 'sex', 'survived'])
    group = group.size().unstack()
    group = group.div(group.sum(1), axis=0)
    cols = [[lightpink, lightblue],[darkpink, darkblue]]
    group.plot(kind='barh', stacked=True, ax=axes3[i],legend=False, color=cols)
    labels = removeBadStringFromLabels(axes3[i], badStringList)
    axes3[i].set_yticklabels(labels)
    axes3[i].get_xaxis().set_ticks([])
    axes3[i].set_ylabel('')
    axes3[i].set_title(varNames[i])

    if i==1:
        axes3[i].yaxis.tick_right()
        axes3[i].yaxis.set_label_position("right")

handles, labels = axes3[0].get_legend_handles_labels()
plt.figlegend(handles[0], ['die','die'], loc='upper center')
plt.figlegend(handles[1], ['live','live'], loc='lower center')
fig3.show()

And this looks like:
agefare2

And using very similar code, provided in the file at the end of the post, we can look at the number
of siblings/spouses and parents/children:

sibsparch

You can see that for females at least it pays to travel alone.

Here is the code to do the whole thing in one go, I encourage you to play around with the plotting options to get a feel for as it as it is a bit fussy.