Home » machine learning » decision trees » A complete guide to getting 0.79903 in Kaggle’s Titanic Competition with Python

A complete guide to getting 0.79903 in Kaggle’s Titanic Competition with Python

Ok so this is going to be a quick recap of all the work we have done so far in this blog, but it should be accessible to first time readers also.

Quickstart: to get underway immediately just download the following python scripts:

Data Cleaning

Graph Theory

Decision Trees

Final Model

Then simply run the last one, making sure that the other files are in your ‘pythonpath’ so that they can be imported. Otherwise read on.

Getting Started

Kaggle have a competition where you must predict the survivors of the titanic. The first step is to download the data,
you’ll need to grab the training data, and also the test data. This guide is going to be using Python, so you’ll also need that. I recommend Python (x,y) with Spyder, which you can download here.

Munging and Plotting

Once you’ve done that, you’ll need to clean the data, that involves getting rid of silly values, filling in missing values, discretising numeric attributes and also creating a template for a submission to Kaggle. The code that does all of that can be found here.

To read a guide to cleaning the data and producing that code, click here.

Ok great so now you’ve got your data, it is time to get to know it a little. The best way to get to know data is through visualisation, so here is some code to produce some plots of the data, which you can tinker with. To read a guide to creating these plots, click here.

Graphs and Trees in Python

So now you know something about Python, and also something about the Titanic data set. Our next goal is to produce a decision tree model for the data. Prior to that I created some Python code to represent graphs, to learn the very basics of graph theory you can click here.

Otherwise you will want to download the tigraph module I created from here. To learn about how I created this, and some related topics in python programming, you can read about object orientated programming here, decorators in python here,
and how to represent graphs in python here and here. Note that to produce plots of the graphs, you will need the igraph module for python.

Now that we can represent graphs in python we can make decision trees. The decision tree module can be downloaded here.

To learn about how they were created you can start here with representation, then here for growing the tree, here for pruning the tree.

Putting it all together

The way I have used decision trees is explained here: basically using training and test data to prune a decision tree, and then swapping the roles of training and test data and repeating several times, finally I combine the predictions by taking the mode.  The final piece of code to produce the model is here. And if you run it you should get a score of 0.79903, which the leaderboard tells me is 230th.

Here is a sample tree from those I took the mode of:

sample_treeThere is a great deal of room for improvement, I have not created any new features, such as extracting titles from the names,  and have done the crudest possible missing data imputation, as well as using decision trees, which are basically the simplest model you can make.

Please let me know what you think or if you have any problems.

Advertisements

2 Comments

  1. Collin Bleak says:

    Looks neat. Are you continuing with this?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: