Ok so this is going to be a quick recap of all the work we have done so far in this blog, but it should be accessible to first time readers also.
Quickstart: to get underway immediately just download the following python scripts:
Then simply run the last one, making sure that the other files are in your ‘pythonpath’ so that they can be imported. Otherwise read on.
Kaggle have a competition where you must predict the survivors of the titanic. The first step is to download the data,
you’ll need to grab the training data, and also the test data. This guide is going to be using Python, so you’ll also need that. I recommend Python (x,y) with Spyder, which you can download here.
Munging and Plotting
Once you’ve done that, you’ll need to clean the data, that involves getting rid of silly values, filling in missing values, discretising numeric attributes and also creating a template for a submission to Kaggle. The code that does all of that can be found here.
To read a guide to cleaning the data and producing that code, click here.
Ok great so now you’ve got your data, it is time to get to know it a little. The best way to get to know data is through visualisation, so here is some code to produce some plots of the data, which you can tinker with. To read a guide to creating these plots, click here.
Graphs and Trees in Python
So now you know something about Python, and also something about the Titanic data set. Our next goal is to produce a decision tree model for the data. Prior to that I created some Python code to represent graphs, to learn the very basics of graph theory you can click here.
Otherwise you will want to download the tigraph module I created from here. To learn about how I created this, and some related topics in python programming, you can read about object orientated programming here, decorators in python here,
and how to represent graphs in python here and here. Note that to produce plots of the graphs, you will need the igraph module for python.
Now that we can represent graphs in python we can make decision trees. The decision tree module can be downloaded here.
Putting it all together
The way I have used decision trees is explained here: basically using training and test data to prune a decision tree, and then swapping the roles of training and test data and repeating several times, finally I combine the predictions by taking the mode. The final piece of code to produce the model is here. And if you run it you should get a score of 0.79903, which the leaderboard tells me is 230th.
Here is a sample tree from those I took the mode of:
There is a great deal of room for improvement, I have not created any new features, such as extracting titles from the names, and have done the crudest possible missing data imputation, as well as using decision trees, which are basically the simplest model you can make.
Please let me know what you think or if you have any problems.