Home » Kaggle » Plotting With Python

Plotting With Python

So we are continuing from last time, with the Titanic data set from Kaggle. Since our ‘response’, the survived column, is categorical or discrete, the easiest kind of plot to read will also be discrete, like bar charts.

First let’s load in the cleaned data we produced, and any libraries we might need.

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: import cleantitanic as ct #make sure that python knows
                                  #to look where the script is by changing PYTHONPATH
In [4]: import matplotlib.pylab as plt
In [5]: data = ct.cleaneddf()
In [6]: traindf, testdf =data[0], data[1]

Some natural columns to start with are embarked, pclass and sex, so let’s make a bar chart for each displaying the proportions of survived and not survived.

In [7]: def proportionSurvived(discreteVar):
   ...:     by_var = traindf.groupby([discreteVar,'survived']) #groups the data act on groups
   ...:     table = by_var.size() #gets group size counts, hashed by the two variables
   ...:     table = table.unstack() #splits the data into 2 columns, 0, 1, each indexed by the
                                    #other variable
   ...:     normedtable = table.div(table.sum(1), axis=0) #divides the counts by the totals
   ...:     return normedtable

In [8]: discreteVarList = ['sex', 'pclass', 'embarked']
In [9]: fig1, axes1 = plt.subplots(3,1) #creates a 3x1 blank plot
In [10]: for i in range(3): #now we fill in the subplots
   ....:     var = discreteVarList[i]
   ....:     table = proportionSurvived(var)
   ....:     table.plot(kind='barh', stacked=True, ax=axes1[i])
In [11]: fig1.show() #displays the plot, might not need this if running in 'interactive' mode

You should get something that looks like:

As you can see it was a dangerous time to be a male! You were also more likely to live if you were in a higher class, and even where you embarked seemed to have an effect. What we would like now is to understand the variation not explained by sex, i.e. how can we predict which females will die and males will live?

To do this, we can try and split the data into male and female, then see what effect pclass and embarked had on survival in each group. I’ll also show how to tinker with of the display
options of the plots.

In [12]: fig2, axes2 = plt.subplots(2,3)
In [13]: genders=traindf.sex.unique()
In [14]: classes=traindf.pclass.unique()
In [15]: def normrgb(rgb):   #this converts rgb codes into the format matplotlib wants
   ....:     rgb = [float(x)/255 for x in rgb]
   ....:     return rgb
In [16]: darkpink, lightpink =normrgb([255,20,147]), normrgb([255,182,193])
In [16]: darkblue, lightblue = normrgb([0,0,128]),normrgb([135,206, 250])
In [17]: for gender in genders:
   ....:     for pclass in classes:
   ....:         if gender=='male':
   ....:             colorscheme = [lightblue, darkblue] #blue for boys
   ....:             row=0
   ....:         else:
   ....:             colorscheme = [lightpink, darkpink] #pink for a girl
   ....:             row=1
   ....:         group = traindf[(traindf.sex==gender)&(traindf.pclass==pclass)]
   ....:         group = group.groupby(['embarked', 'survived']).size().unstack()
   ....:         group = group.div(group.sum(1), axis=0)
   ....:         group.plot(kind='barh', ax=axes2[row, (int(pclass)-1)], color=colorscheme, stacked=True, legend=False).set_title('Class '+pclass).axes.get_xaxis().set_ticks([])
In [18]: plt.subplots_adjust(wspace=0.4, hspace=1.3)
In [19]: fhandles, flabels = axes2[1,2].get_legend_handles_labels()
In [20]: mhandles, mlabels = axes2[0,2].get_legend_handles_labels()
In [21]: plt.figlegend(fhandles, ('die', 'live'), title='Female', loc='center', bbox_to_anchor=(0.06, 0.45, 1.1, .102))
In [22]: plt.figlegend(mhandles, ('die', 'live'), 'center', title='Male',bbox_to_anchor=(-0.15, 0.45, 1.1, .102))
In [23]: fig2.show()

And now you will have a plot that looks like:

I’m sure that you’re sick of bar charts by now, but one last thing I want to demonstrate is binning or discretisation of numeric values. Let’s try that for age and fare.

This time I will show a basic plot to point out the problems. First we bin ages and fare:

bins = [0,5,14, 25, 40, 60, 100]
binNames =['Young Child', 'Child', 'Young Adult', 'Adult', 'Middle Aged', 'Older']
binAge = pd.cut(traindf.age, bins, labels=binNames)
#cut uses the given bins, or if passed an integer, divides the range evenly
binFare = pd.qcut(traindf.fare, 3, labels=['Cheap', 'Middle', 'Expensive'])
#qcut does quantiles

Let’s just try some plots and move it towards something presentable.

fig3, axes3 = plt.subplots(1,2)
binVars = [binAge, binFare]
for i in range(2):
    group = traindf.groupby([binVars[i], 'sex', 'survived'])
    group = group.size().unstack()
    group = group.div(group.sum(1), axis=0)
    ax = group.plot(kind='barh', stacked=True, ax=axes3[i])

Producing something like:

Not very pretty. I would like to remove sex from the yticks, and have this be implied by the colouring of the bars, maybe also remove the xticks since they don’t add much. We may also have to sort out the spacing and legends.

fig3, axes3 = plt.subplots(1,2)
binVars = [binAge, binFare]
varNames = ['Age', 'Fare']
badStringList=['(', ')', 'female', 'male', ',']
def removeBadStringFromString(string, badStringList):
    for badString in ['(', ')', 'female', 'male']: #notice that you want female before male
        string = string.replace(badString, '')
    return string

def removeBadStringFromLabels(ax, badStringList):
    labels = [item.get_text() for item in ax.get_yticklabels()]
    labels = [removeBadStringFromString(label) for label in labels]
    return labels
for i in range(2):
    group = traindf.groupby([binVars[i], 'sex', 'survived'])
    group = group.size().unstack()
    group = group.div(group.sum(1), axis=0)
    cols = [[lightpink, lightblue],[darkpink, darkblue]]
    group.plot(kind='barh', stacked=True, ax=axes3[i],legend=False, color=cols)
    labels = removeBadStringFromLabels(axes3[i], badStringList)

    if i==1:

handles, labels = axes3[0].get_legend_handles_labels()
plt.figlegend(handles[0], ['die','die'], loc='upper center')
plt.figlegend(handles[1], ['live','live'], loc='lower center')

And this looks like:

And using very similar code, provided in the file at the end of the post, we can look at the number
of siblings/spouses and parents/children:


You can see that for females at least it pays to travel alone.

Here is the code to do the whole thing in one go, I encourage you to play around with the plotting options to get a feel for as it as it is a bit fussy.



  1. […] back. So far we’ve learnt how to clean data, and do some basic plotting. Now we really want to move on to some machine learning proper and make some […]

  2. […] Ok great so now you’ve got your data, it is time to get to know it a little. The best way to get to know data is through visualisation, so here is some code to produce some plots of the data, which you can tinker with. To read a guide to creating these plots, click here. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: