So we are continuing from last time, with the Titanic data set from Kaggle. Since our ‘response’, the survived column, is categorical or discrete, the easiest kind of plot to read will also be discrete, like bar charts.
First let’s load in the cleaned data we produced, and any libraries we might need.
In : import pandas as pd In : import numpy as np In : import cleantitanic as ct #make sure that python knows #to look where the script is by changing PYTHONPATH In : import matplotlib.pylab as plt In : data = ct.cleaneddf() In : traindf, testdf =data, data
Some natural columns to start with are embarked, pclass and sex, so let’s make a bar chart for each displaying the proportions of survived and not survived.
In : def proportionSurvived(discreteVar): ...: by_var = traindf.groupby([discreteVar,'survived']) #groups the data act on groups #seperately ...: table = by_var.size() #gets group size counts, hashed by the two variables ...: table = table.unstack() #splits the data into 2 columns, 0, 1, each indexed by the #other variable ...: normedtable = table.div(table.sum(1), axis=0) #divides the counts by the totals ...: return normedtable ...: In : discreteVarList = ['sex', 'pclass', 'embarked'] In : fig1, axes1 = plt.subplots(3,1) #creates a 3x1 blank plot In : for i in range(3): #now we fill in the subplots ....: var = discreteVarList[i] ....: table = proportionSurvived(var) ....: table.plot(kind='barh', stacked=True, ax=axes1[i]) ....: In : fig1.show() #displays the plot, might not need this if running in 'interactive' mode
As you can see it was a dangerous time to be a male! You were also more likely to live if you were in a higher class, and even where you embarked seemed to have an effect. What we would like now is to understand the variation not explained by sex, i.e. how can we predict which females will die and males will live?
To do this, we can try and split the data into male and female, then see what effect pclass and embarked had on survival in each group. I’ll also show how to tinker with of the display
options of the plots.
In : fig2, axes2 = plt.subplots(2,3) In : genders=traindf.sex.unique() In : classes=traindf.pclass.unique() In : def normrgb(rgb): #this converts rgb codes into the format matplotlib wants ....: rgb = [float(x)/255 for x in rgb] ....: return rgb ....: In : darkpink, lightpink =normrgb([255,20,147]), normrgb([255,182,193]) In : darkblue, lightblue = normrgb([0,0,128]),normrgb([135,206, 250]) In : for gender in genders: ....: for pclass in classes: ....: if gender=='male': ....: colorscheme = [lightblue, darkblue] #blue for boys ....: row=0 ....: else: ....: colorscheme = [lightpink, darkpink] #pink for a girl ....: row=1 ....: group = traindf[(traindf.sex==gender)&(traindf.pclass==pclass)] ....: group = group.groupby(['embarked', 'survived']).size().unstack() ....: group = group.div(group.sum(1), axis=0) ....: group.plot(kind='barh', ax=axes2[row, (int(pclass)-1)], color=colorscheme, stacked=True, legend=False).set_title('Class '+pclass).axes.get_xaxis().set_ticks() ....: In : plt.subplots_adjust(wspace=0.4, hspace=1.3) In : fhandles, flabels = axes2[1,2].get_legend_handles_labels() In : mhandles, mlabels = axes2[0,2].get_legend_handles_labels() In : plt.figlegend(fhandles, ('die', 'live'), title='Female', loc='center', bbox_to_anchor=(0.06, 0.45, 1.1, .102)) Out: In : plt.figlegend(mhandles, ('die', 'live'), 'center', title='Male',bbox_to_anchor=(-0.15, 0.45, 1.1, .102)) Out: In : fig2.show()
I’m sure that you’re sick of bar charts by now, but one last thing I want to demonstrate is binning or discretisation of numeric values. Let’s try that for age and fare.
This time I will show a basic plot to point out the problems. First we bin ages and fare:
bins = [0,5,14, 25, 40, 60, 100] binNames =['Young Child', 'Child', 'Young Adult', 'Adult', 'Middle Aged', 'Older'] binAge = pd.cut(traindf.age, bins, labels=binNames) #cut uses the given bins, or if passed an integer, divides the range evenly binFare = pd.qcut(traindf.fare, 3, labels=['Cheap', 'Middle', 'Expensive']) #qcut does quantiles
Let’s just try some plots and move it towards something presentable.
fig3, axes3 = plt.subplots(1,2) binVars = [binAge, binFare] for i in range(2): group = traindf.groupby([binVars[i], 'sex', 'survived']) group = group.size().unstack() group = group.div(group.sum(1), axis=0) ax = group.plot(kind='barh', stacked=True, ax=axes3[i])
Not very pretty. I would like to remove sex from the yticks, and have this be implied by the colouring of the bars, maybe also remove the xticks since they don’t add much. We may also have to sort out the spacing and legends.
fig3, axes3 = plt.subplots(1,2) binVars = [binAge, binFare] varNames = ['Age', 'Fare'] badStringList=['(', ')', 'female', 'male', ','] def removeBadStringFromString(string, badStringList): for badString in ['(', ')', 'female', 'male']: #notice that you want female before male string = string.replace(badString, '') return string def removeBadStringFromLabels(ax, badStringList): labels = [item.get_text() for item in ax.get_yticklabels()] labels = [removeBadStringFromString(label) for label in labels] return labels for i in range(2): group = traindf.groupby([binVars[i], 'sex', 'survived']) group = group.size().unstack() group = group.div(group.sum(1), axis=0) cols = [[lightpink, lightblue],[darkpink, darkblue]] group.plot(kind='barh', stacked=True, ax=axes3[i],legend=False, color=cols) labels = removeBadStringFromLabels(axes3[i], badStringList) axes3[i].set_yticklabels(labels) axes3[i].get_xaxis().set_ticks() axes3[i].set_ylabel('') axes3[i].set_title(varNames[i]) if i==1: axes3[i].yaxis.tick_right() axes3[i].yaxis.set_label_position("right") handles, labels = axes3.get_legend_handles_labels() plt.figlegend(handles, ['die','die'], loc='upper center') plt.figlegend(handles, ['live','live'], loc='lower center') fig3.show()
And using very similar code, provided in the file at the end of the post, we can look at the number
of siblings/spouses and parents/children:
You can see that for females at least it pays to travel alone.
Here is the code to do the whole thing in one go, I encourage you to play around with the plotting options to get a feel for as it as it is a bit fussy.