Home » Kaggle » Machine Learning with Python – First Steps: Munging the Titanic

Machine Learning with Python – First Steps: Munging the Titanic

So in this series I want to learn about Python and Machine Learning, implementing many of the algorithms ‘from scratch’ to really get a feel for how they work.

The data set I intend to examine in great detail is from Kaggle, concerning the survivors of the titanic, you can find the data here.

titanic

In this first post I am going to go through the basics of loading a data set into something Python can work with and general data ‘munging’.

I am using the Spyder IDE with IPython , and learning from the book “Python for Data Analysis” by Wes McKinney.

Some of the packages you ought to have are numpy, pandas, scipy and matplotlib. Pandas is especially useful if you’ve ever used R, because it allows us to create dataframes.

Our goals are to: load the data in, understand what the variables mean, deal with missing or obviously erroneous entries and do some visualisations to get a feel for what’s going on.

First I will copy and paste here the descriptions of the variables given:

VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored. The following are the definitions used
for sibsp and parch.

Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws. Some children travelled
only with a nanny, therefore parch=0 for them. As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.

Let’s get started.

In [2]: import pandas as pd
In [3]: path = 'E:/Datamining/Titanic/rawtrain.csv'
In [4]: traindf = pd.read_csv(path)
In [5]: traindf
Out[5]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns:
survived    891  non-null values
pclass      891  non-null values
name        891  non-null values
sex         891  non-null values
age         714  non-null values
sibsp       891  non-null values
parch       891  non-null values
ticket      891  non-null values
fare        891  non-null values
cabin       204  non-null values
embarked    889  non-null values
dtypes: float64(2), int64(4), object(5)

In [6]: traindf.describe()
Out[6]:
         survived      pclass         age       sibsp       parch        fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200

So we use read_csv since that is the form (comma separated values), the data is in. Pandas automatically gave the columns names from the header and inferred the data types. For large data sets it is recommended that you specify the data types manually.

Notice that the age, cabin and embarked columns have null values. Also we apparently have some free-loaders because the minimum fare is 0. We might think that these are babies, so let’s check that:

>>> In [7]: traindf[['age','fare']][traindf.fare<5]
Out[7]:
     age    fare
179   36  0.0000
263   40  0.0000
271   25  0.0000
277  NaN  0.0000
302   19  0.0000
378   20  4.0125
413  NaN  0.0000
466  NaN  0.0000
481  NaN  0.0000
597   49  0.0000
633  NaN  0.0000
674  NaN  0.0000
732  NaN  0.0000
806   39  0.0000
815  NaN  0.0000
822   38  0.0000

These guys are surely old enough to know better! But notice that there is a jump from a fare of 0 to 4, so there is something going on here, most likely these are errors, so let’s replace them by the mean fare for their class, and do the same for null values.

#first we set those fares of 0 to nan
In [8]: traindf.fare = traindf.fare.map(lambda x: np.nan if x==0 else x)
#not that lambda just means a function we make on the fly
#calculate the mean fare for each class
In [9]:classmeans = traindf.pivot_table('fare', rows='pclass', aggfunc='mean')
In [11]: classmeans
Out[11]:
pclass
1         86.148874
2         21.358661
3         13.787875
Name: fare

In [12]: traindf.fare = traindf[['fare', 'pclass']].apply(lambda x: classmeans[x['pclass']] if pd.isnull(x['fare']) else x['fare'], axis=1 )
#so apply acts on dataframes, either row-wise or column-wise, axis=1 means rows

Now let’s do a similar thing for age, replacing missing values with the overall mean. Later we’ll learn about more sophisticated techniques for replacing missing values and improve upon this.

In [14]: meanAge=np.mean(traindf.age)
In [15]: traindf.age=traindf.age.fillna(meanAge)

Now for the cabin, since the majority of values are missing, it might be best to treat that
as a piece of information itself, so we’ll set these to be ‘Unknown’, also we’ll set the missing values
in embarked to be the mode.

In [16]: traindf.cabin = traindf.cabin.fillna('Unknown')
In [17]: from scipy.stats import mode
In [18]: modeEmbarked = mode(traindf.embarked)[0][0]
In [19]: traindf.embarked = traindf.embarked.fillna(modeEmbarked)

And now we have sorted out the null values and the obvious outliers. Rather than saving the cleaned data set as a csv, I would prefer to have a workflow where we always start from the raw data, this means that we never lose anything. I include the script for doing the whole thing. You can download the script for the process, which also cleans the test data, here.

Next time we will have some fun with plotting.

Advertisements

4 Comments

  1. […] we are continuing from last time, with the titanic data set. Since our ‘response’ the survived column is categorical or […]

  2. […] back. So far we’ve learnt how to clean data, and do some basic plotting. Now we really want to move on to some machine learning proper and make […]

  3. […] To read a guide to cleaning the data and producing that code, click here. […]

  4. […] Last time we implemented logistic regression, where the data is in the form of a numpy array. But what if your data is not of that form? What if it is a pandas dataframe like the Kaggle Titanic data? […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: