So in this series I want to learn about Python and Machine Learning, implementing many of the algorithms ‘from scratch’ to really get a feel for how they work.
In this first post I am going to go through the basics of loading a data set into something Python can work with and general data ‘munging’.
Some of the packages you ought to have are numpy, pandas, scipy and matplotlib. Pandas is especially useful if you’ve ever used R, because it allows us to create dataframes.
Our goals are to: load the data in, understand what the variables mean, deal with missing or obviously erroneous entries and do some visualisations to get a feel for what’s going on.
First I will copy and paste here the descriptions of the variables given:
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored. The following are the definitions used
for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws. Some children travelled
only with a nanny, therefore parch=0 for them. As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
Let’s get started.
In : import pandas as pd In : path = 'E:/Datamining/Titanic/rawtrain.csv' In : traindf = pd.read_csv(path) In : traindf Out: <class 'pandas.core.frame.DataFrame'> Int64Index: 891 entries, 0 to 890 Data columns: survived 891 non-null values pclass 891 non-null values name 891 non-null values sex 891 non-null values age 714 non-null values sibsp 891 non-null values parch 891 non-null values ticket 891 non-null values fare 891 non-null values cabin 204 non-null values embarked 889 non-null values dtypes: float64(2), int64(4), object(5) In : traindf.describe() Out: survived pclass age sibsp parch fare count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
So we use read_csv since that is the form (comma separated values), the data is in. Pandas automatically gave the columns names from the header and inferred the data types. For large data sets it is recommended that you specify the data types manually.
Notice that the age, cabin and embarked columns have null values. Also we apparently have some free-loaders because the minimum fare is 0. We might think that these are babies, so let’s check that:
>>> In : traindf[['age','fare']][traindf.fare<5] Out: age fare 179 36 0.0000 263 40 0.0000 271 25 0.0000 277 NaN 0.0000 302 19 0.0000 378 20 4.0125 413 NaN 0.0000 466 NaN 0.0000 481 NaN 0.0000 597 49 0.0000 633 NaN 0.0000 674 NaN 0.0000 732 NaN 0.0000 806 39 0.0000 815 NaN 0.0000 822 38 0.0000
These guys are surely old enough to know better! But notice that there is a jump from a fare of 0 to 4, so there is something going on here, most likely these are errors, so let’s replace them by the mean fare for their class, and do the same for null values.
#first we set those fares of 0 to nan In : traindf.fare = traindf.fare.map(lambda x: np.nan if x==0 else x) #not that lambda just means a function we make on the fly #calculate the mean fare for each class In :classmeans = traindf.pivot_table('fare', rows='pclass', aggfunc='mean') In : classmeans Out: pclass 1 86.148874 2 21.358661 3 13.787875 Name: fare In : traindf.fare = traindf[['fare', 'pclass']].apply(lambda x: classmeans[x['pclass']] if pd.isnull(x['fare']) else x['fare'], axis=1 ) #so apply acts on dataframes, either row-wise or column-wise, axis=1 means rows
Now let’s do a similar thing for age, replacing missing values with the overall mean. Later we’ll learn about more sophisticated techniques for replacing missing values and improve upon this.
In : meanAge=np.mean(traindf.age) In : traindf.age=traindf.age.fillna(meanAge)
Now for the cabin, since the majority of values are missing, it might be best to treat that
as a piece of information itself, so we’ll set these to be ‘Unknown’, also we’ll set the missing values
in embarked to be the mode.
In : traindf.cabin = traindf.cabin.fillna('Unknown') In : from scipy.stats import mode In : modeEmbarked = mode(traindf.embarked) In : traindf.embarked = traindf.embarked.fillna(modeEmbarked)
And now we have sorted out the null values and the obvious outliers. Rather than saving the cleaned data set as a csv, I would prefer to have a workflow where we always start from the raw data, this means that we never lose anything. I include the script for doing the whole thing. You can download the script for the process, which also cleans the test data, here.
Next time we will have some fun with plotting.