Specifically the problem is variables like ‘Title’ where we have four strings ‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’ as values. The solution is to create a column for each value, for example ‘Title_Mr’, with values 0,1 depending on whether the data point has that value. Reading Wes McKinney’s book again, he explains what to do, and so I have created a handy little function using his example.
It will accept as input a dataframe, and a dictionary telling which variables are nominal. It will then replace each nominal variable in your dataframe with a set of dummy columns, and also update your data type dictionary. Simple but effective.
def dummy_variables(data, data_type_dict): #Loop over nominal variables. for variable in filter(lambda x: data_type_dict[x]=='nominal', data_type_dict.keys()): #First we create the columns with dummy variables. #Note that the argument 'prefix' means the column names will be #prefix_value for each unique value in the original column, so #we set the prefix to be the name of the original variable. dummy_df=pd.get_dummies(data[variable], prefix=variable) #Remove old variable from dictionary. data_type_dict.pop(variable) #Add new dummy variables to dictionary. for dummy_variable in dummy_df.columns: data_type_dict[dummy_variable] = 'nominal' #Add dummy variables to main df. data=data.drop(variable, axis=1) data=data.join(dummy_df) return [data, data_type_dict]
Easy. Now once everything is numeric, we can as use np.asarray to cast our dataframe to a numpy array.