Last time we implemented logistic regression, where the data is in the form of a numpy array. But what if your data is not of that form? What if it is a pandas dataframe like the Kaggle Titanic data?

Specifically the problem is variables like ‘Title’ where we have four strings ‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’ as values. The solution is to create a column for each value, for example ‘Title_Mr’, with values 0,1 depending on whether the data point has that value. Reading Wes McKinney’s book again, he explains what to do, and so I have created a handy little function using his example.

It will accept as input a dataframe, and a dictionary telling which variables are nominal. It will then replace each nominal variable in your dataframe with a set of dummy columns, and also update your data type dictionary. Simple but effective.

def dummy_variables(data, data_type_dict):
#Loop over nominal variables.
for variable in filter(lambda x: data_type_dict[x]=='nominal',
data_type_dict.keys()):
#First we create the columns with dummy variables.
#Note that the argument 'prefix' means the column names will be
#prefix_value for each unique value in the original column, so
#we set the prefix to be the name of the original variable.
dummy_df=pd.get_dummies(data[variable], prefix=variable)
#Remove old variable from dictionary.
data_type_dict.pop(variable)
#Add new dummy variables to dictionary.
for dummy_variable in dummy_df.columns:
data_type_dict[dummy_variable] = 'nominal'
#Add dummy variables to main df.
data=data.drop(variable, axis=1)
data=data.join(dummy_df)
return [data, data_type_dict]

Easy. Now once everything is numeric, we can as use np.asarray to cast our dataframe to a numpy array.

### Like this:

Like Loading...

*Related*

Great posts about the Titanic competition. For the specific problem you address here (one-hot encoding as far as I get it), I’ve actually used some nice code from github, see: https://gist.github.com/kljensen/5452382