Home » machine learning » data cleaning » Creating Dummy Variables in Pandas

Creating Dummy Variables in Pandas

Last time we implemented logistic regression, where the data is in the form of a numpy array. But what if your data is not of that form? What if it is a pandas dataframe like the Kaggle Titanic data?

Specifically the problem is variables like ‘Title’ where we have four strings ‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’ as values. The solution is to create a column for each value, for example ‘Title_Mr’, with values 0,1 depending on whether the data point has that value. Reading Wes McKinney’s book again, he explains what to do, and so I have created a handy little function using his example.

It will accept as input a dataframe, and a dictionary telling which variables are nominal. It will then replace each nominal variable in your dataframe with a set of dummy columns, and also update your data type dictionary. Simple but effective.

def dummy_variables(data, data_type_dict):
    #Loop over nominal variables.
    for variable in filter(lambda x: data_type_dict[x]=='nominal',
                           data_type_dict.keys()):

        #First we create the columns with dummy variables.
        #Note that the argument 'prefix' means the column names will be
        #prefix_value for each unique value in the original column, so
        #we set the prefix to be the name of the original variable.
        dummy_df=pd.get_dummies(data[variable], prefix=variable)

        #Remove old variable from dictionary.
        data_type_dict.pop(variable)

        #Add new dummy variables to dictionary.
        for dummy_variable in dummy_df.columns:
            data_type_dict[dummy_variable] = 'nominal'

        #Add dummy variables to main df.
        data=data.drop(variable, axis=1)
        data=data.join(dummy_df)

    return [data, data_type_dict]

Easy. Now once everything is numeric, we can as use np.asarray to cast our dataframe to a numpy array.

Advertisements

1 Comment

  1. Matthias says:

    Great posts about the Titanic competition. For the specific problem you address here (one-hot encoding as far as I get it), I’ve actually used some nice code from github, see: https://gist.github.com/kljensen/5452382

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: