Creating Dummy variables in Python with Pandas

Dummy variables are used to convert categorical variables into a numerical representation that can be easily used in mathematical equations. In this tutorial, we will discuss different techniques for creating dummy variables in Python with Pandas.

Since categorical variables could not be used directly in most machine learning algorithms or mathematical models we needed to find alternatives. As a result, we created dummy variables, which are binary representations of each category, indicating whether a particular category is present or not.


For this tutorial, the Pandas DataFrame provided below will be utilized:

import pandas as pd

df = pd.DataFrame({
		"Name" : ["John", "Emily", "Michael"],
		"Age" : [25, 30, 28],
		"City" : ["New York", "London", "Paris"]
	})
      Name  Age      City
0     John   25  New York
1    Emily   30    London
2  Michael   28     Paris

Using get_dummies() for creating dummy variables in Python Pandas:

The get_dummies() method in Pandas is used to convert categorical variables into dummy variables.

It creates binary indicator variables for each unique category in the specified column or DataFrame, representing the presence or absence of each category. This means, that only a 1 (meaning True/present) or a 0 (meaning False/absent) will be used to represent the contents of a row. Furthermore, it treats integer columns as continuous variables and does not convert them into binary indicators. Only fields with a string, object, or category type are converted.

Example:

In the first row of df, the Name field contains John so a new field is created called Name_John which is assigned a value of 1. This means that Name_John is present for this particular row. Age only contains integer values, so it is not given any binary indicators. A new field is created for each unique City cell, resulting in the creation of City_New York.

Let’s use code to demonstrate this. We’ll print the original Pandas DataFrame along with the new dummy-coded DataFrame.

dummy_df = pd.get_dummies(df)

print(df)
print(dummy_df)

Output:

      Name  Age      City
0     John   25  New York
1    Emily   30    London
2  Michael   28     Paris

   Age  Name_Emily  Name_John  Name_Michael  City_London  City_New York  City_Paris
0   25           0          1             0            0              1           0
1   30           1          0             0            1              0           0
2   28           0          0             1            0              0           1

Additional Parameters for get_dummies():

get_dummies() takes many additional parameters such as prefix, prefix-sep, columns, and dtype. We will briefly discuss the properties of each parameter.

Prefix Example:

prefix sets the string to be added to the start of the column label. It accepts a string, a list of strings, or a dictionary of strings and has a default value of None. Moreover, the length of the list/dictionary should be equal to the number of columns to be converted. Since Age does not contain categorical variables, it is not included, so the length of the list/dictionary will be 2 instead of 3.

#Using a string
print(pd.get_dummies(df, prefix = "Col"))

#Using a list of strings
print(pd.get_dummies(df, prefix = ["N", "C"]))

#Using a dictionary of strings
print(pd.get_dummies(df, prefix = {"Name" : "Nm", "City" : "Ct"}))

Output:

   Age  Col_Emily  Col_John  Col_Michael  Col_London  Col_New York  Col_Paris
0   25          0         1            0           0             1          0
1   30          1         0            0           1             0          0
2   28          0         0            1           0             0          1

   Age  N_Emily  N_John  N_Michael  C_London  C_New York  C_Paris
0   25        0       1          0         0           1        0
1   30        1       0          0         1           0        0
2   28        0       0          1         0           0        1

   Age  Nm_Emily  Nm_John  Nm_Michael  Ct_London  Ct_New York  Ct_Paris
0   25         0        1           0          0            1         0
1   30         1        0           0          1            0         0
2   28         0        0           1          0            0         1

Prefix-sep Example:

prefix-sep is used to assign the separator for the prefix and original variable label. It follows the same rules as prefix, but its default value is “_”.

dummy_df = pd.get_dummies(df, prefix_sep = "#")
print(dummy_df)

Output:

   Age  Name#Emily  Name#John  Name#Michael  City#London  City#New York  City#Paris
0   25           0          1             0            0              1           0
1   30           1          0             0            1              0           0
2   28           0          0             1            0              0           1

Columns Example:

columns specifies which column(s) should be encoded. It accepts a list of column label(s) but, by default, encodes all columns with object, string, or category datatype.

dummy_df = pd.get_dummies(df, columns=["Name"])
print(dummy_df)

Output:

   Age      City  Name_Emily  Name_John  Name_Michael
0   25  New York           0          1             0
1   30    London           1          0             0
2   28     Paris           0          0             1

Dtype Example:

dtype defines the data type of the contents of the new columns. It accepts a data type and has a default value of int.

dummy_df = pd.get_dummies(df, dtype = bool)
print(dummy_df)

Output:

   Age  Name_Emily  Name_John  Name_Michael  City_London  City_New York  City_Paris
0   25       False       True         False        False           True       False
1   30        True      False         False         True          False       False
2   28       False      False          True        False          False        True

Additional Information:

Forcing other datatypes to be converted:

If we need to covert columns with datatypes besides string, object, and category, we can do so by specifying it in the columns parameter.

dummy_df = pd.get_dummies(df, columns = ["Name", "Age"])
print(dummy_df)

Output:

       City  Name_Emily  Name_John  Name_Michael  Age_25  Age_28  Age_30
0  New York           0          1             0       1       0       0
1    London           1          0             0       0       0       1
2     Paris           0          0             1       0       1       0

Hence, all Age variables were converted into dummy variables.

Unique Variables:

Even if a column contains two or more variables that are identical, only one new column will be made for that particular variable. To demonstrate this, we will modify the DataFrame we have been using to contain two identical variables.

df = pd.DataFrame({
		"Name" : ["John", "John", "Michael"],
		"Age" : [25, 30, 28],
		"City" : ["New York", "London", "Paris"]
	})

dummy_df = pd.get_dummies(df)
print(dummy_df)

Output:

   Age  Name_John  Name_Michael  City_London  City_New York  City_Paris
0   25          1             0            0              1           0
1   30          1             0            1              0           0
2   28          0             1            0              0           1

As you can see, we swapped out Emily for John, making John a repeated variable. However, it is still only given one new column.


This marks the end of the “Creating Dummy variables in Python with Pandas” Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial content can be asked in the comments section below.

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments