Dummy variables are used to convert categorical variables into a numerical representation that can be easily used in mathematical equations. In this tutorial, we will discuss different techniques for creating dummy variables in Python with Pandas.
Since categorical variables could not be used directly in most machine learning algorithms or mathematical models we needed to find alternatives. As a result, we created dummy variables, which are binary representations of each category, indicating whether a particular category is present or not.
For this tutorial, the Pandas DataFrame provided below will be utilized:
import pandas as pd
df = pd.DataFrame({
"Name" : ["John", "Emily", "Michael"],
"Age" : [25, 30, 28],
"City" : ["New York", "London", "Paris"]
})
Name Age City
0 John 25 New York
1 Emily 30 London
2 Michael 28 Paris
Using get_dummies() for creating dummy variables in Python Pandas:
The get_dummies()
method in Pandas is used to convert categorical variables into dummy variables.
It creates binary indicator variables for each unique category in the specified column or DataFrame, representing the presence or absence of each category. This means, that only a 1 (meaning True/present) or a 0 (meaning False/absent) will be used to represent the contents of a row. Furthermore, it treats integer columns as continuous variables and does not convert them into binary indicators. Only fields with a string, object, or category type are converted.
Example:
In the first row of df
, the Name
field contains John
so a new field is created called Name_John
which is assigned a value of 1. This means that Name_John
is present for this particular row. Age
only contains integer values, so it is not given any binary indicators. A new field is created for each unique City
cell, resulting in the creation of City_New York
.
Let’s use code to demonstrate this. We’ll print the original Pandas DataFrame along with the new dummy-coded DataFrame.
dummy_df = pd.get_dummies(df)
print(df)
print(dummy_df)
Output:
Name Age City
0 John 25 New York
1 Emily 30 London
2 Michael 28 Paris
Age Name_Emily Name_John Name_Michael City_London City_New York City_Paris
0 25 0 1 0 0 1 0
1 30 1 0 0 1 0 0
2 28 0 0 1 0 0 1
Additional Parameters for get_dummies():
get_dummies()
takes many additional parameters such as prefix
, prefix-sep
, columns
, and dtype
. We will briefly discuss the properties of each parameter.
Prefix Example:
prefix
sets the string to be added to the start of the column label. It accepts a string, a list of strings, or a dictionary of strings and has a default value of None
. Moreover, the length of the list/dictionary should be equal to the number of columns to be converted. Since Age
does not contain categorical variables, it is not included, so the length of the list/dictionary will be 2 instead of 3.
#Using a string
print(pd.get_dummies(df, prefix = "Col"))
#Using a list of strings
print(pd.get_dummies(df, prefix = ["N", "C"]))
#Using a dictionary of strings
print(pd.get_dummies(df, prefix = {"Name" : "Nm", "City" : "Ct"}))
Output:
Age Col_Emily Col_John Col_Michael Col_London Col_New York Col_Paris
0 25 0 1 0 0 1 0
1 30 1 0 0 1 0 0
2 28 0 0 1 0 0 1
Age N_Emily N_John N_Michael C_London C_New York C_Paris
0 25 0 1 0 0 1 0
1 30 1 0 0 1 0 0
2 28 0 0 1 0 0 1
Age Nm_Emily Nm_John Nm_Michael Ct_London Ct_New York Ct_Paris
0 25 0 1 0 0 1 0
1 30 1 0 0 1 0 0
2 28 0 0 1 0 0 1
Prefix-sep Example:
prefix-sep
is used to assign the separator for the prefix and original variable label. It follows the same rules as prefix
, but its default value is “_”.
dummy_df = pd.get_dummies(df, prefix_sep = "#")
print(dummy_df)
Output:
Age Name#Emily Name#John Name#Michael City#London City#New York City#Paris
0 25 0 1 0 0 1 0
1 30 1 0 0 1 0 0
2 28 0 0 1 0 0 1
Columns Example:
columns
specifies which column(s) should be encoded. It accepts a list of column label(s) but, by default, encodes all columns with object, string, or category datatype.
dummy_df = pd.get_dummies(df, columns=["Name"])
print(dummy_df)
Output:
Age City Name_Emily Name_John Name_Michael
0 25 New York 0 1 0
1 30 London 1 0 0
2 28 Paris 0 0 1
Dtype Example:
dtype
defines the data type of the contents of the new columns. It accepts a data type and has a default value of int
.
dummy_df = pd.get_dummies(df, dtype = bool)
print(dummy_df)
Output:
Age Name_Emily Name_John Name_Michael City_London City_New York City_Paris
0 25 False True False False True False
1 30 True False False True False False
2 28 False False True False False True
Additional Information:
Forcing other datatypes to be converted:
If we need to covert columns with datatypes besides string, object, and category, we can do so by specifying it in the columns
parameter.
dummy_df = pd.get_dummies(df, columns = ["Name", "Age"])
print(dummy_df)
Output:
City Name_Emily Name_John Name_Michael Age_25 Age_28 Age_30
0 New York 0 1 0 1 0 0
1 London 1 0 0 0 0 1
2 Paris 0 0 1 0 1 0
Hence, all Age
variables were converted into dummy variables.
Unique Variables:
Even if a column contains two or more variables that are identical, only one new column will be made for that particular variable. To demonstrate this, we will modify the DataFrame we have been using to contain two identical variables.
df = pd.DataFrame({
"Name" : ["John", "John", "Michael"],
"Age" : [25, 30, 28],
"City" : ["New York", "London", "Paris"]
})
dummy_df = pd.get_dummies(df)
print(dummy_df)
Output:
Age Name_John Name_Michael City_London City_New York City_Paris
0 25 1 0 0 1 0
1 30 1 0 1 0 0
2 28 0 1 0 0 1
As you can see, we swapped out Emily
for John
, making John
a repeated variable. However, it is still only given one new column.
This marks the end of the “Creating Dummy variables in Python with Pandas” Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial content can be asked in the comments section below.