Data Comparison is another important feature that is present in Pandas. It helps us to easily locate any discrepancies between two DataFrames through Pandas in-built functions. In this tutorial, we will cover how to check if two Pandas DataFrames are equal, while also discussing how to pinpoint any differences between them.
Using equals() method to check if DataFrames are equal:
It checks two DataFrames (or Series) for differences and returns True if the shape and elements are the same or False if otherwise. If two corresponding values are NaN, it will treat them as equal. It takes a DataFrame (to compare with) as an argument. If the corresponding values are the same, the index or column labels can be of different data types. However, the elements must be of the same data type.
Out of all the functions/operators/methods we will teach you, this is probably the best and most straightforward one.
In the following code, we declared and initialized three Pandas DataFrames. We assigned identical values to df1
and df2
, but changed “Oswald” to “Os” in df3
. Then we used the equals()
method to compare the DataFrames.
import pandas as pd
df1 = pd.DataFrame([["Oswald", "Male", 67],
["Jack", "Male", 25],
["Lacie", "Female", 32]])
df2 = pd.DataFrame([["Oswald", "Male", 67],
["Jack", "Male", 25],
["Lacie", "Female", 32]])
df3 = pd.DataFrame([["Os", "Male", 67],
["Jack", "Male", 25],
["Lacie", "Female", 32]])
print(df1.equals(df2))
print(df1.equals(df3))
Output:
True
False
As expected, when we compared df1
to df2
, the method returned True as both DataFrames were identical. However, when we compared df1
to df3
, the method returned False as their elements are not identical.
Using equality (==) operator to check if DataFrames are equal:
As you probably already know, the equality operator is used to compare two objects and return True if they are the same, else return False. If we compare two DataFrames using the equality operator, it will return a DataFrame that consists of boolean values. If two corresponding values are not the same, the DataFrame will store False in the corresponding cell, else it will be True.
One reason to use this over the equals()
method is that it is not type sensitive, so it will treat 67 and 67.0 as equals. It can also be used to find any differences in values for data analysis.
import pandas as pd
df1 = pd.DataFrame([["Oswald", "Male", 67],
["Jack", "Male", 25],
["Lacie", "Female", 32]])
df2 = pd.DataFrame([["Oswald", "Male", 67.0],
["Jack", "Male", 25.0],
["Lacie", "Female", 32.0]])
df3 = pd.DataFrame([["Os", "Male", 67],
["Jack", "Male", 25],
["Lacie", "Female", 32]])
print(df1 == df2)
print("")
print(df1 == df3)
Output:
0 1 2
0 True True True
1 True True True
2 True True True
0 1 2
0 False True True
1 True True True
2 True True True
Even though we changed all the integers in df2
to float, we still received True as the equality operator is not type sensitive.
If you simply want to receive one boolean value while comparing two DataFrames, you can use the all()
method. This method returns True if all the boolean values on an axis are True, but returns False if even one value is False. Since there are two axes in a DataFrame, we will use this method twice for each comparison.
df1 = pd.DataFrame([["Oswald", "Male", 67],
["Jack", "Male", 25],
["Lacie", "Female", 32]])
df2 = pd.DataFrame([["Oswald", "Male", 67.0],
["Jack", "Male", 25.0],
["Lacie", "Female", 32.0]])
df3 = pd.DataFrame([["Os", "Male", 67],
["Jack", "Male", 25],
["Lacie", "Female", 32]])
print((df1 == df2).all().all())
print((df1 == df3).all().all())
Output:
True
False
compare() method:
It is used to compare two DataFrames and show the differences between them. It takes another DataFrame as an argument and returns a DataFrame containing the differences between the two DataFrames. Equal values are shown as NaN. It is not type sensitive.
df1 = pd.DataFrame([["Oswald", "Male", 67],
["Jack", "Male", 25],
["Lacie", "Female", 32]])
df2 = pd.DataFrame([["Oswald", "Male", 67.0],
["Jack", "Male", 25.0],
["Lacie", "Female", 32.0]])
df3 = pd.DataFrame([["Os", "Male", 67],
["Alice", "Female", 25],
["Lacie", "Female", 32]])
print(df1.compare(df2))
print("")
print(df1.compare(df3))
Output:
Empty DataFrame
Columns: []
Index: []
0 1
self other self other
0 Oswald Os NaN NaN
1 Jack Alice Male Female
The comparison between df1
and df2
returned an empty DataFrame as there are no changes between the two. However, comparing df1
to df3
returned a DataFrame which displays the differences between the two.
self
refers to df1
(the object that called the method) and other
refers to df3
(the object passed as an argument).
assert_frame_equal() function
It compares two DataFrames and shows their differences. It takes several parameters, including left
, right
, and check_dtype
. The first DataFrame is left
and the second DataFrame is right
. check_dtype
is used to specify whether data types should also be compared, and is True by default.
It’s commonly used for unit testing and throws an error if the two DataFrames are not identical. Thus, we use it with exception handling to return True or False.
import pandas as pd
from pandas.testing import assert_frame_equal
df1 = pd.DataFrame([["Oswald", "Male", 67],
["Jack", "Male", 25],
["Lacie", "Female", 32]])
df2 = pd.DataFrame([["Oswald", "Male", 67.0],
["Jack", "Male", 25.0],
["Lacie", "Female", 32.0]])
try:
assert_frame_equal(df1, df2)
print("True")
except:
print("False")
try:
assert_frame_equal(df1, df2, check_dtype=False)
print("True")
except:
print("False")
Output:
False
True
In the first test, check_dtype
was True (by default) so it threw an error and printed “False”. But in the second test, check_dtype
was False so float could be compared to integer, thus it printed “True”.
Additional Tips:
If you need to compare two DataFrames with different column labels or index numbers, you can save the first DataFrame’s labels and indices in separate temporary variables and then assign the second DataFrame’s labels and indices to the first DataFrame. Then you can compare the DataFrames and then re-assign the first DataFrames labels and indices with the temporary variables.
import pandas as pd
from pandas.testing import assert_frame_equal
df1 = pd.DataFrame([["Oswald", "Male", 67],
["Jack", "Male", 25],
["Lacie", "Female", 32]],
index=[1,2,3],
columns=["Name", "Gender", "Age"])
df2 = pd.DataFrame([["Oswald", "Male", 67],
["Jack", "Male", 25],
["Lacie", "Female", 32]],
index=[1,4,3],
columns=["User", "Gen", "Age"])
print(df1.equals(df2))
temp_col = df1.columns
df1.columns = df2.columns
temp_row = df1.index
df1.index = df2.index
print(df1.equals(df2))
df1.columns = temp_col
df1.index = temp_row
Output:
False
True
Since, in the start, the labels and indices of df1
and df2
were different, equals()
returned False. But after equating the labels and indices, it returned True.
This marks the end of the “How to check if Pandas DataFrames are equal” Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial content can be asked in the comments section below.