Since Pandas is used to create work with large sets of data, it is often necessary to extract selected columns, containing relevant information, from one DataFrame and store it in a new DataFrame. In this tutorial, we will go over several different techniques through which we can achieve this.
We will be using the following Pandas DataFrame to demonstrate these techniques:
import pandas as pd
df=pd.DataFrame({
"EmployeeID": [1, 2, 3, 4, 5],
"Name": ["Harper", "Aiden", "Evelyn", "James", "Charlotte"],
"Department": ["IT", "Finance", "HR", "QA", "Marketing"],
"Salary": [5000, 4000, 7500, 8000, 5500],
"Years_of_Service": [3, 5, 2, 8, 4]
})
EmployeeID Name Department Salary Years_of_Service
0 1 Harper IT 5000 3
1 2 Aiden Finance 4000 5
2 3 Evelyn HR 7500 2
3 4 James QA 8000 8
4 5 Charlotte Marketing 5500 4
Create a new Pandas DataFrame and add the selected columns using filter():
We’ll start off with the filter()
method. It takes items
as a parameter which accepts a list of column labels and is used to specify the columns to be extracted.
Example:
new_df = df.filter(items = ["EmployeeID", "Department"])
print(new_df)
Output:
EmployeeID Department
0 1 IT
1 2 Finance
2 3 HR
3 4 QA
4 5 Marketing
Assigning EmployeeID
and Department
to items
returned a Pandas DataFrame containing these two columns only.
Create a new Pandas DataFrame and add the selected columns with labels:
In case we only know the labels of the relevant columns, we can simply extract these fields/Series and add them into a new Pandas DataFrame.
Example 1:
For the first example, simply specify the column label of the original DataFrame and then assign that column to the new DataFrame.
new_df = pd.DataFrame()
new_df["Name"] = df["Name"]
new_df["Salary"] = df["Salary"]
print(new_df)
Output:
Name Salary
0 Harper 5000
1 Aiden 4000
2 Evelyn 7500
3 James 8000
4 Charlotte 5500
As we can see, the Name
and Salary
columns were successfully copied into the new DataFrame.
Example 2:
We can also pass a list of column labels, instead of individually assigning each column.
new_df = df[["EmployeeID", "Salary"]]
print(new_df)
Output:
Name Salary
0 Harper 5000
1 Aiden 4000
2 Evelyn 7500
3 James 8000
4 Charlotte 5500
Example 3:
Another option is to use the loc[]
method. It allows us to access rows and columns through their labels. It can accept both lists and slice objects.
Pass a slice object of the original DataFrame in the first parameter which will represent its contents. Then pass a list of the selected columns you wish to extract in the second parameter.
new_df = df.loc[ : , ["Name", "Department"]]
print(new_df)
Output:
Name Department
0 Harper IT
1 Aiden Finance
2 Evelyn HR
3 James QA
4 Charlotte Marketing
After passing the contents of the original DataFrame with :
, we specified the Name
and Department
columns using a list.
Example 4:
An alternative to passing a list is to pass a slice object representing all the columns labels, like so:
new_df = df.loc[ : , "Name" : "Salary"]
print(new_df)
Output:
Name Department Salary
0 Harper IT 5000
1 Aiden Finance 4000
2 Evelyn HR 7500
3 James QA 8000
4 Charlotte Marketing 5500
All columns between and including Name
and Salary
were extracted using a slice object.
Create a new Pandas DataFrame and add the selected columns with column indexes:
On the other hand, if we only know the column indexes of the columns we want to extract, we can use the iloc[]
method. This method is almost identical to the loc[]
method with the key difference being that it accepts integers/indexes instead of labels. Therefore, it works exactly like loc[]
and is able to accept lists as well as slice objects.
Example 1:
For this example, we will pass index 2 and 3 inside a list. 2 corresponds to the Department
column while 3 corresponds to the Salary
column.
new_df = df.iloc[ : , [2, 3]]
print(new_df)
Output:
Department Salary
0 IT 5000
1 Finance 4000
2 HR 7500
3 QA 8000
4 Marketing 5500
Example 2:
When using slice objects with iloc[]
, the DataFrame will only contain columns whose index is below the upper limit. For example, in the code written below, only columns with indexes of 1 and 2 are returned, but not 3.
new_df = df.iloc[ : , 1:3]
print(new_df)
Output:
Name Department
0 Harper IT
1 Aiden Finance
2 Evelyn HR
3 James QA
4 Charlotte Marketing
Create a new Pandas DataFrame by removing selected columns:
We can even remove certain columns to get the desired result using the drop()
method. It takes labels
, axis
, index
, columns
, and inplace
as optional parameters.
Example 1:
labels
refers to the column label or index to be removed from the DataFrame. It accepts a single label or a list of labels.
We use the axis
parameter to specify if we want to remove a label/field or an index/row. Passing 0 will remove the index while passing 1 will remove the column label. By default, axis
is assigned a value of 0.
# removing the 3rd row
new_df1 = df.drop(2, axis = 0)
print(new_df1)
# removing the "Department" column
new_df2 = df.drop("Department", axis = 1)
print(new_df2)
Output:
EmployeeID Name Department Salary Years_of_Service
0 1 Harper IT 5000 3
1 2 Aiden Finance 4000 5
3 4 James QA 8000 8
4 5 Charlotte Marketing 5500 4
EmployeeID Name Salary Years_of_Service
0 1 Harper 5000 3
1 2 Aiden 4000 5
2 3 Evelyn 7500 2
3 4 James 8000 8
4 5 Charlotte 5500 4
The row with index 2 was removed and stored in new_df1
while the column with label Department
was removed and stored in new_df2
.
Example 2:
index
and columns
are shorthand ways to avoid using axis
. The rules that apply on labels
are also applied to index
and columns
. We will use these new parameters to produce the previous example’s result in a more concise way.
# removing the 3rd row
new_df1 = df.drop(index = 2)
print(new_df1)
# removing the "Department" column
new_df2 = df.drop(columns = "Department")
print(new_df2)
Output:
EmployeeID Name Department Salary Years_of_Service
0 1 Harper IT 5000 3
1 2 Aiden Finance 4000 5
3 4 James QA 8000 8
4 5 Charlotte Marketing 5500 4
EmployeeID Name Salary Years_of_Service
0 1 Harper 5000 3
1 2 Aiden 4000 5
2 3 Evelyn 7500 2
3 4 James 8000 8
4 5 Charlotte 5500 4
The results of the current and previous examples are identical.
Example 3:
inplace
is used to choose between dropping the labels
internally and returning None
, or returning a copy with the labels
removed. If inplace
is False
it follows the latter, else it follows the former. It is False
by default.
# inplace is False by default
df.drop(index = 1)
print(df)
df.drop(index = 1, inplace = True)
print(df)
Output:
EmployeeID Name Department Salary Years_of_Service
0 1 Harper IT 5000 3
1 2 Aiden Finance 4000 5
2 3 Evelyn HR 7500 2
3 4 James QA 8000 8
4 5 Charlotte Marketing 5500 4
EmployeeID Name Department Salary Years_of_Service
0 1 Harper IT 5000 3
2 3 Evelyn HR 7500 2
3 4 James QA 8000 8
4 5 Charlotte Marketing 5500 4
Since inplace = False
in the first drop()
operation, a copy was returned. However, we did not store this copy in a variable so all the changes made were lost. In the second drop()
operation, inplace = True
, so the operation was performed internally, thus removing any need to store the DataFrame in a new variable.
One interesting thing to note is that Pandas sometimes returns the view of a DataFrame instead of a new copy. This can lead to problems later on if multiple DataFrames have the same reference. A change in one DataFrame will affect all other DataFrames. To remedy this, we create a Deep Copy of the DataFrame with the copy()
method.
This marks the end of the “Create New DataFrame in Pandas with Selected Columns” Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial content can be asked in the comments section below.