Create New DataFrame in Pandas with Selected Columns

Since Pandas is used to create work with large sets of data, it is often necessary to extract selected columns, containing relevant information, from one DataFrame and store it in a new DataFrame. In this tutorial, we will go over several different techniques through which we can achieve this.


We will be using the following Pandas DataFrame to demonstrate these techniques:

import pandas as pd

df=pd.DataFrame({
              "EmployeeID": [1, 2, 3, 4, 5],
              "Name": ["Harper", "Aiden", "Evelyn", "James", "Charlotte"],
              "Department": ["IT", "Finance", "HR", "QA", "Marketing"],
              "Salary": [5000, 4000, 7500, 8000, 5500],
              "Years_of_Service": [3, 5, 2, 8, 4]
               })
   EmployeeID       Name Department  Salary  Years_of_Service
0           1     Harper         IT    5000                 3
1           2      Aiden    Finance    4000                 5
2           3     Evelyn         HR    7500                 2
3           4      James         QA    8000                 8
4           5  Charlotte  Marketing    5500                 4

Create a new Pandas DataFrame and add the selected columns using filter():

We’ll start off with the filter() method. It takes items as a parameter which accepts a list of column labels and is used to specify the columns to be extracted.

Example:

new_df = df.filter(items = ["EmployeeID", "Department"])
print(new_df)

Output:

   EmployeeID Department
0           1         IT
1           2    Finance
2           3         HR
3           4         QA
4           5  Marketing

Assigning EmployeeID and Department to items returned a Pandas DataFrame containing these two columns only.


Create a new Pandas DataFrame and add the selected columns with labels:

In case we only know the labels of the relevant columns, we can simply extract these fields/Series and add them into a new Pandas DataFrame.

Example 1:

For the first example, simply specify the column label of the original DataFrame and then assign that column to the new DataFrame.

new_df = pd.DataFrame()
new_df["Name"] = df["Name"] 
new_df["Salary"] = df["Salary"]
print(new_df)

Output:

        Name  Salary
0     Harper    5000
1      Aiden    4000
2     Evelyn    7500
3      James    8000
4  Charlotte    5500

As we can see, the Name and Salary columns were successfully copied into the new DataFrame.

Example 2:

We can also pass a list of column labels, instead of individually assigning each column.

new_df = df[["EmployeeID", "Salary"]]
print(new_df)

Output:

        Name  Salary
0     Harper    5000
1      Aiden    4000
2     Evelyn    7500
3      James    8000
4  Charlotte    5500

Example 3:

Another option is to use the loc[] method. It allows us to access rows and columns through their labels. It can accept both lists and slice objects.

Pass a slice object of the original DataFrame in the first parameter which will represent its contents. Then pass a list of the selected columns you wish to extract in the second parameter.

new_df = df.loc[ : , ["Name", "Department"]]
print(new_df)

Output:

        Name Department
0     Harper         IT
1      Aiden    Finance
2     Evelyn         HR
3      James         QA
4  Charlotte  Marketing

After passing the contents of the original DataFrame with : , we specified the Name and Department columns using a list.

Example 4:

An alternative to passing a list is to pass a slice object representing all the columns labels, like so:

new_df = df.loc[ : , "Name" : "Salary"]
print(new_df)

Output:

        Name Department  Salary
0     Harper         IT    5000
1      Aiden    Finance    4000
2     Evelyn         HR    7500
3      James         QA    8000
4  Charlotte  Marketing    5500

All columns between and including Name and Salary were extracted using a slice object.


Create a new Pandas DataFrame and add the selected columns with column indexes:

On the other hand, if we only know the column indexes of the columns we want to extract, we can use the iloc[] method. This method is almost identical to the loc[] method with the key difference being that it accepts integers/indexes instead of labels. Therefore, it works exactly like loc[] and is able to accept lists as well as slice objects.

Example 1:

For this example, we will pass index 2 and 3 inside a list. 2 corresponds to the Department column while 3 corresponds to the Salary column.

new_df = df.iloc[ : , [2, 3]]
print(new_df)

Output:

  Department  Salary
0         IT    5000
1    Finance    4000
2         HR    7500
3         QA    8000
4  Marketing    5500

Example 2:

When using slice objects with iloc[], the DataFrame will only contain columns whose index is below the upper limit. For example, in the code written below, only columns with indexes of 1 and 2 are returned, but not 3.

new_df = df.iloc[ : , 1:3]
print(new_df)

Output:

        Name Department
0     Harper         IT
1      Aiden    Finance
2     Evelyn         HR
3      James         QA
4  Charlotte  Marketing

Create a new Pandas DataFrame by removing selected columns:

We can even remove certain columns to get the desired result using the drop() method. It takes labels, axis, index, columns, and inplace as optional parameters.

Example 1:

labels refers to the column label or index to be removed from the DataFrame. It accepts a single label or a list of labels.

We use the axis parameter to specify if we want to remove a label/field or an index/row. Passing 0 will remove the index while passing 1 will remove the column label. By default, axis is assigned a value of 0.

# removing the 3rd row
new_df1 = df.drop(2, axis = 0)
print(new_df1)

# removing the "Department" column
new_df2 = df.drop("Department", axis = 1)
print(new_df2)

Output:

   EmployeeID       Name Department  Salary  Years_of_Service
0           1     Harper         IT    5000                 3
1           2      Aiden    Finance    4000                 5
3           4      James         QA    8000                 8
4           5  Charlotte  Marketing    5500                 4


   EmployeeID       Name  Salary  Years_of_Service
0           1     Harper    5000                 3
1           2      Aiden    4000                 5
2           3     Evelyn    7500                 2
3           4      James    8000                 8
4           5  Charlotte    5500                 4

The row with index 2 was removed and stored in new_df1 while the column with label Department was removed and stored in new_df2.

Example 2:

index and columns are shorthand ways to avoid using axis. The rules that apply on labels are also applied to index and columns. We will use these new parameters to produce the previous example’s result in a more concise way.

# removing the 3rd row
new_df1 = df.drop(index = 2)
print(new_df1)

# removing the "Department" column
new_df2 = df.drop(columns = "Department")
print(new_df2)

Output:

   EmployeeID       Name Department  Salary  Years_of_Service
0           1     Harper         IT    5000                 3
1           2      Aiden    Finance    4000                 5
3           4      James         QA    8000                 8
4           5  Charlotte  Marketing    5500                 4


   EmployeeID       Name  Salary  Years_of_Service
0           1     Harper    5000                 3
1           2      Aiden    4000                 5
2           3     Evelyn    7500                 2
3           4      James    8000                 8
4           5  Charlotte    5500                 4

The results of the current and previous examples are identical.

Example 3:

inplace is used to choose between dropping the labels internally and returning None, or returning a copy with the labels removed. If inplace is False it follows the latter, else it follows the former. It is False by default.

# inplace is False by default
df.drop(index = 1)
print(df)

df.drop(index = 1, inplace = True)
print(df)

Output:

   EmployeeID       Name Department  Salary  Years_of_Service
0           1     Harper         IT    5000                 3
1           2      Aiden    Finance    4000                 5
2           3     Evelyn         HR    7500                 2
3           4      James         QA    8000                 8
4           5  Charlotte  Marketing    5500                 4


   EmployeeID       Name Department  Salary  Years_of_Service
0           1     Harper         IT    5000                 3
2           3     Evelyn         HR    7500                 2
3           4      James         QA    8000                 8
4           5  Charlotte  Marketing    5500                 4

Since inplace = False in the first drop() operation, a copy was returned. However, we did not store this copy in a variable so all the changes made were lost. In the second drop() operation, inplace = True, so the operation was performed internally, thus removing any need to store the DataFrame in a new variable.


One interesting thing to note is that Pandas sometimes returns the view of a DataFrame instead of a new copy. This can lead to problems later on if multiple DataFrames have the same reference. A change in one DataFrame will affect all other DataFrames. To remedy this, we create a Deep Copy of the DataFrame with the copy() method.



This marks the end of the “Create New DataFrame in Pandas with Selected Columns” Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial content can be asked in the comments section below.

Leave a Comment