Compress Pickle File Dumps in Python

Pickle is a powerful serialization library in Python, capable of serializing (pickling) objects to files for later use. Often when working with large datasets, the size of these files can get quite large. This also slows the loading and dumping of data, causing you to have to wait longer. To solve this issue, we compress our Pickle file dumps, which drastically reduces the size.

In this tutorial we’ll be using the bz2 compression library. There is also the gzip library which is a bit faster, but not as good when it comes to reducing file size.

Compressing Data

First let’s take a look at the Data that we are gonna be dumping into a file.

We’ve come up with a good chunk of code that generates 1000 Objects from our Custom Class “Student”. To include some realism into our code, I also used the random library to add some variety into the data of the Student Objects by picking a random combination of Names and Genders.

# Data to be Compresssed
import pickle
import bz2
import random
import os

class Student:
    def __init__(self, ID, Name, Gender):
        self.ID = ID
        self.Name = Name
        self.Gender = Gender

    def display(self):
        print("Name:", self.Name,
              "ID:", self.ID,
              "Gender:", self.Gender)

data = []
names = ["Bob", "Rob", "Sam", "Jane", "Sally", "John", "Emilia"]
genders = ["M", "F", "O"]

for x in range(1000):
    data.append(Student(x, random.choice(names), random.choice(genders)))

We now have our list called “data”, which has 1000 Student Objects. Let’s try dumping this to a file and see what happens.

# Writing data normally to a file (pickle)

ofile = open("BinaryData",'wb')
pickle.dump(data, ofile)
ofile.close()

print(os.path.getsize("BinaryData"))  # Return file size

The above code returns the following output:

Compressing Pickle File Data

Now let’s see just how much we can compress that file using bz2. We’ll be doing the same procedure again, but this time through bz2's functions.

Basically all we have to do, is use the bz2.BZ2File Class, instead of the standard open() function seen in regular File Handling. Likewise, you can also use the bz2.open() function, which will provide the same compression effect.

For the unware, the first parameter is the filepath or filename, and the second is the mode it’s opened in. “w” stands for write, whereas as “wb” stands for binary write. Since we are dealing with Binary data, we need to use “wb”.

# Compressing Data

ofile = bz2.BZ2File("BinaryData",'wb')
pickle.dump(data,ofile)
ofile.close()

print(os.path.getsize("BinaryData"))

Here’s the new file size, which we obtained using the getsize() function from the OS library.

All in all, we’ve observed a nice 5x improvement over the original pickle file size. And all it took was just adding in a few lines of code. In the next section, we’ll explore how to read this content back into python so we can use it.

Check out the bz2 tutorial for alternate ways of compressing data, or the gzip tutorial for faster compression!

Decompressing Pickle Data in Python

Decompressing Pickle file data python is pretty easy and can be done in two ways. The first method is shown below, where we use the bz2.BZ2File() method to return a file stream object.

Just like before, the Syntax is the same, where the first parameter is the filename (use filepath if in another directory), and the second parameter is “rb” for reading binary files.

Once we have our BZ2File object ready, we just have to pass it into pickle.load(), which takes a file stream object as parameter. The pickle.load() will return the pickled data back in the same form that it was when it was dumped.

# Reading the compressed data 

ifile = bz2.BZ2File("BinaryData4",'rb')
newdata = pickle.load(ifile)
ifile.close()

for x in newdata:
    x.display()

Here are the first 6 results from the output of the above code.

Name: Sally ID: 0 Gender: O
Name: Emilia ID: 1 Gender: F
Name: Sam ID: 2 Gender: M
Name: Sam ID: 3 Gender: O
Name: Bob ID: 4 Gender: O
Name: Rob ID: 5 Gender: F

Alternate Method

This here is an alternate method of reading data.

Instead of passing a file stream object to pickle.load(), we can choose to pass the pickled data in the form of a string to pickle.loads(). Of course, we’ll have to decompress it first (or pickle won’t recognize it) using the bz2.decompress() function, which takes string data. And for that we’ll have to use the read() function on the file stream object.

Overall it’s alot of methods being used, but it’s just an interesting showcase and an alternative way of handling things.

# Reading the compressed data

ifile = open("BinaryData",'rb')
newdata = pickle.loads(bz2.decompress(ifile.read()))
ifile.close()

for x in newdata:
    x.display()

And of course, the output of the above code is going to be same as the previous one.

This marks the end of the Python Pickle File Compress Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial content can be asked in the comments section below.