Python gzip module – compress and decompress files

Often when dealing the large files, our loading times can cause delays within our applications. The solution to this is compressing files with the help of a compression library to produce smaller file sizes. In Python, we can make use of the gzip library, used to compress and decompress files.


What is Object Serialization?

Serialization is the process of converting an object to a byte stream, and the inverse of which is converting a byte stream back to a python object.

In simpler words, Object Serialization is the process of converting actual Python Objects into bytes, allowing the whole object to be preserved (with all it’s current values). This is commonly known as “pickling” or “dumping”, where we save the byte stream into a file.

The reverse process is de-serialization where we convert the byte stream back into objects that are recognize by Python.

Object Serialization is a useful tool in many situations, such as creating save files to store things like game data or training models for AI/Machine Learning problems. It can take a long time for AI algorithms to generate a model, so instead of doing it every time you run the program, you could just dump it to a file once, and then read it from there each time you need it, potentially speeding up your program by 100x times.


Test Data

Here is the data which we will be trying to compress throughout this tutorial. gzip is only able to compress byte streams, so we have to declare the below string as a byte stream by adding a little “b” before the quotation marks.

import gzip
import os

data = b''' Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Proin congue orci odio, non sagittis orci cursus vel.
Ut aliquam sapien non vulputate elementum.
Sed rutrum ante sit amet egestas ultrices.
Vivamus malesuada tincidunt justo, in ornare nibh porta ac.
Nullam interdum, quam at ornare hendrerit, ex augue rutrum dui, eu dictum diam velit eu urna.
Morbi malesuada elit sed laoreet varius.
Maecenas elementum vestibulum erat, sit amet molestie ex volutpat in.
Phasellus viverra lectus ante, vitae ultricies leo maximus fermentum.
Phasellus consectetur ex id ipsum iaculis, ut tempus risus porttitor.
Nam in ante blandit, feugiat magna a, ullamcorper nulla. Integer ac egestas justo.
Duis volutpat mattis nibh, ut semper justo luctus nec. '''

Compressing Files and Data

We’ll take a look at two different methods of compressing Files and Data with gzip in Python.

Opening a File with gzip

This method is used for creating files and storing compressed data in them. If you open the file stream with the gzip.GzipFile() method, any data you write to it will be automatically compressed.

The gzip.GzipFile() method takes two compulsory parameters. The first is the filename (or filepath if not in the same directory), and the second is the mode it is to be opened in. The syntax is just like regular File Handling in Python, as well as the modes.

We’ll be using the “rb” (read binary) and “wb” (write data) modes in this tutorial instead of “r” (read) and “w” (write) because we are dealing with Binary data.

ofile = gzip.GzipFile("BinaryData", "wb")
ofile.write(data)
ofile.close()

print(os.path.getsize("BinaryData"))

Resulting File Size:

439

If you open up the resulting Binary file, and check it’s contents, you will see something like this.

BZh91AY&SY_ª4  ×€ @'K ?÷ÿ@@ɹ@1O$Ó  =C@EOÁ=L“DÐ
4Si¥6 &@Ä 	M)‰Oše'©¡¸Uô#0iÚÑeôCh%1e‚Œ•æI!5 áêvº¢šIÆ0èû09£ØÒËWÉIØ¥Î+I¸µ°Ê³	¸qÏ‘p*×97hBÇ2A®ÑFÒdª~&[‚k‡öZrŒ '9Ž@¡iÖ›J,¶¼ÚÄ{½á,VDåÈ¥O“uüÒ°p„‰-CÎ{óÃWžŒß>½µÖäa²vX2a1Ì·]~BÌ]“ë•Eù…Z×V†ˆ^fÞ”pj}ôÚ}[-ÛFæý›{2dCµ•<¦(p–0¡ÏˆCª´Ö‰zŠpF’ßSM[:…ñÆçsݸ¿ëß×+à£_m6ÐE³¿":.îÊãœä¼Êҏ*!Ê(‚’—0"â‹Èò °K#bµôíe¥Ñ
4 r/$Ô¦Âä’5gm48xÅðÙEϳc¯õñõÇ”~wë‡ÝãT$V^|]Õ»}µÉ±’lÉ
ÓuÿrE8P_ª4

This is a text editor’s attempt to display binary data, which is isn’t able to do in any readable format.

Now let’s try that again, but without gzip, and then compare the resulting file sizes.

ofile2 = open("BinaryData2", "wb")
ofile2.write(data)
ofile2.close()

print(os.path.getsize("BinaryData2"))

Resulting File Size:

741

Here we can see a value almost twice as large as the compressed gzip version.

Generally speaking, as the dataset grows larger, the compression will be more noticeable. This is due to some overhead that occurs with small datasets, sometimes allowing the non-compressed version to be smaller. As the dataset grows larger this becomes a non-issue.

How similar the data is to each other also plays a big role in how much the compression efficiency is. All in all, the Python gzip compression can result in a file upto 10x times smaller than the original.

Alternatively, you can use gzip.open() instead of gzip.GzipFile(). Both are pretty much the same thing, returning an object of type GzipFile.

Compression Setting

You can actually adjust the Compression settings, ranging from 1 to 9. The higher the value it is, the better the compression. This won’t be noticeable on a dataset as small as the we are using right now though.

ofile = gzip.GzipFile("BinaryData5", "wb", compresslevel = 1)

It has been noted a lower compression level is marginally faster. The compression difference isn’t that much between 1 and 9 (around 5% on large datasets), so if speed is your top priority you may want to use a low compress level setting.


Compress Function

The method we discussed earlier, relies on opening a file with the gzip.GzipFile Class. Basically we are working with file streams. But what if we just want to compress some regular data, without having anything to do with file streams?

This is where the compress() function comes in, which can directly take the data into it’s parameters, and return a compressed version.

compressed = gzip.compress(data)
print(len(compressed))
print(compressed)

The output: (Truncated for readability purposes)

428
b"\x1f\x8b\x08\x00\xb5L}a\x02\xffM\x92Kn\xdc0\x0c\x86\xf7>\x05\x0f0\xf0)\xba)\xd0\x14\x01\x8av\xcf\x91\x18\x0f\x0b=\\\x8a4r\xfc\xferf&\xd9\x89\xef\x9f\x1fE\xbaI%\xddGT\xca\xbdt\xa3\xa1N\\\xc5/\x94z\x1b\x92.............

Decompressing Files and Data

Now let’s take a look at how to decompressed data. Remember, don’t try decompressing regular data. Only try this on encoded data such as the type produced by gzip’s compression functions/algorithms.

Decompressing Data from Opened File

The first thing you can do is open up a file with compressed data, and then use the gzip.read() function on it. This will return the decompressed data exactly like it was before.

ofile = gzip.GzipFile("BinaryData5", "wb")
ofile.write(data)
ofile.close()

ifile = gzip.GzipFile("BinaryData5", "rb")
print(ifile.read())
ifile.close()

The output:

b' Lorem ipsum dolor sit amet, consectetur adipiscing elit.\nProin congue orci odio, non sagittis orci cursus vel.\nUt aliquam sapien non vulputate elementum.\nSed rutrum ante sit amet egestas ultrices.\nVivamus malesuada tincidunt justo, in ornare nibh porta ac.\nNullam interdum, quam at ornare hendrerit, ex augue rutrum dui, eu dictum diam velit eu urna.\nMorbi malesuada elit sed laoreet varius.\nMaecenas elementum vestibulum erat, sit amet molestie ex volutpat in.\nPhasellus viverra lectus ante, vitae ultricies leo maximus fermentum.\nPhasellus consectetur ex id ipsum iaculis, ut tempus risus porttitor.\nNam in ante blandit, feugiat magna a, ullamcorper nulla. Integer ac egestas justo.\nDuis volutpat mattis nibh, ut semper justo luctus nec. '

gzip.open() and gzip.GzipFile are both interchangeable here.


Decompress Function

An alternative way of decompressing compressed data is to use the decompress() function. You will especially require the use of this method if the file was opened for reading with the regular open() function, rather the gzip.open() function.

compressed = gzip.compress(data)
print(len(compressed))
print(compressed)

print("")

uncompressed = gzip.decompress(compressed)
print(len(uncompressed))
print(uncompressed)

The output will look something like this. (Run the code yourself to see the full output)

428
b"\x1f\x8b\x08\x00\xb5L}a\x02\xffM\x92Kn\xdc0\x0c\x86\xf7>\x05\x0f0\xf0)\xba)\xd0\x14\x01\x8av\xcf\x91\x18\x0f\x0b=\\\x8a4r\xfc\xferf&\xd9\x89\xef\x9f\x1fE\xbaI%\xddGT\xca\xbdt\xa3\xa1N\\\xc5/\x94z\x1b\x92.............


741
b' Lorem ipsum dolor sit amet, consectetur adipiscing elit.\nProin congue orci odio, non sagittis orci cursus vel.\nUt aliquam sapien non vulputate elementum.......

Using gzip with Pickle

A rather common use of the gzip library, is with the Python library Pickle, used to convert python objects to a byte stream (serializing) and save it to a file. The gzip library is then used to compress these byte streams and drastically reduce the size of the pickle files.

# Compressing Pickle Data

ofile = gzip.GzipFile("BinaryData",'wb')
pickle.dump(data,ofile)
ofile.close()

print(os.path.getsize("BinaryData"))

For the full tutorial, please check out our Python Pickle Tutorial, or even our bz2 tutorial dedicated to the compression of Pickle data. bz2 is an alternate option to gzip2, that you can use to compress data streams.


This marks the end of the Python gzip Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial content can be asked in the comments section below.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments