In this tutorial we will explore the Python bz2 library and it’s various functions used to compress and decompress file data.
File size is an important detail in almost any data related field. When you are working with large sets of data, you need to often move this data around, (e.g: loading and saving it). If the file size is too large, this will undoubtedly cause many problems.
Hence the solution is to use compression techniques to lower the size as much as possible. The bz2
library is what we will be using in this tutorial to compress file streams and string data, and then decompress the same data.
This is the random text data that we’ll be working with through the tutorial. bz2
is only able to compress byte streams, hence the below string is declared as a byte stream by adding a little “b” before the quotation marks.
import bz2
import os
data = b''' Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Proin congue orci odio, non sagittis orci cursus vel.
Ut aliquam sapien non vulputate elementum.
Sed rutrum ante sit amet egestas ultrices.
Vivamus malesuada tincidunt justo, in ornare nibh porta ac.
Nullam interdum, quam at ornare hendrerit, ex augue rutrum dui, eu dictum diam velit eu urna.
Morbi malesuada elit sed laoreet varius.
Maecenas elementum vestibulum erat, sit amet molestie ex volutpat in.
Phasellus viverra lectus ante, vitae ultricies leo maximus fermentum.
Phasellus consectetur ex id ipsum iaculis, ut tempus risus porttitor.
Nam in ante blandit, feugiat magna a, ullamcorper nulla. Integer ac egestas justo.
Duis volutpat mattis nibh, ut semper justo luctus nec. '''
Compressing Files and Data
We’ll take a look at two different methods of compressing Files and Data with bz2 in Python.
Opening a File with bz2
This method is used for creating files and storing compressed data in them. If you open the file stream with the bz2.BZ2File()
method, any data you write to it will be automatically compressed.
The bz2.BZ2File()
method takes two compulsory parameters. The first is the filename (or filepath if not in the same directory), and the second is the mode it is to be opened in. The syntax is just like regular File Handling in Python, as well as the modes.
We’ll be using the “rb” (read binary) and “wb” (write data) modes in this tutorial instead of “r” (read) and “w” (write) because we are dealing with Binary data.
ofile = bz2.BZ2File("BinaryData", "wb")
ofile.write(data)
ofile.close()
print(os.path.getsize("BinaryData"))
Resulting File Size:
451
If you open up the resulting Binary file, and check it’s contents, you will see something like this.
BZh91AY&SY_ª4 ×€ @'K ?÷ÿ@@ɹ@1O$Ó =C@EOÁ=L“DÐ
4Si¥6 &@Ä M)‰Oše'©¡¸Uô#0iÚÑeôCh%1e‚Œ•æI!5 áêvº¢šIÆ0èû09£ØÒËWÉIØ¥Î+I¸µ°Ê³ ¸qÏ‘p*×97hBÇ2A®ÑFÒdª~&[‚k‡öZrŒ '9Ž@¡iÖ›J,¶¼ÚÄ{½á,VDåÈ¥O“uüÒ°p„‰-CÎ{óÃWžŒß>½µÖäa²vX2a1Ì·]~BÌ]“ë•Eù…Z×V†ˆ^fÞ”pj}ôÚ}[-ÛFæý›{2dCµ•<¦(p–0¡ÏˆCª´Ö‰zŠpF’ßSM[:…ñÆçsݸ¿ëß×+à£_m6ÐE³¿":.îÊãœä¼ÊÒ*!Ê(‚’—0"â‹Èò °K#bµôíe¥Ñ
4 r/$Ô¦Âä’5gm48xÅðÙEϳc¯õñõÇ”~wë‡ÝãT$V^|]Õ»}µÉ±’lÉ
ÓuÿrE8P_ª4
This is a text editor’s attempt to display binary data, which is isn’t able to do in any readable format.
Now let’s try that again, but without bz2, and then compare the resulting file sizes.
ofile2 = open("BinaryData2", "wb")
ofile2.write(data)
ofile2.close()
print(os.path.getsize("BinaryData2"))
Resulting File Size:
741
Here we can see a value almost twice as large as the compressed bz2 version.
Generally speaking, as the dataset grows larger, the compression will be more noticeable. This is due to some overhead that occurs with small datasets, sometimes allowing the non-compressed version to be smaller. As the dataset grows larger this becomes a non-issue.
How similar the data is to each other also plays a big role in how much the compression efficiency is. All in all, the Python bz2 compression can result in a file upto 10x times smaller than the original.
Alternatively, you can use bz2.open() instead of bz2.BZ2File(). Both are pretty much the same thing, returning an object of type BZ2File.
Compression Setting
You can actually adjust the Compression settings, ranging from 1 to 9. The higher the value it is, the better the compression. This won’t be noticeable on a dataset as small as the we are using right now though.
ofile = bz2.BZ2File("BinaryData5", "wb", compresslevel = 1)
It has been noted a lower compression level is marginally faster. The compression difference isn’t that much between 1 and 9 (around 5% on large datasets), so if speed is your top priority you may want to use a low compress level setting.
Compress Function
The method we discussed earlier, relies on opening a file with the bz2.BZ2File
Class. Basically we are working with file streams. But what if we just want to compress some regular data, without having anything to do with file streams?
This is where the compress()
function comes in, which can directly take the data into it’s parameters, and return a compressed version.
compressed = bz2.compress(data)
print(len(compressed))
print(compressed)
The output: (Truncated for readability purposes)
451
b'BZh91AY&SY_\x19\xaa4\x00\x00\x06\xd7\x80\x00\x10@\x05\x04\'K\x00?\xf7\xff@@\x01\xc9\xb9@\x111O$\xd3 \x00=C@EO\xc1=L\x93D\xd0\r\x194\x03Si\xa56\xa0&@\xc4\x00\tM\x13)\x89O\x04\x9ae\x03\'\xa9\xa1\xb8\x1cU\xf4#0\x06i\x8f\xda\xd1\x14e\xf4Ch%1e\x82\x8c\x95\xe6I!5.............
Decompressing Files and Data
Now let’s take a look at how to decompressed data. Remember, don’t try decompressing regular data. Only try this on encoded data such as the type produced by bz2’s compression functions/algorithms.
Decompressing Data from Opened File
The first thing you can do is open up a file with compressed data, and then use the bz2.read() function on it. This will return the decompressed data exactly like it was before.
ofile = bz2.BZ2File("BinaryData5", "wb")
ofile.write(data)
ofile.close()
ifile = bz2.BZ2File("BinaryData5", "rb")
print(ifile.read())
ifile.close()
The output:
b' Lorem ipsum dolor sit amet, consectetur adipiscing elit.\nProin congue orci odio, non sagittis orci cursus vel.\nUt aliquam sapien non vulputate elementum.\nSed rutrum ante sit amet egestas ultrices.\nVivamus malesuada tincidunt justo, in ornare nibh porta ac.\nNullam interdum, quam at ornare hendrerit, ex augue rutrum dui, eu dictum diam velit eu urna.\nMorbi malesuada elit sed laoreet varius.\nMaecenas elementum vestibulum erat, sit amet molestie ex volutpat in.\nPhasellus viverra lectus ante, vitae ultricies leo maximus fermentum.\nPhasellus consectetur ex id ipsum iaculis, ut tempus risus porttitor.\nNam in ante blandit, feugiat magna a, ullamcorper nulla. Integer ac egestas justo.\nDuis volutpat mattis nibh, ut semper justo luctus nec. '
bz2.open() and bz2.BZ2File are both interchangeable here.
Decompress Function
An alternative way of decompressing compressed data is to use the decompress()
function. You will especially require the use of this method if the file was opened for reading with the regular open()
function, rather the bz2.open()
function.
compressed = bz2.compress(data)
print(len(compressed))
print(compressed)
print("")
uncompressed = bz2.decompress(compressed)
print(len(uncompressed))
print(uncompressed)
The output will look something like this. (Run the code yourself to see the full output)
451
b'BZh91AY&SY_\x19\xaa4\x00\x00\x06\xd7\x80\x00\x10@\x05\x04\'K\x00?\xf7\xff@@\x01\xc9\xb9@\x111O$\xd3 \x00=C@EO\xc1=L\x93D\xd0\r\x194\x03Si\xa56\xa0&@\xc4\x00\tM\x13)\x89O\x04\x9ae\x03\'\xa9\xa1\xb8\x1cU\xf4#0\x06i\x8f\xda\xd1\x14e\xf4Ch%1e\x82\x8c\x95\xe6I!5......
741
b' Lorem ipsum dolor sit amet, consectetur adipiscing elit.\nProin congue orci odio, non sagittis orci cursus vel.\nUt aliquam sapien non vulputate elementum.......
Using bz2 with Pickle
A very popular use of the bz2 library, is with the Python library Pickle, which is used to convert python objects to a byte stream (serializing) and save it to a file. bz2 library is used to compress these byte streams and significantly reduce the size of the pickle files.
# Compressing Pickle Data
ofile = bz2.BZ2File("BinaryData",'wb')
pickle.dump(data,ofile)
ofile.close()
print(os.path.getsize("BinaryData"))
For the full tutorial, please check out our Python Pickle Tutorial, or even our gzip tutorial dedicated to the compression of Pickle data.
This marks the end of the Python bz2 file compression Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial content can be asked in the comments section below.