Speeding up data serialization in Python

      Comments Off on Speeding up data serialization in Python

Serialization is the process of converting some in-memory object to another format that could be used to either store in a file or sent over the network. Deserialization is the inverse process. In the context of machine learning, data serialization is often used to store the parameters of a neural network, as well as a means of storing and retrieving training data.

There are several formats for data serialization, some of which are based on open source and open-standards, such as the HDF5  and JSON and a number of proprietary formats such as the ones listed here. In fact, most modern programming languages such as JAVA, .NET framework and Python offer their own data serialization modules. Let’s focus on the Python language here.

Anyone programmer who has been involved in data analysis using Python will know what the term “pickling” means. In short, “pickling’ is the Pythonic term used to refer to data serialization.  Python offers an out-of-the-box solution for data serialization and this module is known as the pickle module. After pickling, say an array, list,  dictionary or any object in Python, a pickle file is created and can be stored on disk or transmitted over the network.  There are many tutorial resources on the web to learn to use the pickle module and its faster C-based cousin cPickle (which can be up to 1000 times faster!).

My objective here is to show how to speed up data serialization in Python, which could prove to be essential when dealing with large Python objects.

First of all, for most practical purposes, use cPickle instead of pickle.

Secondly, even if we used cPickle, there is in fact another way to speed up the process. There are several formats used by the pickle module to perform data serialization. By default, a human-readable format (i.e. as an ASCII text file, albeit in a Python-specific data format) is used and this is protocol=0. Two additional binary formats are available, which are protocol=1 and protocol=2. When creating a pickle file, pass the parameter HIGHEST_PROTOCOL or protocol=2, which provides considerable speed-up, especially when de-serializing data from pickle files.

While pickle and CPickle are simple to use and is compatible with a large number of Python objects,their data format is Python-specific. This means, pickled data can only be unpicked using Python scripts. In order to overcome this limitation and make your serialized data compatible with other non-Python tools, we can use an open-standards based data serialisation protocol.  

One such popular open-format is the HDF5 format. Python supports HDF5 format using the  hdf5py module but if you are looking to replace your existing code with HDF5, fortunately there is a module called hickle which is essentially a wrapper for the hdf5py module, but provides a similar syntax to the oft used pickle module.

To install hickle, just $pip install hickle and you should be good to go.

I performed a simple test to compare the performances of the data serialisation techniques discussed in this article by taking 1000 jpeg images (320 x 240 pixels) from a public image dataset and iteratively loading each file from disk, storing the image into a Numpy array and serialising the array in either a pickle or hickle file. Then, I de-serialized the Python objects from the files read from disk iteratively and stored back into a Numpy array. The code below shows the idea:

import os

from PIL import Image
import cPickle
import numpy
import time
import hickle

path = '/path/to/file'

# we use Python Imaging Library to read the jpeg file
# we store the read image as a numpy array

def PIL2array(img):
	return numpy.array(img.getdata(),
	numpy.uint8).reshape(img.size[1], img.size[0], 1)

# this function writes the contents of the numpy array back into a jpeg file

def array2PIL(arr, size):
	mode = 'RGBA'
	arr = arr.reshape(arr.shape[0]*arr.shape[1], arr.shape[2])
	if len(arr[0]) == 3:
		arr = numpy.c_[arr, 255*numpy.ones((len(arr),1), numpy.uint8)]
	return Image.frombuffer(mode, size, arr.tostring(), 'raw', mode, 0, 1)

# routine to recursively traverse a folder and save list of file names

def main():


	fileList = [os.path.join(dirpath, f)
		for dirpath, dirnames, files in os.walk(path)
		for f in files if f.endswith('.tiff')]

	print "Preparing your pickle files. Pls wait..."

	t0 = time.time()
	for file_ in fileList:
		print file_
		img = Image.open(file_)
		arr = PIL2array(img)

	# the next two lines use cPickle or the hickle to store the array
	# for simplicity, I just commented out either one to choose

	# cPickle.dump(arr,open(file+"-prot0"+".pkl","wb"),protocol=0)
		hickle.dump(arr,file_+"-prot"+".hkl",mode='w')

	t1=time.time()
	total = t1-t0
	print "P(h)ickling execution time: %.2f sec" % total

	# routine to recursively traverse a folder and save list of file names
	pklList = [os.path.join(dirpath, f)
		for dirpath, dirnames, files in os.walk(path)
		for f in files if f.endswith('.hkl')]

	# here we load the pickle file back into memory
	t3 = time.time()
	for file_ in pklList:
		arr2 = hickle.load(file_)
		print arr2.shape
		img2 = Image.fromarray(numpy.squeeze(arr2))
		img2.save(file_ +'.jpg')
		print file_ + '.jpg'


	t4=time.time()

	total2 = t4-t3

	print "Unp(h)ickling execution time: %.2f sec" % total2

if __name__ == "__main__":
	main()


The results of this test are as follows:

Pickling and unpickling time for 4 runs.

Dataset used: 1000 images, jpeg format (320 x 240 color image)

dump(sec) load(sec) stored file size
(in kB)
cPickle
(protocol=0)
6.55 2.82 600
cPickle
(protocol=2)
5.48 0.38 231
hickle 5.63 0.45 232

We can clearly see cPickle (using protocol=2) is the fastest technique for data serialisation among the three compared while hickle module provides similar performance to cPickle (protocol=2) and also has the advantage of being a cross-platform standard.