Serialization is the process of converting some in-memory object to another format that could be used to either store in a file or sent over the network. Deserialization is the inverse process. In the context of machine learning, data serialization is often used to store the parameters of a neural network, as well as a means of storing and retrieving training data.
There are several formats for data serialization, some of which are based on open source and open-standards, such as the HDF5 and JSON and a number of proprietary formats such as the ones listed here. In fact, most modern programming languages such as JAVA, .NET framework and Python offer their own data serialization modules. Let’s focus on the Python language here.
Anyone programmer who has been involved in data analysis using Python will know what the term “pickling” means. In short, “pickling’ is the Pythonic term used to refer to data serialization. Python offers an out-of-the-box solution for data serialization and this module is known as the
pickle module. After pickling, say an array, list, dictionary or any object in Python, a pickle file is created and can be stored on disk or transmitted over the network. There are many tutorial resources on the web to learn to use the
pickle module and its faster C-based cousin
cPickle (which can be up to 1000 times faster!).
My objective here is to show how to speed up data serialization in Python, which could prove to be essential when dealing with large Python objects.
First of all, for most practical purposes, use
cPickle instead of
Secondly, even if we used
cPickle, there is in fact another way to speed up the process. There are several formats used by the pickle module to perform data serialization. By default, a human-readable format (i.e. as an ASCII text file, albeit in a Python-specific data format) is used and this is
protocol=0. Two additional binary formats are available, which are
protocol=2. When creating a pickle file, pass the parameter
protocol=2, which provides considerable speed-up, especially when de-serializing data from pickle files.
CPickle are simple to use and is compatible with a large number of Python objects,their data format is Python-specific. This means, pickled data can only be unpicked using Python scripts. In order to overcome this limitation and make your serialized data compatible with other non-Python tools, we can use an open-standards based data serialisation protocol.
One such popular open-format is the HDF5 format. Python supports HDF5 format using the
hdf5py module but if you are looking to replace your existing code with HDF5, fortunately there is a module called
hickle which is essentially a wrapper for the
hdf5py module, but provides a similar syntax to the oft used
To install hickle, just
$pip install hickle and you should be good to go.
I performed a simple test to compare the performances of the data serialisation techniques discussed in this article by taking 1000 jpeg images (320 x 240 pixels) from a public image dataset and iteratively loading each file from disk, storing the image into a Numpy array and serialising the array in either a
hickle file. Then, I de-serialized the Python objects from the files read from disk iteratively and stored back into a Numpy array. The code below shows the idea:
import os from PIL import Image import cPickle import numpy import time import hickle path = '/path/to/file' # we use Python Imaging Library to read the jpeg file # we store the read image as a numpy array def PIL2array(img): return numpy.array(img.getdata(), numpy.uint8).reshape(img.size, img.size, 1) # this function writes the contents of the numpy array back into a jpeg file def array2PIL(arr, size): mode = 'RGBA' arr = arr.reshape(arr.shape*arr.shape, arr.shape) if len(arr) == 3: arr = numpy.c_[arr, 255*numpy.ones((len(arr),1), numpy.uint8)] return Image.frombuffer(mode, size, arr.tostring(), 'raw', mode, 0, 1) # routine to recursively traverse a folder and save list of file names def main(): fileList = [os.path.join(dirpath, f) for dirpath, dirnames, files in os.walk(path) for f in files if f.endswith('.tiff')] print "Preparing your pickle files. Pls wait..." t0 = time.time() for file_ in fileList: print file_ img = Image.open(file_) arr = PIL2array(img) # the next two lines use cPickle or the hickle to store the array # for simplicity, I just commented out either one to choose # cPickle.dump(arr,open(file+"-prot0"+".pkl","wb"),protocol=0) hickle.dump(arr,file_+"-prot"+".hkl",mode='w') t1=time.time() total = t1-t0 print "P(h)ickling execution time: %.2f sec" % total # routine to recursively traverse a folder and save list of file names pklList = [os.path.join(dirpath, f) for dirpath, dirnames, files in os.walk(path) for f in files if f.endswith('.hkl')] # here we load the pickle file back into memory t3 = time.time() for file_ in pklList: arr2 = hickle.load(file_) print arr2.shape img2 = Image.fromarray(numpy.squeeze(arr2)) img2.save(file_ +'.jpg') print file_ + '.jpg' t4=time.time() total2 = t4-t3 print "Unp(h)ickling execution time: %.2f sec" % total2 if __name__ == "__main__": main()
The results of this test are as follows:
Pickling and unpickling time for 4 runs.
Dataset used: 1000 images, jpeg format (320 x 240 color image)
|dump(sec)||load(sec)||stored file size
We can clearly see cPickle (using protocol=2) is the fastest technique for data serialisation among the three compared while hickle module provides similar performance to cPickle (protocol=2) and also has the advantage of being a cross-platform standard.