Saturday, December 5, 2020

Sequentially Writing Data to Zip Files with Python

The write() and writestr() methods of the ZipFile class in Python's zipfile library allow the addition of an entire file or a single string as a member file within a zip file.  However, if a large amount of data will be generated dynamically, these methods do not allow separate data items to be written sequentially to a member file within a zip file.

The StreamableZipfile class in the following code snippet provides this missing capability.

import time
import zipfile

class StreamableZipfile(object):
	def __init__(self, zipfile_name, mode='a'):
		# Compression type and level are only available in Python
		# since versions 3.3 and 3.7, respectively.
		self.zf = zipfile.ZipFile(zipfile_name, mode,
				compression=zipfile.ZIP_BZIP2, compresslevel=9)
	def close(self):
		self.zf.close()
	def member_file(self, member_filename):
		# Creates a ZipInfo object (file) within the zipfile
		# and opens it for writing.
		self.current_zinfo = zipfile.ZipInfo(filename=member_filename,
						date_time=time.localtime(time.time())[:6])
		self.current_zinfo.compress_type = self.zf.compression
		self.current_zinfo._compresslevel = self.zf.compresslevel
		self.current_zinfo.file_size = 0
		self.current_handle = self.zf.open(self.current_zinfo, mode='w')
	def close_member(self):
		self.current_handle.close()
	def write(self, str_data):
		# Writes the given text to the currently open member.
		data = str_data.encode("utf-8")
		with self.zf._lock:
			self.current_zinfo.file_size = self.current_zinfo.file_size + len(data)
			self.current_handle.write(data)

This simple implementation does not include any error checking and always compresses the data with the bzip2 algorithm, using the highest compression level.  A more robust and flexible implementation would eliminate these limitations.  Modifications are also needed for versions of Python prior to 3.7.

Use of the StreamableZipfile class is illustrated by the following code, which creates a zip file containing two separate files, where lines are written sequentially to each of the files.

import os

zfname = "Test.zip"
if os.path.isfile(zfname):
	os.remove(zfname)

# Open the streamable zip file and write several lines to a file within it.
zfile = StreamableZipfile(zfname)
zfile.member_file('file1.txt')
zfile.write("This is file 1, line 1\n")
zfile.write("This is file 1, line 2\n")
zfile.write("This is file 1, line 3\n")
zfile.close_member()
zfile.close()


# Open the same zip file and write lines to another file within it.
zfile = StreamableZipfile(zfname)
zfile.member_file('file2.txt')
zfile.write("This is file 2, line 1\n")
zfile.write("This is file 2, line 2\n")
zfile.write("This is file 2, line 3\n")
zfile.close_member()
zfile.close()