Using Python's gzip and StringIO to compress data in memory
I needed to gzip some data in memory that would eventually end up saved to disk as a .gz file. I thought, That's easy, just use Python's built in gzip module.
However, I needed to pass the data to pycurl as a file-like object. I didn't want to write the data to disk and then read it again just to pass to pycurl. I thought, That's easy also-- just use Python's cStringIO module.
The solution did end up being simple, but figuring out the solution was a lot harder than I thought. Below is my roundabout process of finding the simple solution.
Here is my setup/test code. I am running Python 2.7.3 on Ubuntu 12.04.
import cStringIO
import gzip
STUFF_TO_GZIP = """Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?"""
FILENAME = 'myfile.json.gz'
def pycurl_simulator(fileobj):
# Get the file size
fileobj.seek(0, 2)
filesize = fileobj.tell()
fileobj.seek(0, 0)
# Read the file data
fout = open(FILENAME, 'wb')
fout.write(fileobj.read())
fout.close()
return filesize
Try 1: seek from the end fails¶
Here is my first attempt using cStringIO with the gzip module.
def try1_seek_from_end_fails():
ftemp = cStringIO.StringIO()
fgzipped = gzip.GzipFile(
filename=FILENAME, mode='wb', fileobj=ftemp)
fgzipped.write(STUFF_TO_GZIP)
filesize = pycurl_simulator(fgzipped)
print filesize
I got this exception:
Traceback (most recent call last): File "tmp.py", line 232, in <module> try1_seek_from_end_fails() File "tmp.py", line 83, in try1_seek_from_end_fails filesize = pycurl_simulator(fgzipped) File "tmp.py", line 25, in pycurl_simulator fileobj.seek(0, 2) File "/usr/lib/python2.7/gzip.py", line 415, in seek raise ValueError('Seek from end not supported') ValueError: Seek from end not supported
It turns out the gzip object doesn't support seeking from the end. See this thread on the Python mailing list: http://mail.python.org/pipermail/python-list/2009-January/519398.html
Try 2: data is not compressed¶
What if we don't seek() from the end and just tell() where we are? (It should be at the end after doing a write(), right?) Unfortunately, this gave me the uncompressed size.
Reading from the GzipFile object also gave me an error saying that I couldn't read from a writable object.
def try2_data_is_not_compressed():
ftemp = cStringIO.StringIO()
fgzipped = gzip.GzipFile(
filename=FILENAME, mode='wb', fileobj=ftemp)
fgzipped.write(STUFF_TO_GZIP)
filesize = fgzipped.tell()
print filesize
Try 5: file much too small¶
I googled, then looked at the source code for gzip.py. I found that the compressed data was in the StringIO object. So I performed my file operations on it instead of the GzipFile object. Now I was able to write the data out to a file. However, the size of the file was much too small.
def try5_file_much_too_small():
fgz = cStringIO.StringIO()
gzip_obj = gzip.GzipFile(
filename=FILENAME, mode='wb', fileobj=fgz)
gzip_obj.write(STUFF_TO_GZIP)
filesize = pycurl_simulator(fgz)
print filesize
Try 6: unexpected end of file¶
I saw there was a flush() method in the source code. I added a call to flush(). This time, I got a reasonable file size, however, when trying to gunzip it from the command line, I got the following error:
gzip: myfile.json.gz: unexpected end of file
def try6_unexpected_end_of_file():
fgz = cStringIO.StringIO()
gzip_obj = gzip.GzipFile(
filename=FILENAME, mode='wb', fileobj=fgz)
gzip_obj.write(STUFF_TO_GZIP)
gzip_obj.flush()
filesize = pycurl_simulator(fgz)
print filesize
Try 7: got it working¶
I knew that GzipFile worked properly when writing files directly as opposed to reading from the StringIO object. It turns out the difference was that there was code in the close() method of GzipFile which wrote some extra required data. Now stuff was working.
def try7_got_it_working():
fgz = cStringIO.StringIO()
gzip_obj = gzip.GzipFile(
filename=FILENAME, mode='wb', fileobj=fgz)
gzip_obj.write(STUFF_TO_GZIP)
gzip_obj.flush()
# Do stuff that GzipFile.close() does
gzip_obj.fileobj.write(gzip_obj.compress.flush())
gzip.write32u(gzip_obj.fileobj, gzip_obj.crc)
gzip.write32u(gzip_obj.fileobj, gzip_obj.size & 0xffffffffL)
filesize = pycurl_simulator(fgz)
print filesize
Try 8: (not really) final version¶
Here's the (not really) final version using a subclass of GzipFile that adds a method to write the extra data at the end. If also overrides close() so that stuff isn't written twice in case you need to use close(). Also, the separate flush() call is not needed.
def try8_not_really_final_version():
class MemoryGzipFile(gzip.GzipFile):
"""
A GzipFile subclass designed to be used with in memory file like
objects, i.e. StringIO objects.
"""
def write_crc_and_filesize(self):
"""
Flush and write the CRC and filesize. Normally this is done
in the close() method. However, for in memory file objects,
doing this in close() is too late.
"""
self.fileobj.write(self.compress.flush())
gzip.write32u(self.fileobj, self.crc)
# self.size may exceed 2GB, or even 4GB
gzip.write32u(self.fileobj, self.size & 0xffffffffL)
def close(self):
if self.fileobj is None:
return
self.fileobj = None
if self.myfileobj:
self.myfileobj.close()
self.myfileobj = None
fgz = cStringIO.StringIO()
gzip_obj = MemoryGzipFile(
filename=FILENAME, mode='wb', fileobj=fgz)
gzip_obj.write(STUFF_TO_GZIP)
gzip_obj.write_crc_and_filesize()
filesize = pycurl_simulator(fgz)
print filesize
Try 9: didn't need to do that (final version)¶
It turns out I can close the GzipFile object and the StringIO object remains available. So that MemoryGzipFile class above is completely unnecessary. I am dumb. Here is the final iteration:
def try9_didnt_need_to_do_that():
fgz = cStringIO.StringIO()
gzip_obj = gzip.GzipFile(
filename=FILENAME, mode='wb', fileobj=fgz)
gzip_obj.write(STUFF_TO_GZIP)
gzip_obj.close()
filesize = pycurl_simulator(fgz)
print filesize
References¶
Here is some googling I did:
- http://docs.python.org/2/library/gzip.html
- http://hg.python.org/cpython/file/2.7/Lib/gzip.py
- http://docs.python.org/2/library/stringio.html
- http://stackoverflow.com/questions/5829964/reading-gzip-file-that-is-currently-being-written-to
- http://mail.python.org/pipermail/python-list/2009-January/519398.html
- http://code.activestate.com/recipes/148292-controlling-gzipped-io/
Comments
Good article. I appreciate and it is nice to see your exploration / thought process.
fgz = cStringIO.StringIO()
with gzip.GzipFile(filename=FILENAME, mode='wb', fileobj=fgz) as gzip_obj:
gzip_obj.write(STUFF_TO_GZIP)
filesize = pycurl_simulator(fgz)
print filesize
works as expected
pablo: Using a context manager makes it even simpler-- thanks!
Thanks for the article! It helped today
disqus:2670022689
I found a very weird issue in the code below.
with gzip.open('test.gz', 'wb') as f:
f.write('Hello world!\n')
After the code is run, the file 'test.gz' will be created with the content "Hello world!" but it is uncompressed. However, when I changed the filename to 'test' (omitting the extension), the file 'test' is created with the compressed content. Do you have any idea about this?
disqus:3245132653
Seems that the "less" command in my linux box will decompress the gzip file on-the-fly for easy viewing. Therefore the file is in fact encrypted already, and the python gzip library is working.
disqus:3249359516