compression - Streaming decompression of zip archives in python -

March 15, 2011

is there way streaming decompression of single-file zip archives?

i have arbitrarily large zipped archives (single file per archive) in s3. able process files iterating on them without having download files disk or memory.

a simple example:

import boto  def count_newlines(bucket_name, key_name):     conn = boto.connect_s3()     b = conn.get_bucket(bucket_name)     # key .zip file     key = b.get_key(key_name)      count = 0     chunk in key:         # how should decompress happen?         count += decompress(chunk).count('\n')      return count

this answer demonstrates method of doing same thing gzip'd files. unfortunately, haven't been able same technique work using zipfile module, seems require random access entire file being unzipped.

you can use https://pypi.python.org/pypi/tubing, has built in s3 source support using boto3.

from tubing.ext import s3 tubing import pipes, sinks output = s3.s3source(bucket, key) \     | pipes.gunzip() \     | pipes.split(on=b'\n') \     | sinks.objects() print len(output)

if didn't want store entire output in returned sink, make own sink counts. impl like:

class countwriter(object):     def __init__(self):         self.count = 0     def write(self, chunk):         self.count += len(chunk) counter = sinks.makesink(countwriter)

Search This Blog

Plus Code

compression - Streaming decompression of zip archives in python -

Comments

Post a Comment

Popular posts from this blog

How to group boxplot outliers in gnuplot -

cakephp - simple blog with croogo -

bash - Performing variable substitution in a string -