compression - Streaming decompression of zip archives in python -
is there way streaming decompression of single-file zip archives?
i have arbitrarily large zipped archives (single file per archive) in s3. able process files iterating on them without having download files disk or memory.
a simple example:
import boto def count_newlines(bucket_name, key_name): conn = boto.connect_s3() b = conn.get_bucket(bucket_name) # key .zip file key = b.get_key(key_name) count = 0 chunk in key: # how should decompress happen? count += decompress(chunk).count('\n') return count this answer demonstrates method of doing same thing gzip'd files. unfortunately, haven't been able same technique work using zipfile module, seems require random access entire file being unzipped.
you can use https://pypi.python.org/pypi/tubing, has built in s3 source support using boto3.
from tubing.ext import s3 tubing import pipes, sinks output = s3.s3source(bucket, key) \ | pipes.gunzip() \ | pipes.split(on=b'\n') \ | sinks.objects() print len(output) if didn't want store entire output in returned sink, make own sink counts. impl like:
class countwriter(object): def __init__(self): self.count = 0 def write(self, chunk): self.count += len(chunk) counter = sinks.makesink(countwriter)
Comments
Post a Comment