Saturday, 29 August 2020

Creating large zip files in AWS S3 in chunks

So, this question ends up being both about python and S3.

Let's say I have an S3 Bucket with these files :

file1 --------- 2GB
file2 --------- 3GB
file3 --------- 1.9GB
file4 --------- 5GB

These files were uploaded using a presigned post URL for S3

What I need to do is to give the client the ability to download them all in a ZIP (or similar), but I can't do it in memory neither on the server storage as this is a serverless setup.

From my understanding, ideally the server needs to:

  1. Start a multipartupload job on S3
  2. Probably need to send a chunk to the multipart job as the header of the zip file;
  3. Download each file in the bucket chunk by chunk in some sort of stream as to not overflow memory
  4. Use said stream above to them create a zip chunk and send this in the multipart job
  5. Finish the multipart job and the zip file

Now, I honestly have no idea how to achieve this and if it is even possible, but some questions are :

  • How do I download a file in S3 in chunks? Preferably using boto3 or botocore
  • How do I create a zip file in chunks while freeing memory?
  • How do I connect this all in a multipartupload?

Edit: Now that I think about it, maybe I don't even need to put the ZIP file in S3, I can just directly stream to the client right? That would be so much better actually

Here's some hypothetical code assuming my edit above :

  #Let's assume Flask
  @app.route(/'download_bucket_as_zip'):
  def stream_file():
    def stream():
      #Probably needs to yield zip headers/metadata?
      for file in getFilesFromBucket():
         for chunk in file.readChunk(4000):
            zipchunk = bytesToZipChunk(chunk)
            yield zipchunk
    return Response(stream(), mimetype='application/zip')


from Creating large zip files in AWS S3 in chunks

No comments:

Post a Comment