Tuesday, 8 January 2019

AWS lambda memory usage with temporary files in python code

Does data written to temporary files contribute to memory usage in AWS lambda? In a lambda function, I'm streaming a file to a temporary file. In the lambda logs, I see that the max memory used is larger than the file that was downloaded. Strangely, if the lambda is invoked multiple times in quick succession the invocations that downloaded smaller files still report the max memory used from the invocation that downloaded the larger file. I have the concurrency limit set to 2.

When I run the code locally, my memory usage is as expected at around 20MB. On lambda it is 180MB, which is about the size of the file that is streamed. The code is simply using python requests library to stream a file download, to shutil.copyfileobj() to write to tempfile.TemporaryFile(), which is then piped to postgres "copy from stdin".

This makes it seem like the /tmp storage counts towards memory usage but haven't found any mention of this. The only mention of /tmp in the lambda documentation is that there is a 512mb limit.

Example code:

import sys
import json
import os
import io
import re
import traceback
import shutil
import tempfile

import boto3
import psycopg2
import requests


def handler(event, context):
    try:
        import_data(event["report_id"])
    except Exception as e:
        notify_failed(e, event)
        raise

def import_data(report_id):
    token = get_token()
    conn = psycopg2.connect(POSTGRES_DSN, connect_timeout=30)
    cur = conn.cursor()

    metadata = load_metadata(report_id, token)
    table = ensure_table(metadata, cur, REPLACE_TABLE)
    conn.commit()
    print(f"report {report_id}: downloading")
    with download_report(report_id, token) as f:
        print(f"report {report_id}: importing data")
        with conn, cur:
            cur.copy_expert(f"COPY {table} FROM STDIN WITH CSV HEADER", f)
        print(f"report {report_id}: data import complete")
    conn.close()


def download_report(report_id, token):
    url = f"https://some_url"
    params = {"includeHeader": True}
    headers = {"authorization": f"Bearer {token['access_token']}"}

    with requests.get(url, params=params, headers=headers, stream=True) as r:
        r.raise_for_status()
        tmp = tempfile.TemporaryFile()
        print("streaming contents to temporary file")
        shutil.copyfileobj(r.raw, tmp)
        tmp.seek(0)
        return tmp


if __name__ == "__main__":
    if len(sys.argv) > 1:
        handler({"report_id": sys.argv[1]}, None)

UPDATE: After changing the code to not use a temporary file but to just stream the download directly to postgres copy command the memory usage was fixed. Makes me think the /tmp directory is contributing to the logged memory usage.



from AWS lambda memory usage with temporary files in python code

No comments:

Post a Comment