Thursday, 7 September 2023

Optimizing DynamoDB Operations for Multiple Record Types: Inserts and Updates

I need to process two types of JSON files and store their details into AWS DynamoDB. Let's consider type A and type B files. Type A includes process ID, start date, client ID, and process type, while type B includes process ID, end date, process type, and client ID. The only difference between them is the presence of start date in type A and end date in type B.

The attributes of my DynamoDB table are process ID, start date, end date, and client ID.

For instance:

Type A record: 'process_id': 1234, 'start_date': '2023-08-01', 'client_id': 'C-1', process_type: 'A'

Type B record: 'process_id': 1234, 'end_date': '2023-08-10', 'client_id': 'C-1', process_type: 'B'

I will store these records as a single entry in the DynamoDB table:

'process_id': 1234, 'start_date': '2023-08-01', 'end_date': '2023-08-10', 'client_id': 'C-1'

Depending on which type comes first, I will insert that record. If another record with the same process ID and a different type comes later, I will update the date accordingly. For example:

If type A records come first, I will insert the record with process ID, start date, and client ID (only missing end date). Later, if type B data with the same process ID 1234 arrives, I will update the end date instead of creating a new entry. Similarly, if type B comes first, I will insert with process ID, end date, and client ID (start date only missing). I will update the start date when a record with the same process ID arrives. I don't know which type of record will arrive first.

Type A: `{'process_id': 1234, 'start_date': '2023-08-01', 'client_id': 'C-1'}` // Type A arrived first, so INSERT the record without the `end_date`. 
Type B: {'process_id': 1234, 'end_date': '2023-08-10', 'client_id': 'C-1'}` // When Type B arrives with the `end_date` value, UPDATE the record to include 'end_date': '2023-08-10', with the above record.
Actual record in Table: {'process_id': 1234, 'start_date': '2023-08-01', 'end_date': '2023-08-10', 'client_id': 'C-1'} // Viceversa if type B comes first.

I have implemented a Lambda function to process records whenever an object is created in the S3 bucket. Using boto3, I insert each record using put_item, employing a ConditionExpression as follows:

condition_expression = 'attribute_not_exists(process_id)'
try:
    dynamodb.put_item(
        TableName='my-table',
        Item=item,
        ConditionExpression=condition_expression
    )
except dynamodb.exceptions.ConditionalCheckFailedException:
    if process_type == 'typeA':
        update_expression += ', start_date = :start_date_val'
        expression_attribute_values[':start_date_val'] = {'S': start_date}
    elif process_type == 'typeB':
        update_expression += ', end_date = :end_date_val'
        expression_attribute_values[':end_date_val'] = {'S': end_date}
    dynamodb.update_item(
        TableName='my-table',
        Key={'process_id': {'S': process_id}},
        UpdateExpression=update_expression,
        ExpressionAttributeValues=expression_attribute_values
    )

I use the ConditionExpression to check whether a record with a certain process ID already exists. If it doesn't, I insert the record. If it does, I update the dates accordingly.

However, I'm facing a performance issue. I have 50,000 records in a single file type. I attempted to use batch_writer for inserting and updating, but it doesn't support put_item with ConditionExpression or update_item.

I experimented with using get_item inside batch_writer to check if the record exists. If it doesn't, I use put_item to insert; if it does, I use update_item to update the date. However, using get_item seems to create individual calls for each item to check existence, and update_item is not supported by batch_writer. Is there a more optimized way to insert and update the records?

I've considered multithreading, but it still seems to result in individual insert requests to DynamoDB. Making 50,000+50,000 (type A + type B) records as individual requests for insert/update is a significant performance concern. Note that I will split the records into batches of 1000 to prevent Lambda timeouts. Thanks in advance.



from Optimizing DynamoDB Operations for Multiple Record Types: Inserts and Updates

No comments:

Post a Comment