I need to process two types of JSON files and store their details into AWS DynamoDB. Let's consider type A and type B files. Type A includes process ID, start date, client ID, and process type, while type B includes process ID, end date, process type, and client ID. The only difference between them is the presence of start date in type A and end date in type B.
The attributes of my DynamoDB table are process ID, start date, end date, and client ID.
For instance:
Type A record: 'process_id': 1234, 'start_date': '2023-08-01', 'client_id': 'C-1', process_type: 'A'
Type B record: 'process_id': 1234, 'end_date': '2023-08-10', 'client_id': 'C-1', process_type: 'B'
I will store these records as a single entry in the DynamoDB table:
'process_id': 1234, 'start_date': '2023-08-01', 'end_date': '2023-08-10', 'client_id': 'C-1'
Depending on which type comes first, I will insert that record. If another record with the same process ID and a different type comes later, I will update the date accordingly. For example:
If type A records come first, I will insert the record with process ID, start date, and client ID (only missing end date). Later, if type B data with the same process ID 1234 arrives, I will update the end date instead of creating a new entry. Similarly, if type B comes first, I will insert with process ID, end date, and client ID (start date only missing). I will update the start date when a record with the same process ID arrives. I don't know which type of record will arrive first.
Type A: `{'process_id': 1234, 'start_date': '2023-08-01', 'client_id': 'C-1'}` // Type A arrived first, so INSERT the record without the `end_date`.
Type B: {'process_id': 1234, 'end_date': '2023-08-10', 'client_id': 'C-1'}` // When Type B arrives with the `end_date` value, UPDATE the record to include 'end_date': '2023-08-10', with the above record.
Actual record in Table: {'process_id': 1234, 'start_date': '2023-08-01', 'end_date': '2023-08-10', 'client_id': 'C-1'} // Viceversa if type B comes first.
I have implemented a Lambda function to process records whenever an object is created in the S3 bucket. Using boto3, I insert each record using put_item
, employing a ConditionExpression
as follows:
condition_expression = 'attribute_not_exists(process_id)'
try:
dynamodb.put_item(
TableName='my-table',
Item=item,
ConditionExpression=condition_expression
)
except dynamodb.exceptions.ConditionalCheckFailedException:
if process_type == 'typeA':
update_expression += ', start_date = :start_date_val'
expression_attribute_values[':start_date_val'] = {'S': start_date}
elif process_type == 'typeB':
update_expression += ', end_date = :end_date_val'
expression_attribute_values[':end_date_val'] = {'S': end_date}
dynamodb.update_item(
TableName='my-table',
Key={'process_id': {'S': process_id}},
UpdateExpression=update_expression,
ExpressionAttributeValues=expression_attribute_values
)
I use the ConditionExpression
to check whether a record with a certain process ID already exists. If it doesn't, I insert the record. If it does, I update the dates accordingly.
However, I'm facing a performance issue. I have 50,000 records in a single file type. I attempted to use batch_writer
for inserting and updating, but it doesn't support put_item
with ConditionExpression
or update_item
.
I experimented with using get_item
inside batch_writer
to check if the record exists. If it doesn't, I use put_item
to insert; if it does, I use update_item
to update the date. However, using get_item
seems to create individual calls for each item to check existence, and update_item
is not supported by batch_writer
. Is there a more optimized way to insert and update the records?
I've considered multithreading, but it still seems to result in individual insert requests to DynamoDB. Making 50,000+50,000 (type A + type B) records as individual requests for insert/update is a significant performance concern. Note that I will split the records into batches of 1000 to prevent Lambda timeouts. Thanks in advance.
from Optimizing DynamoDB Operations for Multiple Record Types: Inserts and Updates
No comments:
Post a Comment