Sunday, 4 December 2022

Compare 2 dataframes, assign labels and split rows in Pandas/Pyspark

I have 2 dataframes consisting expected_orders and actual_orders details.

Input data:

input

I want to create a label field in both dataframe and split the rows based on following criteria:

  • Sort by country, product and date
  • Group both data frames by country and product
  • In both data frames, for each group if row's date and qty are matching then assign label same actual date/ same expected date
  • If qty is matching but dates are different then assign labels (earlier expected date/ later expected date) and (earlier actual date/ later actual date)
  • If qty is not an exact match but there are qty values remaining in other data frame of that group then split the row with greater qty value df to 2 rows: matching (less) qty value and remaining value
  • Repeat steps unless all rows have labels
  • If no quantity is remaining from other group then assign label no actual date or no expected date

Expected output:

expected output

I am trying to do this with nested loops but with millions of rows this is quite slow.

for key, exp in expected_grouped:
  act = actual_grouped.get_group(key)
  ...
  for i, outerrow in enumerate(exp.itertuples()):
    for j, innerrow in enumerate(act.itertuples()):
      if: ...
      elif: ...

Is there any better and faster way to do this? Any suggestions for improvement would be highly appreciated.



from Compare 2 dataframes, assign labels and split rows in Pandas/Pyspark

No comments:

Post a Comment