Hemant Vishwakarma: Compare 2 dataframes, assign labels and split rows in Pandas/Pyspark

Sunday, 4 December 2022

Compare 2 dataframes, assign labels and split rows in Pandas/Pyspark

I have 2 dataframes consisting expected_orders and actual_orders details.

Input data:

I want to create a label field in both dataframe and split the rows based on following criteria:

Sort by country, product and date
Group both data frames by country and product
In both data frames, for each group if row's date and qty are matching then assign label same actual date/ same expected date
If qty is matching but dates are different then assign labels (earlier expected date/ later expected date) and (earlier actual date/ later actual date)
If qty is not an exact match but there are qty values remaining in other data frame of that group then split the row with greater qty value df to 2 rows: matching (less) qty value and remaining value
Repeat steps unless all rows have labels
If no quantity is remaining from other group then assign label no actual date or no expected date

Expected output:

I am trying to do this with nested loops but with millions of rows this is quite slow.

for key, exp in expected_grouped:
  act = actual_grouped.get_group(key)
  ...
  for i, outerrow in enumerate(exp.itertuples()):
    for j, innerrow in enumerate(act.itertuples()):
      if: ...
      elif: ...

Is there any better and faster way to do this? Any suggestions for improvement would be highly appreciated.

from Compare 2 dataframes, assign labels and split rows in Pandas/Pyspark

Hemant Vishwakarma

Sunday, 4 December 2022

Compare 2 dataframes, assign labels and split rows in Pandas/Pyspark

No comments:

Post a Comment