I have 2 dataframes consisting expected_orders and actual_orders details.
Input data:
I want to create a label field in both dataframe and split the rows based on following criteria:
- Sort by country, product and date
- Group both data frames by country and product
- In both data frames, for each group if row's date and qty are matching then assign label same actual date/ same expected date
- If qty is matching but dates are different then assign labels (earlier expected date/ later expected date) and (earlier actual date/ later actual date)
- If qty is not an exact match but there are qty values remaining in other data frame of that group then split the row with greater qty value df to 2 rows: matching (less) qty value and remaining value
- Repeat steps unless all rows have labels
- If no quantity is remaining from other group then assign label no actual date or no expected date
Expected output:
I am trying to do this with nested loops but with millions of rows this is quite slow.
for key, exp in expected_grouped:
act = actual_grouped.get_group(key)
...
for i, outerrow in enumerate(exp.itertuples()):
for j, innerrow in enumerate(act.itertuples()):
if: ...
elif: ...
Is there any better and faster way to do this? Any suggestions for improvement would be highly appreciated.
from Compare 2 dataframes, assign labels and split rows in Pandas/Pyspark


No comments:
Post a Comment