Sunday, 16 May 2021

Pandas - Updating columns based on several conditions - group by method

I am aiming to the below output.

A B C D E F
0000 ZZZ 987 QW1 8 first three-four col and offset
0000 ZZZ 987 QW1 -8 first three-four col and offset
0000 ZZZ 987 QW1 -8 first or no match
1111 AAA 123 AB1 1 first three-four col and offset
1111 AAA 123 CD1 -1 first three-four col and offset
2222 BBB 456 EF1 -4 first three-four col and offset
2222 BBB 456 GH1 -1 first three-four col and offset
2222 BBB 456 IL1 5 first three-four col and offset
3333 CCC 789 MN1 2 first two col and offset
3333 CCC 101 MN1 -2 first two col and offset
4444 DDD 121 UYT 6 first two col and offset
4444 DDD 131 FB1 -5 first two col and offset
4444 DDD 141 UYT -1 first two col and offset
5555 EEE 151 CB1 3 first two col and offset
5555 EEE 161 CR1 -3 first two col and offset
5555 EEE 161 CR1 -5 first or no match
6666 FFF 111 CB1 4 first or no match
7777 GGG 222 ZB1 10.5 first three-four col and small offset
7777 GGG 222 ZB1 -10 first three-four col and small offset

1st rule) the first three columns must equal each other - regardless of the fourth, which can either be equal or not. Each combination must have the associated numbers (col E) offset to zero (can be from 2 to X records combined). IT SHOULD WORK EVEN WITH UNEVEN ENTRIES.

2nd rule) the first two columns must equal each other - regardless of the fourth, which can either be equal or not. Each combination must have the associated numbers (col E) offset to zero (can be from 2 to X records combined). IT SHOULD WORK EVEN WITH UNEVEN ENTRIES.

3rd rule) no match.

4th rule) the first three columns must equal each other - regardless of the fourth, which can either be equal or not. Each combination can have a difference of 0.5 AT MOST (col E) and NO offset to zero (can be from 2 to X records combined). IT SHOULD WORK EVEN WITH UNEVEN ENTRIES.

Please see my code below.

Through the below code, I am able to get a similar result, however, this way does not work with uneven entries, for instance the first three rows would result as following:

A B C D E F
0000 ZZZ 987 QW1 8 first or no match
0000 ZZZ 987 QW1 -8 first or no match
0000 ZZZ 987 QW1 -8 first or no match

Instead of the following:

A B C D E F
0000 ZZZ 987 QW1 8 first three-four col and offset
0000 ZZZ 987 QW1 -8 first three-four col and offset
0000 ZZZ 987 QW1 -8 first or no match

Code so far:

m1 = df.groupby(['A', 'B', 'C'])['E'].transform('sum').eq(0)  # Rule 1
m2 = df.groupby(['A', 'B'])['E'].transform('sum').eq(0)  # Rule 2
m3 = df.groupby(['A', 'B', 'C'])['E'].transform('sum').abs().le(0.5)  # Rule 4

df['new'] = np.select([m1, m2, m3], ['first three-four col and offset',
                      'first two col and offset', 'first three-four col and small offset'], 'first or no match')


from Pandas - Updating columns based on several conditions - group by method

No comments:

Post a Comment