Friday 5 March 2021

How to manipulate data cell by cell in pandas df?

Let the sample df (df1) be,

sample data

We can achieve df2 or final data-frame by manipulating the data of df1 in the following manner,

Step 1: Remove all positive numbers including zeros

After Step 1 the sample data should look like,

sample data after Step1

Step 2: If column4 A row is a negative number and column4 B is blank, then remove the -ve number of column4 A row

Step 3: If column4 A row is blank and column4 B is a negative number, then keep the -ve number of column4 B row

sample data after step2 and 3

After Steps 1,2 and 3 are done,

Step 4: If both A and B of column4 are negative then,

For each A and B row of column4, check the left-side (LHS) value (for a given month) of the same A and B row of column4

Step 4.1: If either of the LHS values of A or B is a -ve number, then delete the current row value of B column4 and keep the current row value of A column4

After Step 4.1, the sample data should look like this,

sample data after step 4.1

Step 4.2:

If the LHS value of A and B column4 is blank, then keep the current row value of B column4 and delete the current row value of A column4

Sample data after Step 4.2 should look like,

sample data after step4.2

Since we see two negative numbers still, we perform Step 4.1 again and then the final data-frame or df2 will look like,

sample data after Step4.1

How may I achieve the above using pandas? I was able to achieve till Step 1 but have no idea as to how to proceed further. Any help would be greatly appreciated.

This is the approach that I took,

import pandas as pd
df = pd.read_excel('df1.xlsx', engine='openpyxl')
df.to_pickle('./df1.pkl')
unpickled_df = pd.read_pickle('./df1.pkl')
rem_cols = ['column2', 'column3', 'column5', 'column6', 'column7']
unpickled_df['g'] = unpickled_df.groupby(['column1', 'column4'] ).cumcount()
df1 = unpickled_df.drop(rem_cols, axis=1)
df1 = df1.set_index(['column1','g', 'column4'])
df1.columns = pd.to_datetime(df1.columns, format='%b-%y').strftime('%b-%y')
first_date = df1.columns[0]
df1 = df1.unstack(-1)
df1 = df1.mask(df1.ge(0))
m1 = (df1.xs('A', level=1, axis=1, drop_level=False).notna() & 
      df1.xs('B', level=1, axis=1, drop_level=False).rename(columns={'B':'A'}, level=1).isna())
m2 = (df1.xs('B', level=1, axis=1, drop_level=False).notna() &
      df1.xs('A', level=1, axis=1, drop_level=False).rename(columns={'A':'B'}, level=1).isna())

m = m1.join(m2)
df1 = df1.mask(m)
df2 = df1.groupby(level=1, axis=1).shift(1, axis=1)
mask1 = df1.notna() & df2.isna() & (df1.columns.get_level_values(1) == 'A')[ None, :]
mask1[first_date] = False
mask2 = df1.notna() & df2.notna() & (df1.columns.get_level_values(1) == 'B')[ None, :]
df1 = df1.mask(mask1).mask(mask2).stack(dropna=False)
unpickled_df = unpickled_df[rem_cols + ['column1','g', 'column4']].join(df1, on=['column1','g', 'column4'])
#print(unpickled_df)

Small Test data: df1,

{'column1': ['ABC', 'ABC', 'CDF', 'CDF'], 'column4': ['A', 'B', 'A', 'B'], 'Feb-21': [0, 10, 0, 0], 'Mar-21': [0, 0, 70, 70], 'Apr-21': [-10, -10, -8, 60], 'May-21': [-30, -60, -10, 40], 'Jun-21': [-20, 9, -40, -20], 'Jul-21': [30, -10, 0, -20], 'Aug-21': [-30, -20, 0, -20], 'Sep-21': [0, -15, 0, -20], 'Oct-21': [0, -15, 0, -20]}

df2 (expected output),

{'column1': ['ABC', 'ABC', 'CDF', 'CDF'], 'column4': ['A', 'B', 'A', 'B'], 'Feb-21': [nan, nan, nan, nan], 'Mar-21': [nan, nan, nan, nan], 'Apr-21': [nan, -10.0, nan, nan], 'May-21': [-30.0, nan, nan, nan], 'Jun-21': [nan, nan, nan, -20.0], 'Jul-21': [nan, -10.0, nan, -20.0], 'Aug-21': [-30.0, nan, nan, -20.0], 'Sep-21': [nan, -15.0, nan, -20.0], 'Oct-21': [nan, -15.0, nan, -20.0]}

Test data:

df1

{'column1': ['CT', 'CT', 'NBB', 'NBB', 'CT', 'CT', 'NBB', 'NBB', 'HHH', 'HHH', 'TP1', 'TP1', 'TPR', 'TPR', 'PP1', 'PP1', 'PP1', 'PP1'], 'column2': ['POUPOU', 'POUPOU', 'PRPRP', 'PRPRP', 'STDD', 'STDD', 'STDD', 'STDD', 'STEVT', 'STEVT', 'SYSYS', 'SYSYS', 'SYSYS', 'SYSYS', 'SHW', 'SHW', 'JV', 'JV'], 'column3': ['V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV'], 'column4': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'], 'column5': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'column6': ['BBB', 'BBB', 'CCC', 'CCC', 'BBB', 'BBB', 'BBB', 'BBB', 'VVV', 'VVV', 'CHCH', 'CHCH', 'CHCH', 'CHCH', 'CCC', 'CCC', 'CHCH', 'CHCH'], 'column7': ['Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21'], 'Feb-21': [11655, 0, 0, 0, 121117, 0, 14948, 0, 0, 0, 0, 0, 0, 0, 1838, 0, 0, 0], 'Mar-21': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -16474.0, -16474.0, 7000.0, 7000.0, -19946.0, -19946.0, 16084.44444444444, 0.0, 0.0, 0.0], 'Apr-21': [104815.0, 104815.0, 17949.0, 17949.0, 96132.0, 96132.0, 0.0, 0.0, -17001.0, -33475.0, -878.0, 6122.0, 8398.0, -11548.0, -5297.073170731703, -5297.073170731703, -282.0, -282.0], 'May-21': [78260.0, 183075.0, 42557.0, 60506.0, -15265.0, 80867.0, -18.0, -18.0, 21084.0, -12391.0, -1831.0, 4291.0, 2862.0, -8686.0, 5261.25, -35.8231707317027, -369.0, -651.0], 'Jun-21': [-52480.0, 130595.0, -13258.0, 47248.0, -35577.0, 45290.0, 2434.0, 2416.0, 31147.0, 18756.0, -4310.0, -19.0, -4750.0, -13436.0, -92.0, -127.8231707317027, -280.0, -931.0], 'Jul-21': [-174544.0, -43949.0, -38127.0, 9121.0, -124986.0, -79696.0, -9707.0, -7291.0, 13577.0, 32333.0, 0.0, -19.0, -15746.0, -29182.0, 93.0, -34.8231707317027, -319.0, -1250.0], 'Aug-21': [35498.0, -8451.0, -37094.0, -27973.0, 79021.0, -675.0, -1423.0, -8714.0, 32168.0, 64501.0, 0.0, -19.0, 18702.0, -10480.0, 4347.634146341465, 4312.810975609762, -341.0, -1591.0], 'Sep-21': [44195.0, 35744.0, 2039.0, -25934.0, 70959.0, 70284.0, 2816.0, -5898.0, 38359.0, 102860.0, 0.0, -19.0, 18119.0, 7639.0, 5302.222222222219, 9615.033197831981, 0.0, -1591.0], 'Oct-21': [-13163.0, 22581.0, -4773.0, -30707.0, 205080.0, 275364.0, -709.0, -6607.0, -1397.0, 101463.0, 0.0, -19.0, 0.0, 7639.0, -34.0, 9581.033197831981, 0.0, -1591.0]}

df2 (expected output) ,

{'column1': ['CT', 'CT', 'NBB', 'NBB', 'CT', 'CT', 'NBB', 'NBB', 'HHH', 'HHH', 'TP1', 'TP1', 'TPR', 'TPR', 'PP1', 'PP1', 'PP1', 'PP1'], 'column2': ['POUPOU', 'POUPOU', 'PRPRP', 'PRPRP', 'STDD', 'STDD', 'STDD', 'STDD', 'STEVT', 'STEVT', 'SYSYS', 'SYSYS', 'SYSYS', 'SYSYS', 'SHW', 'SHW', 'JV', 'JV'], 'column3': ['V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV'], 'column4': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'], 'column5': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'column6': ['BBB', 'BBB', 'CCC', 'CCC', 'BBB', 'BBB', 'BBB', 'BBB', 'VVV', 'VVV', 'CHCH', 'CHCH', 'CHCH', 'CHCH', 'CCC', 'CCC', 'CHCH', 'CHCH'], 'column7': ['Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21'], 'Feb-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Mar-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, -16474.0, nan, nan, nan, -19946.0, nan, nan, nan, nan], 'Apr-21': [nan, nan, nan, nan, nan, nan, nan, nan, -17001.0, nan, nan, nan, nan, -11548.0, nan, -5297.073170731703, nan, -282.0], 'May-21': [nan, nan, nan, nan, nan, nan, nan, -18.0, nan, -12391.0, nan, nan, nan, -8686.0, nan, -35.8231707317027, -369.0, nan], 'Jun-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, -19.0, -4750.0, nan, -92.0, nan, -280.0, nan], 'Jul-21': [nan, -43949.0, nan, nan, nan, -79696.0, nan, -7291.0, nan, nan, nan, -19.0, -15746.0, nan, nan, -34.8231707317027, -319.0, nan], 'Aug-21': [nan, -8451.0, nan, -27973.0, nan, -675.0, -1423.0, nan, nan, nan, nan, -19.0, nan, -10480.0, nan, nan, -341.0, nan], 'Sep-21': [nan, nan, nan, -25934.0, nan, nan, nan, -5898.0, nan, nan, nan, -19.0, nan, nan, nan, nan, nan, -1591.0], 'Oct-21': [nan, nan, -4773.0, nan, nan, nan, -709.0, nan, nan, nan, nan, -19.0, nan, nan, nan, nan, nan, -1591.0]}

Note: I have implemented my code on the basis of the Test data provided. The sample data is merely to focus on the columns that are supposed to be manipulated.



from How to manipulate data cell by cell in pandas df?

No comments:

Post a Comment