Let the sample df (df1) be,
We can achieve df2 or final data-frame by manipulating the data of df1 in the following manner,
Step 1: Remove all positive numbers including zeros
After Step 1 the sample data should look like,
Step 2: If column4
A row is a negative number and column4
B is blank, then remove the -ve number of column4
A row
Step 3: If column4
A row is blank and column4
B is a negative number, then keep the -ve number of column4
B row
After Steps 1,2 and 3 are done,
Step 4: If both A and B of column4
are negative then,
For each A and B row of column4
, check the left-side (LHS) value (for a given month) of the same A and B row of column4
Step 4.1: If either of the LHS values of A or B is a -ve number, then delete the current row value of B column4
and keep the current row value of A column4
After Step 4.1, the sample data should look like this,
Step 4.2:
If the LHS value of A and B column4
is blank, then keep the current row value of B column4
and delete the current row value of A column4
Sample data after Step 4.2 should look like,
Since we see two negative numbers still, we perform Step 4.1 again and then the final data-frame or df2 will look like,
How may I achieve the above using pandas? I was able to achieve till Step 1 but have no idea as to how to proceed further. Any help would be greatly appreciated.
This is the approach that I took,
import pandas as pd
df = pd.read_excel('df1.xlsx', engine='openpyxl')
df.to_pickle('./df1.pkl')
unpickled_df = pd.read_pickle('./df1.pkl')
rem_cols = ['column2', 'column3', 'column5', 'column6', 'column7']
unpickled_df['g'] = unpickled_df.groupby(['column1', 'column4'] ).cumcount()
df1 = unpickled_df.drop(rem_cols, axis=1)
df1 = df1.set_index(['column1','g', 'column4'])
df1.columns = pd.to_datetime(df1.columns, format='%b-%y').strftime('%b-%y')
first_date = df1.columns[0]
df1 = df1.unstack(-1)
df1 = df1.mask(df1.ge(0))
m1 = (df1.xs('A', level=1, axis=1, drop_level=False).notna() &
df1.xs('B', level=1, axis=1, drop_level=False).rename(columns={'B':'A'}, level=1).isna())
m2 = (df1.xs('B', level=1, axis=1, drop_level=False).notna() &
df1.xs('A', level=1, axis=1, drop_level=False).rename(columns={'A':'B'}, level=1).isna())
m = m1.join(m2)
df1 = df1.mask(m)
df2 = df1.groupby(level=1, axis=1).shift(1, axis=1)
mask1 = df1.notna() & df2.isna() & (df1.columns.get_level_values(1) == 'A')[ None, :]
mask1[first_date] = False
mask2 = df1.notna() & df2.notna() & (df1.columns.get_level_values(1) == 'B')[ None, :]
df1 = df1.mask(mask1).mask(mask2).stack(dropna=False)
unpickled_df = unpickled_df[rem_cols + ['column1','g', 'column4']].join(df1, on=['column1','g', 'column4'])
#print(unpickled_df)
Small Test data: df1,
{'column1': ['ABC', 'ABC', 'CDF', 'CDF'], 'column4': ['A', 'B', 'A', 'B'], 'Feb-21': [0, 10, 0, 0], 'Mar-21': [0, 0, 70, 70], 'Apr-21': [-10, -10, -8, 60], 'May-21': [-30, -60, -10, 40], 'Jun-21': [-20, 9, -40, -20], 'Jul-21': [30, -10, 0, -20], 'Aug-21': [-30, -20, 0, -20], 'Sep-21': [0, -15, 0, -20], 'Oct-21': [0, -15, 0, -20]}
df2 (expected output),
{'column1': ['ABC', 'ABC', 'CDF', 'CDF'], 'column4': ['A', 'B', 'A', 'B'], 'Feb-21': [nan, nan, nan, nan], 'Mar-21': [nan, nan, nan, nan], 'Apr-21': [nan, -10.0, nan, nan], 'May-21': [-30.0, nan, nan, nan], 'Jun-21': [nan, nan, nan, -20.0], 'Jul-21': [nan, -10.0, nan, -20.0], 'Aug-21': [-30.0, nan, nan, -20.0], 'Sep-21': [nan, -15.0, nan, -20.0], 'Oct-21': [nan, -15.0, nan, -20.0]}
Test data:
df1
{'column1': ['CT', 'CT', 'NBB', 'NBB', 'CT', 'CT', 'NBB', 'NBB', 'HHH', 'HHH', 'TP1', 'TP1', 'TPR', 'TPR', 'PP1', 'PP1', 'PP1', 'PP1'], 'column2': ['POUPOU', 'POUPOU', 'PRPRP', 'PRPRP', 'STDD', 'STDD', 'STDD', 'STDD', 'STEVT', 'STEVT', 'SYSYS', 'SYSYS', 'SYSYS', 'SYSYS', 'SHW', 'SHW', 'JV', 'JV'], 'column3': ['V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV'], 'column4': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'], 'column5': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'column6': ['BBB', 'BBB', 'CCC', 'CCC', 'BBB', 'BBB', 'BBB', 'BBB', 'VVV', 'VVV', 'CHCH', 'CHCH', 'CHCH', 'CHCH', 'CCC', 'CCC', 'CHCH', 'CHCH'], 'column7': ['Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21'], 'Feb-21': [11655, 0, 0, 0, 121117, 0, 14948, 0, 0, 0, 0, 0, 0, 0, 1838, 0, 0, 0], 'Mar-21': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -16474.0, -16474.0, 7000.0, 7000.0, -19946.0, -19946.0, 16084.44444444444, 0.0, 0.0, 0.0], 'Apr-21': [104815.0, 104815.0, 17949.0, 17949.0, 96132.0, 96132.0, 0.0, 0.0, -17001.0, -33475.0, -878.0, 6122.0, 8398.0, -11548.0, -5297.073170731703, -5297.073170731703, -282.0, -282.0], 'May-21': [78260.0, 183075.0, 42557.0, 60506.0, -15265.0, 80867.0, -18.0, -18.0, 21084.0, -12391.0, -1831.0, 4291.0, 2862.0, -8686.0, 5261.25, -35.8231707317027, -369.0, -651.0], 'Jun-21': [-52480.0, 130595.0, -13258.0, 47248.0, -35577.0, 45290.0, 2434.0, 2416.0, 31147.0, 18756.0, -4310.0, -19.0, -4750.0, -13436.0, -92.0, -127.8231707317027, -280.0, -931.0], 'Jul-21': [-174544.0, -43949.0, -38127.0, 9121.0, -124986.0, -79696.0, -9707.0, -7291.0, 13577.0, 32333.0, 0.0, -19.0, -15746.0, -29182.0, 93.0, -34.8231707317027, -319.0, -1250.0], 'Aug-21': [35498.0, -8451.0, -37094.0, -27973.0, 79021.0, -675.0, -1423.0, -8714.0, 32168.0, 64501.0, 0.0, -19.0, 18702.0, -10480.0, 4347.634146341465, 4312.810975609762, -341.0, -1591.0], 'Sep-21': [44195.0, 35744.0, 2039.0, -25934.0, 70959.0, 70284.0, 2816.0, -5898.0, 38359.0, 102860.0, 0.0, -19.0, 18119.0, 7639.0, 5302.222222222219, 9615.033197831981, 0.0, -1591.0], 'Oct-21': [-13163.0, 22581.0, -4773.0, -30707.0, 205080.0, 275364.0, -709.0, -6607.0, -1397.0, 101463.0, 0.0, -19.0, 0.0, 7639.0, -34.0, 9581.033197831981, 0.0, -1591.0]}
df2 (expected output) ,
{'column1': ['CT', 'CT', 'NBB', 'NBB', 'CT', 'CT', 'NBB', 'NBB', 'HHH', 'HHH', 'TP1', 'TP1', 'TPR', 'TPR', 'PP1', 'PP1', 'PP1', 'PP1'], 'column2': ['POUPOU', 'POUPOU', 'PRPRP', 'PRPRP', 'STDD', 'STDD', 'STDD', 'STDD', 'STEVT', 'STEVT', 'SYSYS', 'SYSYS', 'SYSYS', 'SYSYS', 'SHW', 'SHW', 'JV', 'JV'], 'column3': ['V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV'], 'column4': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'], 'column5': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'column6': ['BBB', 'BBB', 'CCC', 'CCC', 'BBB', 'BBB', 'BBB', 'BBB', 'VVV', 'VVV', 'CHCH', 'CHCH', 'CHCH', 'CHCH', 'CCC', 'CCC', 'CHCH', 'CHCH'], 'column7': ['Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21'], 'Feb-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Mar-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, -16474.0, nan, nan, nan, -19946.0, nan, nan, nan, nan], 'Apr-21': [nan, nan, nan, nan, nan, nan, nan, nan, -17001.0, nan, nan, nan, nan, -11548.0, nan, -5297.073170731703, nan, -282.0], 'May-21': [nan, nan, nan, nan, nan, nan, nan, -18.0, nan, -12391.0, nan, nan, nan, -8686.0, nan, -35.8231707317027, -369.0, nan], 'Jun-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, -19.0, -4750.0, nan, -92.0, nan, -280.0, nan], 'Jul-21': [nan, -43949.0, nan, nan, nan, -79696.0, nan, -7291.0, nan, nan, nan, -19.0, -15746.0, nan, nan, -34.8231707317027, -319.0, nan], 'Aug-21': [nan, -8451.0, nan, -27973.0, nan, -675.0, -1423.0, nan, nan, nan, nan, -19.0, nan, -10480.0, nan, nan, -341.0, nan], 'Sep-21': [nan, nan, nan, -25934.0, nan, nan, nan, -5898.0, nan, nan, nan, -19.0, nan, nan, nan, nan, nan, -1591.0], 'Oct-21': [nan, nan, -4773.0, nan, nan, nan, -709.0, nan, nan, nan, nan, -19.0, nan, nan, nan, nan, nan, -1591.0]}
Note: I have implemented my code on the basis of the Test data provided. The sample data is merely to focus on the columns that are supposed to be manipulated.
from How to manipulate data cell by cell in pandas df?
No comments:
Post a Comment