Tuesday, 23 August 2022

window timedelta not recording all records

This is in relations to a previous post from here:

I am trying to build a set of analytics that would catch students sharing answers. I need to have the format so that I can create some nice plots that would show the results. Not only that but I need to have the evidence to back up my findings. Here, I am working on an analytic that would look at the challenges that a student answered within 5 minutes from each other. I need the timedelta window of 5 minutes to ignore the times that have already been counted for. I have already scowled many pandas functions to see if there was something that would do the like of what I've been building from a modified version of a rolling list. I need the 5 minute window to be the span of one record back down to the second. My end goal is to have a dataframe looking like this:

ValCountTimeWin_df
                 Time   5   10
0 2022-06-22 14:00:06  11    0
1 2022-06-22 14:04:27   4    0
2 2022-06-22 14:15:08   4    0
3 2022-06-22 14:24:34   1    1
3 2022-06-22 14:26:49   0    2

I have a rolling class that should record a count for each of the various values seen within a timedelta window of 5 minutes.

import pandas as pd

df = pd.DataFrame(
    data={
        "Value": [0,0,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,10,10,10],
        "Time": ['2022-06-22 13:55:01',
                '2022-06-22 13:55:53',
                '2022-06-22 13:56:25',
                '2022-06-22 13:56:32',
                '2022-06-22 13:56:39',
                '2022-06-22 13:56:48',
                '2022-06-22 13:58:49',
                '2022-06-22 13:58:57',
                '2022-06-22 13:59:28',
                '2022-06-22 13:59:37',
                '2022-06-22 13:59:46',
                '2022-06-22 13:59:57',
                '2022-06-22 14:00:06',
                '2022-06-22 14:01:30',
                '2022-06-22 14:02:11',
                '2022-06-22 14:03:42',
                '2022-06-22 14:04:27',
                '2022-06-22 14:10:50',
                '2022-06-22 14:11:25',
                '2022-06-22 14:12:40',
                '2022-06-22 14:15:08',
                '2022-06-22 14:19:33',
                '2022-06-22 14:24:34',
                '2022-06-22 14:25:13',
                '2022-06-22 14:26:49',
                ],
    }   
)
df["Time"] = pd.to_datetime(df["Time"],format='%Y-%m-%d %H:%M:%S')
print(df.info(verbose=True))
print(df)

class ValueRollingList:
    def __init__(self,T='5T'):
        self.cur = pd.DataFrame(columns=['Value','Time'])
        self.window = pd.Timedelta(T)
        self.new_df = pd.DataFrame()
        self.last_window = list()
    
    def __add(self, row):
        idx = self.cur.index.max()
        new_idx = idx+1 if idx==idx else 0
        self.cur.loc[new_idx] = row[['Value','Time']]
    
    def handle_row(self, row):
        # Add the row before anything else to make sure we process all rows
        self.__add(row)
        
        self.cur = self.cur[~self.cur.Value.eq(0)]
        
        window_mask = (row['Time'] - self.cur['Time']).abs() <= self.window
        
        if ~window_mask.all():
            # Select all rows from 0 to one before the current row in "self.cur"
            # This would mean, process the rows from 0 to the current row (Including it) from the original data
            df = self.cur.iloc[:row.name]
            if len(df)>0:
                df= df[~df.Time.isin(self.last_window)]
                if len(self.cur)==11:
                    print("**************")
                    print("self.cur")
                    print(self.cur)
                    print("df")
                    print(df)
                    print('self.last_window')
                    print(self.last_window)
                    print('window_mask')
                    print(window_mask)
                    print('self.window')
                    print(self.window)
                    print((row['Time'] - self.cur['Time']).abs())
                    print((row['Time'] - self.cur['Time']))
                    print(row['Time'])
                    print(self.cur['Time'])
                    print("**************")
                df2=df[['Value']].value_counts().reset_index(inplace=False)
                df2.columns = ["Value","Count"]

                # Only record the time windows that have more than one record
                if len(df)>2:
                    print('#######################')
                    print(df['Time'])
                    print(len(df))
                    self.last_window = self.last_window + (df["Time"].tolist())
                    print('-------------------')
                    print("self.last_window count: %d" %(len(self.last_window)))
                    print('-------------------')
                    print("df count: %d"%(len(df)))
                    print(df['Time'])
                    print(df)
                    print('$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$')
                    print('about to be added')
                    print(df.tail(1).reset_index(drop=True).drop(columns='Value').join(df2.pivot_table(columns='Value').reset_index(drop=True),how='outer'))
                    if len(df)>0:
                        self.new_df=pd.concat([self.new_df,df.tail(1).reset_index(drop=True).drop(columns='Value').join(df2.pivot_table(columns='Value').reset_index(drop=True),how='outer')])   
        
        return
    
    def dump_last(self):
        return self.new_df.reset_index(inplace=False).drop('index',axis=1).fillna(0)
    
rolling_list = ValueRollingList('5T')
df.apply(rolling_list.handle_row, axis=1)
ValCountTimeWin_df = rolling_list.dump_last()
print("ValCountTimeWin_df")
print(ValCountTimeWin_df)

When changing the dataframe to contain different times and values as posted above I noticed that I get a final output for ValCOuntTimeWin_df:

ValCountTimeWin_df
                 Time   5   10
0 2022-06-22 14:00:06  11    0
1 2022-06-22 14:04:27   4    0
2 2022-06-22 14:15:08   4    0
3 2022-06-22 14:24:34   1    1
3 2022-06-22 14:26:49   0    2

Correct my math if I'm wrong but should the output not be something like this?

ValCountTimeWin_df
                 Time     5   10
0 2022-06-22 14:01:30  12.0  0.0
1 2022-06-22 14:04:27   3.0  0.0
2 2022-06-22 14:15:08   4.0  0.0
3 2022-06-22 14:19:33   1.0  0.0
3 2022-06-22 14:26:49   0.0  3.0

I believe it's because of

if len(df)>2:

That was supposed to only record the values for the timedelta windows more than 2 but it's actually only recording them in chunks of 3 and not more.I suppose it's doing exactly what I told it to do but not what I really want it to do.

I don't understand why the first record went up to 12 but all of the others did not...Can anyone help me?

##################################################### EDIT:

I might be wrong but I believe it has to do with the window_mask that is probably not resetting after it added a new count record into self.new_df.

I NEED the solution to be in the form of a class that reflects this one.



from window timedelta not recording all records

No comments:

Post a Comment