This is in relations to a previous post from here:
I am trying to build a set of analytics that would catch students sharing answers. I need to have the format so that I can create some nice plots that would show the results. Not only that but I need to have the evidence to back up my findings. Here, I am working on an analytic that would look at the challenges that a student answered within 5 minutes from each other. I need the timedelta window of 5 minutes to ignore the times that have already been counted for. I have already scowled many pandas functions to see if there was something that would do the like of what I've been building from a modified version of a rolling list. I need the 5 minute window to be the span of one record back down to the second. My end goal is to have a dataframe looking like this:
ValCountTimeWin_df
Time 5 10
0 2022-06-22 14:00:06 11 0
1 2022-06-22 14:04:27 4 0
2 2022-06-22 14:15:08 4 0
3 2022-06-22 14:24:34 1 1
3 2022-06-22 14:26:49 0 2
I have a rolling class that should record a count for each of the various values seen within a timedelta window of 5 minutes.
import pandas as pd
df = pd.DataFrame(
data={
"Value": [0,0,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,10,10,10],
"Time": ['2022-06-22 13:55:01',
'2022-06-22 13:55:53',
'2022-06-22 13:56:25',
'2022-06-22 13:56:32',
'2022-06-22 13:56:39',
'2022-06-22 13:56:48',
'2022-06-22 13:58:49',
'2022-06-22 13:58:57',
'2022-06-22 13:59:28',
'2022-06-22 13:59:37',
'2022-06-22 13:59:46',
'2022-06-22 13:59:57',
'2022-06-22 14:00:06',
'2022-06-22 14:01:30',
'2022-06-22 14:02:11',
'2022-06-22 14:03:42',
'2022-06-22 14:04:27',
'2022-06-22 14:10:50',
'2022-06-22 14:11:25',
'2022-06-22 14:12:40',
'2022-06-22 14:15:08',
'2022-06-22 14:19:33',
'2022-06-22 14:24:34',
'2022-06-22 14:25:13',
'2022-06-22 14:26:49',
],
}
)
df["Time"] = pd.to_datetime(df["Time"],format='%Y-%m-%d %H:%M:%S')
print(df.info(verbose=True))
print(df)
class ValueRollingList:
def __init__(self,T='5T'):
self.cur = pd.DataFrame(columns=['Value','Time'])
self.window = pd.Timedelta(T)
self.new_df = pd.DataFrame()
self.last_window = list()
def __add(self, row):
idx = self.cur.index.max()
new_idx = idx+1 if idx==idx else 0
self.cur.loc[new_idx] = row[['Value','Time']]
def handle_row(self, row):
# Add the row before anything else to make sure we process all rows
self.__add(row)
self.cur = self.cur[~self.cur.Value.eq(0)]
window_mask = (row['Time'] - self.cur['Time']).abs() <= self.window
if ~window_mask.all():
# Select all rows from 0 to one before the current row in "self.cur"
# This would mean, process the rows from 0 to the current row (Including it) from the original data
df = self.cur.iloc[:row.name]
if len(df)>0:
df= df[~df.Time.isin(self.last_window)]
if len(self.cur)==11:
print("**************")
print("self.cur")
print(self.cur)
print("df")
print(df)
print('self.last_window')
print(self.last_window)
print('window_mask')
print(window_mask)
print('self.window')
print(self.window)
print((row['Time'] - self.cur['Time']).abs())
print((row['Time'] - self.cur['Time']))
print(row['Time'])
print(self.cur['Time'])
print("**************")
df2=df[['Value']].value_counts().reset_index(inplace=False)
df2.columns = ["Value","Count"]
# Only record the time windows that have more than one record
if len(df)>2:
print('#######################')
print(df['Time'])
print(len(df))
self.last_window = self.last_window + (df["Time"].tolist())
print('-------------------')
print("self.last_window count: %d" %(len(self.last_window)))
print('-------------------')
print("df count: %d"%(len(df)))
print(df['Time'])
print(df)
print('$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$')
print('about to be added')
print(df.tail(1).reset_index(drop=True).drop(columns='Value').join(df2.pivot_table(columns='Value').reset_index(drop=True),how='outer'))
if len(df)>0:
self.new_df=pd.concat([self.new_df,df.tail(1).reset_index(drop=True).drop(columns='Value').join(df2.pivot_table(columns='Value').reset_index(drop=True),how='outer')])
return
def dump_last(self):
return self.new_df.reset_index(inplace=False).drop('index',axis=1).fillna(0)
rolling_list = ValueRollingList('5T')
df.apply(rolling_list.handle_row, axis=1)
ValCountTimeWin_df = rolling_list.dump_last()
print("ValCountTimeWin_df")
print(ValCountTimeWin_df)
When changing the dataframe to contain different times and values as posted above I noticed that I get a final output for ValCOuntTimeWin_df
:
ValCountTimeWin_df
Time 5 10
0 2022-06-22 14:00:06 11 0
1 2022-06-22 14:04:27 4 0
2 2022-06-22 14:15:08 4 0
3 2022-06-22 14:24:34 1 1
3 2022-06-22 14:26:49 0 2
Correct my math if I'm wrong but should the output not be something like this?
ValCountTimeWin_df
Time 5 10
0 2022-06-22 14:01:30 12.0 0.0
1 2022-06-22 14:04:27 3.0 0.0
2 2022-06-22 14:15:08 4.0 0.0
3 2022-06-22 14:19:33 1.0 0.0
3 2022-06-22 14:26:49 0.0 3.0
I believe it's because of
if len(df)>2:
That was supposed to only record the values for the timedelta windows more than 2 but it's actually only recording them in chunks of 3 and not more.I suppose it's doing exactly what I told it to do but not what I really want it to do.
I don't understand why the first record went up to 12 but all of the others did not...Can anyone help me?
##################################################### EDIT:
I might be wrong but I believe it has to do with the window_mask
that is probably not resetting after it added a new count record into self.new_df
.
I NEED the solution to be in the form of a class that reflects this one.
from window timedelta not recording all records
No comments:
Post a Comment