Hi Guys i am working on a dataset containing following example:
the data contains start_time, end_time, id and url. for one id and url group i have different in and out values the problem is that in and out values are in different rows, i want to fill the missing end_time/start_time values. for this i have to use the following logic:
- if i have values in start_time and end time is null then i have to fill the end_time with the closest end_time considering end_time >= start_time and delete the used/matched row
- after all the rows having star_time are filled and used/matched rows are deleted, and still remain some rows with empty start_time, then i have to fill the start_time with the same value as end_time.
- if no matching end_time value is found for the given start_time then i have to fill the end_time value with the same start_time value.
considering the above things in mind the expected result should be similar as following, i am giving output in two stages so that its easy to understand
currently i am using the following way to achieve this but i feel its not optimized:
def process(self, param, context):
df = context['data']
# df = df.drop_duplicates()
key_cols = param['keys_cols']
start_time_col = param['start_time_col']
end_time_col = param['end_time_col']
guid_col = param.get('guid_col','guid')
df_groupby = df.groupby(key_cols).size().reset_index()
final_dfs = []
condition = ''
for key in key_cols:
if condition == '':
condition = '(df[\''+str(key)+"\']==row[\'"+str(key)+"\'])"
else:
condition = condition + ' & ' +'(df[\'' + str(key) + "\']==row[\'" + str(key) + "\'])"
for index, row in df_groupby.iterrows():
sub_df = df[eval(condition)]
if sub_df[start_time_col].isnull().sum() != len(sub_df[start_time_col]) and (sub_df[end_time_col].isnull().sum() != len(sub_df[end_time_col])):
sub_df = sub_df.sort_values([start_time_col, end_time_col], ascending=True)
subdf_start_time_not_null = sub_df[sub_df[start_time_col].notnull()]
subdf_end_time_not_null = sub_df[sub_df[end_time_col].notnull()]
subdf_end_time_not_null['combined'] = subdf_end_time_not_null[end_time_col] +"__"+ subdf_end_time_not_null[guid_col]
end_time_values = subdf_end_time_not_null['combined'].values.tolist()
for row_number, (stime_index, stime_row) in enumerate(subdf_start_time_not_null.iterrows()):
delete_index = row_number
if row_number < len(end_time_values):
end_time_value = np.nan
if int(str(subdf_start_time_not_null.at[stime_index,start_time_col]).replace(":","").replace(" ","").replace("-","")) <= int(str(end_time_values[row_number]).split("__")[0].replace(":","").replace(" ","").replace("-","")):
end_time_value = end_time_values[row_number]
subdf_start_time_not_null.at[stime_index,end_time_col] = str(end_time_values[row_number]).split("__")[0]
else:
prev_index = end_time_values.index(end_time_values[row_number])
for end_time in end_time_values:
current_index = end_time_values.index(end_time)
if current_index > prev_index:
if int(str(subdf_start_time_not_null.at[stime_index,start_time_col]).replace(":","").replace(" ","").replace("-","")) <= int(str(end_time_values[current_index]).split("__")[0].replace(":","").replace(" ","").replace("-","")):
subdf_start_time_not_null.at[stime_index, end_time_col] = end_time_values[current_index]
delete_index = current_index
end_time_value = end_time_values.pop(delete_index)
break
subdf_end_time_not_null = subdf_end_time_not_null[subdf_end_time_not_null[guid_col]!=end_time_value.split("__")[1]]
else:
subdf_start_time_not_null.at[stime_index,end_time_col] = subdf_start_time_not_null.at[stime_index,start_time_col]
subdf_end_time_not_null.drop('combined', axis=1, inplace=True)
sub_df = pd.concat([subdf_start_time_not_null,subdf_end_time_not_null])
sub_df[start_time_col] = np.where(sub_df[start_time_col].isnull(),sub_df[end_time_col],sub_df[start_time_col])
sub_df[end_time_col] = np.where(sub_df[end_time_col].isnull(),sub_df[start_time_col],sub_df[end_time_col])
final_dfs.append(sub_df)
# LOGGER.info('do something' +str(index))
df = pd.concat(final_dfs)
context['data'] = df
context['continue'] = True
return context
where param is as following:
param = {"keys_cols":['id', 'url'], "start_time_col":"start_time","end_time_col":"end_time"}
and "df" is the data.
please help to review and suggest how to make it more optimized, i have more than 70000 rows of data with more than 12000 pairs of id and urls in one file
looking forward to you guys.
Thanks
from pandas combine rows based on conditions
No comments:
Post a Comment