I am looking to learn how to improve the performance of code over a large dataframe (10 million rows) and my solution loops over multiple dates (2023-01-10, 2023-01-20, 2023-01-30) for different combinations of category_a and category_b (can also use random_str to run the function over multiple cores using multiprocessing).
The working approach is shown below, which iterates over the dates for different pairings of the two-category data by first locating a subset of a particular pair. However, I would want to refactor it to see if there is an approach that is more efficient.
My input (df) looks like:
| date | category_a | category_b | outflow | open | inflow | max | close | buy | random_str | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023-01-10 | 4 | 1 | 1 | 0 | 0 | 10 | 0 | 0 | a |
| 1 | 2023-01-20 | 4 | 1 | 2 | 0 | 0 | 20 | nan | nan | a |
| 2 | 2023-01-30 | 4 | 1 | 10 | 0 | 0 | 20 | nan | nan | a |
| 3 | 2023-01-10 | 4 | 2 | 2 | 0 | 0 | 10 | 0 | 0 | b |
| 4 | 2023-01-20 | 4 | 2 | 2 | 0 | 0 | 20 | nan | nan | b |
| 5 | 2023-01-30 | 4 | 2 | 0 | 0 | 0 | 20 | nan | nan | b |
with 2 pairs (4, 1) and (4,2) over the days and my expected output (results) looks like this:
| date | category_a | category_b | outflow | open | inflow | max | close | buy | random_str | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023-01-10 | 4 | 1 | 1 | 0 | 0 | 10 | -1 | 23 | a |
| 1 | 2023-01-20 | 4 | 1 | 2 | -1 | 23 | 20 | 20 | 10 | a |
| 2 | 2023-01-30 | 4 | 1 | 10 | 20 | 10 | 20 | 20 | nan | a |
| 3 | 2023-01-10 | 4 | 2 | 2 | 0 | 0 | 10 | -2 | 24 | b |
| 4 | 2023-01-20 | 4 | 2 | 2 | -2 | 24 | 20 | 20 | 0 | b |
| 5 | 2023-01-30 | 4 | 2 | 0 | 20 | 0 | 20 | 20 | nan | b |
I have a working solution using pandas dataframes to take a subset then loop over it to get a solution but I would like to see how I can improve the performance of this using perhaps ;numpy, numba, pandas-multiprocessing or dask. Another great idea was to rewrite it in BigQuery SQL.
I am not sure what the best solution would be and I would appreciate any help in improving the performance.
Minimum working example
The code below generates the input dataframe.
import pandas as pd
import numpy as np
# prepare the input df
df = pd.DataFrame({
'date' : ['2023-01-10', '2023-01-20','2023-01-30', '2023-01-10', '2023-01-20','2023-01-30'] ,
'category_a' : [4, 4,4,4, 4, 4] ,
'category_b' : [1, 1,1, 2, 2,2] ,
'outflow' : [1, 2,10, 2, 2, 0],
'open' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ,
'inflow' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ,
'max' : [10, 20, 20 , 10, 20, 20] ,
'close' : [0, np.nan,np.nan, 0, np.nan, np.nan] ,
'buy' : [0, np.nan,np.nan, 0, np.nan,np.nan],
'random_str' : ['a', 'a', 'a', 'b', 'b', 'b']
})
df['date'] = pd.to_datetime(df['date'])
# get unique pairs of category_a and category_b in a dictionary
unique_pairs = df.groupby(['category_a', 'category_b']).size().reset_index().rename(columns={0:'count'})[['category_a', 'category_b']].to_dict('records')
unique_dates = np.sort(df['date'].unique())
Using this input dataframe and Numpy, the code below is what I am trying to optmizize.
df = df.set_index('date')
day_0 = unique_dates[0] # first date
# Using Dictionary comprehension
list_of_numbers = list(range(len(unique_pairs)))
myset = {key: None for key in list_of_numbers}
for count_pair, value in enumerate(unique_pairs):
# pair of category_a and category_b
category_a = value['category_a']
category_b = value['category_b']
# subset the dataframe for the pair
df_subset = df.loc[(df['category_a'] == category_a) & (df['category_b'] == category_b)]
log.info(f" running for {category_a} and {category_b}")
# day 0
df_subset.loc[day_0, 'close'] = df_subset.loc[day_0, 'open'] + df_subset.loc[day_0, 'inflow'] - df_subset.loc[day_0, 'outflow']
# loop over single pair using date
for count, date in enumerate(unique_dates[1:], start=1):
previous_date = unique_dates[count-1]
df_subset.loc[date, 'open'] = df_subset.loc[previous_date, 'close']
df_subset.loc[date, 'close'] = df_subset.loc[date, 'open'] + df_subset.loc[date, 'inflow'] - df_subset.loc[date, 'outflow']
# check if closing value is negative, if so, set inflow to buy for next weeks deficit
if df_subset.loc[date, 'close'] < df_subset.loc[date, 'max']:
df_subset.loc[previous_date, 'buy'] = df_subset.loc[date, 'max'] - df_subset.loc[date, 'close'] + df_subset.loc[date, 'inflow']
elif df_subset.loc[date, 'close'] > df_subset.loc[date, 'max']:
df_subset.loc[previous_date, 'buy'] = 0
else:
df_subset.loc[previous_date, 'buy'] = df_subset.loc[date, 'inflow']
df_subset.loc[date, 'inflow'] = df_subset.loc[previous_date, 'buy']
df_subset.loc[date, 'close'] = df_subset.loc[date, 'open'] + df_subset.loc[date, 'inflow'] - df_subset.loc[date, 'outflow']
# store all the dataframes in a container myset
myset[count_pair] = df_subset
# make myset into a dataframe
result = pd.concat(myset.values()).reset_index(drop=False)
result
After which we can check that the solution is the same as what we expected.
from pandas.testing import assert_frame_equal
expected = pd.DataFrame({
'date' : [pd.Timestamp('2023-01-10 00:00:00'), pd.Timestamp('2023-01-20 00:00:00'), pd.Timestamp('2023-01-30 00:00:00'), pd.Timestamp('2023-01-10 00:00:00'), pd.Timestamp('2023-01-20 00:00:00'), pd.Timestamp('2023-01-30 00:00:00')] ,
'category_a' : [4, 4, 4, 4, 4, 4] ,
'category_b' : [1, 1, 1, 2, 2, 2] ,
'outflow' : [1, 2, 10, 2, 2, 0] ,
'open' : [0.0, -1.0, 20.0, 0.0, -2.0, 20.0] ,
'inflow' : [0.0, 23.0, 10.0, 0.0, 24.0, 0.0] ,
'max' : [10, 20, 20, 10, 20, 20] ,
'close' : [-1.0, 20.0, 20.0, -2.0, 20.0, 20.0] ,
'buy' : [23.0, 10.0, np.nan, 24.0, 0.0, np.nan] ,
'random_str' : ['a', 'a', 'a', 'b', 'b', 'b']
})
# check that the result is the same as expected
assert_frame_equal(result, expected)
SQL to create first table
The solution can also be in sql, if so you can use the following code to create the initial table.
I am busy trying to implement a solution in big query sql using a user defined function to keep the logic going too. This would be a nice approach to solving the problem too.
WITH data AS (
SELECT
DATE '2023-01-10' as date, 4 as category_a, 1 as category_b, 1 as outflow, 0 as open, 0 as inflow, 10 as max, 0 as close, 0 as buy, 'a' as random_str
UNION ALL
SELECT
DATE '2023-01-20' as date, 4 as category_a, 1 as category_b, 2 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'a' as random_str
UNION ALL
SELECT
DATE '2023-01-30' as date, 4 as category_a, 1 as category_b, 10 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'a' as random_str
UNION ALL
SELECT
DATE '2023-01-10' as date, 4 as category_a, 2 as category_b, 2 as outflow, 0 as open, 0 as inflow, 10 as max, 0 as close, 0 as buy, 'b' as random_str
UNION ALL
SELECT
DATE '2023-01-20' as date, 4 as category_a, 2 as category_b, 2 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'b' as random_str
UNION ALL
SELECT
DATE '2023-01-30' as date, 4 as category_a, 2 as category_b, 0 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'b' as random_str
)
SELECT
ROW_NUMBER() OVER (ORDER BY date) as " ",
date,
category_a,
category_b,
outflow,
open,
inflow,
max,
close,
buy,
random_str
FROM data
from Improving performance for a nested for loop iterating over dates
No comments:
Post a Comment