Monday, 16 January 2023

Improving performance for a nested for loop iterating over dates

I am looking to learn how to improve the performance of code over a large dataframe (10 million rows) and my solution loops over multiple dates (2023-01-10, 2023-01-20, 2023-01-30) for different combinations of category_a and category_b (can also use random_str to run the function over multiple cores using multiprocessing).

The working approach is shown below, which iterates over the dates for different pairings of the two-category data by first locating a subset of a particular pair. However, I would want to refactor it to see if there is an approach that is more efficient.

My input (df) looks like:

date category_a category_b outflow open inflow max close buy random_str
0 2023-01-10 4 1 1 0 0 10 0 0 a
1 2023-01-20 4 1 2 0 0 20 nan nan a
2 2023-01-30 4 1 10 0 0 20 nan nan a
3 2023-01-10 4 2 2 0 0 10 0 0 b
4 2023-01-20 4 2 2 0 0 20 nan nan b
5 2023-01-30 4 2 0 0 0 20 nan nan b

with 2 pairs (4, 1) and (4,2) over the days and my expected output (results) looks like this:

date category_a category_b outflow open inflow max close buy random_str
0 2023-01-10 4 1 1 0 0 10 -1 23 a
1 2023-01-20 4 1 2 -1 23 20 20 10 a
2 2023-01-30 4 1 10 20 10 20 20 nan a
3 2023-01-10 4 2 2 0 0 10 -2 24 b
4 2023-01-20 4 2 2 -2 24 20 20 0 b
5 2023-01-30 4 2 0 20 0 20 20 nan b

I have a working solution using pandas dataframes to take a subset then loop over it to get a solution but I would like to see how I can improve the performance of this using perhaps ;numpy, numba, pandas-multiprocessing or dask. Another great idea was to rewrite it in BigQuery SQL.

I am not sure what the best solution would be and I would appreciate any help in improving the performance.

Minimum working example

The code below generates the input dataframe.

import pandas as pd
import numpy as np

# prepare the input  df
df = pd.DataFrame({
'date' : ['2023-01-10', '2023-01-20','2023-01-30', '2023-01-10', '2023-01-20','2023-01-30'] ,
'category_a' : [4, 4,4,4, 4, 4] ,
'category_b' : [1, 1,1, 2, 2,2] ,
'outflow' : [1, 2,10, 2, 2, 0],
'open' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ,
'inflow' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ,
'max' : [10, 20, 20 , 10, 20,  20] ,
'close' : [0, np.nan,np.nan, 0, np.nan, np.nan] ,
'buy' : [0, np.nan,np.nan, 0, np.nan,np.nan],
'random_str' : ['a', 'a', 'a', 'b', 'b', 'b'] 
})

df['date'] = pd.to_datetime(df['date'])

# get unique pairs of category_a and category_b in a dictionary
unique_pairs = df.groupby(['category_a', 'category_b']).size().reset_index().rename(columns={0:'count'})[['category_a', 'category_b']].to_dict('records')
unique_dates = np.sort(df['date'].unique())

Using this input dataframe and Numpy, the code below is what I am trying to optmizize.

df = df.set_index('date')
day_0 = unique_dates[0] # first date

# Using Dictionary comprehension

list_of_numbers = list(range(len(unique_pairs)))
myset  = {key: None for key in list_of_numbers}

for count_pair, value in enumerate(unique_pairs):
    
    # pair of category_a and category_b
    category_a = value['category_a']
    category_b = value['category_b']

    # subset the dataframe for the pair
    df_subset = df.loc[(df['category_a'] == category_a) & (df['category_b'] == category_b)]

    log.info(f" running for {category_a} and {category_b}")

    # day 0
    df_subset.loc[day_0, 'close'] = df_subset.loc[day_0, 'open'] + df_subset.loc[day_0, 'inflow'] - df_subset.loc[day_0, 'outflow']
    
    
    # loop over single pair using date
    for count, date in enumerate(unique_dates[1:], start=1):
        previous_date = unique_dates[count-1]

        df_subset.loc[date, 'open'] = df_subset.loc[previous_date, 'close']
        df_subset.loc[date, 'close'] = df_subset.loc[date, 'open'] + df_subset.loc[date, 'inflow'] - df_subset.loc[date, 'outflow']

        # check if closing value is negative, if so, set inflow to buy for next weeks deficit

        if df_subset.loc[date, 'close'] < df_subset.loc[date, 'max']:
            df_subset.loc[previous_date, 'buy'] = df_subset.loc[date, 'max'] - df_subset.loc[date, 'close'] + df_subset.loc[date, 'inflow']
        elif df_subset.loc[date, 'close'] > df_subset.loc[date, 'max']:
            df_subset.loc[previous_date, 'buy'] = 0
        else:
            df_subset.loc[previous_date, 'buy'] = df_subset.loc[date, 'inflow']
        
        df_subset.loc[date, 'inflow'] = df_subset.loc[previous_date, 'buy']
        df_subset.loc[date, 'close'] = df_subset.loc[date, 'open'] + df_subset.loc[date, 'inflow'] - df_subset.loc[date, 'outflow']
    
    # store all the dataframes in a container myset
    myset[count_pair] = df_subset
    
# make myset into a dataframe
result = pd.concat(myset.values()).reset_index(drop=False)
result

After which we can check that the solution is the same as what we expected.

from pandas.testing import assert_frame_equal

expected = pd.DataFrame({
'date' : [pd.Timestamp('2023-01-10 00:00:00'), pd.Timestamp('2023-01-20 00:00:00'), pd.Timestamp('2023-01-30 00:00:00'), pd.Timestamp('2023-01-10 00:00:00'), pd.Timestamp('2023-01-20 00:00:00'), pd.Timestamp('2023-01-30 00:00:00')] ,
'category_a' : [4, 4, 4, 4, 4, 4] ,
'category_b' : [1, 1, 1, 2, 2, 2] ,
'outflow' : [1, 2, 10, 2, 2, 0] ,
'open' : [0.0, -1.0, 20.0, 0.0, -2.0, 20.0] ,
'inflow' : [0.0, 23.0, 10.0, 0.0, 24.0, 0.0] ,
'max' : [10, 20, 20, 10, 20, 20] ,
'close' : [-1.0, 20.0, 20.0, -2.0, 20.0, 20.0] ,
'buy' : [23.0, 10.0, np.nan, 24.0, 0.0, np.nan] ,
'random_str' : ['a', 'a', 'a', 'b', 'b', 'b'] 
})

# check that the result is the same as expected
assert_frame_equal(result, expected)

SQL to create first table

The solution can also be in sql, if so you can use the following code to create the initial table.

I am busy trying to implement a solution in big query sql using a user defined function to keep the logic going too. This would be a nice approach to solving the problem too.

WITH data AS (
  SELECT 
    DATE '2023-01-10' as date, 4 as category_a, 1 as category_b, 1 as outflow, 0 as open, 0 as inflow, 10 as max, 0 as close, 0 as buy, 'a' as random_str
  UNION ALL
  SELECT 
    DATE '2023-01-20' as date, 4 as category_a, 1 as category_b, 2 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'a' as random_str
  UNION ALL
  SELECT 
    DATE '2023-01-30' as date, 4 as category_a, 1 as category_b, 10 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'a' as random_str
  UNION ALL
  SELECT 
    DATE '2023-01-10' as date, 4 as category_a, 2 as category_b, 2 as outflow, 0 as open, 0 as inflow, 10 as max, 0 as close, 0 as buy, 'b' as random_str
  UNION ALL
  SELECT 
    DATE '2023-01-20' as date, 4 as category_a, 2 as category_b, 2 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'b' as random_str
  UNION ALL
  SELECT 
    DATE '2023-01-30' as date, 4 as category_a, 2 as category_b, 0 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'b' as random_str
)

SELECT 
  ROW_NUMBER() OVER (ORDER BY date) as " ",
  date,
  category_a,
  category_b,
  outflow,
  open,
  inflow,
  max,
  close,
  buy,
  random_str
FROM data


from Improving performance for a nested for loop iterating over dates

No comments:

Post a Comment