Tuesday, 4 January 2022

Thread-safe way to use preloaded data in a for-loop

Suppose we apply a set of (inplace) operations within a for-loop on mostly the same fudamental data (mutable). What is a memory efficient (and thread safe) way to do so?

Note the fundamental data should not be altered within the for-loop from iteration to iteration.

Example Code:

Assume we have some Excel files containing fundamental data in a data directory. Further we have some addtional data in the some_more_data directory. I want to apply operations on the data retrieved from the data directory using the files from the some_more_data directory. Afterwards I want to print the results to a new pickle file.

import copy
import pickle
import pandas as pd

# Excel import function to obtain dictionary of pandas DataFrames.
def read_data(info_dict):

   data_dict = dict()
   for dname, dpath in info_dict.items():
       data_dict[dname] = pd.read_excel(dpath, index_col=0)

   return data_dict

# list of data files
data_list = {'price': 'data/price.xlsx', 'eps': 'data/eps.xlsx'}
raw_data = read_data(data_list)

# list of files used for operation (they, for example, have different indices)
some_more_data= {
    'some_data_a': 'some_more_data/some_data_a.xlsx',
    'some_data_b': 'some_more_data/some_data_b.xlsx'
    }
some_more_data = read_data(some_more_data)

# Apply operation to data (explicitly use a for-loop)
for smd_k, smd_v in some_more_data.items():
   rdata = copy.deepcopy(raw_data)

   rdata['price'] = rdata['price'].reindex(smd_v.index)
   rdata['eps'] = rdata['eps'].reindex(columns=smd_v.columns)

   with open(f'data/changed_{smd_k}.pkl', 'wb') as handle:
        pickle.dump(rdata, handle, protocol=pickle.HIGHEST_PROTOCOL)

Is the deepcopy operation in my example above thread safe (assuming I want to use Multithreading)? Or should I repeatedly load the data from Excel within the for loop (very slow)? Or is there an even better way?

Thank you for your help.


Code to Generate Sample DataFrames and Save Data in Excel Files

Note that the directories data and some_more_data must be created manually first

import pandas as pd
import numpy as np


price = pd.DataFrame([[-1.332298,  0.396217,  0.574269, -0.679972, -0.470584,  0.234379],
                      [-0.222567,  0.281202, -0.505856, -1.392477,  0.941539,  0.974867],
                      [-1.139867, -0.458111, -0.999498,  1.920840,  0.478174, -0.315904],
                      [-0.189720, -0.542432, -0.471642,  1.506206, -1.506439,  0.301714]],
                     columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], 
                     index=pd.date_range('2000', freq='D', periods=4))

eps = pd.DataFrame([[-1.91,  1.63,  0.51, -.32, -0.84,  0.37],
                      [-0.56,  0.02, 0.56, 1.77,  0.99,  0.97],
                      [-1.67, -0.41, -0.98,  1.20,  0.74, -0.04],
                      [-0.80, -0.43, -0.12,  1.06, 1.59,  0.34]],
                     columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], 
                     index=pd.date_range('2000', freq='D', periods=4))

some_data_a = pd.DataFrame(np.random.randint(0,100,size=(4, 6)), columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], index=pd.date_range('2001', freq='D', periods=4))
some_data_b = pd.DataFrame(np.random.randint(0,100,size=(20, 6)), columns=['GM', 'TSLA', 'IBM', 'MSFT', 'APPL', 'ORCL'], index=pd.date_range('2000', freq='D', periods=20))

price.to_excel('data/price.xlsx')
eps.to_excel('data/eps.xlsx')
some_data_a.to_excel('some_more_data/some_data_a.xlsx')
some_data_b.to_excel('some_more_data/some_data_b.xlsx')



from Thread-safe way to use preloaded data in a for-loop

No comments:

Post a Comment