Saturday, 22 August 2020

Hashing a pandas dataframe for calculated column caching

I am using composition method to create a class with a contained pandas dataframe as shown below. I am creating a derived property by doing some operation on the base columns.

import numpy as np
import pandas as pd

class myclass:
    def __init__(self, *args, **kwargs):
        self.df = pd.DataFrame(*args, **kwargs)
    @property
    def derived(self):
        return self.df.sum(axis=1)

myobj = myclass(np.random.randint(100, size=(100,6)))
d = mc.derived

The calculation of derived is an expensive step and hence I would like to cache this function. I want to use functools.lru_cache for the same. However, it requires that the original object be hashed. I tried creating a __hash__ function for the object as detailed in this answer https://stackoverflow.com/a/47800021/3679377.

Now I run in to a new problem where the hashing function is an expensive step!. Is there any way to get around this problem? Or have I reached a dead end?

Is there any better way to check if a dataframe has been modified and if not, keep returning the same hash?



from Hashing a pandas dataframe for calculated column caching

No comments:

Post a Comment