Suppose I had a function to generate a (py)spark data frame, caching the data frame into memory as the last operation.
def gen_func(inputs):
df = ... do stuff...
df.cache()
df.count()
return df
In my understanding Spark's caching works as follows:
- When
cache/persistplus an action (count()) is called on a data frame, it is computed from its DAG and cached into memory, affixed to the object which refers to it. - As long as a reference exists to that object, possibly within other functions/other scopes, the df will continue to be cached, and all DAGs that depend on the df will use the in-memory cached data as a starting point.
- If all references to the df are deleted, Spark puts up the cache as memory to be garbage collected. It may not be garbage collected immediately, causing some short-term memory blocks (and in particular, memory leaks if you generate cached data and throw them away too fast), but eventually it will be cleared up.
My question is, suppose I use gen_func to generate a data frame, but then overwrite the original data frame reference (perhaps with a filter or a withColumn).
df=gen_func(inputs)
df=df.filter("some_col = some_val")
In my understanding, RDD/DF in Spark are immutable, so the reassigned df after the filter and the df before the filter refer to two entirely different objects. In this case, the reference to the original df that was cache/counted has been overwritten. Does that mean that the cached data frame is no longer available and will be garbage collected? Does that mean that the new post-filter df will compute everything from scratch, despite being generated from a previously cached data frame?
I am asking this because I am recently fixing some out-of-memory issues with my code, and it seems to me that caching might be the problem. However, I do not really understand the full details yet of what are safe ways to use cache, and how one might accidentally invalidate one's cached memory. What is missing in my understanding? Am I deviating from best practice in doing the above? Thanks!
from If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?
No comments:
Post a Comment