I have a data frame from a query in pyspark. But I want to pass the data frame to 2 different functions in pyspark.
Basically, I am doing something like this, and I want to avoid running the same query to create initial_df twice.
def post_process_1(df):
df = df_in_concern.filter(1st set of filter and group by)
write_to_table_1(df)
def post_process_2(df):
df = df_in_concern.filter(2nd set of filter and group by)
write_to_table_2(df)
initial_df = df.filter(...).groupby(...).order1(...)
post_process_1(initial_df)
post_process_2(initial_df)
I found this post which talks about a problem similar to mine. pyspark python dataframe reuse in different functions
And, the suggestion is to use createOrReplaceTempView. But from what I learn, this function createOrReplaceTempView needs the query to be using SQL syntax.
And in my example, post_process_1, post_process_2, the query (filter/ group by) are done using pyspark.
Is there anyway, I can avoid querying initial_df twice and keep using pyspark queries (not sql queries) in my post_process_1, post_process_2 functions?
from How to pass dataframe to different functions with filters and group by
No comments:
Post a Comment