Tuesday, 6 June 2023

How to pass dataframe to different functions with filters and group by

I have a data frame from a query in pyspark. But I want to pass the data frame to 2 different functions in pyspark.

Basically, I am doing something like this, and I want to avoid running the same query to create initial_df twice.

def post_process_1(df):
       df = df_in_concern.filter(1st set of filter and group by)
       write_to_table_1(df)

def post_process_2(df):
       df = df_in_concern.filter(2nd set of filter and group by)
       write_to_table_2(df)

initial_df = df.filter(...).groupby(...).order1(...)
post_process_1(initial_df)
post_process_2(initial_df)

I found this post which talks about a problem similar to mine. pyspark python dataframe reuse in different functions

And, the suggestion is to use createOrReplaceTempView. But from what I learn, this function createOrReplaceTempView needs the query to be using SQL syntax.

And in my example, post_process_1, post_process_2, the query (filter/ group by) are done using pyspark.

Is there anyway, I can avoid querying initial_df twice and keep using pyspark queries (not sql queries) in my post_process_1, post_process_2 functions?



from How to pass dataframe to different functions with filters and group by

No comments:

Post a Comment