Wednesday, 15 February 2023

Pyspark: How to avoid python UDF as a driver operation?

I have a python UDF that needs to be run in a pyspark code, Is there any way of calling that UDF using mappartitions, so that I can avoid that python operation not just run only in the driver node and use the full cluster, because If I just use the UDF directly on the dataframe, that would run as a driver operation, isn't it? What is the efficient way of doing this?

class Some_class_name
   def pyt_udf(x):
     <some python operation>
     return data

   def opr_to_be_done:
      df = spark.sql(f'''select col1, col2 from table_name''')
      rdd2=df.rdd.mappartition(lambda x: pyt_udf(x))


from Pyspark: How to avoid python UDF as a driver operation?

No comments:

Post a Comment