I have a python UDF that needs to be run in a pyspark code, Is there any way of calling that UDF using mappartitions, so that I can avoid that python operation not just run only in the driver node and use the full cluster, because If I just use the UDF directly on the dataframe, that would run as a driver operation, isn't it? What is the efficient way of doing this?
class Some_class_name
def pyt_udf(x):
<some python operation>
return data
def opr_to_be_done:
df = spark.sql(f'''select col1, col2 from table_name''')
rdd2=df.rdd.mappartition(lambda x: pyt_udf(x))
from Pyspark: How to avoid python UDF as a driver operation?
No comments:
Post a Comment