Saturday, 4 May 2019

Why does a PySpark UDF that operates on a column generated by rand() fail?

Given the following Python function:

def f(col):
    return col

If I turn it into a UDF and apply it to a column object, it works...

from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType

df = spark.range(10)
udf = F.udf(f, returnType=DoubleType()).asNondeterministic()

df.withColumn('new', udf(F.lit(0))).show()

...Except if the column is generated by rand:

df.withColumn('new', udf(F.rand())).show()  # fails

However, the following two work:

df.withColumn('new', F.rand()).show()
df.withColumn('new', F.rand()).withColumn('new2', udf(F.col('new'))).show()

The error:

Py4JJavaError: An error occurred while calling o469.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 1 times, most recent failure: Lost task 0.0 in stage 20.0 (TID 34, localhost, executor driver): java.lang.NullPointerException

Why does this happen, and how can I use a rand column expression created within a UDF?



from Why does a PySpark UDF that operates on a column generated by rand() fail?

1 comment:

  1. https://www.wizweb.in

    Wizweb Technology is a leading software development company custom website design, software development, SMS Provider, Bulk sms, transactional sms, promotional sms, mobile app development, Hosting Solution, seo(search engine optimization) and Digital marketing etc.

    ReplyDelete