Hemant Vishwakarma: Why does a PySpark UDF that operates on a column generated by rand() fail?

Saturday, 4 May 2019

Why does a PySpark UDF that operates on a column generated by rand() fail?

Given the following Python function:

def f(col):
    return col

If I turn it into a UDF and apply it to a column object, it works...

from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType

df = spark.range(10)
udf = F.udf(f, returnType=DoubleType()).asNondeterministic()

df.withColumn('new', udf(F.lit(0))).show()

...Except if the column is generated by rand:

df.withColumn('new', udf(F.rand())).show()  # fails

However, the following two work:

df.withColumn('new', F.rand()).show()
df.withColumn('new', F.rand()).withColumn('new2', udf(F.col('new'))).show()

The error:

Py4JJavaError: An error occurred while calling o469.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 1 times, most recent failure: Lost task 0.0 in stage 20.0 (TID 34, localhost, executor driver): java.lang.NullPointerException

Why does this happen, and how can I use a rand column expression created within a UDF?

from Why does a PySpark UDF that operates on a column generated by rand() fail?

1 comment:

Hemant4 May 2019 at 10:17
https://www.wizweb.in

Wizweb Technology is a leading software development company custom website design, software development, SMS Provider, Bulk sms, transactional sms, promotional sms, mobile app development, Hosting Solution, seo(search engine optimization) and Digital marketing etc.
ReplyDelete
Replies

Add comment