Tuesday, 5 November 2019

How to mock inner call to pyspark sql function

Got the following piece of pyspark code:

import pyspark.sql.functions as F

...

null_or_unknown_count = df.sample(0.01).filter(
    F.col('env').isNull() | (F.col('env) == 'Unknown')
).count()

In test code, the data frame is mocked, so I am trying to set the return_value for this call like this:

from unittest import mock
from unittest.mock import ANY

@mock.patch('pyspark.sql.DataFrame', spec=pyspark.sql.DataFrame)
def test_null_or_unknown_validation(self, mock_df):
    mock_df.sample(0.01).filter(ANY).count.return_value = 250

But this fails with the following:

File "/usr/local/lib/python3.7/site-packages/pyspark/sql/functions.py", line 44, in _
  jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
AttributeError: 'NoneType' object has no attribute '_jvm'

Also tried mock_df.sample().filter().count.return_value = 250, which gives the same error.

How do I mock the filter i.e. F.col('env').isNull() | (F.col('env) == 'Unknown') correctly?



from How to mock inner call to pyspark sql function

No comments:

Post a Comment