Wednesday, 2 January 2019

ModuleNotFoundError because PySpark serializer is not able to locate library folder

I have the following folder structure

 - libfolder
    - lib1.py
    - lib2.py
 - main.py

main.py calls libfolder.lib1.py which then calls libfolder.lib2.py and others.

It all works perfectly fine in local machine but after I deploy it to Dataproc I get the following error

File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 455, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'libfolder'

I have zipped the folder into xyz.zip and run the following command:

spark-submit --py-files=xyz.zip main.py

The serializer is not able to find the location for libfolder . Is there a problem with the way i am packaging my folders?

This issue is similar to this one but it's not answered.

Edit: response to Igor's questions

unzip -l for the zip file returns the following

 - libfolder
    - lib1.py
    - lib2.py
 - main.py

In main.py lib1.py is called with this import statement

from libfolder import lib1



from ModuleNotFoundError because PySpark serializer is not able to locate library folder

No comments:

Post a Comment