Friday, 26 February 2021

Unable to initialize main class org.apache.spark.deploy.SparkSubmit when trying to run pyspark

I have a conda installation of python 3.7

$python3 --version
Python 3.7.6

pyspark was installed via pip3 install (conda does not have a native package for it).

$conda list | grep pyspark
pyspark                   2.4.5                    pypi_0    pypi

Here is what pip3 tells me:

$pip3 install pyspark
Requirement already satisfied: pyspark in ./miniconda3/lib/python3.7/site-packages (2.4.5)
Requirement already satisfied: py4j==0.10.7 in ./miniconda3/lib/python3.7/site-packages (from pyspark) (0.10.7)

jdk 11 is installed:

    $java -version
    openjdk version "11.0.2" 2019-01-15
    OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
    OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)

When attempting to import pyspark things are not going so well. Here is a mini test program:

from pyspark.sql import SparkSession
import os, sys
def setupSpark():
    os.environ["PYSPARK_SUBMIT_ARGS"] = "pyspark-shell"
    spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()
    return spark

sp = setupSpark()
df = sp.createDataFrame({'a':[1,2,3],'b':[4,5,6]})
df.show()

That results in :

Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmit Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter

Here is full details:

$python3 sparktest.py 
Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmit
Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter
Traceback (most recent call last):
  File "sparktest.py", line 9, in <module>
    sp = setupSpark()
  File "sparktest.py", line 6, in setupSpark
    spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()
  File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 367, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
    return _launch_gateway(conf)
  File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

Any pointers or info on working environment in conda would be appreciated.



from Unable to initialize main class org.apache.spark.deploy.SparkSubmit when trying to run pyspark

No comments:

Post a Comment