I have a Spark program that runs locally on my Windows machine. I use numpy to do some calculations, but I get an exception:
ModuleNotFoundError: No module named 'numpy'
My code:
import numpy as np
from scipy.spatial.distance import cosine
from pyspark.sql.functions import udf,array
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Playground")\
.config("spark.master", "local")\
.getOrCreate()
@udf("float")
def myfunction(x):
y=np.array([1,3,9])
x=np.array(x)
return cosine(x,y)
df = spark\
.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ])\
.withColumnRenamed("_1","doc")\
.withColumnRenamed("_2","word1")\
.withColumnRenamed("_3","word2")\
.withColumnRenamed("_4","word3")
df2=df.select("doc", array([c for c in df.columns if c not in {'doc'}]).alias("words"))
df2=df2.withColumn("cosine",myfunction("words"))
df2.show() # The exception is thrown here
However if I run a different file that only includes:
import numpy as np
x = np.array([1, 3, 9])
then it works fine.
Edit:
As pissall suggested in comment, I've installed both numpy and scipy on the venv. Now if I try to run it with spark-submit then it falls on the first line, and if I run it using python.exe then I keep getting the same error message I had before.
I run it like that:
spark-submit --master local spark\ricky2.py --conf spark.pyspark.virtualenv.requirements=requir
ements.txt
requirements.txt:
numpy==1.16.3
pyspark==2.3.4
scipy==1.2.1
But it fails on the first line.
from Cannot use numpy with Spark
Did this work for you? I'm also facing the same issue in udf
ReplyDeletePlease let know how did you fix this.
Delete