Sunday, 6 October 2019

Cannot use numpy with Spark

I have a Spark program that runs locally on my Windows machine. I use numpy to do some calculations, but I get an exception:

ModuleNotFoundError: No module named 'numpy'

My code:

import numpy as np
from scipy.spatial.distance import cosine
from pyspark.sql.functions import udf,array
from pyspark.sql import SparkSession

spark = SparkSession\
      .builder\
      .appName("Playground")\
      .config("spark.master", "local")\
      .getOrCreate()

@udf("float")
def myfunction(x):
    y=np.array([1,3,9])
    x=np.array(x)
    return cosine(x,y)


df = spark\
    .createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ])\
    .withColumnRenamed("_1","doc")\
    .withColumnRenamed("_2","word1")\
    .withColumnRenamed("_3","word2")\
    .withColumnRenamed("_4","word3")


df2=df.select("doc", array([c for c in df.columns if c not in {'doc'}]).alias("words"))

df2=df2.withColumn("cosine",myfunction("words"))

df2.show() # The exception is thrown here

However if I run a different file that only includes:

import numpy as np
x = np.array([1, 3, 9])

then it works fine.

Edit:

As pissall suggested in comment, I've installed both numpy and scipy on the venv. Now if I try to run it with spark-submit then it falls on the first line, and if I run it using python.exe then I keep getting the same error message I had before.

I run it like that:

spark-submit --master local spark\ricky2.py --conf spark.pyspark.virtualenv.requirements=requir
ements.txt

requirements.txt:

numpy==1.16.3
pyspark==2.3.4
scipy==1.2.1

But it fails on the first line.



from Cannot use numpy with Spark

2 comments:

  1. Did this work for you? I'm also facing the same issue in udf

    ReplyDelete