Sunday, 14 August 2022

Pyspark: Invalid status code '400' when lazy loading the dataframe

I am having Invalid status code '400' errors with every time I tried to show the pyspark dataframe. My AWS sagemaker driver and executor memory are 32G.

-Env:

Python version : 3.7.6
pyspark version : '2.4.5-amzn-0'
Notebook instance : 'ml.t2.2xlarge'

-EMR cluster config

{"classification":"livy-conf","properties":{"livy.server.session.timeout":"5h"}},
{"classification":"spark-defaults","properties":{"spark.driver.memory":"20G"}}

After some manipulation, I cleaned data and reduced the data size. The dataframe should be correct

print(df.count(), len(df.columns))
print(df.show())
(1642, 9)

 stock     date     time   spread  time_diff    ...
  VOD      01-01    9:05    0.01     1132       ...
  VOD      01-01    9:12    0.03     465        ...
  VOD      01-02   10:04    0.02     245
  VOD      01-02   10:15    0.01     364     
  VOD      01-02   10:04    0.02     12

However if I continue to do filtering,

new_df= df.filter(f.col('time_diff')<= 1800)
new_df.show()

then I got this error

An error was encountered:
Invalid status code '400' from http://11.146.133.8:8990/sessions/34/statements/8 with error payload: {"msg":"requirement failed: Session isn't active."}

I really have no idea whats going on.

Can someone please advise ?

Thanks



from Pyspark: Invalid status code '400' when lazy loading the dataframe

No comments:

Post a Comment