I am having Invalid status code '400'
errors with every time I tried to show
the pyspark dataframe. My AWS sagemaker driver and executor memory are 32G.
-Env:
Python version : 3.7.6
pyspark version : '2.4.5-amzn-0'
Notebook instance : 'ml.t2.2xlarge'
-EMR cluster config
{"classification":"livy-conf","properties":{"livy.server.session.timeout":"5h"}},
{"classification":"spark-defaults","properties":{"spark.driver.memory":"20G"}}
After some manipulation, I cleaned data and reduced the data size. The dataframe should be correct
print(df.count(), len(df.columns))
print(df.show())
(1642, 9)
stock date time spread time_diff ...
VOD 01-01 9:05 0.01 1132 ...
VOD 01-01 9:12 0.03 465 ...
VOD 01-02 10:04 0.02 245
VOD 01-02 10:15 0.01 364
VOD 01-02 10:04 0.02 12
However if I continue to do filtering,
new_df= df.filter(f.col('time_diff')<= 1800)
new_df.show()
then I got this error
An error was encountered:
Invalid status code '400' from http://11.146.133.8:8990/sessions/34/statements/8 with error payload: {"msg":"requirement failed: Session isn't active."}
I really have no idea whats going on.
Can someone please advise ?
Thanks
from Pyspark: Invalid status code '400' when lazy loading the dataframe
No comments:
Post a Comment