Wednesday, 29 August 2018

Predict Day-Ahead in parallelized and scalabile Environment -H2o Package - R or Python

Following my answered question: R or Python - loop the test data - Prediction validation next 24 hours (96 values each day)

I want to predict the next day using H2o Package. You can find detail explanation for my dataset in the same above link.

The data dimension in H2o is different

So, after making the prediction, I want to calculate the MAPE

I have to change training and testing data to H2o format

train_h2o< - as.h2o(train_data)

test_h2o< - as.h2o(test_data)

mape_calc <- function(sub_df) {
  pred <- predict.glm(glm_model, sub_df)
  actual <- sub_df$Ptot
  mape <- 100 * mean(abs((actual - pred)/actual))

  new_df <- data.frame(date = sub_df$date[[1]], mape = mape)

  return(new_df)
}

# LIST OF ONE-ROW DATAFRAMES
df_list <- by(test_data, test_data$date, map_calc)

# FINAL DATAFRAME
final_df <- do.call(rbind, df_list)

The upper code works well for "Non-H2o" prediction validation for the day-ahead and it calculates the MAPE for every day.

I tried to convert the H2o predicted model to normal format but according to to:https://stackoverflow.com/a/39221269/9341589, it is not possible.

To make a prediction in H2o

for instance, let say we want to create a Random Forest Model

y <- "RealPtot" #target
x <- names(train_h2o) %>% setdiff(y) #features


rforest.model <- h2o.randomForest(y=y, x=x, training_frame = train_h2o, ntrees = 2000, mtries = 3, max_depth = 4, seed = 1122)

then we can get the prediction for complete dataset as shown below.

predict.rforest <- as.data.frame(h2o.predict(rforest.model, test_h2o)

But in my case I am trying to get one-day prediction using mape_calc

So modifying the code to accept H2o input format

mape_calc <- function(sub_df) {
  pred <- predict(rforest.model, sub_df)
  #I modified this line
  actual <-sub_df[, "RealPtot"]
  mape <- 100 * mean(abs((actual - pred)/actual))
  #And I changed this line 
  new_df <- data.frame(date = sub_df[,"date"][[1]], mape = mape)

  return(new_df)
}

# LIST OF ONE-ROW DATAFRAMES
df_list <- by(test_h2o, test_h2o[, "RealPtot"], mape_calc )

# FINAL DATAFRAME
final_df <- do.call(rbind, df_list)

I am getting error in df_list stage:

Error in unique.default(x, nmax = nmax) : 
  invalid type/length (environment/0) in vector allocation

NOTE: Any thoughts in R or Python will be appreciated.

from Predict Day-Ahead in parallelized and scalabile Environment -H2o Package - R or Python

Hemant Vishwakarma