Wednesday, 2 February 2022

Getting optimal vocab size and embedding dimensionality using GridSearchCV

I'm trying to use GridSearchCV to find the best hyperparameters for an LSTM model, including the best parameters for vocab size and the word embeddings dimension. First, I prepared my testing and training data..

x = df['tweet_text']
y = df['potentially_harmful']

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)

x_train= x_train.to_numpy().reshape(-1, 1)
y_train= y_train.to_numpy().reshape(-1, 1)
x_test = x_test.to_numpy().reshape(-1, 1)
y_test = y_test.to_numpy().reshape(-1,1)

And then I tried to create a model that I could use for my GridSearchCV. I know to use a Keras model for the grid search, you need to use KerasClassifier or KerasRegressor. I also make sure to adapt x, not x_train or anything, as x is the full x_data and I assume it needs to vectorize all of x so that all input docs have a consistent vectorized form.

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding
from scikeras.wrappers import KerasClassifier, KerasRegressor
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import StratifiedKFold

def build_model(max_tokens, max_len, dropout):
    
    model = Sequential()
    vectorize_layer = TextVectorization(

    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_len,
    )
    vectorize_layer.adapt(x)
    model.add(Embedding(max_tokens + 1, 128))
    model.add(LSTM(64, dropout = dropout, recurrent_dropout = dropout))
    model.add(Dense(64, activation="relu"))
    model.add(Dense(1, activation="sigmoid"))
    model.compile(
    optimizer='adam', 
    loss='binary_crossentropy',
    metrics=['accuracy'],
    )
    return model

Here I try to instantiate the model with the params. The classifier complained that I should add the dropout = 0.2, max_len = 5, max_tokens=25 part.

model = KerasClassifier(build_fn=build_model, dropout = 0.2, max_len = 5, max_tokens=25)
params = {
    "max_tokens" : [25, 50, 500, 5000],
    "max_len" : [5, 50, 500, 1000],
    "dropout" : [0.1, 0.2, 0.3, 0.4, 0.5],
}

grid = GridSearchCV(estimator = model, scoring = 'accuracy', param_grid = params,  cv = 3, verbose = 2, error_score = 'raise')

grid.fit(x_train, y_train)

Then, I get this error:

Fitting 3 folds for each of 80 candidates, totalling 240 fits
ValueError: could not convert string to float: 'promo looks promising pls say absence means fauxfoodies r couple eliminated next round ugh cantstandthem mkr'

Which confuses me. This model works if I just try to instantiate a model with something like model = build_model(...) and try model.fit(x_train, y_train), for example, and it doesn't have trouble converting strings to floats then. Why is it unable to do so now?



from Getting optimal vocab size and embedding dimensionality using GridSearchCV

No comments:

Post a Comment