Thursday, 13 May 2021

Text and dummy variables in ML - features selection

I have a dataframe like this:

       Text                                             A   B   C    Label
337 nobodi can explain gave what we did ...             0   1   0      1
338 provide an example                                  1   1   0      0
339 another one????                                     1   0   0      1

I would like to understand how to build a ML classifier. Currently, I did as follows:

X = train[['Text','A','B','C']]
y = train['Label']

# Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
X_train, X_valid, y_train, y_valid  = train_test_split(X, y, test_size=0.25, random_state=40) 
# Returning to one dataframe
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)
valid_df = pd.concat([X_valid, y_valid], axis=1)

Then I create features using BOW and TFIDF:

countV = CountVectorizer()
train_count = countV.fit_transform(train_df['Text'].values)

# To create tfidf frequency features

tfidfV = TfidfTransformer()
train_tfidf = tfidfV.fit_transform(train_count)

tfidf_ngram = TfidfVectorizer(stop_words='english',ngram_range=. (1,2),use_idf=True,smooth_idf=True)

However, when I build the models, for example a NB model:

nb_pipeline = Pipeline([
        ('NBCV', countV),
        ('nb_clf',MultinomialNB())])

nb_pipeline.fit(train_df['Text'],train_df['Label'])
predicted_nb = nb_pipeline.predict(test_df['Text'])
np.mean(predicted_nb == test_df['Label'])

Something does not work, as I loose information on my dummy variables A,B,C. I have only features from Text. I can check this when I try to look at features importance:

feature_names = nb_pipeline.named_steps["NBCV"].get_feature_names()
coefs = nb_pipeline.named_steps["nb_clf"].coef_.flatten()

import pandas as pd
zipped = zip(feature_names, coefs)
df = pd.DataFrame(zipped, columns=["feature", "value"])
df["ABS"] = df["value"].apply(lambda x: abs(x))
df["colors"] = df["value"].apply(lambda x: "green" if x > 0 else "red")
df = df.sort_values("ABS", ascending=True)

Can you explain me why I am loosing this information and how I can keep my dummy variables in the model? Those variables should be very meaningful for the model, so I cannot exclude them from the model build. I need to check accuracy of the model and see the impact of those variables on that.



from Text and dummy variables in ML - features selection

No comments:

Post a Comment