Saturday, 20 February 2021

Are all the features correctly selected and used in a classifier?

I would like to know if when I use a classifier, for example:

random_forest_bow = Pipeline([
        ('rf_tfidf',Feat_Selection. countV),
        ('rf_clf',RandomForestClassifier(n_estimators=300,n_jobs=3))
        ])
    
random_forest_ngram.fit(DataPrep.train['Text'],DataPrep.train['Label'])
predicted_rf_ngram = random_forest_ngram.predict(DataPrep.test_news['Text'])
np.mean(predicted_rf_ngram == DataPrep.test_news['Label'])

I am also considering other features in the model. I defined X and y as follows:

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40) 

df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

countV = CountVectorizer()
train_count = countV.fit_transform(df.train['Text'].values)

My dataset looks as follows

Text                             is_it_capital?     is_it_upper?      contains_num?   Label
an example of text                      0                  0               0            0
ANOTHER example of text                 1                  1               0            1
What's happening?Let's talk at 5        1                  0               1            1

I would like to use as features also is_it_capital?,is_it_upper?,contains_num?, but since they have binary values (1 or 0, after encoding), I should apply BoW only on Text to extract extra features. Maybe my question is obvious, but since I am a new ML learner and I am not familiar with classifiers and encoding, I will be thankful for all the support and comments that you will provide. Thanks




from Are all the features correctly selected and used in a classifier?

No comments:

Post a Comment