I am using an e-commerce dataset to predict product categories. I use the product description and supplier code as features, and predict the product category.
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import ensemble
df['joined_features'] = df['description'].astype(str) + ' ' + df['supplier'].astype(str)
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['joined_features'], df['category'])
# encode target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
# count vectorizer object
count_vect = CountVectorizer(analyzer='word')
count_vect.fit(df['joined_features'])
# transform training and validation data
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
classifier = ensemble.RandomForestClassifier()
classifier.fit(xtrain_count, train_y)
predictions = classifier.predict(feature_vector_valid)
I get ~90% accuracy with this prediction. I now want to predict more categories. These categories are hierarchical. The category I predicted was the main one. I want to predict a couple more.
As an example, I predicted clothing. Now I want to predict: Clothing -> Shoes
I tried joining both categories: df['category1'] + df['category2']
and predicting them as one, but I get around 2% accuracy, which is really low.
What is the proper way to make a classifier in a hierarchical fashion?
from Turning a Multiclass Classifier into a Hierarchical Multiclass Classifier
No comments:
Post a Comment