Monday, 22 February 2021

Data pre-processing steps with different features

I would like to include multiple features in a classifier for better improving model performance. I have a dataset similar to this one

Text is_it_capital? is_it_upper? contains_num? Label
an example of text 0 0 0 0
ANOTHER example of text 1 1 0 1
What's happening?Let's talk at 5 1 0 1 1

I am applying different pre-processing algorithms to Text (BoW, TF-IDF,...). It was 'easy' to use only Text column in my classifier by selecting X= df['Text'] and applying the algorithm of pre-processing. However, I would like to include now also is_it_capital? and the other variables (except Label) as features as I found them potentially useful for my classifier. What I tried was the following:

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
    def fit(self, X, y=None, **fit_params):
        return self
    def transform(self, X, y=None, **fit_params):
        return X.todense()

from sklearn.pipeline import Pipeline
pipeline = Pipeline([
     ('vectorizer', CountVectorizer()), 
     ('to_dense', DenseTransformer()), 
])

transformer = ColumnTransformer([('text', pipeline, 'Text')], remainder='passthrough')

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40)

X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)

df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

#Logistic regression
logR_pipeline = Pipeline([
        ('LogRCV',countV),
        ('LogR_clf',LogisticRegression())
        ])

logR_pipeline.fit(df_train['Text'], df_train['Label'])
predicted_LogR = logR_pipeline.predict(df_test['Text'])
np.mean(predicted_LogR == df_test['Label'])

However I got the error:

TypeError: cannot concatenate object of type '<class 'scipy.sparse.csr.csr_matrix'>'; only Series and DataFrame objs are valid

Is there anyone that handled with a similar problem? How could I fix it? My goal is to include all the features in my classifiers.

UPDATE:

I tried also with this:

from sklearn.base import BaseEstimator,TransformerMixin

class custom_count_v(BaseEstimator,TransformerMixin):
    def __init__(self,tfidf):
        self.tfidf = tfidf

    def fit(self, X, y=None):
        joined_X = X.apply(lambda x: ' '.join(x), axis=1)
        self.tfidf.fit(joined_X)        
        return self

    def transform(self, X):
        joined_X = X.apply(lambda x: ' '.join(x), axis=1)

        return self.tfidf.transform(joined_X)        


count_v = CountVectorizer() 

clmn = ColumnTransformer([("count", custom_count_v(count_v), ['Text'])],remainder="passthrough")
clmn.fit_transform(df)

It does not return any error, but it is not clear if I am including all the features correctly, and if I need to do it before or after the train/test split.It would be extremely helpful if you could show me until the application of the classifier:

#Logistic regression
logR_pipeline = Pipeline([
        ('LogRCV',....),
        ('LogR_clf',LogisticRegression())
        ])

logR_pipeline.fit(....)
predicted_LogR = logR_pipeline.predict(...)
np.mean(predicted_LogR == ...)

where instead of dots there should be dataframe or column (it depends on the transformation and concatenation, I guess), in order to get better the steps and errors I made.



from Data pre-processing steps with different features

No comments:

Post a Comment