Tuesday 14 January 2020

scikit-learn: FeatureUnion to include hand crafted features

I am performing multi-label classification on text data. I wish to use combined features of tfidf and custom linguistic features similar to the example here using FeatureUnion.

I already have generated the custom linguistic features, which are in the form of a dictionary where keys represent the labels and (list of) values represent the features.

custom_features_dict = {'contact':['contact details', 'e-mail'], 
                       'demographic':['gender', 'age', 'birth'],
                       'location':['location', 'geo']}

Training data structure is as follows:

text                                            contact  demographic  location
---                                              ---      ---          ---
'provide us with your date of birth and e-mail'  1        1            0
'contact details and location will be stored'    1        0            1
'date of birth should be before 2004'            0        1            0

How can the above dict be incorporated into FeatureUnion? My understanding is that a user-defined function should be called that returns boolean values corresponding to the presence or absence of string values (from custom_features_dict) in the training data.

This gives the following list of dict for the given training data:

[
    {
       'contact':1,
       'demographic':1,
       'location':0
    },
    {
       'contact':1,
       'demographic':0,
       'location':1
    },
    {
       'contact':0,
       'demographic':1,
       'location':0
    },
] 

How can the above list be used to implement fit and transform?

The code is given below:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
#from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from io import StringIO

data = StringIO(u'''text,contact,demographic,location
provide us with your date of birth and e-mail,1,1,0
contact details and location will be stored,0,1,1
date of birth should be before 2004,0,1,0''')

df = pd.read_csv(data)

custom_features_dict = {'contact':['contact details', 'e-mail'], 
                        'demographic':['gender', 'age', 'birth'],
                        'location':['location', 'geo']}

my_features = [
    {
       'contact':1,
       'demographic':1,
       'location':0
    },
    {
       'contact':1,
       'demographic':0,
       'location':1
    },
    {
       'contact':0,
       'demographic':1,
       'location':0
    },
]

bow_pipeline = Pipeline(
    steps=[
        ("tfidf", TfidfVectorizer(stop_words=stop_words)),
    ]
)

manual_pipeline = Pipeline(
    steps=[
        # This needs to be fixed
        ("custom_features", my_features),
        ("dict_vect", DictVectorizer()),
    ]
)

combined_features = FeatureUnion(
    transformer_list=[
        ("bow", bow_pipeline),
        ("manual", manual_pipeline),
    ]
)

final_pipeline = Pipeline([
            ('combined_features', combined_features),
            ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
        ]
)

labels = ['contact', 'demographic', 'location']

for label in labels:
    final_pipeline.fit(df['text'], df[label]) 


from scikit-learn: FeatureUnion to include hand crafted features

No comments:

Post a Comment