I am performing multi-label classification on text data. I wish to use combined features of tfidf
and custom linguistic features similar to the example here using FeatureUnion.
I already have generated the custom linguistic features, which are in the form of a dictionary where keys represent the labels and (list of) values represent the features.
custom_features_dict = {'contact':['contact details', 'e-mail'],
'demographic':['gender', 'age', 'birth'],
'location':['location', 'geo']}
Training data structure is as follows:
text contact demographic location
--- --- --- ---
'provide us with your date of birth and e-mail' 1 1 0
'contact details and location will be stored' 1 0 1
'date of birth should be before 2004' 0 1 0
How can the above dict
be incorporated into FeatureUnion
? My understanding is that a user-defined function should be called that returns boolean values corresponding to the presence or absence of string values (from custom_features_dict
) in the training data.
This gives the following list
of dict
for the given training data:
[
{
'contact':1,
'demographic':1,
'location':0
},
{
'contact':1,
'demographic':0,
'location':1
},
{
'contact':0,
'demographic':1,
'location':0
},
]
How can the above list
be used to implement fit and transform?
The code is given below:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
#from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from io import StringIO
data = StringIO(u'''text,contact,demographic,location
provide us with your date of birth and e-mail,1,1,0
contact details and location will be stored,0,1,1
date of birth should be before 2004,0,1,0''')
df = pd.read_csv(data)
custom_features_dict = {'contact':['contact details', 'e-mail'],
'demographic':['gender', 'age', 'birth'],
'location':['location', 'geo']}
my_features = [
{
'contact':1,
'demographic':1,
'location':0
},
{
'contact':1,
'demographic':0,
'location':1
},
{
'contact':0,
'demographic':1,
'location':0
},
]
bow_pipeline = Pipeline(
steps=[
("tfidf", TfidfVectorizer(stop_words=stop_words)),
]
)
manual_pipeline = Pipeline(
steps=[
# This needs to be fixed
("custom_features", my_features),
("dict_vect", DictVectorizer()),
]
)
combined_features = FeatureUnion(
transformer_list=[
("bow", bow_pipeline),
("manual", manual_pipeline),
]
)
final_pipeline = Pipeline([
('combined_features', combined_features),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
]
)
labels = ['contact', 'demographic', 'location']
for label in labels:
final_pipeline.fit(df['text'], df[label])
from scikit-learn: FeatureUnion to include hand crafted features
No comments:
Post a Comment