Hemant Vishwakarma: How to explain text clustering result by feature importance? (DBSCAN)

Monday, 21 September 2020

How to explain text clustering result by feature importance? (DBSCAN)

There are similar questions and libraries like ELI5 and LIME. But I couldn't find a solution to my problem. I have a set of documents and I am trying to cluster them using scikit-learn's DBSCAN. First, I am using TfidfVectorizer to vectorize the documents. Then, I simply cluster the data and receive the predicted labels. My question is: How can I explain the reason why a cluster has formed? I mean, imagine there are 2 predicted clusters (cluster 1 and cluster 2). Which features (since our input data is vectorized documents, our features are vectorized "words") are important for the creation of the cluster 1 (or cluster 2)?

Below you can find a minimal example of what I am currently working on. This is not a minimal working example of what I am trying to achieve (since I don't know how).

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=('headers', 'footers'),
)

visualize_train_data = pd.DataFrame(data=np.c_[twenty_train
                                                      ['data'], twenty_train
                                                      ['target']])
print(visualize_train_data.head())

vec = TfidfVectorizer(min_df=3, stop_words='english',
                      ngram_range=(1, 2))
vectorized_train_data = vec.fit_transform(twenty_train.data)

clustering = DBSCAN(eps=0.6, min_samples=2).fit(vectorized_train_data)
print(f"Unique labels are {np.unique(clustering.labels_)}")

Side notes: The question I provided focuses on specifically the k-Means algorithm and the answer isn't very intuitive (for me). ELI5 and LIME are great libraries but the examples provided by them are either regression or classification related (not clustering) and their regressors and classifiers support "predict" directly. DBSCAN doesn't...

from How to explain text clustering result by feature importance? (DBSCAN)

Hemant Vishwakarma

Monday, 21 September 2020

How to explain text clustering result by feature importance? (DBSCAN)

No comments:

Post a Comment