Hemant Vishwakarma: What is the default smartirs for gensim TfidfModel?

Thursday, 27 December 2018

What is the default smartirs for gensim TfidfModel?

Using gensim:

from gensim.models import TfidfModel
from gensim.corpora import Dictionary

sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()

dataset = [sent0, sent1]
vocab = Dictionary(dataset)
corpus = [vocab.doc2bow(sent) for sent in dataset] 
model = TfidfModel(corpus)

# To retrieve the same pd.DataFrame format.
documents_tfidf_lol = [{vocab[word_idx]:tfidf_value for word_idx, tfidf_value in sent} for sent in model[corpus]]
documents_tfidf = pd.DataFrame(documents_tfidf_lol)
documents_tfidf.fillna(0, inplace=True)

documents_tfidf

[out]:

    dog mr  quick
0   0.707107    0.0 0.707107
1   0.000000    1.0 0.000000

If we do the TF-IDF computation manually,

sent0 = "The quick brown fox jumps over the lazy brown dog .".lower().split()
sent1 = "Mr brown jumps over the lazy fox .".lower().split()

documents = pd.DataFrame.from_dict(list(map(Counter, [sent0, sent1])))
documents.fillna(0, inplace=True, downcast='infer')
documents = documents.apply(lambda x: x/sum(x))  # Normalize the TF.
documents.head()

# To compute the IDF for all words.
num_sentences, num_words = documents.shape

idf_vector = [] # Lets save an ordered list of IDFS w.r.t. order of the column names.

for word in documents:
  word_idf = math.log(num_sentences/len(documents[word].nonzero()[0]))
  idf_vector.append(word_idf)

# Compute the TF-IDF table.
documents_tfidf = pd.DataFrame(documents.as_matrix() * np.array(idf_vector), 
                               columns=list(documents))
documents_tfidf

[out]:

    .   brown   dog fox jumps   lazy    mr  over    quick   the
0   0.0 0.0 0.693147    0.0 0.0 0.0 0.000000    0.0 0.693147    0.0
1   0.0 0.0 0.000000    0.0 0.0 0.0 0.693147    0.0 0.000000    0.0

If we use math.log2 instead of math.log:

    .   brown   dog fox jumps   lazy    mr  over    quick   the
0   0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1   0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

It looks like gensim:

remove the non-salient words from the TF-IDF model, it's evident when we print(model[corpus])
maybe the log base seem to be different from the log_2
maybe there's some normalization going on.

Looking at https://radimrehurek.com/gensim/models/tfidfmodel.html#gensim.models.tfidfmodel.TfidfModel , the smart scheme difference would have output different values but it's not clear in the docs what is the default value.

What is the default smartirs for gensim TfidfModel?

What are the other default parameters that've caused the difference between a natively implemented TF-IDF and gensim's?

from What is the default smartirs for gensim TfidfModel?

Hemant Vishwakarma

Thursday, 27 December 2018

What is the default smartirs for gensim TfidfModel?

No comments:

Post a Comment