Monday, 22 April 2019

Calculate TD-IDF for a single word in Textacy

I'm trying to use Textacy to calculate the TF-IDF score for a single word across the standard corpus, but am a bit unclear about the result I am receiving.

I was expecting a single float which represented the frequency of the word in the corpus. So why am I receiving a list (?) of 7 results?

"acculer" is actually a French word, so was expecting a result of 0 from an English corpus.

word = 'acculer'
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
tf_idf = vectorizer.fit_transform(word)
logger.info("tf_idf:")
logger.info(tfidf)

Output

tf_idf:
(0, 0)  2.386294361119891
(1, 1)  1.9808292530117262
(2, 1)  1.9808292530117262
(3, 5)  2.386294361119891
(4, 3)  2.386294361119891
(5, 2)  2.386294361119891
(6, 4)  2.386294361119891

The second part of the question is how can I provide my own corpus to the TF-IDF function in Textacy, esp. one in a different language?

EDIT

As mentioned by @Vishal I have logged the ouput using this line:

logger.info(vectorizer.vocabulary_terms)

It seems the provided word acculer has been split into characters.

{'a': 0, 'c': 1, 'u': 5, 'l': 3, 'e': 2, 'r': 4}

(1) How can I get the TF-IDF for this word against the corpus, rather than each character?

(2) How can I provide my own corpus and point to it as a param?

(3) Can TF-IDF be used at a sentence level? ie: what is the relative frequency of this sentence's terms against the corpus.



from Calculate TD-IDF for a single word in Textacy

No comments:

Post a Comment