I'm trying to use Textacy to calculate the TF-IDF score for a single word across the standard corpus, but am a bit unclear about the result I am receiving.
I was expecting a single float which represented the frequency of the word in the corpus. So why am I receiving a list (?) of 7 results?
"acculer" is actually a French word, so was expecting a result of 0 from an English corpus.
word = 'acculer'
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
tf_idf = vectorizer.fit_transform(word)
logger.info("tf_idf:")
logger.info(tfidf)
Output
tf_idf:
(0, 0) 2.386294361119891
(1, 1) 1.9808292530117262
(2, 1) 1.9808292530117262
(3, 5) 2.386294361119891
(4, 3) 2.386294361119891
(5, 2) 2.386294361119891
(6, 4) 2.386294361119891
The second part of the question is how can I provide my own corpus to the TF-IDF function in Textacy, esp. one in a different language?
EDIT
As mentioned by @Vishal I have logged the ouput using this line:
logger.info(vectorizer.vocabulary_terms)
It seems the provided word acculer has been split into characters.
{'a': 0, 'c': 1, 'u': 5, 'l': 3, 'e': 2, 'r': 4}
(1) How can I get the TF-IDF for this word against the corpus, rather than each character?
(2) How can I provide my own corpus and point to it as a param?
(3) Can TF-IDF be used at a sentence level? ie: what is the relative frequency of this sentence's terms against the corpus.
from Calculate TD-IDF for a single word in Textacy
No comments:
Post a Comment