Saturday, 24 July 2021

Python - Using TF-IDF to summarise dataframe text column

I have a dataframe with a column containing text.

I want to create a new column that contains a tuple/list of the top 'n' TF-IDF scoring words in each row as a way of summarizing what is in the text.

An example dataframe (with a large amount of brevity) is:

df = pd.DataFrame({'Ref': [1,2,3,4,5], 'Text': ["the cow jumped off the other cow", 
                                                "the fox had a fox", 
                                                "the spanner was a tool to tool", 
                                                "the football player played football",
                                                "the house had a house"]})

I have spent the last few days trying to find a solution, but I can only find examples which finds the top TF-IDF words for the whole corpus, rather than for each row in a dataframe based on the whole corpus.

Can anyone steer me in the right direction?



from Python - Using TF-IDF to summarise dataframe text column

No comments:

Post a Comment