Thursday, 29 October 2020

Sentences embedding using word2vec

I'd like to compare the difference among the same word mentioned in different sentences, for example "travel". What I would like to do is:

  • Take the sentences mentioning the term "travel" as plain text;
  • In each sentence, replace 'travel' with travel_sent_x.
  • Train a word2vec model on these sentences.
  • Calculate the distance between travel_sent1, travel_sent2, and other relabelled mentions of "travel" So each sentence's "travel" gets its own vector, which is used for comparison.

I know that word2vec requires much more than several sentences to train reliable vectors. The official page recommends datasets including billions of words, but I have not a such number in my dataset(I have thousands of words).

I was trying to test the model with the following few sentences:

    Sentences
    Hawaii makes a move to boost domestic travel and support local tourism
    Honolulu makes a move to boost travel and support local tourism
    Hawaii wants tourists to return so much it's offering to pay for half of their travel expenses

My approach to build the vectors has been:

from gensim.models import Word2Vec

vocab = df['Sentences']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
df['Sentences'].apply(model.vectorize)

However I do not know how to visualise the results to see their similarity and get some useful insight. Any help and advice will be welcome.

Update: I would use Principal Component Analysis algorithm to visualise embeddings in 3-dimensional space. I know how to do for each individual word, but I do not know how to do it in case of sentences.



from Sentences embedding using word2vec

No comments:

Post a Comment