Wednesday, 6 October 2021

What is the best way to get accurate text similarity in python for comparing single words or bigrams?

I've got similar product data in both the products_a array and products_b array:

products_a = [{color: "White", size: "2' 3\""}, {color: "Blue", size: "5' 8\""} ]
products_b = [{color: "Black", size: "2' 3\""}, {color: "Sky blue", size: "5' 8\""} ]

I would like to be able to accurately tell similarity between the colors in the two arrays, with a score between 0 and 1. For example, comparing "Blue" against "Sky blue" should be scored near 1.00 (probably like 0.78 or similar).

Spacy Similarity

I tried using spacy to solve this:

import spacy
nlp = spacy.load('en_core_web_sm')

def similarityscore(text1, text2 ):
    doc1 = nlp( text1 )
    doc2 = nlp( text2 )
    similarity = doc1.similarity( doc2 )
    return similarity

Yeah, well when passing in "Blue" against "Sky blue" it scores it as 0.6545742918773636. Ok, but what happens when passing in "White" against "Black"? The score is 0.8176945362451089... as in spacy is saying "White" against "Black" is ~81% similar! This is a failure when trying to make sure product colors are not similar.

Jaccard Similarity

I tried Jaccard Similarity on "White" against "Black" using this and got a score of 0.0 (maybe overkill on single words but room for future larger corpuses):

# remove punctuation and lowercase all words function
def simplify_text(text):
    for punctuation in ['.', ',', '!', '?', '"']:
        text = text.replace(punctuation, '')
    return text.lower()

# Jaccard function
def jaccardSimilarity(text_a, text_b ):
    word_set_a, word_set_b = [set(self.simplify_text(text).split())
                                for text in [text_a, text_b]]
    num_shared = len(word_set_a & word_set_b)
    num_total = len(word_set_a | word_set_b)
    jaccard = num_shared / num_total
    return jaccard

Getting differing scores of 0.0 and 0.8176945362451089 on "White" against "Black" is not acceptable to me. I keep seeking a more accurate way of solving this issue. Even taking the mean of the two would be not accurate. Please let me know if you have any better ways.



from What is the best way to get accurate text similarity in python for comparing single words or bigrams?

No comments:

Post a Comment