Wednesday 23 September 2020

Find the most similar text in a list of strings

First of all, I want to acknowledge that this is perhaps a very sophisticated problem; however, I have not been able to find a definitive answer for it online so I'm looking for suggestions.

Suppose I want to collect a list containing over hundred thousand strings, values of these strings are sentences that a user has typed. The values are added to the list as soon as a user types a new message. For example:

["Hello world!", "Good morning, my name is John", "Good morning, everyone"]

But I also want to have a timeout for each string so if they are not repeated within 5 min, they should be removed, so I change it to following format:

[{message:"Hello world!", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, my name is John", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, everyone", timeout: NodeJS.Timeout, count: 1}]

Now suppose a user types the following message:

Good morning, everyBODY I want to compare this string to all the messages in list and if one is 70% or more similar, update the count of that message, otherwise insert it as a new message. For this message for example, the application should update the count for Good morning, everyone to be equal to 2.

Since users can type a lot of messages in a short amount of time, the algorithm must also support fast insertion, searching, and deleting after the timeout.

What is the best way to implement this? or are there any libraries to help me with this?

NOTE: The strings do not need to be in an array, any data structure would work.

The main purpose of this algorithm is to detect similar messages when the count reaches a predefined value. For example warning: Over 5 users typed messages similar to "Hello everybody" within 5 minutes

I have looked at B-Trees, Nearest Neighbor, etc but I can't figure out what would be the best solution.

Update: I plan on using Levenshtein distance for string similarity, however the main problem is how to apply that to a list of strings in most time efficient way, without having to check every single string every time a new message is added.



from Find the most similar text in a list of strings

No comments:

Post a Comment