Sunday 31 January 2021

Query segmentation with spell check

Assuming I have a fixed list of multi word names like: Water Tocopherol (Vitamin E) Vitamin D PEG-60 Hydrogenated Castor Oil

I want the following input/output results:

  1. Water, PEG-60 Hydrogenated Castor Oil -> Water, PEG-60 Hydrogenated Castor Oil
  2. PEG-60 Hydrnated Castor Oil -> PEG-60 Hydrogenated Castor Oil
  3. wter PEG-60 Hydrnated Castor Oil -> Water, PEG-60 Hydrogenated Castor Oil
  4. Vitamin E -> Tocopherol (Vitamin E)

I need it to be performant and the ability to recognize that either there are too many close matches and no close matches. With 1 its relatively easy because I can separate by the comma. Most times the input list is separated by the comma so this works 80% of the time but even this has the small issue. Take for example 4. Once separated, 4's ideal match is not returned by most spellcheck libraries (I've tried a number) because the edit distance to Vitamin D is much smaller. There are some websites that do this well but I'm lost as to how to do it.

The second part to this problem is, how do I do word segmentation on top. Let's say a given list doesn't have a comma, I need to be able to recognize that. Simplest example being Water Vtamin D should become Water, Vitamin D. I can give a ton of examples but I think this gives a good idea of the problem.



from Query segmentation with spell check

No comments:

Post a Comment