Assuming I have a fixed list of multi word names like: Water Tocopherol (Vitamin E) Vitamin D PEG-60 Hydrogenated Castor Oil
I want the following input/output results:
Water, PEG-60 Hydrogenated Castor Oil->Water,PEG-60 Hydrogenated Castor OilPEG-60 Hydrnated Castor Oil->PEG-60 Hydrogenated Castor Oilwter PEG-60 Hydrnated Castor Oil->Water,PEG-60 Hydrogenated Castor OilVitamin E->Tocopherol (Vitamin E)
I need it to be performant and the ability to recognize that either there are too many close matches and no close matches. With 1 its relatively easy because I can separate by the comma. Most times the input list is separated by the comma so this works 80% of the time but even this has the small issue. Take for example 4. Once separated, 4's ideal match is not returned by most spellcheck libraries (I've tried a number) because the edit distance to Vitamin D is much smaller. There are some websites that do this well but I'm lost as to how to do it.
The second part to this problem is, how do I do word segmentation on top. Let's say a given list doesn't have a comma, I need to be able to recognize that. Simplest example being Water Vtamin D should become Water, Vitamin D. I can give a ton of examples but I think this gives a good idea of the problem.
from Query segmentation with spell check
No comments:
Post a Comment