Assuming I have a fixed list of multi word names like: Water
Tocopherol (Vitamin E)
Vitamin D
PEG-60 Hydrogenated Castor Oil
I want the following input/output results:
Water, PEG-60 Hydrogenated Castor Oil
->Water
,PEG-60 Hydrogenated Castor Oil
PEG-60 Hydrnated Castor Oil
->PEG-60 Hydrogenated Castor Oil
wter PEG-60 Hydrnated Castor Oil
->Water
,PEG-60 Hydrogenated Castor Oil
Vitamin E
->Tocopherol (Vitamin E)
I need it to be performant and the ability to recognize that either there are too many close matches and no close matches. With 1 its relatively easy because I can separate by the comma. Most times the input list is separated by the comma so this works 80% of the time but even this has the small issue. Take for example 4. Once separated, 4's ideal match is not returned by most spellcheck libraries (I've tried a number) because the edit distance to Vitamin D
is much smaller. There are some websites that do this well but I'm lost as to how to do it.
The second part to this problem is, how do I do word segmentation on top. Let's say a given list doesn't have a comma, I need to be able to recognize that. Simplest example being Water Vtamin D
should become Water
, Vitamin D
. I can give a ton of examples but I think this gives a good idea of the problem.
from Query segmentation with spell check
No comments:
Post a Comment