I have a pandas data frame that contains two columns named Potential Word, Fixed Word. The Potential Word column contains words of different languages which contains spell mistakes words and correct words and the Fixed Word column contains the correct words corresponded to Potential Word.
Below I have shared some of the samples data
| Potential Word | Fixed Word |
|---|---|
| Exemple | Example |
| pipol | People |
| pimple | Pimple |
| Iunik | unique |
My vocab Dataframe contains 600K unique row.
My Solution:
key = given_word
glob_match_value = 0
potential_fixed_word = ''
match_threshold = 0.65
for each in df['Potential Word']:
match_value = match(each, key) # match is a function that returns a
# similarity value of two strings
if match_value > glob_match_value and match_value > match_threshold:
glob_match_value = match_value
potential_fixed_word = each
Problem
The problem with my code its takes a lot of time to fix every word because of the loop running through the large vocab list. When a word is missed on the vocab then it takes almost 5 or 6 sec to solve a sentence of 10 ~12 words. The match function performs decently so the objective of the optimization.
I need optimized solution help me here
from Finding best matched word from large Vocalist
No comments:
Post a Comment