Monday, 6 September 2021

Can anyone explain this code for computing Levenshtein distance?

I was given this code that quickly returns whether the Levenshtein distance between two strings is exactly 2.

def li(s, i):
    try:
        return s[i]
    except IndexError:
        return None
    
def f(str1, str2):
 t = [4, 4, 1, 2, 3]
 for i, str1_symb in enumerate(str1):
    p = 4
    res = []
    for j, t_val in enumerate(t):
        p = min(t_val - (str1_symb == li(str2, i + j - 2)), p, li(t, j + 1) or 4) + 1
        res.append(p)
    t = res
 return li(t, len(str2) - len(str1) + 2) == 3

You can test it with:

f("zzzzfood", "zzzzdodod") 

for example which will return True

and

f("zzzzfood", "zzzzdodo")

which will return False.

The standard algorithm for computing the Levenshtein distance builds a dynamic programming table and fills in the elements from left to right, top to bottom using the formula:

enter image description here

(from the wiki page linked above)

If you only want to return if the Levenshtein distance is at most 2 you can only look at cells of the dynamic programming that are at most 2 right or left from the diagonal.

The code above doesn't obviously do that and I can't work out what it is doing. Some particularly mysterious parts:

  • What is the role of t = [4, 4, 1, 2, 3]?
  • The li() function is taking both a string and a list in this code. It only returns None if the index i is greater than or equal to len(s). Sometimes i will be negative where it will still return a letter from s.
  • li(t, j + 1) or 4 returns 4 if li(t, j + 1) is None but I don't know what its purpose is.
  • What is the purpose/meaning of p?

Can anyone decipher it?



from Can anyone explain this code for computing Levenshtein distance?

No comments:

Post a Comment