Tuesday 2 May 2023

Simplifying ngram loops to compress the string given a fix set of ngrams

Given in list of characters, list('Hello▁world▁') and a list of character tuples, i.e.

[('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]

The objective is to iterate through the tuples and collapse the list of characters if they match. I've tried this and it works:

import copy

def matcher(s, ngram):
  while s:
    window = tuple(s[:2]) # Since the string tuples are in pairs.
    if window == ngram:
      yield "".join(window)
      s = s[2:]
    else:
      yield s[0]
      s = s[1:]

def combine_ngrams(s, vocab):
  prev = copy.copy(s)
  while True:
    for v in vocab:
      s = list(matcher(s, v))
    if s == prev:
      break
    else:
      prev = s
  return s


vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), 
         ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]

s = list('Hello▁world▁')

combine_ngrams(s, vocab)

[out]:

['Hello▁', 'world▁']

But the multiple while loops in both the outer function combine_ngrams() and inner matcher() looks like something that can be easily simplified.

Or maybe the operations doesn't need to loop through the tuples and maybe some regex methods to iteratively apply the vocab substitution would work. Is there a way to simply the nested while loops in the combine_ngrams function?


Here's more input/output examples:

[in]:

s = list('abcde'); vocab = [('a', 'b'), ('b', 'c'), ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]

s = list('abcde'); vocab = [('a', 'b'), ('ab', 'c'), ('b', 'c'),  ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]

s = list('aaab'); vocab = [('a', 'a'), ('a', 'aa'), ('aaa', 'b')]

s = list('Hello▁ポケモンセンター▁world▁'); vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]

[out]:

['ab', 'c', 'd', 'e']

['abcde']

['aa', 'a', 'b']

['Hello▁', 'ポ', 'ケ', 'モ', 'ン', 'セ', 'ン', 'タ', 'ー', '▁', 'world▁']

P/S: For anyone interested this is related to the byte-pair encoding algorithm and if there's a more algorithmic rather than pythonic loop way to solve this problem, please do suggest.



from Simplifying ngram loops to compress the string given a fix set of ngrams

No comments:

Post a Comment