Wednesday, 7 October 2020

Reconstruct input string given ngrams of that string

Given a string, e.g. i am a string.

I can generate the n-grams of this string like so, using the nltk package, where n is variable as per a specified range.

from nltk import ngrams 

s = 'i am a string'
for n in range(1, 3):
    for grams in ngrams(s.split(), n):
        print(grams)

Gives the output:

('i',)
('am',)
('a',)
('string',)
('i', 'am')
('am', 'a')
('a', 'string')

Is there a way to 'reconstruct' the original string using combinations of the generated ngrams? Or, in the words of the below commenter, is there a way to divide the sentence into consecutive word sequences where each sequence has a maximum length of k (in this case k is 2).

[('i'), ('am'), ('a'), ('string')]
[('i', 'am'), ('a'), ('string')]
[('i'), ('am', 'a'), ('string')]
[('i'), ('am'), ('a', 'string')]
[('i', 'am'), ('a', 'string')]

The question is similar to this one, though with an additional layer of complexity.

Working solution - adapted from here.

I have a working solution, but it's really slow for longer strings.

def get_ngrams(s, min_=1, max_=4):
    token_lst = []
    for n in range(min_, max_):
        for idx, grams in enumerate(ngrams(s.split(), n)):
            token_lst.append(' '.join(grams))
    return token_lst 

def to_sum_k(s):
    for len_ in range(1, len(s.split())+1):
        for i in itertools.permutations(get_ngrams(s), r=len_):
            if ' '.join(i) == s:
                print(i)

to_sum_k('a b c')


from Reconstruct input string given ngrams of that string

No comments:

Post a Comment