Given a string, e.g. i am a string
.
I can generate the n-grams of this string like so, using the nltk
package, where n is variable as per a specified range.
from nltk import ngrams
s = 'i am a string'
for n in range(1, 3):
for grams in ngrams(s.split(), n):
print(grams)
Gives the output:
('i',)
('am',)
('a',)
('string',)
('i', 'am')
('am', 'a')
('a', 'string')
Is there a way to 'reconstruct' the original string using combinations of the generated ngrams? Or, in the words of the below commenter, is there a way to divide the sentence into consecutive word sequences where each sequence has a maximum length of k (in this case k is 2).
[('i'), ('am'), ('a'), ('string')]
[('i', 'am'), ('a'), ('string')]
[('i'), ('am', 'a'), ('string')]
[('i'), ('am'), ('a', 'string')]
[('i', 'am'), ('a', 'string')]
The question is similar to this one, though with an additional layer of complexity.
Working solution - adapted from here.
I have a working solution, but it's really slow for longer strings.
def get_ngrams(s, min_=1, max_=4):
token_lst = []
for n in range(min_, max_):
for idx, grams in enumerate(ngrams(s.split(), n)):
token_lst.append(' '.join(grams))
return token_lst
def to_sum_k(s):
for len_ in range(1, len(s.split())+1):
for i in itertools.permutations(get_ngrams(s), r=len_):
if ' '.join(i) == s:
print(i)
to_sum_k('a b c')
from Reconstruct input string given ngrams of that string
No comments:
Post a Comment