Hemant Vishwakarma: Sequitur on multiple strings

Thursday, 22 April 2021

Sequitur on multiple strings

I would like to use Sequitur (and in particular its very nice implementation in the Python scikit package) to infer a CFG from a large number of strings (“sentences”). Sequitur expects a single string as entry. Of course I can incluse my strings into BEGIN and END symbols and concatenate them into a single long string without ny loss of information. But in that case the structure Sequitur will find will cross sentence boundaries, and this is not what I want.

How can I make Sequitur look only inside sentences and produce no rule the projection of which includes a BEGIN symbol followed by an END symbol? (Except for the first rule, that would be an umbrella for all sentences…)

Example: If I have sentences "The boy eats the apple" and "The girl sleeps", and I replace words by their POS tags ("a" for article, "n" for noun, "v" for verb) I get "anvan" and "anv". If I merge them including B (BEGIN) and E (END) symbols I get the string "BanvanEBanvE". When I supply this string to Sequitur I get

0 → 1 2 E 1 E 
1 → B 2 v
2 → a n

where the first rule has E symbols but no B symbols, and the second one a B symbol, an intermediate symbol (essentially the noun phrase ART+NOUN) and a v.

What I would like would be

0 → B 1 E B 2 E

etc. so that the rest of the grammar has only sentence-internal rules. In our simplistic example, it would be

0 → B 1 E B 2 E
1 → 3 v 3
2 → 3 v
3 → a n

Is there a way to achieve that without altering the algorithm's code? If not, is there some other (implemented) algorithm that allows to obtain exactly that?

from Sequitur on multiple strings

Hemant Vishwakarma

Thursday, 22 April 2021

Sequitur on multiple strings

No comments:

Post a Comment