I would like to use Sequitur (and in particular its very nice implementation in the Python scikit package) to infer a CFG from a large number of strings (“sentences”). Sequitur expects a single string as entry. Of course I can incluse my strings into BEGIN and END symbols and concatenate them into a single long string without ny loss of information. But in that case the structure Sequitur will find will cross sentence boundaries, and this is not what I want.
How can I make Sequitur look only inside sentences and produce no rule the projection of which includes a BEGIN symbol followed by an END symbol? (Except for the first rule, that would be an umbrella for all sentences…)
Example: If I have sentences "The boy eats the apple" and "The girl sleeps", and I replace words by their POS tags ("a" for article, "n" for noun, "v" for verb) I get "anvan" and "anv". If I merge them including B (BEGIN) and E (END) symbols I get the string "BanvanEBanvE". When I supply this string to Sequitur I get
0 → 1 2 E 1 E
1 → B 2 v
2 → a n
where the first rule has E symbols but no B symbols, and the second one a B symbol, an intermediate symbol (essentially the noun phrase ART+NOUN) and a v.
What I would like would be
0 → B 1 E B 2 E
etc. so that the rest of the grammar has only sentence-internal rules. In our simplistic example, it would be
0 → B 1 E B 2 E
1 → 3 v 3
2 → 3 v
3 → a n
Is there a way to achieve that without altering the algorithm's code? If not, is there some other (implemented) algorithm that allows to obtain exactly that?
from Sequitur on multiple strings
No comments:
Post a Comment