Monday, 22 April 2019

Recursion in nltk's RegexpParser

Based on the grammar in the chapter 7 of the NLTK Book:

grammar = r"""
      NP: {<DT|JJ|NN.*>+} # ...
"""

I want to expand NP (noun phrase) to include multiple NP joined by CC (coordinating conjunctions: and) or , (commas) to capture noun phrases like:

  • The house and tree
  • The apple, orange and mango
  • Car, house, and plane

I cannot get my modified grammar to capture those as a single NP:

import nltk

grammar = r"""
  NP: {<DT|JJ|NN.*>+(<CC|,>+<NP>)?}
"""

sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

Results in:

(S (NP The/DT house/NN) and/CC (NP tree/NN))

I've tried moving the NP to the beginning: NP: {(<NP><CC|,>+)?<DT|JJ|NN.*>+} but I get the same result

(S (NP The/DT house/NN) and/CC (NP tree/NN))



from Recursion in nltk's RegexpParser

No comments:

Post a Comment