I'm working on replicating an algorithm describe in this paper: https://arxiv.org/pdf/1811.11008.pdf
On the last page it describes extracting a leaf defined in the grammar labelled 'NP JJ' using the following example: Operating profit margin was 8.3%, compared to 11.8% a year earlier.
I'm expecting to see a leaf labelled 'NP JJ' but I'm not. I'm tearing my hair out as to why (relatively new to regular expressions.)
def split_sentence(sentence_as_string):
''' function to split sentence into list of words
'''
words = word_tokenize(sentence_as_string)
return words
def pos_tagging(sentence_as_list):
words = nltk.pos_tag(sentence_as_list)
return words
def get_regex(sentence, grammar):
sentence = pos_tagging(split_sentence(sentence));
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
return result
example_sentence = "Operating profit margin was 8.3%, compared to 11.8% a year earlier."
grammar = """JJ : {< JJ.∗ > ∗}
V B : {< V B.∗ >}
NP : {(< NNS|NN >)∗}
NP P : {< NNP|NNP S >}
RB : {< RB.∗ >}
CD : {< CD >}
NP JJ : : {< NP|NP P > +(< (>< .∗ > ∗ <) >) ∗ (< IN >< DT > ∗ < RB > ∗ < JJ > ∗ < NP|NP P >) ∗ < RB > ∗(< V B >< JJ >< NP >)∗ < V B > (< DT >< CD >< NP >) ∗ < NP|NP P > ∗ < CD > ∗ < .∗ > ∗ < CD > ∗| < NP|NP P >< IN >< NP|NP P >< CD >< .∗ > ∗ <, >< V B > < IN >< NP|NP P >< CD >}"""
grammar = grammar.replace('∗','*')
tree = get_regex(example_sentence, grammar)
print(tree)
from How to use NLTK Regex patterns to annotate financial news with UP/DOWN indicator?
No comments:
Post a Comment