Thursday, 24 December 2020

How to use NLTK Regex patterns to annotate financial news with UP/DOWN indicator?

I'm working on replicating an algorithm describe in this paper: https://arxiv.org/pdf/1811.11008.pdf

On the last page it describes extracting a leaf defined in the grammar labelled 'NP JJ' using the following example: Operating profit margin was 8.3%, compared to 11.8% a year earlier.

I'm expecting to see a leaf labelled 'NP JJ' but I'm not. I'm tearing my hair out as to why (relatively new to regular expressions.)

def split_sentence(sentence_as_string):
    ''' function to split sentence into list of words
    '''
    words = word_tokenize(sentence_as_string)

    return words

def pos_tagging(sentence_as_list):

    words = nltk.pos_tag(sentence_as_list)

    return words

def get_regex(sentence, grammar):

    sentence = pos_tagging(split_sentence(sentence));

    cp = nltk.RegexpParser(grammar) 

    result = cp.parse(sentence) 

    return result


example_sentence = "Operating profit margin was 8.3%, compared to 11.8% a year earlier."

grammar = """JJ : {< JJ.∗ > ∗}
            V B : {< V B.∗ >}
            NP : {(< NNS|NN >)∗}
            NP P : {< NNP|NNP S >}
            RB : {< RB.∗ >}
            CD : {< CD >}
            NP JJ : : {< NP|NP P > +(< (>< .∗ > ∗ <) >) ∗ (< IN >< DT > ∗ < RB > ∗ < JJ > ∗ < NP|NP P >) ∗ < RB > ∗(< V B >< JJ >< NP >)∗ < V B > (< DT >< CD >< NP >) ∗ < NP|NP P > ∗ < CD > ∗ < .∗ > ∗ < CD > ∗| < NP|NP P >< IN >< NP|NP P >< CD >< .∗ > ∗ <, >< V B > < IN >< NP|NP P >< CD >}"""

grammar = grammar.replace('∗','*')

tree = get_regex(example_sentence, grammar)

print(tree)


from How to use NLTK Regex patterns to annotate financial news with UP/DOWN indicator?

No comments:

Post a Comment