Wednesday, 19 April 2023

How to get partial subtree depending on dependency relations with SpaCy?

I have parsed the dependency relations of some text with SpaCy. How can I impose a condition relating to those dependency relations when extracting the subtree of a given token/span?

For example, I would like to get the subtree of a given token but exclude all portions of the subtree where the immediate child of my original token has a conjunction ("conj") dependency relation with that token.

To give an even more concrete example: I would like to extract the names and the corresponding attributes from the following sentence: "The entrepreneur and philanthropist Bill Gates and the Apple's Steve Jobs ate hamburgers."

person attribute
Bill Gates entrepreneur and philanthropist
Steve Jobs Apple's

The dependency relations look like this: dependency graph

The following code succeeds at extracting the person entities but Bill Gates' subtree overlaps that of Steve Jobs:

import spacy
nlp = spacy.load("en_core_web_trf")

s = "The entrepreneur and philanthropist Bill Gates and the Apple's Steve Jobs ate hamburgers."
doc = nlp(s)

persons = [ent for ent in doc.ents if ent.label_ == "PERSON"]
# [Bill Gates, Steve Jobs]

[[token for token in p.subtree] for p in persons]
# [[The, entrepreneur, and, philanthropist, Bill, Gates, and, the, Apple, 's, Steve, Jobs], [the, Apple, 's, Steve, Jobs]]

So I would like to either get only the parts of Bill Gates' subtree where the first child has a nmod dependency relation, or remove those parts that are connected to a first child with the conj dependency relation. In R, the package rsyntax would get the job done so I assume something similar is already built into SpaCy.

(Any tips for smarter ways to get the table above are also appreciated – I'm not super well-versed in SpaCy nor Python in general)



from How to get partial subtree depending on dependency relations with SpaCy?

No comments:

Post a Comment