I have parsed the dependency relations of some text with SpaCy. How can I impose a condition relating to those dependency relations when extracting the subtree of a given token/span?
For example, I would like to get the subtree of a given token but exclude all portions of the subtree where the immediate child of my original token has a conjunction ("conj") dependency relation with that token.
To give an even more concrete example: I would like to extract the names and the corresponding attributes from the following sentence: "The entrepreneur and philanthropist Bill Gates and the Apple's Steve Jobs ate hamburgers."
| person | attribute |
|---|---|
| Bill Gates | entrepreneur and philanthropist |
| Steve Jobs | Apple's |
The dependency relations look like this: 
The following code succeeds at extracting the person entities but Bill Gates' subtree overlaps that of Steve Jobs:
import spacy
nlp = spacy.load("en_core_web_trf")
s = "The entrepreneur and philanthropist Bill Gates and the Apple's Steve Jobs ate hamburgers."
doc = nlp(s)
persons = [ent for ent in doc.ents if ent.label_ == "PERSON"]
# [Bill Gates, Steve Jobs]
[[token for token in p.subtree] for p in persons]
# [[The, entrepreneur, and, philanthropist, Bill, Gates, and, the, Apple, 's, Steve, Jobs], [the, Apple, 's, Steve, Jobs]]
So I would like to either get only the parts of Bill Gates' subtree where the first child has a nmod dependency relation, or remove those parts that are connected to a first child with the conj dependency relation. In R, the package rsyntax would get the job done so I assume something similar is already built into SpaCy.
(Any tips for smarter ways to get the table above are also appreciated – I'm not super well-versed in SpaCy nor Python in general)
from How to get partial subtree depending on dependency relations with SpaCy?
No comments:
Post a Comment