Wednesday 27 October 2021

(Re-)tokenize entities in a dataframe without losing the associated label

Context

I'm looking for a way to tokenize entities (in a CONLL) from a dataframe by following the rule: ["d'Angers"] => ["d '", "Angers"] / ["l'impératrice" ] => ["l'", "impératrice"] (split entities on apostrophes).

Code

My initial dataframe looks like this (correponding to CONLL file) :

 Sentence  Mention  Tag
9   3   Vincennes   B-LOCATION
10  3   .   O
12  4   Confirmation    O
13  4   des O
14  4   privilèges  O
15  4   de  O
16  4   la  O
17  4   ville   O
18  4   d'Aire  O
19  4   1   O
20  4   ,   O
21  4   au  O
22  4   bailliage   B-ORGANISATION
23  4   d'Amiens    I-ORGANISATION

First create a Retokenization class :

class Retokenization:
    def __init__(self, mention) -> None:
        self.mention = mention
        self.tokens = self.split_off_apostrophes()
    
    def split_off_apostrophes(self):
        if "'" in self.mention and len(self.mention) > 1:
            if not re.search(r"[\-]", str(self.mention)) and not re.search(r"[\w]{2,}['][\w]+", str(self.mention)):
                inter = re.split(r"(\w')", self.mention)
                tokens = [tok for tok in inter if tok != '']
                return tokens
            else:
                return self.mention.split()
        else:
            return self.mention.split()

then apply this retokenize class to dataframe :

mentions = df['Mention'].apply(lambda mention : Retokenization(mention).tokens).fillna(value='_')

OUT :

9           [Vincennes]
10                  [.]
12       [Confirmation]
13                [des]
14         [privilèges]
15                 [de]
16                 [la]
17              [ville]
18           [d', Aire]
19                  [1]
20                  [,]
21                 [au]
22          [bailliage]
23         [d', Amiens]

Then I recreate the dataframe with my tokenized mentions and the associated labels :

from itertools import chain

df_retokenized = pd.DataFrame({
    
    'Sentence' : df['Sentence'].values.repeat(mentions.str.len()),
    'Mention' : list(chain.from_iterable(mentions.tolist())),
    'Tag' : df['Tag'].values.repeat(mentions.str.len())

    })

and I add a boolean mask for some missing labels during recreate dataframe :

m1 = df['Tag'].eq('O')
m2 = m1 & df['Tag'].shift(-1).str.startswith('I-')
add_tag = df['Tag'].shift(-1).str.replace(r"\w[-](\w+)", r"\1", regex=True)

df['Tag'] = np.select([m2], ['B-' + add_tag], df['Tag'])

OUT :

Mention Tag
JJ O
/ O
/ O
226 O
/ O
A O

465 O

Vincennes B-LOCATION
. O

Confirmation O
des O
privilèges O
de O
la O
ville O
d' O
Aire O
1 O
, O
au O
bailliage B-ORGANISATION
d' I-ORGANISATION
Amiens I-ORGANISATION

Problem

The code works but not fully operational. In fact, when I look further in the output file I notice that some retokenized entities have lost their Tag (IOB) for example:

Example # 1

Before :

Projet O
de O
" O
tour O
de O
l'impératrice B-TITLE
Eugénie B-PERSON

After :

Projet O
de O
" O
tour O
de O
l' O
impératrice 
Eugénie B-PERSON

Expected output :

Projet O
de O
" O
tour O
de O
l' O
impératrice B-TITLE
Eugénie B-PERSON

Example #2

Before :

à O
l'ONU B-ORGANISATION
, O
durée O

After :

à O
l' O
ONU O
, O
durée O

Expected output :

à O
l' O
ONU B-ORGANISATION
, O
durée O

I have other examples (but I don't want to overload the question)

Question

Is there a way to apply my (re-)tokenization to the dataframe without losing NER tags on re-tokenized entities ?

Excuse me for the length (I have not found other way to summarize correctly), if anyone has any leads, thank you in advance.



from (Re-)tokenize entities in a dataframe without losing the associated label

No comments:

Post a Comment