Context
I'm looking for a way to tokenize entities (in a CONLL) from a dataframe by following the rule: ["d'Angers"] => ["d '", "Angers"] / ["l'impératrice" ] => ["l'", "impératrice"] (split entities on apostrophes).
Code
My initial dataframe looks like this (correponding to CONLL file) :
Sentence Mention Tag
9 3 Vincennes B-LOCATION
10 3 . O
12 4 Confirmation O
13 4 des O
14 4 privilèges O
15 4 de O
16 4 la O
17 4 ville O
18 4 d'Aire O
19 4 1 O
20 4 , O
21 4 au O
22 4 bailliage B-ORGANISATION
23 4 d'Amiens I-ORGANISATION
First create a Retokenization class :
class Retokenization:
def __init__(self, mention) -> None:
self.mention = mention
self.tokens = self.split_off_apostrophes()
def split_off_apostrophes(self):
if "'" in self.mention and len(self.mention) > 1:
if not re.search(r"[\-]", str(self.mention)) and not re.search(r"[\w]{2,}['][\w]+", str(self.mention)):
inter = re.split(r"(\w')", self.mention)
tokens = [tok for tok in inter if tok != '']
return tokens
else:
return self.mention.split()
else:
return self.mention.split()
then apply this retokenize class to dataframe :
mentions = df['Mention'].apply(lambda mention : Retokenization(mention).tokens).fillna(value='_')
OUT :
9 [Vincennes]
10 [.]
12 [Confirmation]
13 [des]
14 [privilèges]
15 [de]
16 [la]
17 [ville]
18 [d', Aire]
19 [1]
20 [,]
21 [au]
22 [bailliage]
23 [d', Amiens]
Then I recreate the dataframe with my tokenized mentions and the associated labels :
from itertools import chain
df_retokenized = pd.DataFrame({
'Sentence' : df['Sentence'].values.repeat(mentions.str.len()),
'Mention' : list(chain.from_iterable(mentions.tolist())),
'Tag' : df['Tag'].values.repeat(mentions.str.len())
})
and I add a boolean mask for some missing labels during recreate dataframe :
m1 = df['Tag'].eq('O')
m2 = m1 & df['Tag'].shift(-1).str.startswith('I-')
add_tag = df['Tag'].shift(-1).str.replace(r"\w[-](\w+)", r"\1", regex=True)
df['Tag'] = np.select([m2], ['B-' + add_tag], df['Tag'])
OUT :
Mention Tag
JJ O
/ O
/ O
226 O
/ O
A O
465 O
Vincennes B-LOCATION
. O
Confirmation O
des O
privilèges O
de O
la O
ville O
d' O
Aire O
1 O
, O
au O
bailliage B-ORGANISATION
d' I-ORGANISATION
Amiens I-ORGANISATION
Problem
The code works but not fully operational. In fact, when I look further in the output file I notice that some retokenized entities have lost their Tag (IOB) for example:
Example # 1
Before :
Projet O
de O
" O
tour O
de O
l'impératrice B-TITLE
Eugénie B-PERSON
After :
Projet O
de O
" O
tour O
de O
l' O
impératrice
Eugénie B-PERSON
Expected output :
Projet O
de O
" O
tour O
de O
l' O
impératrice B-TITLE
Eugénie B-PERSON
Example #2
Before :
à O
l'ONU B-ORGANISATION
, O
durée O
After :
à O
l' O
ONU O
, O
durée O
Expected output :
à O
l' O
ONU B-ORGANISATION
, O
durée O
I have other examples (but I don't want to overload the question)
Question
Is there a way to apply my (re-)tokenization to the dataframe without losing NER tags on re-tokenized entities ?
Excuse me for the length (I have not found other way to summarize correctly), if anyone has any leads, thank you in advance.
from (Re-)tokenize entities in a dataframe without losing the associated label
No comments:
Post a Comment