How does SpaCy keeps track of character and token offset during tokenization?
In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init
There's a _recalculate_indices method seems to be retrieving the token_by_start and token_by_end but that looks like all the recalcuation is doing.
When looking at extraneous spaces, it's doing some smart alignment of the spans.
Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?
from How does SpaCy keeps track of character and token offset during tokenization?
No comments:
Post a Comment