Trying to understand the
char_to_word method on
transformers.BatchEncoding. The description in the docs is:
Get the word in the original string corresponding to a character in the original string of a sequence of the batch.
Is a “word” defined to be a collection of nonwhitespace characters mapped, by the tokenizer, to a single token?
The notion of word depends on the tokenizer, and the text words are the result of the pre-tokenziation operation. Depending on the tokenizer, it can be split by whitespace, or by whitespace and punctuation, or other more advanced stuff
Does each tokenizer override
char_to_word? Is this even possible, since
char_to_word is defined on
BatchEncoding? How do the subtleties of each tokenizer get transmitted to
BatchEncoding has the correspondence from chars/tokens to words stored inside of it, which the fast tokenizer put there.
Is this the
No, it’s the
offset_mapping is the map token to span of characters. You need to combine it with the
word_ids to get the map char to word.