Trying to understand the char_to_word
method on transformers.BatchEncoding
. The description in the docs is:
Get the word in the original string corresponding to a character in the original string of a sequence of the batch.
Is a “word” defined to be a collection of nonwhitespace characters mapped, by the tokenizer, to a single token?
The notion of word depends on the tokenizer, and the text words are the result of the pre-tokenziation operation. Depending on the tokenizer, it can be split by whitespace, or by whitespace and punctuation, or other more advanced stuff 
Does each tokenizer override char_to_word
? Is this even possible, since char_to_word
is defined on BatchEncoding
? How do the subtleties of each tokenizer get transmitted to char_to_word
?
The BatchEncoding
has the correspondence from chars/tokens to words stored inside of it, which the fast tokenizer put there.
1 Like
Is this the offset_mapping
?
No, it’s the offset_mapping
is the map token to span of characters. You need to combine it with the word_ids
to get the map char to word.
1 Like