What is a "word"?

Trying to understand the char_to_word method on transformers.BatchEncoding. The description in the docs is:

Get the word in the original string corresponding to a character in the original string of a sequence of the batch.

Is a “word” defined to be a collection of nonwhitespace characters mapped, by the tokenizer, to a single token?

The notion of word depends on the tokenizer, and the text words are the result of the pre-tokenziation operation. Depending on the tokenizer, it can be split by whitespace, or by whitespace and punctuation, or other more advanced stuff :slight_smile:

Does each tokenizer override char_to_word? Is this even possible, since char_to_word is defined on BatchEncoding? How do the subtleties of each tokenizer get transmitted to char_to_word?

The BatchEncoding has the correspondence from chars/tokens to words stored inside of it, which the fast tokenizer put there.

1 Like

Is this the offset_mapping?

No, it’s the offset_mapping is the map token to span of characters. You need to combine it with the word_ids to get the map char to word.

1 Like