What is a "word"?

mgreenbe · November 9, 2021, 5:24pm

Trying to understand the char_to_word method on transformers.BatchEncoding. The description in the docs is:

Get the word in the original string corresponding to a character in the original string of a sequence of the batch.

Is a “word” defined to be a collection of nonwhitespace characters mapped, by the tokenizer, to a single token?

sgugger · November 9, 2021, 6:13pm

The notion of word depends on the tokenizer, and the text words are the result of the pre-tokenziation operation. Depending on the tokenizer, it can be split by whitespace, or by whitespace and punctuation, or other more advanced stuff

mgreenbe · November 9, 2021, 6:55pm

Does each tokenizer override char_to_word? Is this even possible, since char_to_word is defined on BatchEncoding? How do the subtleties of each tokenizer get transmitted to char_to_word?

sgugger · November 9, 2021, 7:39pm

The BatchEncoding has the correspondence from chars/tokens to words stored inside of it, which the fast tokenizer put there.

mgreenbe · November 9, 2021, 7:52pm

Is this the offset_mapping?

sgugger · November 9, 2021, 7:58pm

No, it’s the offset_mapping is the map token to span of characters. You need to combine it with the word_ids to get the map char to word.

Topic		Replies	Views
Xlm-Roberta Tokenizing 🤗Transformers	3	479	January 19, 2021
Different Behaviors between Tokenizers for Question Answering 🤗Transformers	0	341	October 20, 2021
How to map generated characters to tokens? 🤗Transformers	0	485	September 21, 2022
Bug with tokernizer's offset mapping for NER problems? 🤗Tokenizers	3	7243	December 23, 2020
Tokenizer.encode not returning encodings 🤗Tokenizers	2	904	October 9, 2021

What is a "word"?

Related topics