ByT5 tokenizer/embedding confusion from description

Almas · September 20, 2022, 7:15pm

Here (attached in the picture), and in the original paper it is stated that last 100 bytes are “reused” to be sentinel tokens.

. Thus, I expect that a hundred bytes starting from 258 (since there are 2 additional special tokens) to 159 are sentinels.

However, upon inspecting the embeddings and the tokenizer I see that there are in fact 384 tokens, around 126 of which are sentinels. Perhaps I misunderstood something but why the discrepancy?

Topic		Replies	Views
ByT5: problem with tokenizer.decode() Intermediate	3	1137	October 15, 2021
Difference between vocab_size in model T5forConditionalGeneration "t5-small" and its corresponding Tokenizer "t5-small" 🤗Transformers	1	634	June 30, 2023
ByT5 and LongT5 Models	2	477	March 2, 2024
Why do I get different embeddings when I perform batch encoding in huggingface MT5 model? 🤗Transformers	2	610	March 12, 2024
T51.1 vocab seems to inlcude added tokens? Beginners	0	66	May 7, 2024

ByT5 tokenizer/embedding confusion from description

Related topics