Tokenizer producing token index greater than size of the dictionary

surya-narayanan May 15, 2023, 7:58pm 1

I am using a tokenizer with a vocab size of 30522, but the tokenized dataset has a token with id 50,000 and above. Is that possible? What might I be doing wrong?

Topic		Replies	Views
T51.1 vocab seems to inlcude added tokens? Beginners	0	66	May 7, 2024
Inputs.word_ids() length not matching word label length 🤗Tokenizers	3	530	March 22, 2024
Difference between vocab_size in model T5forConditionalGeneration "t5-small" and its corresponding Tokenizer "t5-small" 🤗Transformers	1	634	June 30, 2023
Adding New Tokens - IndexError: index out of range in self Beginners	5	2695	June 17, 2021
Help with Tokenizer Word Length Limit Intermediate	2	1624	July 16, 2023

Tokenizer producing token index greater than size of the dictionary

Related topics