Tokenizer.pad_token=what?

MLEnthusiast · October 13, 2022, 12:49am

Hey everyone! Trying out some fine-tuning and I’m not exactly sure how I fix this error:

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as ‘pad_token’ ‘(tokenizer.pad_token = tokenizer.eos_token e.g.)’ or add a new pad token via ‘tokenizer.add_special_tokens ({‘pad_token’: ‘[PAD]’})’

I’m trying to fine-tune openai-gpt when receiving this error.

I’m relatively new to FT and not really understanding the concept so far behind tokenizers and padding. Would love some clarification with any examples/solutions!

(I’ve found Huggingface tutorials on this to be too high level for a beginner).

Cheers!

lianghsun · October 25, 2022, 8:14pm

Actually there is no short cut for you to learn HF Tokenizers library… I’ll suggest you should take a look at HF Course and HF Tokenizers documentation to learn how to use it. However I’ll give you a hint how to deal with padding: first of all, you should specify [PAD] token in BpeTrainer

trainer = BpeTrainer(special_tokens=[..., '[PAD]', ...])

Second, after training the tokenizer, you can specify tokenzier.enable_padding(pad_id=tokenizer.token_to_id('[PAD]')). Then you encode any sentence may contain padding follow by.

Please note why we set .enable_padding() after training? This because when training to learn BPE vocab, it’s no need to consider padding in progress. Padding is the thing that to consider after tokenizing the sentense (truncation as well).

MLEnthusiast · November 8, 2022, 11:20am

Thanks very much for your response Lianghsun

Topic		Replies	Views
Asking to pad but the tokenizer does not have a padding token 🤗Tokenizers	0	1706	May 6, 2024
Padding not working when loading a tokenizer trained via the tokenizers library into transformers 🤗Transformers	1	6227	June 11, 2023
Trained tokenizer API as PretrainedTokenizer 🤗Tokenizers	1	524	October 25, 2022
Padding not transferring when loading a tokenizer trained via the tokenizers library into transformers 🤗Tokenizers	0	498	June 12, 2023
Can't set pad_token by adding special token to Llama's tokenizer 🤗Transformers	4	5854	August 12, 2024

Tokenizer.pad_token=what?

Related topics