Tokenizer.pad_token=what?

Hey everyone! Trying out some fine-tuning and I’m not exactly sure how I fix this error:

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as ‘pad_token’ ‘(tokenizer.pad_token = tokenizer.eos_token e.g.)’ or add a new pad token via ‘tokenizer.add_special_tokens ({‘pad_token’: ‘[PAD]’})’

I’m trying to fine-tune openai-gpt when receiving this error.

I’m relatively new to FT and not really understanding the concept so far behind tokenizers and padding. Would love some clarification with any examples/solutions!

(I’ve found Huggingface tutorials on this to be too high level for a beginner).

Cheers!

Actually there is no short cut for you to learn HF Tokenizers library… I’ll suggest you should take a look at HF Course and HF Tokenizers documentation to learn how to use it. However I’ll give you a hint how to deal with padding: first of all, you should specify [PAD] token in BpeTrainer

trainer = BpeTrainer(special_tokens=[..., '[PAD]', ...])

Second, after training the tokenizer, you can specify tokenzier.enable_padding(pad_id=tokenizer.token_to_id('[PAD]')). Then you encode any sentence may contain padding follow by.

Please note why we set .enable_padding() after training? This because when training to learn BPE vocab, it’s no need to consider padding in progress. Padding is the thing that to consider after tokenizing the sentense (truncation as well).

1 Like

Thanks very much for your response Lianghsun :slight_smile:

1 Like