Can't set pad_token by adding special token to Llama's tokenizer

phosseini · November 2, 2023, 7:26am

I’m trying to fine-tune a Llama model. As we know in the model’s tokenizer by default the pad_token is not set which will result in getting an error when we want to pad input samples. As suggested by others, I’m trying to add the pad_token by adding a special token like the following:

    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

    # Avoiding mismatch between model input and tokenizer length size
    model.resize_token_embeddings(len(tokenizer))

And the model in this case is meta-llama/Llama-2-7b-hf. When I check the tokenizer after running my code, I don’t see [PAD] added to the tokenizer, and in the special_tokens_map.json I see that "pad_token": "</s>". I wonder why this happens.

Also, once I fine-tune my model (that part does run successfully,) I wonder when I’m loading my model, do I need to add the padding token again or will the loader automatically load the updated tokenizer with added padding token?

vikram0711 · November 6, 2023, 5:51am

I have the same exact questions as you & evidently there isn’t an answer to this anywhere on this forum.

But I do see that my pad token has been updated after executing:

print(repr(tokenizer.pad_token))

phosseini · November 6, 2023, 6:25pm

When I load the tokenizer after fine-tuning my model, the pad token is set, and tokenizer.pad_token shows </s> (even though I expect it to be [PAD]). Now the problem is that when I want to do inference, I get the following error: ValueError: Cannot handle batch sizes > 1 if no padding token is defined. but the pad token is clearly set (regardless of what value). I don’t know why this happens!

vikram0711 · November 8, 2023, 5:15am

After fine tuning, along with the fine tuned model, are you also saving the updated tokenizer (with [PAD]) ? I think you’re supposed to save both and during inference load both and then that probably will not cause any issue. Unless of course, you’re already doing this and the issue still persists.

IqB · August 12, 2024, 6:17pm

Hi, it will be better, if you show steps me how to write in code. I am little confused

Topic		Replies	Views
Padding Token Missing from LLaMA Models	1	172	April 17, 2025
LLama pad token Beginners	3	2026	February 18, 2025
How to set the Pad Token for meta-llama/Llama-3 Models Models	6	11827	August 29, 2024
How to actually use padding in Lllama Tokenizers 🤗Transformers	2	4916	June 16, 2023
Llama2 pad token for batched inference Models	7	15580	March 31, 2024

Can't set pad_token by adding special token to Llama's tokenizer

Related topics