Can't set pad_token by adding special token to Llama's tokenizer

I’m trying to fine-tune a Llama model. As we know in the model’s tokenizer by default the pad_token is not set which will result in getting an error when we want to pad input samples. As suggested by others, I’m trying to add the pad_token by adding a special token like the following:

    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

    # Avoiding mismatch between model input and tokenizer length size

And the model in this case is meta-llama/Llama-2-7b-hf. When I check the tokenizer after running my code, I don’t see [PAD] added to the tokenizer, and in the special_tokens_map.json I see that "pad_token": "</s>". I wonder why this happens.

Also, once I fine-tune my model (that part does run successfully,) I wonder when I’m loading my model, do I need to add the padding token again or will the loader automatically load the updated tokenizer with added padding token?

I have the same exact questions as you & evidently there isn’t an answer to this anywhere on this forum.

But I do see that my pad token has been updated after executing:


When I load the tokenizer after fine-tuning my model, the pad token is set, and tokenizer.pad_token shows </s> (even though I expect it to be [PAD]). Now the problem is that when I want to do inference, I get the following error: ValueError: Cannot handle batch sizes > 1 if no padding token is defined. but the pad token is clearly set (regardless of what value). I don’t know why this happens!

After fine tuning, along with the fine tuned model, are you also saving the updated tokenizer (with [PAD]) ? I think you’re supposed to save both and during inference load both and then that probably will not cause any issue. Unless of course, you’re already doing this and the issue still persists.