I’m trying to fine-tune a Llama model. As we know in the model’s tokenizer by default the pad_token
is not set which will result in getting an error when we want to pad input samples. As suggested by others, I’m trying to add the pad_token
by adding a special token like the following:
tokenizer = AutoTokenizer.from_pretrained(args.model_path)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# Avoiding mismatch between model input and tokenizer length size
model.resize_token_embeddings(len(tokenizer))
And the model in this case is meta-llama/Llama-2-7b-hf
. When I check the tokenizer after running my code, I don’t see [PAD]
added to the tokenizer, and in the special_tokens_map.json
I see that "pad_token": "</s>"
. I wonder why this happens.
Also, once I fine-tune my model (that part does run successfully,) I wonder when I’m loading my model, do I need to add the padding token again or will the loader automatically load the updated tokenizer with added padding token?