How to set the Pad Token for meta-llama/Llama-3 Models

Chahnwoo · August 22, 2024, 1:10am

The title of the post is pretty much all there is to my question.

I have seen some conflicting pieces of information wandering around the internet

Some people recommend setting tokenizer.pad_token = tokenizer.eos_token
Some people recommend setting tokenizer.pad_token = tokenizer.unk_token
Some people have noted that the Llama3 model tokenizers have both an <|eot_id|> and <|end_of_text|> token, and suggest setting to the prior
Some people have suggested setting an entirely new pad_token and resizing model dimensions to allow this
a. This is an approach I would personally like to avoid

Personally, I have also noticed that the Llama3 tokenizers contain a <|finetune_right_pad_id|> token and was wondering if this is what I should be using for the pad_token, and if the naming implies that padding should be added to the right.

Is there any “official” approach to performing padding for this set of models? I would greatly appreciate any tips or resources suggestions!

nbroad · August 23, 2024, 8:51pm

You could try asking the model authors in a discussion on the model page or on their github, but I doubt you would get a response.

The short answer is that the padding tokens are not that important as long as you are consistent. Moreover, if you use flash attention 2, the padding tokens will be deleted entirely, so they don’t matter at all.

If you aren’t using flash attention 2, you should be careful about the padding tokens because some data collators will mask out padding tokens from the loss. If you set the padding token to be the same as the eos token, then the model will never learn when to stop because the stop token will not be included in the loss.

If you use a model in TGI or vLLM, the padding tokens don’t matter.

morenolq · August 24, 2024, 2:29pm

It actually really depends on the case.

If you want to fine-tune the model, setting tokenizer.pad_token = tokenizer.eos_token or tokenizer.pad_token = tokenizer.unk_token is actually a bad idea because I think the model would ignore EOS and UNK, that is actually not what you really want, right? I think tokenizer.pad_token = tokenizer.eos_token can be correct at inference time.

What I personally did is to use as pad token one of the reserved token from the tokenizers, however, I think your solution (e.g., setting it to <|finetune_right_pad_id|>) would be a better idea.

Adding a new token is really not needed IMHO.

Chahnwoo · August 29, 2024, 7:35am

Hello @nbroad,
I had been left incredibly confused because so many people seemed to set pad_token=eos_token, despite the reality you’ve mentioned here:

Thank you for the clarification.

Chahnwoo · August 29, 2024, 7:45am

Hello @morenolq,

I have attempted fine-tuning the LLaMA-3.1-8B model with the SFTTrainer by setting tokenizer.pad_token = '<|finetune_right_pad_id|>'. The reason I ended up inquiring about the pad token was because of the fine-tuned model’s performance on the EleutherAI Evalution Harness.

For some reason, the fine-tuned model’s performance on HellaSwag dropped to nearly a third of what the original model’s performance was, and I had thought that the pad token might be what was causing the issue.

I’ve recently found out that the LLaMA 3 model tokenizers do not add an eos_token_id at the end of inputs, even if you attempt to set it manually with tokenizer.add_eos_token = True. I am thinking that the SFTTrainer’s internal logic does not take into account cases like this, and that manually adding an eos_token to the model fine-tuning prompts will help mitigate this issue. If you know anything in this regard, I’d love to hear your advice on this as well.

morenolq · August 29, 2024, 11:01am

The only thought I’ve on this is related to my personal experience:

I agree, tokenizer seems not to add the final eos token, thus I added it manually (I’ve a pytorch custom dataset with a custom get_item).
I trained setting the pad token to the first of the reserved special tokens. Using the token you suggested just seemed to me a better option, didn’t try it tho.
If it can help, I used it with DoRA.

To wrap up, I would do explicit tokenization and pass token IDs to SFTTrainer, and add an extra EOS token manually. Finally, setting pad to a reserved (unused) token should work, either a reserved one or the token you mentioned should do the trick.

Just to understand, have you checked generated sentences? Are those very long (like not ended generation)?

system · August 29, 2024, 11:01pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Padding Token Missing from LLaMA Models	1	158	April 17, 2025
How to actually use padding in Lllama Tokenizers 🤗Transformers	2	4914	June 16, 2023
Can't set pad_token by adding special token to Llama's tokenizer 🤗Transformers	4	5808	August 12, 2024
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation Models	5	3645	October 16, 2024
Llama2 pad token for batched inference Models	7	15564	March 31, 2024

How to set the Pad Token for meta-llama/Llama-3 Models

Related topics