How to set the Pad Token for meta-llama/Llama-3 Models

The title of the post is pretty much all there is to my question.

I have seen some conflicting pieces of information wandering around the internet

  1. Some people recommend setting tokenizer.pad_token = tokenizer.eos_token
  2. Some people recommend setting tokenizer.pad_token = tokenizer.unk_token
  3. Some people have noted that the Llama3 model tokenizers have both an <|eot_id|> and <|end_of_text|> token, and suggest setting to the prior
  4. Some people have suggested setting an entirely new pad_token and resizing model dimensions to allow this
    a. This is an approach I would personally like to avoid

Personally, I have also noticed that the Llama3 tokenizers contain a <|finetune_right_pad_id|> token and was wondering if this is what I should be using for the pad_token, and if the naming implies that padding should be added to the right.

Is there any “official” approach to performing padding for this set of models? I would greatly appreciate any tips or resources suggestions!

6 Likes

You could try asking the model authors in a discussion on the model page or on their github, but I doubt you would get a response.

The short answer is that the padding tokens are not that important as long as you are consistent. Moreover, if you use flash attention 2, the padding tokens will be deleted entirely, so they don’t matter at all.

If you aren’t using flash attention 2, you should be careful about the padding tokens because some data collators will mask out padding tokens from the loss. If you set the padding token to be the same as the eos token, then the model will never learn when to stop because the stop token will not be included in the loss.

If you use a model in TGI or vLLM, the padding tokens don’t matter.

3 Likes

It actually really depends on the case.

If you want to fine-tune the model, setting tokenizer.pad_token = tokenizer.eos_token or tokenizer.pad_token = tokenizer.unk_token is actually a bad idea because I think the model would ignore EOS and UNK, that is actually not what you really want, right? I think tokenizer.pad_token = tokenizer.eos_token can be correct at inference time.

What I personally did is to use as pad token one of the reserved token from the tokenizers, however, I think your solution (e.g., setting it to <|finetune_right_pad_id|>) would be a better idea.

Adding a new token is really not needed IMHO.

Hello @nbroad,
I had been left incredibly confused because so many people seemed to set pad_token=eos_token, despite the reality you’ve mentioned here:

Thank you for the clarification.

1 Like

Hello @morenolq,

I have attempted fine-tuning the LLaMA-3.1-8B model with the SFTTrainer by setting tokenizer.pad_token = '<|finetune_right_pad_id|>'. The reason I ended up inquiring about the pad token was because of the fine-tuned model’s performance on the EleutherAI Evalution Harness.

For some reason, the fine-tuned model’s performance on HellaSwag dropped to nearly a third of what the original model’s performance was, and I had thought that the pad token might be what was causing the issue.

I’ve recently found out that the LLaMA 3 model tokenizers do not add an eos_token_id at the end of inputs, even if you attempt to set it manually with tokenizer.add_eos_token = True. I am thinking that the SFTTrainer’s internal logic does not take into account cases like this, and that manually adding an eos_token to the model fine-tuning prompts will help mitigate this issue. If you know anything in this regard, I’d love to hear your advice on this as well.

1 Like

The only thought I’ve on this is related to my personal experience:

  1. I agree, tokenizer seems not to add the final eos token, thus I added it manually (I’ve a pytorch custom dataset with a custom get_item).
  2. I trained setting the pad token to the first of the reserved special tokens. Using the token you suggested just seemed to me a better option, didn’t try it tho.
  3. If it can help, I used it with DoRA.

To wrap up, I would do explicit tokenization and pass token IDs to SFTTrainer, and add an extra EOS token manually. Finally, setting pad to a reserved (unused) token should work, either a reserved one or the token you mentioned should do the trick.

Just to understand, have you checked generated sentences? Are those very long (like not ended generation)?

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.