How to set the Pad Token for meta-llama/Llama-3 Models

You could try asking the model authors in a discussion on the model page or on their github, but I doubt you would get a response.

The short answer is that the padding tokens are not that important as long as you are consistent. Moreover, if you use flash attention 2, the padding tokens will be deleted entirely, so they don’t matter at all.

If you aren’t using flash attention 2, you should be careful about the padding tokens because some data collators will mask out padding tokens from the loss. If you set the padding token to be the same as the eos token, then the model will never learn when to stop because the stop token will not be included in the loss.

If you use a model in TGI or vLLM, the padding tokens don’t matter.

6 Likes