Hi all! There’s an interesting story here.
In general you are correct that causal LMs like Falcon are not trained with a pad token, and so the tokenizer does not have one set. This is true for a lot of causal LMs in the Hub. During training, these models are often fed sequences that have been concatenated together and truncated at the maximum sequence length, and so there is never any empty space that needs padding.
The reason we add one later is because a lot of downstream methods use padding and attention masks in some way. However, in many cases it doesn’t really matter what you set the padding token to! This is because the padded tokens will generally be masked by setting the attention_mask to 0, so those tokens will not be attended to by the rest of the sequence.
However, one place the choice of padding token can matter is in the labels when fine-tuning the model. This is because in standard CLM training, the labels are the inputs, shifted by a single position. This would mean that in the final position of the sequence before the padding at the end, the label at that position will be the padding token. When training models with shorter sequences (such as for chat), we generally want them to mark the end of the text they’ve generated, using a token like eos_token
. As a result, we commonly just use eos_token
as the padding token.
However, depending on your fine-tuning task, you may not want the model to learn to predict eos_token
at the end of a sequence - if this is the case, simply change the label at that position to the token you do want, or set the label to -100
to mask the label at that position.
Does that answer the questions you had? Feel free to let me know if I missed anything here!