Why does the falcon QLoRA tutorial code use eos_token as pad_token?

Hi @brando @maxolotl @Rocketknight1
Best way to fix this issue is to change the processing template:

from transformers import AutoTokenizer
from tokenizers.processors import TemplateProcessing

text = "Random text"
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")

print(tokenizer(text)) # base tokenizer
# {'input_ids': [25070, 2288], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]}



tokenizer._tokenizer.post_processor = TemplateProcessing(
    single="$A " + tokenizer.eos_token,
    pair="$A "+ tokenizer.eos_token +" $B:1 "+ tokenizer.eos_token +":1",
    special_tokens=[(tokenizer.eos_token, tokenizer.eos_token_id)],
)

print(tokenizer(text)) # Updated tokenizer with EOS token
# {'input_ids': [25070, 2288, 11], 'token_type_ids': [0, 0, 0], 'attention_mask': [1, 1, 1]}



tokenizer.pad_token = tokenizer.eos_token
tokenizer.model_max_length = 5

print(tokenizer(text, padding="max_length")) # Updated tokenizer with EOS token and padding
# {'input_ids': [25070, 2288, 11, 11, 11], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 0, 0]}

Note that the model has to learn to predict the eos token through causal language modeling.

1 Like