Hi @brando @maxolotl @Rocketknight1
Best way to fix this issue is to change the processing template:
from transformers import AutoTokenizer
from tokenizers.processors import TemplateProcessing
text = "Random text"
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
print(tokenizer(text)) # base tokenizer
# {'input_ids': [25070, 2288], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]}
tokenizer._tokenizer.post_processor = TemplateProcessing(
single="$A " + tokenizer.eos_token,
pair="$A "+ tokenizer.eos_token +" $B:1 "+ tokenizer.eos_token +":1",
special_tokens=[(tokenizer.eos_token, tokenizer.eos_token_id)],
)
print(tokenizer(text)) # Updated tokenizer with EOS token
# {'input_ids': [25070, 2288, 11], 'token_type_ids': [0, 0, 0], 'attention_mask': [1, 1, 1]}
tokenizer.pad_token = tokenizer.eos_token
tokenizer.model_max_length = 5
print(tokenizer(text, padding="max_length")) # Updated tokenizer with EOS token and padding
# {'input_ids': [25070, 2288, 11, 11, 11], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 0, 0]}
Note that the model has to learn to predict the eos token through causal language modeling.