Apparently the post-processor needs to be added. It’s automatically added to the official pretrained version.
from tokenizers.processors import TemplateProcessing
tokenizer._tokenizer.post_processor = TemplateProcessing(
single="$A </s>",
pair="$A </s> $B </s>",
special_tokens=[("</s>", tokenizer.eos_token_id)]
)
inputs = tokenizer(['Hello world', 'Hello'], padding=True, truncation=True, max_length=100, return_tensors="pt")
labels = inputs["input_ids"]
print(labels)
Credit goes to arr10 on Stackoverflow.