Tokenizer causes TRL completion data collator failure

Tokenizers might cause a potential failure in TRL’s DataCollatorForCompletionOnlyLM class due to the difference in encoding of plain instruction and response templates vs encoding of them in context.

To use the data collator for completion only training, we need to define instruction and response template. I followed the open assistant format and defined the following data collator with the tatsu-lab/alpaca_farm dataset which has an "input" field:

    instruction_template = "<|prompter|>"
    response_template = "<|assistant|>"
    collator = DataCollatorForCompletionOnlyLM(
        instruction_template=instruction_template, 
        response_template=response_template, 
        tokenizer=tokenizer,
    )
    def formatting_prompts_func(example):
        output_texts = []
        for i in range(len(example['instruction'])):
            if example['input'] != "":
                text = f"{instruction_template} {example['instruction'][i]} {example['input'][i]}\n{response_template} {example['output'][i]}"
            else:
                text = f"{instruction_template} {example['instruction'][i]}\n{response_template} {example['output'][i]}"
            output_texts.append(text)
        return output_texts

The particular string formatting in the example above works after a number of manual tweaks. However, if you were to use some other formatting such as "{response_template}:", the data collator would fail immediately saying it cannot recognize any response templates in the input.

This is because the data collator first encodes the template into a sequence of ids and searches for the template id sequence in the input_ids. However, when the tokenizer encodes in the input as a whole, it might generate a sequence of ids that is different from encoding the template alone (due to how BPE works). For example, the last one or two elements in the encoding of "<|assistant|>" is different from "<|assistant|>:".

Not sure if it’s just me or people have encountered this and opt to some other methods to avoid this behavior? For example, you could presumably define two new tokens for the instruction and response templates in the tokenizer and fine tune their embeddings. But this sounds very dubious.