Tokenizer causes TRL completion data collator failure

ran-w · March 3, 2024, 12:17am

Tokenizers might cause a potential failure in TRL’s DataCollatorForCompletionOnlyLM class due to the difference in encoding of plain instruction and response templates vs encoding of them in context.

To use the data collator for completion only training, we need to define instruction and response template. I followed the open assistant format and defined the following data collator with the tatsu-lab/alpaca_farm dataset which has an "input" field:

    instruction_template = "<|prompter|>"
    response_template = "<|assistant|>"
    collator = DataCollatorForCompletionOnlyLM(
        instruction_template=instruction_template, 
        response_template=response_template, 
        tokenizer=tokenizer,
    )
    def formatting_prompts_func(example):
        output_texts = []
        for i in range(len(example['instruction'])):
            if example['input'] != "":
                text = f"{instruction_template} {example['instruction'][i]} {example['input'][i]}\n{response_template} {example['output'][i]}"
            else:
                text = f"{instruction_template} {example['instruction'][i]}\n{response_template} {example['output'][i]}"
            output_texts.append(text)
        return output_texts

The particular string formatting in the example above works after a number of manual tweaks. However, if you were to use some other formatting such as "{response_template}:", the data collator would fail immediately saying it cannot recognize any response templates in the input.

This is because the data collator first encodes the template into a sequence of ids and searches for the template id sequence in the input_ids. However, when the tokenizer encodes in the input as a whole, it might generate a sequence of ids that is different from encoding the template alone (due to how BPE works). For example, the last one or two elements in the encoding of "<|assistant|>" is different from "<|assistant|>:".

Not sure if it’s just me or people have encountered this and opt to some other methods to avoid this behavior? For example, you could presumably define two new tokens for the instruction and response templates in the tokenizer and fine tune their embeddings. But this sounds very dubious.

Topic		Replies	Views
Best practice for usage of Data Collator For CompletionOnlyLM in multi-turn chat 🤗Transformers	2	738	May 25, 2025
DataCollator uses Tokenizer while having BatchEncodings? 🤗Transformers	0	138	February 29, 2024
When to use a DataCollator for SFTTrainer Beginners	1	768	March 15, 2025
ZERO loss while finetuning Llama2 usin SFT trainer and the use of collator 🤗Transformers	1	1882	December 9, 2023
Slower train with collator for completion only 🤗Transformers	1	1231	April 7, 2024

Tokenizer causes TRL completion data collator failure

Related topics