Not able to add data_collator to Trainer

brand17 · May 12, 2024, 12:41pm

I am trying the example: Google Colab

The only thing I did - I added data_collator:

    from transformers import DataCollatorWithPadding
    data_collator = 
    DataCollatorWithPadding(tokenizer=tokenizer)
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        data_collator=data_collator,
        train_dataset=train_dataset,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        dataset_num_proc=2,
        packing=False,  # Can make training 5x faster for short sequences.
        args=TrainingArguments(
            per_device_train_batch_size=2,
            gradient_accumulation_steps=4,
            warmup_steps=5,
            max_steps=60,  # Set num_train_epochs = 1 for full training runs
            learning_rate=2e-4,
            fp16=not torch.cuda.is_bf16_supported(),
            bf16=torch.cuda.is_bf16_supported(),
            logging_steps=1,
            optim="adamw_8bit",
            weight_decay=0.01,
            lr_scheduler_type="linear",
            seed=3407,
            output_dir="outputs",
        ),
    )

But I am getting error ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,attention_mask. on calling trainer.train()

RaushanTurganbay · May 13, 2024, 8:43am

Hi!

SFTTrainer usually adds a DataCollatorForLanguageModeling by default:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

And the DataCollatorForLanguageModeling does padding and adds labels, which is further used by the model to calculate loss.

But when you use DataCollatorWithPadding, it does not add labels but only pads provided inputs to the maximum length. For that reason the model cannot compute and return loss.

If your goal was to pad the dataset, then I recommend using DataCollatorForLanguageModeling as it already does padding along with adding labels. Otherwise, you can write your own custom DataCollator similar to below, where call method returns the inputs ready for model forwarding.

class DataCollator:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        
    def __call__(self, examples: List[str]):
        inputs = tokenizer(examples, padding=True)
        inputs["labels"] = inputs["inpuy_ids"].clone()
        return inputs

Topic		Replies	Views
When to use a DataCollator for SFTTrainer Beginners	1	808	March 15, 2025
Help using sfttrainer with data collator, peft, and tokenizer template Intermediate	0	143	July 23, 2024
Fine tune with SFTTrainer Intermediate	17	15043	September 12, 2024
Change loss and dataset format with SFTTrainer (TRL & QLoRA ) 🤗Transformers	0	1735	July 19, 2023
Perhaps your features (`output` in this case) have excessive nesting (inputs type `list` where type `int` is expected) 🤗Transformers	19	716	January 20, 2025

Not able to add data_collator to Trainer

Related topics