Not able to add data_collator to Trainer

I am trying the example: Google Colab

The only thing I did - I added data_collator:

    from transformers import DataCollatorWithPadding
    data_collator = 
    DataCollatorWithPadding(tokenizer=tokenizer)
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        data_collator=data_collator,
        train_dataset=train_dataset,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        dataset_num_proc=2,
        packing=False,  # Can make training 5x faster for short sequences.
        args=TrainingArguments(
            per_device_train_batch_size=2,
            gradient_accumulation_steps=4,
            warmup_steps=5,
            max_steps=60,  # Set num_train_epochs = 1 for full training runs
            learning_rate=2e-4,
            fp16=not torch.cuda.is_bf16_supported(),
            bf16=torch.cuda.is_bf16_supported(),
            logging_steps=1,
            optim="adamw_8bit",
            weight_decay=0.01,
            lr_scheduler_type="linear",
            seed=3407,
            output_dir="outputs",
        ),
    )

But I am getting error ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,attention_mask. on calling trainer.train()

Hi!

SFTTrainer usually adds a DataCollatorForLanguageModeling by default:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

And the DataCollatorForLanguageModeling does padding and adds labels, which is further used by the model to calculate loss.

But when you use DataCollatorWithPadding, it does not add labels but only pads provided inputs to the maximum length. For that reason the model cannot compute and return loss.

If your goal was to pad the dataset, then I recommend using DataCollatorForLanguageModeling as it already does padding along with adding labels. Otherwise, you can write your own custom DataCollator similar to below, where call method returns the inputs ready for model forwarding.

class DataCollator:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        
    def __call__(self, examples: List[str]):
        inputs = tokenizer(examples, padding=True)
        inputs["labels"] = inputs["inpuy_ids"].clone()
        return inputs
1 Like