Using Data Collator For Completion Only LM requires more memory

I’m trying to use PHILSCHMID blog: Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora
to fine-tune llama3 8b for completions only, ignoring prompts.
I’m using trl’s DataCollatorForCompletionOnlyLM, as explained here: Supervised Fine-tuning Trainer

For some reason, I get Cuda out-of-memory errors with a much smaller batch size than when I used the same script without the data collator.

Does it have to do with the packing=False argument passed to the trainer? because it seems like the training is much more memory-consuming.

Thank you for the clarification,

1 Like