When to use SFTTrainer

In the recent QLoRA blog post , the Colab notebooks use the standard Trainer class, however SFTTrainer was mentioned briefly at the end of the post. Why wasn’t it used in the Colab notebooks associated with this blog post, and when would you advise using it over Trainer ?


I’m not sure about when SFTTrainer should be used, my guess is that the SFTTrainer makes it easier to finetune a pretrained model, as compared to the standard Trainer that is designed for training from scratch, and may thus be more complex to use.

An example to use it with qlora is given here

I’m also curious: Discord

my guess though I’m not sure, the SFTTrainer takes in a peft_config. So it probably knows which weights to introduce and set up the entire thing so that only those get trainer, saved, etc.

Otherwise Idk why tbh. Saves time a suppose. The price is less control, but it’s annoying to manage saving the weights etc. I suppose. Plus less bugs if done before?

The answer here makes most sense tbh:

linch — Today at 2:02 PM
it inherits from the original transformers.Trainer class, but it also accepts param peft_config to directly initialize the model for PEFT, I’d use it if I wanted to benchmark PEFT and non-PEFT models with a uniform interface. From the class docstring:
Class definition of the Supervised Finetuning Trainer (SFT Trainer).
This class is a wrapper around the transformers.Trainer class and inherits all of its attributes and methods.
The trainer takes care of properly initializing the PeftModel in case a user passes a PeftConfig object.

tldr; same as trainer but accepts a peft config so it can run lora fine-tuning.


@ybelkada The docs tell us that the SFTTrainer is useful for preparing models for supervised fine-tuning. There are some nice examples with decoder-only models. Does the SFT trainer support encoder-decoder architectures (eg FLAN-T5)?

I also had a question about packing. Does your implementation ensure that the packed examples are not arbitrarily truncated? For clarity: suppose we want to train T5 with a maximum seq length of 2048. Let’s say we have an example that is about 1800 tokens. Will the packing algorithm select an example that is < 248 tokens (if it exists) or will it just choose sth at random and truncate the example (this would be a disaster for me because I work on dialogue where the most recent information is the most relevant)?

1 Like