When to use SFTTrainer

austinmw · May 25, 2023, 6:15pm

In the recent QLoRA blog post , the Colab notebooks use the standard Trainer class, however SFTTrainer was mentioned briefly at the end of the post. Why wasn’t it used in the Colab notebooks associated with this blog post, and when would you advise using it over Trainer ?

cerisara · June 13, 2023, 7:52am

I’m not sure about when SFTTrainer should be used, my guess is that the SFTTrainer makes it easier to finetune a pretrained model, as compared to the standard Trainer that is designed for training from scratch, and may thus be more complex to use.

An example to use it with qlora is given here

brando · June 27, 2023, 2:24am

I’m also curious: Discord

brando · June 27, 2023, 2:25am

my guess though I’m not sure, the SFTTrainer takes in a peft_config. So it probably knows which weights to introduce and set up the entire thing so that only those get trainer, saved, etc.

Otherwise Idk why tbh. Saves time a suppose. The price is less control, but it’s annoying to manage saving the weights etc. I suppose. Plus less bugs if done before?

brando · June 28, 2023, 2:21am

The answer here makes most sense tbh:

linch — Today at 2:02 PM
it inherits from the original transformers.Trainer class, but it also accepts param peft_config to directly initialize the model for PEFT, I’d use it if I wanted to benchmark PEFT and non-PEFT models with a uniform interface. From the class docstring:
Class definition of the Supervised Finetuning Trainer (SFT Trainer).
This class is a wrapper around the transformers.Trainer class and inherits all of its attributes and methods.
The trainer takes care of properly initializing the PeftModel in case a user passes a PeftConfig object.

tldr; same as trainer but accepts a peft config so it can run lora fine-tuning.

deathcrush · December 6, 2023, 9:33am

@ybelkada The docs tell us that the SFTTrainer is useful for preparing models for supervised fine-tuning. There are some nice examples with decoder-only models. Does the SFT trainer support encoder-decoder architectures (eg FLAN-T5)?

I also had a question about packing. Does your implementation ensure that the packed examples are not arbitrarily truncated? For clarity: suppose we want to train T5 with a maximum seq length of 2048. Let’s say we have an example that is about 1800 tokens. Will the packing algorithm select an example that is < 248 tokens (if it exists) or will it just choose sth at random and truncate the example (this would be a disaster for me because I work on dialogue where the most recent information is the most relevant)?

Topic		Replies	Views
Is it possible to finetune *ForQA models with SFT (PEFT/QLoRA)? Beginners	2	563	January 7, 2024
[LMM Fine Tuning] Supervised Fine Tuning Trainer (SFTTrainer) vs transformers Trainer Intermediate	1	1676	November 29, 2023
Stf train problem Beginners	0	18	January 21, 2025
Trouble running SFT with PEFT model Beginners	2	1015	March 19, 2024
Error using SFTTrainer: Make sure that your dataset has enough samples to at least yield one packed sequence Beginners	9	3003	November 1, 2024

When to use SFTTrainer

Related topics