Fine tune with SFTTrainer

I noticed that, according to the trainer’s documentation, when fine-tuning the model, I am required to provide a text field (trl/trl/trainer/sft_trainer.py at 18a33ffcd3a576f809b6543a710e989333428bd3 · huggingface/trl · GitHub). However, this does not seem to be a supervised task!

Upon further examination, I observed that the model’s labels are the same as the input_ids, except they are shifted. This leads me to ask how this can be considered supervised learning. In my understanding, the prompt should serve as the input, and the completion should be the label. However, in this case, there are no distinct prompts and completions, only raw text.

Could you clarify what I am missing here?

2 Likes

Hi,

So SFT (supervised fine-tuning) is called supervised since we’re collecting the data from humans. However we’re still training the model using the same cross-entropy loss as during pre-training (i.e. predicting the next token).

We now just make it more likely that the model will generate a useful completion given an instruction like “what are 10 things to do in London”, then the model should learn to generate “in London, you can visit (…)” for instance.

Since the model is still trained to predict the next token, we just concatenate the instruction and completion in a single “text” column, hence we can create the labels by shifting the inputs one position to the right (as is done during pre-training). One can then decide to only train the model on the completions, rather than the instructions, but the default SFTTrainer of TRL trains the model to predict both instructions and completions.

5 Likes

I have the same question as you, can you show me how to check how the dataset is created after putting the “text” field into traniner()?

Hi, not sure if you have tried or seen this. When I try to do sft on only completions using DataCollatorForCompletionOnlyLM, I get nan in the gradients very quickly. However, when I use the default sft which is on the entire input, everything works well. Do you happen to have any ideas why?

My issue is linked here: TRL SFT super prone to nan when using data collator

1 Like

You could check this by doing trainer.get_train_dataloader, and then check the first batch.

I had the same surprise as ron5569 when I looked at the SFTTrainer code.

One can then decide to only train the model on the completions, rather than the instructions, but the default SFTTrainer of TRL trains the model to predict both instructions and completions.

Questions:

  • is it fair to say that most people fine tune with both the instruction and the completion?
  • if so, does that mean fine tuning on both leads to about as good performance as fine tuning on just the completions?
  • fine tuning with generation of the instruction as well as the completion would seem like a waste of unnecessary computation no? Shouldn’t the default be to only fine tune on the completions?
  • The DataCollatorForCompletionOnlyLM works by setting the indexes on the instruction part to -100. But this still means generation of those tokens happens, just they are not included in the loss function calculation. Again, would that not be wasteful of compute?

Btw, thank you @nielsr for Tutorials/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling! Very nice to see the complete example but raises the above questions about why not train only on the completions by default.

8 Likes

Hi!
I’ve been trying to fine tune a GPT2 based model using SFT trainer. However, When I pass my dataset with only one column, named “text”, it raises the error:" you should provide a list of encodings but you have provided none." what could be the problems here?

That looks like an issue with data preparation. Are you using the tokenizer to prepare data for the model?

the dataset contains only nontokenized data under a column named “text”. when initializing the SFTtrainer, I set “tokenizer” parameter equal to my tokenizer and data_text_field to text.

a snippet of my code:


class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        input_text = self.data.dataset.iloc[idx]


        batch = {
            "text" : input_text["text"]
        }

        return batch


current_device = Accelerator().local_process_index
    
    
# Define training arguments
training_args = SFTConfig(
    output_dir= datasetPath,
    # overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=8,   
    per_device_eval_batch_size=8,
    save_total_limit=5,
    evaluation_strategy="steps",
    save_strategy = "epoch",
    # save_steps=5000,
    eval_steps=50,
    logging_dir= datasetPath,
    logging_strategy="steps",
    logging_steps=10,
    do_eval=True,
    do_train=True,
    learning_rate=5e-4,
    adam_epsilon=1e-08,
    warmup_steps=100,               
    eval_accumulation_steps=1,
    gradient_checkpointing=False,
    auto_find_batch_size=False,
    gradient_accumulation_steps = 1,
    dataloader_drop_last=True,
    save_safetensors=False,
    dataset_text_field="text",  
)  


# Define optimizer
from torch.optim import AdamW as PyTorchAdamW
params = model.parameters()

# Define the optimizer with specified parameters
optimizer = PyTorchAdamW(
    params,
    lr=5e-4,
    # betas=(0.9, 0.999),  ## the default value
    eps=1e-08,
    # weight_decay=0.1,  ## maybe later
    # correct_bias=True,
)

t_total = (len(train_dataloader) // training_args.gradient_accumulation_steps) * training_args.num_train_epochs


# Create the scheduler
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=training_args.warmup_steps,
    num_training_steps=t_total,
)






# Create Trainer instance
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    optimizers=(optimizer, scheduler),
    max_seq_length = 768,
    tokenizer=tokenizer,

)


trainer.train()

the error:
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py in pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
3297 # The model’s main input name, usually input_ids, has be passed for padding
3298 if self.model_input_names[0] not in encoded_inputs:
→ 3299 raise ValueError(
3300 “You should supply an encoding or a list of encodings to this method "
3301 f"that includes {self.model_input_names[0]}, but you provided {list(encoded_inputs.keys())}”

ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided

Yes the issue here is that you’re not preparing the text in the format that the model expects.

I’d recommend taking a look at the example script here regarding preparing the data in the right format: alignment-handbook/scripts/run_sft.py at main · huggingface/alignment-handbook · GitHub

thanks, I’ll check that out.
also what would be the correct dataset format If I wanted to pass the tokenized dataset? the “input_ids” should be equal to tokenized(prompt + completion)? should an explicit “labels” be defined in the dataset or just the input_ids and the corresponding attention_mask would be sufficient?

Yes explicit labels should be passed to the model as otherwise no loss is calculated.

The SFTTrainer will do this for you however as it uses the DataCollatorForLanguageModeling as data collator by default.

so should the labels be exactly the same as input_ids?
is this the format you have in mind?:

text = prompt + completion
"input_ids" : tokenizer( text, max_length = max_length .....)["input_ids"]
"attention_mask" :tokenizer( text, max_length = max_length .....)["attention_mask"]
"labels" :  tokenizer( text, max_length = max_length .....)["input_ids"]

also I once tried training the model with only input_ids and attention_mask for 2 epochs, and the loss were calculated and logged. is this because of the datacollator you mentioned? If so, does it mean there is no difference whether labels is defined in the dataset we pass to SFTtrainer or not?

Yes typically the labels are the same as the input_ids, with padding tokens (or other special tokens to be ignored) replaced by -100 (as -100 is the ignore index of the CrossEntropyLoss in PyTorch).

If you use a data collator like DataCollatorForLanguageModeling, this automatically happens for you as can be seen here. The SFTTrainer class uses this data collator by default as seen here, which explains why it works for you even though you did not specify labels.

Dear @nielsr,

Thank you for your all responses. All your answers were very informative. I just want to confirm a few more points:

Training a GPT-2 Algorithm with SFTTrainer
As I understand from the topic, I can train a GPT-2 algorithm with SFTTrainer on a single string text, where the input and label are the same. I can tokenize the text and feed it directly to the algorithm without needing any specific data collator or other preparation steps.

Handling Padding Token IDs
Additionally, if the ID of padding tokens in the tokenizer is not -100, I can ignore that using the attention mask.

Using the Trained Model
Finally, the resultant trained algorithm can be used in the PPO trainer. If I am correct, everything in SFTTrainer is the same as a simple class of trainer, right?

Please let me know if I have understood everything correctly.

Thank you.

The SFTTrainer is optimized for the task of SFT (supervised fine-tuning) of large language models, whereas the Trainer class is a general training class which can be used for LLMs but also for models like BERT, the Vision Transformer (ViT), audio models, etc.

Yes that works, although the labels are the same as the inputs (input_ids) with padding tokens or other tokens you want the model to ignore replaced by -100. For instance, in supervised fine-tuning of LLMs, we typically also mask (ignore) the prompt tokens so that the model only needs to learn the completions. The TRL library by default doesn’t do this, but you can use the DataCollatorForCompletionOnlyLM class for this purpose: Supervised Fine-tuning Trainer.

Handling Padding Token IDs
Additionally, if the ID of padding tokens in the tokenizer is not -100, I can ignore that using the attention mask.

So setting labels to -100 means those will be ignored by the loss calculation, whereas the attention_mask is meant for tokens to be not involved in the attention computation inside the model. Typically, padding tokens are both ignored in the loss calculation and are not involved in the attention computation (but both are different things).

Using the Trained Model
Finally, the resultant trained algorithm can be used in the PPO trainer. If I am correct, everything in SFTTrainer is the same as a simple class of trainer, right?

The PPO Trainer is another subclass of the Trainer class which is specifically optimized for fine-tuning LLMs using reinforcement learning (using the PPO algorithm). One typically first performs supervised fine-tuning (SFT) of an LLM, followed by fine-tuning on human preferences (using either PPO but more recent alternatives include DPO and KTO). After that, your model is ready to be used for inference.

1 Like

Please, I would like to be clear about something. I am training a SmolLM model on grammatical error correction and my output is the entire prompt I passed in to train the model. This include the instruction and the corrected sentence. To worsen the case, the responses are repeated. I am fairly certain this is because there is no end token. I only want the output to be the corrected sentence.

Would you recommend I maintain a single string input containing the instruction and the expected output and mask the prompt in some way with the attention?

Or I should just create labels to be passed to the model containing the tokenized output?

What advice do you have for the EOS token?