Whats happening in the SFT trainer?

Hello,

I am new to Hugging Face library and I stumbled upon SFTT for fine-tuning which seems really great but a bit obscure on what it is doing. I checked the doc but I still don’t get what is happening.

So lets say I have a dataset ‘data’ with features ‘prompt’, ‘answer’, ‘text’ with ‘text’ just a combination of ‘prompt’ and ‘answer’ in a nice format. I want the model to train on generating the texts so that he knows what to say when receiving similar prompts of the dataset.

If I were to use SFTT, I would put train_dataset=data and dataset_text_field=‘text’ in the arguments but why ? Does it indicates that given the prompt, it needs to generate the answer in the ‘text’ format ?

Hello Adl8,

To make it simple, when training an LLM, you feed directly the complete text built by concatenating the prompt and the answer into a single text. However, be sure to concatenate it into a nice format which is often defined in the tokenizer.chat_template. If your model does not have one, then you can define your pompting strategy as you which.

Inside the SFTT, you define a model, a tokenizer, training arguments, a dataset and the column to use as input. The SFTT will:

  • Use the arguments to define a training procedure (epochs, steps, logs, saves…)
  • Process for each batch using your tokenizer and possibly a formatting function (optional)
  • Use this processed input to compute logits and loss
  • Finally optimize the model

This is a very big picture but to make it short, this class enables you to define a single big class that eventually enables you to run “trainer.train()” which is far more useful than using a training loops built with your own dirty hands.

Hope this helps

But I dont understand what are the labels. Does the model train using a context sliding window to generate only the answer or the whole text or neither of them ?

The labels are directly computed within the SFTT Trainer. The models takes the input, and shift them one the right so that the input at time t is used to compute the output at time t+1.

There are no sliding window as it is just shifting values.

Ok I get it , so the model train on generating the whole text that is prompt+answer but shouldnt it trains on generating only the answer, using the prompt ?

1 Like

Hey ! Late answer

When using chat template, you define the instruction using special tokens ([INST]) that enables the tokenizer to set the attention masks accordingly. Hence, the model does indeed learn to generate the answer only and not the prompt itself.