I am new to Hugging Face library and I stumbled upon SFTT for fine-tuning which seems really great but a bit obscure on what it is doing. I checked the doc but I still don’t get what is happening.
So lets say I have a dataset ‘data’ with features ‘prompt’, ‘answer’, ‘text’ with ‘text’ just a combination of ‘prompt’ and ‘answer’ in a nice format. I want the model to train on generating the texts so that he knows what to say when receiving similar prompts of the dataset.
If I were to use SFTT, I would put train_dataset=data and dataset_text_field=‘text’ in the arguments but why ? Does it indicates that given the prompt, it needs to generate the answer in the ‘text’ format ?
To make it simple, when training an LLM, you feed directly the complete text built by concatenating the prompt and the answer into a single text. However, be sure to concatenate it into a nice format which is often defined in the tokenizer.chat_template. If your model does not have one, then you can define your pompting strategy as you which.
Inside the SFTT, you define a model, a tokenizer, training arguments, a dataset and the column to use as input. The SFTT will:
Use the arguments to define a training procedure (epochs, steps, logs, saves…)
Process for each batch using your tokenizer and possibly a formatting function (optional)
Use this processed input to compute logits and loss
Finally optimize the model
This is a very big picture but to make it short, this class enables you to define a single big class that eventually enables you to run “trainer.train()” which is far more useful than using a training loops built with your own dirty hands.
But I dont understand what are the labels. Does the model train using a context sliding window to generate only the answer or the whole text or neither of them ?
The labels are directly computed within the SFTT Trainer. The models takes the input, and shift them one the right so that the input at time t is used to compute the output at time t+1.
There are no sliding window as it is just shifting values.
Ok I get it , so the model train on generating the whole text that is prompt+answer but shouldnt it trains on generating only the answer, using the prompt ?
When using chat template, you define the instruction using special tokens ([INST]) that enables the tokenizer to set the attention masks accordingly. Hence, the model does indeed learn to generate the answer only and not the prompt itself.
because I am running into following error when I am doing this.
result = trainer.evaluate(dataset_test_final)
result = trainer.evaluate(test_dataset)
Also as per Hugging-face Docs they mentioned we no need to explicitly encode the columns and the SFTTrainer will handle it. Please here thanks
ValueError: You should supply an encoding or a list of encodings to this method that includes
input_ids, but you provided ['output', 'input', 'instruction', 'text']
What’s the text ? Is it the concatenation of input and output ? What’s instruction ?
Don’t use packing if your model is an instruct one because with a short sequence length as you have (2048) It will truncate too much text if packing is done.
Indeed the error is stating that the input should be the something with input_ids, meaning a tokenized text. The problem is that as you did define an eval dataset within the trainer, you don’t need to respecify it in the evaluate part. I f you do so, you overwrite the eval dataset (which has been tokenized) and put a text dataset with strings.
Hence, juste call trainer.evaluate() and everything will be good for the evaluation part.
Juste do some extra checks on how you are tokenizing and packing the data as it impacts a lot how the model learns.
But issue is for other metrics like recall precision etc I tried to used compute_metrics=compute_metrics which is not working with the SFTTrainer.
if I do evaluate() using custom metrics following issue rises.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-16-3de37008debf> in <cell line: 0>()
1 # result = trainer.evaluate(dataset_test_final)
----> 2 result = trainer.evaluate()
5 frames
/usr/local/lib/python3.11/dist-packages/accelerate/utils/operations.py in recursively_apply(func, data, test_type, error_on_other_type, *args, **kwargs)
127 return func(data, *args, **kwargs)
128 elif error_on_other_type:
--> 129 raise TypeError(
130 f"Unsupported types ({type(data)}) passed to `{func.__name__}`. Only nested list/tuple/dicts of "
131 f"objects that are valid for `{test_type.__name__}` should be passed."
TypeError: Unsupported types (<class 'unsloth.models._utils.EmptyLogits'>) passed to `_pad_across_processes`. Only nested list/tuple/dicts of objects that are valid for `is_torch_tensor` should be passed.
Juste do some extra checks on how you are tokenizing and packing the data as it impacts a lot how the model learns
SFTTrianer token generation is taken care by the lib itself correct? also packing true or false? I am using base model not instruct model help me to explain your statement thanks.
I don’t understand the function because it seems like you have labels and predictions coming from the same tensor which is weird. Probably this can help you :
Do some intermediary steps to be sure what’s the input of the function because you’ll need both labels and predictions to compute your metrics accordingly.
For the packing, if you don’t use an instruct model then it’s fine. However, I don’t understand why you feed in instructions to a base model as most of the time such training requires millions of examples.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[8], line 2
1 # result = trainer.evaluate(dataset_test_final)
----> 2 result = trainer.evaluate()
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/trainer.py:4050, in Trainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
4047 start_time = time.time()
4049 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
-> 4050 output = eval_loop(
4051 eval_dataloader,
4052 description="Evaluation",
4053 # No point gathering the predictions if there are no metrics, otherwise we defer to
4054 # self.args.prediction_loss_only
4055 prediction_loss_only=True if self.compute_metrics is None else None,
4056 ignore_keys=ignore_keys,
4057 metric_key_prefix=metric_key_prefix,
4058 )
4060 total_batch_size = self.args.eval_batch_size * self.args.world_size
4061 if f"{metric_key_prefix}_jit_compilation_time" in output.metrics:
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/trainer.py:4266, in Trainer.evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
4264 labels = self.accelerator.pad_across_processes(labels, dim=1, pad_index=-100)
4265 if logits is not None:
-> 4266 logits = self.accelerator.pad_across_processes(logits, dim=1, pad_index=-100)
...
131 f"objects that are valid for `{test_type.__name__}` should be passed."
132 )
133 return data
TypeError: Unsupported types (<class 'unsloth.models._utils.EmptyLogits'>) passed to `_pad_across_processes`. Only nested list/tuple/dicts of objects that are valid for `is_torch_tensor` should be passed.