Whats happening in the SFT trainer?

Hello,

I am new to Hugging Face library and I stumbled upon SFTT for fine-tuning which seems really great but a bit obscure on what it is doing. I checked the doc but I still don’t get what is happening.

So lets say I have a dataset ‘data’ with features ‘prompt’, ‘answer’, ‘text’ with ‘text’ just a combination of ‘prompt’ and ‘answer’ in a nice format. I want the model to train on generating the texts so that he knows what to say when receiving similar prompts of the dataset.

If I were to use SFTT, I would put train_dataset=data and dataset_text_field=‘text’ in the arguments but why ? Does it indicates that given the prompt, it needs to generate the answer in the ‘text’ format ?

Hello Adl8,

To make it simple, when training an LLM, you feed directly the complete text built by concatenating the prompt and the answer into a single text. However, be sure to concatenate it into a nice format which is often defined in the tokenizer.chat_template. If your model does not have one, then you can define your pompting strategy as you which.

Inside the SFTT, you define a model, a tokenizer, training arguments, a dataset and the column to use as input. The SFTT will:

  • Use the arguments to define a training procedure (epochs, steps, logs, saves…)
  • Process for each batch using your tokenizer and possibly a formatting function (optional)
  • Use this processed input to compute logits and loss
  • Finally optimize the model

This is a very big picture but to make it short, this class enables you to define a single big class that eventually enables you to run “trainer.train()” which is far more useful than using a training loops built with your own dirty hands.

Hope this helps

But I dont understand what are the labels. Does the model train using a context sliding window to generate only the answer or the whole text or neither of them ?

The labels are directly computed within the SFTT Trainer. The models takes the input, and shift them one the right so that the input at time t is used to compute the output at time t+1.

There are no sliding window as it is just shifting values.

Ok I get it , so the model train on generating the whole text that is prompt+answer but shouldnt it trains on generating only the answer, using the prompt ?

2 Likes

Hey ! Late answer

When using chat template, you define the instruction using special tokens ([INST]) that enables the tokenizer to set the attention masks accordingly. Hence, the model does indeed learn to generate the answer only and not the prompt itself.

1 Like

Follow up question for example ?

I have dataset like this after formatting it

train_dataset, val_dataset, test_dataset

(Dataset({
     features: ['output', 'input', 'instruction', 'text'],
     num_rows: 41925
 }),
 Dataset({
     features: ['output', 'input', 'instruction', 'text'],
     num_rows: 4659
 }),
 Dataset({
     features: ['output', 'input', 'instruction', 'text'],
     num_rows: 5176
 }))

Before using this dataset in this SFTTrainer should I need to drop the other columns 'output', 'input', 'instruction'

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        dataset_text_field="text",
        # ignored_columns=ignored_columns,
        max_seq_length=2048,
        dataset_num_proc=2,
        packing=True,
        # callbacks=[EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.01)],
        args=training_args
    )

because I am running into following error when I am doing this.

result = trainer.evaluate(dataset_test_final)

result = trainer.evaluate(test_dataset)

Also as per Hugging-face Docs they mentioned we no need to explicitly encode the columns and the SFTTrainer will handle it. Please here thanks

ValueError: You should supply an encoding or a list of encodings to this method that includes
input_ids, but you provided ['output', 'input', 'instruction', 'text']
1 Like

Hey !

First I see two things that seem risky :

  • What’s the text ? Is it the concatenation of input and output ? What’s instruction ?
  • Don’t use packing if your model is an instruct one because with a short sequence length as you have (2048) It will truncate too much text if packing is done.

Indeed the error is stating that the input should be the something with input_ids, meaning a tokenized text. The problem is that as you did define an eval dataset within the trainer, you don’t need to respecify it in the evaluate part. I f you do so, you overwrite the eval dataset (which has been tokenized) and put a text dataset with strings.

Hence, juste call trainer.evaluate() and everything will be good for the evaluation part.

Juste do some extra checks on how you are tokenizing and packing the data as it impacts a lot how the model learns.

Hope this helps !

2 Likes
  1. text is concatenation for instruct tuning
from datasets import load_dataset, DatasetDict

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    return {"text": [alpaca_prompt.format(inst, inp, out) + EOS_TOKEN
                      for inst, inp, out in zip(examples["instruction"], examples["input"], examples["output"])]}

# Load the dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

# Apply formatting
dataset = dataset.map(formatting_prompts_func, batched=True, remove_columns=["instruction", "input", "output"])

# Split the dataset into train, validation, and test sets
train_valid_test_split = dataset.train_test_split(test_size=0.1, seed=42)
train_valid_dataset = train_valid_test_split['train']
test_dataset = train_valid_test_split['test']

train_valid_split = train_valid_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = train_valid_split['train']
val_dataset = train_valid_split['test']
  1. I am using the base model to train with the prompt text that I prepared above.
    So I think I can use packed as True?
from unsloth import FastLanguageModel
import torch
max_seq_length = 10 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B", # or choose "unsloth/Llama-3.2-1B"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
  1. The problem is that as you did define an eval dataset within the trainer, you don’t need to respecify it in the evaluate part

yes this is correct. I am just using like this and can see few metrics but not all.

  1. finally this was my SFTTrainer code.
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    # dataset_text_field="text",
    # ignored_columns=ignored_columns,
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    # callbacks=[EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.01)],
    args=training_args,
    # compute_metrics=compute_metrics
)

But issue is for other metrics like recall precision etc I tried to used compute_metrics=compute_metrics which is not working with the SFTTrainer.

if I do evaluate() using custom metrics following issue rises.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-3de37008debf> in <cell line: 0>()
      1 # result = trainer.evaluate(dataset_test_final)
----> 2 result = trainer.evaluate()

5 frames
/usr/local/lib/python3.11/dist-packages/accelerate/utils/operations.py in recursively_apply(func, data, test_type, error_on_other_type, *args, **kwargs)
    127         return func(data, *args, **kwargs)
    128     elif error_on_other_type:
--> 129         raise TypeError(
    130             f"Unsupported types ({type(data)}) passed to `{func.__name__}`. Only nested list/tuple/dicts of "
    131             f"objects that are valid for `{test_type.__name__}` should be passed."

TypeError: Unsupported types (<class 'unsloth.models._utils.EmptyLogits'>) passed to `_pad_across_processes`. Only nested list/tuple/dicts of objects that are valid for `is_torch_tensor` should be passed.
1 Like

Hello

Can you please share what’s your compute metrics ?

Thanks !

1 Like

here

import numpy as np
import evaluate

# Load the desired metrics
accuracy_metric = evaluate.load("accuracy")
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    # Compute individual metrics
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    precision = precision_metric.compute(predictions=predictions, references=labels, average='weighted')
    recall = recall_metric.compute(predictions=predictions, references=labels, average='weighted')
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='weighted')
    
    # Combine metrics into a single dictionary
    metrics = {
        'accuracy': accuracy['accuracy'],
        'precision': precision['precision'],
        'recall': recall['recall'],
        'f1': f1['f1'],
    }
    
    return metrics
1 Like

please can you explain this line ?

Juste do some extra checks on how you are tokenizing and packing the data as it impacts a lot how the model learns

SFTTrianer token generation is taken care by the lib itself correct? also packing true or false? I am using base model not instruct model help me to explain your statement thanks.

1 Like

I don’t understand the function because it seems like you have labels and predictions coming from the same tensor which is weird. Probably this can help you :

def compute_metrics(pred):
global num_labels
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average=‘macro’)
acc = accuracy_score(labels, preds)
loss_fct = CrossEntropyLoss()
logits = torch.tensor(pred.predictions)
labels = torch.tensor(labels)
loss = loss_fct(logits.view(-1, num_labels), labels.view(-1))
return {
‘accuracy@’+lang: acc,
‘f1@’+lang: f1,
‘precision@’+lang: precision,
‘recall@’+lang: recall,
‘loss@’+lang: loss,
}

Do some intermediary steps to be sure what’s the input of the function because you’ll need both labels and predictions to compute your metrics accordingly.

For the packing, if you don’t use an instruct model then it’s fine. However, I don’t understand why you feed in instructions to a base model as most of the time such training requires millions of examples.

Hope this helps !

2 Likes

same error when I use SFTTrainer evaluate()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 2
      1 # result = trainer.evaluate(dataset_test_final)
----> 2 result = trainer.evaluate()

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/trainer.py:4050, in Trainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
   4047 start_time = time.time()
   4049 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
-> 4050 output = eval_loop(
   4051     eval_dataloader,
   4052     description="Evaluation",
   4053     # No point gathering the predictions if there are no metrics, otherwise we defer to
   4054     # self.args.prediction_loss_only
   4055     prediction_loss_only=True if self.compute_metrics is None else None,
   4056     ignore_keys=ignore_keys,
   4057     metric_key_prefix=metric_key_prefix,
   4058 )
   4060 total_batch_size = self.args.eval_batch_size * self.args.world_size
   4061 if f"{metric_key_prefix}_jit_compilation_time" in output.metrics:

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/trainer.py:4266, in Trainer.evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
   4264     labels = self.accelerator.pad_across_processes(labels, dim=1, pad_index=-100)
   4265 if logits is not None:
-> 4266     logits = self.accelerator.pad_across_processes(logits, dim=1, pad_index=-100)
...
    131         f"objects that are valid for `{test_type.__name__}` should be passed."
    132     )
    133 return data

TypeError: Unsupported types (<class 'unsloth.models._utils.EmptyLogits'>) passed to `_pad_across_processes`. Only nested list/tuple/dicts of objects that are valid for `is_torch_tensor` should be passed.

1 Like