Bad Performance Finetuning Llama Chat and Instruct Models on GSM8K

Hi all!

I’m having an extremely difficult time finetuning Llama-2-7B-Chat and Llama-3.1-8B-Instruct models on the GSM8K dataset. I have spent a week optimizing hyperparameters but never seem to obtain satisfactory results when evaluating on GSM8K using lm-eval-harness.

I have trained LLMs before, so this really frustrates me at this point. I am wondering whether there is an error in my finetuning script and/or the way how I evaluate and process it.

What I want to do:
Specifically, I want to finetune using PEFT/LoRA without the overhead of unsloth and other finetuning frameworks. The goal is to investigate a research question targeting Chat/Instruct models.

How I do it:
Here is my finetuning script for Llama-2-7B-Chat-HF:

import functools
from typing import Dict, Any

import yaml
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    set_seed
)
from trl import SFTTrainer, SFTConfig

def chat_format(example: Dict[str, Any], tokenizer) -> Dict[str, str]:
    """Format the example to include the question and answer in the text field."""
    prompt = example['question']
    answer = example['answer']
    chat = [
        {"role": "system", "content": "You are a helpful assistant."},
        {'role': 'user', 'content': prompt},
        {'role': 'assistant', 'content': answer}
    ]
    text = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt = False)
    return {'text': text}

def finetune() -> None:
    """Fine-tune the model based on the config."""

    # Seed
    seed = 42
    set_seed(seed)

    # Model and tokenizer
    model_name = "meta-llama/Llama-2-7b-chat-hf"
    model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

    # Handle pad token
    if tokenizer.pad_token is None:
        tokenizer.add_special_tokens({"pad_token": "<pad>"})
        tokenizer.pad_token = tokenizer.eos_token

    # Load and preprocess dataset
    dataset = load_dataset("openai/gsm8k", "main")
    chat_dataset = dataset.map(functools.partial(chat_format, tokenizer=tokenizer), remove_columns=dataset['train'].column_names)

    # LoRA
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=16,
        lora_alpha=16,
        lora_dropout=0.01,
        bias="none"
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    # Huggingface
    hf_output_dir = "./results/llama-2-7b-chat-hf"
    push_to_hub = hf_output_dir is not None

    # SFTConfig
    training_args = SFTConfig(
        output_dir=hf_output_dir if push_to_hub else "./results",
        max_seq_length=1024,
        dataset_text_field='text',
        auto_find_batch_size=False,
        per_device_train_batch_size=2,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=16,
        num_train_epochs=3,
        logging_steps=1,
        save_steps=100,
        eval_strategy="steps",
        eval_steps=50,
        bf16=True,
        lr_scheduler_type="cosine",
        learning_rate=5e-5,
        warmup_ratio=0.1,
        weight_decay=0.0,
        push_to_hub=push_to_hub,
        load_best_model_at_end=False,
        ddp_find_unused_parameters=False
    )

    # SFTTrainer
    trainer = SFTTrainer(
        model=model,
        train_dataset=chat_dataset['train'],
        eval_dataset=chat_dataset['test'],
        tokenizer=tokenizer,
        args=training_args
    )

    trainer.train()

    # Evaluation
    print(trainer.evaluate())

    # Save locally
    save_path = "./results/llama-2-7b-chat-hf"
    trainer.save_model(save_path)
    tokenizer.save_pretrained(save_path)


if __name__ == "__main__":
    finetune()

I have found the current hyperparams (LR, epochs, etc.) to be the sweet spot.

How I finetune:
I finetune using FSDP as follows:

accelerate launch --config_file "configs/fsdp.yaml" --num_processes=8 src/finetune_basic.py --config configs/llama-2-7B-chat-hf-gsm8k.yaml

How I evaluate:
I evaluate using lm-eval-harness (0-shot, gsm8k task)

accelerate launch --num_processes=8 -m lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf,peft=ketchup123/llama-2-7b-chat-hf --tasks gsm8k --num_fewshot 0 --batch_size 16 --apply_chat_template

For Llama-3.1-8B instruct, I have used the gsm8k_cot_llama from lm-eval-harness, as suggested.

My problem:
I do not encounter a specific error per se, but when comparing to the non-finetuned Llama-2-7B-Chat-hf, I only get a 2-3% increase in performance from 22% to 25% exact match on GSM8K. I have seen other works achieve much more (e.g. neuralmagic model even gets 37%). I would expect a better performance after finetuning, especially for the larger 7B and 8B models.

My questions:

  1. Is my chat templating ok?
  2. Are there any severely wrong settings?
  3. Is evaluation using lm-eval-harness the right way to go?

I have exhausted hyperparameter tuning (1-7 epochs, LR from 1e-4 to 1e-5, several batch sizes and gradient accumulations, etc.). I am finetuning on 8xA100 (80GB) GPUs.

I would appreciate any help and suggestions! :slight_smile: Currently, I am exhausted and cannot seem to find the error, if any.

Thank you!

1 Like

Man, this reminds me when Llama3 was out and I finetuned it 8 times, thinking that I messed up :confused:

Good news is that you already nailed your issue: you have to adjust the embeddings after adding the pad token and otras cositas mas…

Also, I’d recommend always printing a sample so you can effectively see what is going to the model and what is in the model config. Anyways, here a colab that hopefully will make you achieve what you want: Google Colab

Ps. Why not use Unsloth? (I’m trying to implement the SimPO experiment and was considering that as one option.)

1 Like

Thank you very much for the Colab link, my friend @arthrod !

However, after trying out the different tokenizations and after finetuning more than 30 times with different configurations, hyperparameters and tokenizer settings, I have in the end opted for Unsloth. This works out of the box and gives me good results, i.e. I was able to replicate other papers. This is quite unfortunate because Unsloth currently only supports a single GPU (no FSDP / DDP).

For the curious, I have also come to the following conclusion:

  • the problem must lie in the way tokenization is handled during training
  • I have also tried plugging in the pre-processed dataset (applying chat template, etc.) from Unsloth and still got the same sub-par results. Vice verca, I have also plugged in the pre-processed dataset with my chat templating into Unsloth and got way better results. Thus, I can safely exclude that the problem lies in my chat templating and must be due to either model or tokenizer.
  • Since the model is loaded similarly and Unsloth only performs HW-efficient optimizations, and since the training pipeline for Unsloth is basically the standard HF one, I am very sure that the tokenizer is the problem.
  • In the end, Unsloth just performs so much better with the same hyperparameters (lr=1e-4, linear scheduler, batch size 32, warmup steps 64, epochs=6). I can replicate relevant papers with Unsloth.
  • Also, I have synced with the authors from said papers and most of them also use lm-eval for the evaluations with more or less the same configurations as me. So this solidifies my point.

I am beyond frustrated at this point and every HF tutorial handles things very differently for tokenization of Llama chat/instruction models. At least for my training of Llama-2-7b-chat-hf and the newer Llama-3.1 instruct models (on different datasets such as gsm8k, pubmedqa, etc.), nothing seems to work. In fact, most of the tutorials are not evaluating the model on benchmarks after training and merely printing out a response. This, of course, gives me no way to verify that the finetuning is actually good and thus understandably may mislead.

As said, I have tried multiple chat/instruct models from Llama on different datasets. I honestly don’t know what’s happening. Would be really unfortunate to rely on llama-recipes or Unsloth for such a “standard” finetuning procedure.

But maybe some of the legends around here have a clue and can tell me what I am missing, e.g. @philschmid ? :slight_smile:

Until then, Unsloth/Llama-recipes is the way to go, at least for chat and instruct LLama models, at least for me.

I will attach my current lm-eval results for comparison and completeness for GSM8K:

- Llama-3.1-8B-Instruct (Non-Finetuned)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     0|exact_match|↑  |0.7384|±  |0.0121|
|     |       |strict-match    |     0|exact_match|↑  |0.0000|±  |0.0000|

- Llama-3.1-8B-Instruct Finetuned (mine)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     0|exact_match|↑  |0.6899|±  |0.0127|
|     |       |strict-match    |     0|exact_match|↑  |0.6770|±  |0.0129|

- Llama-3.1-8B-Instruct Finetuned (Unsloth)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     0|exact_match|↑  |0.7551|±  |0.0118|
|     |       |strict-match    |     0|exact_match|↑  |0.7460|±  |0.0120|

- Llama-2-7B-Chat-HF (Non-Fintuned)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     0|exact_match|↑  |0.2267|±  |0.0115|
|     |       |strict-match    |     0|exact_match|↑  |0.0000|±  |0.0000|

- Llama-2-7B-Chat-HF (Mine)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     0|exact_match|↑  |0.1501|±  |0.0098|
|     |       |strict-match    |     0|exact_match|↑  |0.2578|±  |0.0120|

- Llama-2-7B-Chat-HF (Unsloth)
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     0|exact_match|↑  |0.2737|±  |0.0123|
|     |       |strict-match    |     0|exact_match|↑  |0.2737|±  |0.0123|

Thanks so much so far. Maybe we can resolve this at some later point!

2 Likes

Thank you so much for the detailed follow-up, much appreciated. I have 3 other ideas (to make you waste even more time!): 1. Use Unsloth until the training function, 2. Print samples of tokenized (or better, save into a csv file) and use f1 to compare en mass if they are the same, 3. (this is just a wild guess:) you may be having memory leaks issues due to the FSPP, so maybe you could pretokenize everything, run a few checks and then send to the model. Anyways, thanks again for this, always learning something.

1 Like

Also, try axolotl and llama factory?

1 Like

Thank you @arthrod for your swift replies!

If I have some more time, I’ll dig deeper, of course :slight_smile:

Good point! Forgot to mention, hence also my confusion. I did use the same dataset and applied the same ShareGPT chat template as in the Unsloth tutorial. Long story short: both my tokenization approach and the one from Unsloth yield the same tokenized outputs. Here some rudimentary code:

Unsloth chat templating applied to a sharegpt dataset of gsm8k:

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = True, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("McLuian/GSM8K-Train-ShareGPT", split="train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Mine for the same stuff with a workaround function:

from typing import Dict, Any

import yaml
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments
)
from trl import SFTTrainer, SFTConfig

# Model and tokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side="left")

# Handle pad token
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "<pad>"})
    tokenizer.pad_token = tokenizer.eos_token

# Dataset
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    formatted_convos = []
    for convo in convos:
        formatted_convo = []
        last_role = None
        for message in convo:
            role = message["from"]
            if role == "human":
                role = "user"
            elif role == "gpt":
                role = "assistant"

            # Ensure alternation
            if last_role == role:
                raise ValueError(f"Invalid alternation: Role '{role}' repeated in conversation.")
            last_role = role

            formatted_convo.append({"role": role, "content": message["value"]})
        formatted_convos.append(formatted_convo)
    
    texts = [tokenizer.apply_chat_template(convo, tokenize=True, add_generation_prompt=False) for convo in formatted_convos]
    return {"text": texts}

dataset_n = load_dataset("McLuian/GSM8K-Train-ShareGPT", split="train")
chat_dataset = dataset_n.map(formatting_prompts_func, batched=True)

Then we can do:

for c,d in zip(dataset["text"], chat_dataset["text"]):
    assert c == d

which is True, no issues.

So this really confuses me as said!
I have also tried training on a single GPU for apples to apples comparison and excluding FSDP/DDP issues. No changes …

2 Likes