Different results from checkpoint evaluation when loading fine-tuned LLM model

ilektram · June 13, 2023, 6:02pm

I am finetuning Llama for binary sequence classification with PEFT & Lora using the Trainer class. Loss seems to decrease nicely and accuracy on the validation data reaches ~90% in the final epoch: { "epoch": 4.98, "eval_accuracy": 0.9346576058546785, "eval_loss": 0.18449442088603973, "eval_runtime": 496.2064, "eval_samples_per_second": 7.711, "eval_steps_per_second": 0.965, "step": 595 }
After the training is over I save the model as follows:`

model.state_dict = (
    lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
).__get__(model, type(model))

model = torch.compile(model)

trainer.train()
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
torch.save(model.state_dict(), "torch_openllama_saver")

I subsequently try to reload the model and reproduce the evaluation result on the same validation set. However, I get an accuracy of 55% this time. I load the model as follows:

CUTOFF_LEN = 512

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

try:
    if torch.backends.mps.is_available():
        device = "mps"
except:  # noqa: E722
    pass

import sys
import textwrap

import torch
from peft import PeftModel, PeftConfig
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
from transformers.generation.utils import GreedySearchDecoderOnlyOutput

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

try:
    if torch.backends.mps.is_available():
        device = "mps"
except:  # noqa: E722
    pass

load_8bit = True
# Load peft config for pre-trained checkpoint etc.
peft_model_id = "results/experiments_openllama"
config = PeftConfig.from_pretrained(peft_model_id)
config.inference_mode = True


base_model = config.base_model_name_or_path
lora_weights = peft_model_id
if device == "cuda":
    model = LlamaForSequenceClassification.from_pretrained(
        base_model,
        load_in_8bit=load_8bit,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    model = PeftModel.from_pretrained(
        model, lora_weights, torch_dtype=torch.float16, config=config
    )
elif device == "mps":
    model = LlamaForSequenceClassification.from_pretrained(
        base_model, device_map={"": device}, torch_dtype=torch.float16,
    )
    model = PeftModel.from_pretrained(
        model,
        lora_weights,
        device_map={"": device},
        torch_dtype=torch.float16,
        config=config,
    )  # must set inference_mode=True
else:
    model = LlamaForSequenceClassification.from_pretrained(
        base_model, device_map={"": device}, low_cpu_mem_usage=True
    )
    model = PeftModel.from_pretrained(
        model, lora_weights, device_map={"": device}, config=config
    )


model.load_state_dict(
    torch.load(os.path.join(lora_weights, "adapter_model.bin")), strict=False
)

tokenizer = LlamaTokenizer.from_pretrained(peft_model_id)

model.config.pad_token_id = tokenizer.pad_token_id
model.config.bos_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.sep_token_id = tokenizer.sep_token_id
model.config.unk_token_id = tokenizer.pad_token_id

Why is that & what am I doing wrong? Is the original evaluation accuracy incorrect? Or is there an issue with loading the trained model weights?

00BER · June 15, 2023, 6:47pm

I recently ran into the problem where the LORA adapters weren’t being saved correctly.
If you are running finetuning code similar to the alpaca-lora repository, apparently there is a bug where the adapter weights aren’t saved.

Check out these issues:

github.com/tloen/alpaca-lora

Bug : there is a bug in the finetune.py, that would result to an empyt adapter checkpoint after training!!!

opened 12:00PM - 24 May 23 UTC

DragonMengLong

In the finetune.py, lin line 246, where the original `model.state_dict` is chan…ged into a lambda function of `get_peft_model_state_dict()` `get_peft_model_state_dict()` will retuen a dict of only the Lora parameter and the keys of the parameter is regularized, the `adapter_name` in the key string is removed. [`to_return = {k.replace(f".{adapter_name}", ""): v for k, v in to_return.items()}`](https://github.com/huggingface/peft/blob/3714aa2fff158fdfa637b2b65952580801d890b2/src/peft/utils/save_and_load.py#LL76C1-L76C1) Then when saving the model after Training, In the finetune.py line 275, use the `model.save_pretrained()` function, which is implemented by the `PeftModel.save_pretrained()`. Inside the `PeftModel.save_pretrained()`, ` get_peft_model_state_dict()` is used again. However, because the `model.state_dict` is already changed! The dict returened by it is regularized (changed keys' name), the the ` get_peft_model_state_dict()` inside the `PeftModel.save_pretrained()` will remove all the parameter whose keys string doesn't contain the `adapter_name`, which is all! It will result in the final state_dict to be save is None!!! [`to_return = {k: v for k, v in to_return.items() if (("lora_" in k and adapter_name in k) or ("bias" in k))}`](https://github.com/huggingface/peft/blob/3714aa2fff158fdfa637b2b65952580801d890b2/src/peft/utils/save_and_load.py#LL52C9-L52C116)

github.com/huggingface/peft

model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora

opened 01:16PM - 09 Apr 23 UTC

closed 06:35PM - 11 Apr 23 UTC

zetavg

I recently found that when fine-tuning using [alpaca-lora](https://github.com/tl…oen/alpaca-lora), [`model.save_pretrained()`](https://github.com/tloen/alpaca-lora/blob/8d58d37b65501eb07a1397a10ca5e80834b626f1/finetune.py#L268) will save a `adapter_model.bin` that is only 443 B. This seems to be happening after peft@`75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495`. Normally `adapter_model.bin` should be > 16 MB. And while the 443 B `adapter_model.bin` is loaded, the model behaves like not fine-tuned at all. In contrast, loading other checkpoints from the same training works as expected. ``` drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr 9 12:55 . drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr 9 12:54 .. -rw-rw-r-- 1 ubuntu ubuntu 350 Apr 9 12:55 adapter_config.json -rw-rw-r-- 1 ubuntu ubuntu 443 Apr 9 12:55 adapter_model.bin drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr 9 12:06 checkpoint-400 drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr 9 12:06 checkpoint-600 drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr 9 12:07 checkpoint-800 ``` I'm not sure if this is an issue to `peft` or not, or is this a duplication of other issues, but just leaving this for reference. I've been testing with multiple versions of `peft`: * `072da6d9d62` works * `382b178911edff38c1ff619bbac2ba556bd2276b` works * `75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495` not working * `445940fb7b5d38390ffb6707e2a989e89fff03b5` not working * `1a6151b91fcdcc25326b9807d7dbf54e091d506c` not working * `1117d4772109a098787ce7fc297cb6cd641de6eb` not working Steps to reproduce: ```bash conda create python=3.8 -n test conda activate test git clone https://github.com/tloen/alpaca-lora.git cd alpaca-lora pip install -r requirements.txt # to workaround AttributeError: bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats cd /home/ubuntu/miniconda3/envs/test/lib/python3.8/site-packages/bitsandbytes/ mv libbitsandbytes_cpu.so libbitsandbytes_cpu.so.bak cp libbitsandbytes_cuda121.so libbitsandbytes_cpu.so cd - conda install cudatoolkit # alpaca_data_cleaned_first_100.json is alpaca_data_cleaned.json with only the first 100 items, setting --val_set_size 0 because there're not enough data to build the test set python finetune.py --base_model 'decapoda-research/llama-7b-hf' --data_path '/data/datasets/alpaca_data_cleaned_first_100.json' --output_dir './lora-alpaca' --val_set_size 0 ``` ```bash $ ls -alh lora-alpaca total 16K drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr 9 12:55 . drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr 9 12:54 .. -rw-rw-r-- 1 ubuntu ubuntu 350 Apr 9 12:55 adapter_config.json -rw-rw-r-- 1 ubuntu ubuntu 443 Apr 9 12:55 adapter_model.bin ``` (`adapter_model.bin` should normally be around 16 MB) Running on Lambda Cloud A10 instance.

Basically this part seems to cause the weights to not be saved properly:

model.state_dict = (
    lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
).__get__(model, type(model))

model = torch.compile(model)

Suechun · July 6, 2023, 3:19am

@ilektram Hi, can you share your script with finetuning llama, in my code when finetuning, there’s only training loss, no accuracy info. And although I trained for 3 epoches, the loss falls very slowly, after 3 epochs training, loss is still 0.96. I don’t know whta’s the problem.

Thinkcru · September 21, 2023, 4:03pm

Hey @00BER -

I was dealing with similar issue and I setup this gist to explain:

gist.github.com

https://gist.github.com/cmosguy/4b5052507191200397c551fc24853ef0#file-03-ragntune-fine-tune-code-llama-py-L203-L209

03-ragntune-fine-tune-code-llama.py

# %%
from datetime import datetime
import os
import sys
import gpustat

gpus = gpustat.new_query()
[print(gpu) for gpu in gpus]

This file has been truncated. show original

requirements-debug.txt

git+https://github.com/huggingface/transformers.git@main 
bitsandbytes  # we need latest transformers for this
git+https://github.com/huggingface/peft.git@4c611f4
datasets==2.10.1
wandb
scipy
gpustat
pytz
tensorboardX

If you look at this you can see that it saves 443 sized files and are not changing at all from checkpoint to checkpoint.

I did this on the latest peft versions, do you have any ideas what can be done to resolve the issue?

00BER · September 21, 2023, 6:02pm

@Thinkcru Not sure if it is the same issue, but for me commenting out that part solved the problem.

botkop · September 22, 2023, 8:24am

probably related:

Topic		Replies	Views
Llama-2 Sequence Classification: Much lower accuracy on inference from checkpoint compared to model 🤗Transformers	5	5951	February 20, 2024
How to properly load the PEFT LoRA model 🤗Transformers	4	7036	April 13, 2025
Load model from checkpoints occurs degraded performance Beginners	2	788	July 7, 2023
Proper way of saving/loading models for complex workflows 🤗Transformers	2	7	July 22, 2025
Inference, checkpoint Beginners	0	873	December 5, 2023

Different results from checkpoint evaluation when loading fine-tuned LLM model

Related topics