Difference in model prediction before saving and after loafing

Iamexperimenting · April 13, 2024, 9:34pm

Hi @nielsr , I’m seeing difference in the model prediction before saving the model and after loading the model.

I’m fine-tuning google’s gemma 2b model,

Please find the reproducible code below,

Here i’m fine-tuning the gemma model with my dataset.

import pandas as pd
import torch
import transformers
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import Dataset, load_dataset

model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])

data = pd.read_csv('train_data.csv')
train_df = Dataset.from_pandas(data)

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_df,
    dataset_text_field = "text",
    max_seq_length = 512,
    args=transformers.TrainingArguments(
        num_train_epochs = 10,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=16,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        seed = 12,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
)

trainer.train()

After finishing training, I immediately tested with two examples to check how model does prediction for it, and I noticed model generated the expected output.

# Below is with example 1 input

text = "trained input example text 1"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Below is with example 2 input

text = "trained input example text 2"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After checking the performance of the fine-tuned model, I saved the model with the below step

trainer.save_model("finetuned_model")

After saving the model, i restarted the kernel and loaded the fine-tuned model

from peft import AutoPeftModelForCausalLM
import torch
new_finetuned_model = AutoPeftModelForCausalLM.from_pretrained(
     "finetuned_model",
     low_cpu_mem_usage=True,
     return_dict = True, 
     torch_dtype = torch.float16,
     device_map = "cuda:0",)

After loading the fine-tuned model, I tested the model with the same example input and I noticed generated output from the model is different.

text = "trained input example text 1"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = new_finetuned_model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

I’m not sure where I’m making mistake could you please help me here?

Expected behavior

Expecting the model to generate the same answer before saving the model and after loading the model.

nielsr · June 3, 2024, 6:50am

Hi,

that’s because during fine-tuning it looks like you used Q-LoRa (with the base model loaded in 4-bits and adapters added on top in float16), whereas as inference time you load the model in 16-bit precision.

mitra-mir · October 21, 2024, 9:49pm

Hi @nielsr and @Iamexperimenting ,

Based on the fact that we use LoRa in fine-tuning, how should we save and load the model to avoid issues? Currently, I load and save like this:

# saving it
model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)
model.save_pretrained(path_to_model)



# load the model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

mrinalr · January 22, 2025, 10:16pm

I’m experiencing something similar with BART. For simplicity, I’m not even finetuning it. Just loading the model, saving it, then reloading it.

from transformers import (BartForConditionalGeneration, BartTokenizer)
import torch

tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")

bart = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
bart.save_pretrained(save_directory="./bart")
reloaded = BartForConditionalGeneration.from_pretrained("./bart")

sentences = ["Hi, how are you?", "Hello, how re you?", "How's it going?"]
for sample in sentences:

    print("Input: ", sample)
    input = tokenizer(sample, max_length=512, truncation=True, return_tensors="pt").input_ids

    with torch.no_grad():
        bart_output = bart.generate(input)
    print(bart_output.shape)
    print("Generated bart: ", tokenizer.decode(bart_output[0], skip_special_tokens=True))

    with torch.no_grad():
        reloaded_output = reloaded.generate(input)
    print(reloaded_output.shape)
    print("Generated reloaded: ", tokenizer.decode(reloaded_output[0], skip_special_tokens=True))

    print(""10)

The issue is “bart” and “reloaded_bart” generate different outputs even though they are essentially the same model. What am I missing? @nielsr @mitra-mir @Iamexperimenting
Any help is greatly appreciated. This is the first time I’m encountering this issue.

mrinalr · January 23, 2025, 9:11pm

I generally pass generation kwargs while doing inference. This was my first time using generate with just the input IDs so the results confused me. Here’s what happened:

I assumed that when I load the model using from_pretrained() the generation config present inside the saved model folder gets loaded as well and will eventually be used if no decoding strategy specifc arguments are passed to generate(). This is where I was wrong.
I needed to load the generation config seperately and pass it along with the inputs.
After doing this, the outputs of original model and reloaded model were same.
Problem solved!

Topic		Replies	Views
Different model performance after saving and loading Donut model 🤗Transformers	1	358	July 6, 2024
Merged and Saved model not giving same result after loading Models	3	86	December 27, 2024
Trouble saving and loading a finetuned model Beginners	1	312	July 7, 2024
Loading and saving a model Beginners	2	12698	September 14, 2024
Different results predicting from trainer and model Beginners	6	7962	December 20, 2021

Difference in model prediction before saving and after loafing

Expected behavior

Related topics