Difference in model prediction before saving and after loafing

Hi @nielsr , I’m seeing difference in the model prediction before saving the model and after loading the model.

I’m fine-tuning google’s gemma 2b model,

Please find the reproducible code below,

Here i’m fine-tuning the gemma model with my dataset.

import pandas as pd
import torch
import transformers
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import Dataset, load_dataset

model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])

data = pd.read_csv('train_data.csv')
train_df = Dataset.from_pandas(data)

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_df,
    dataset_text_field = "text",
    max_seq_length = 512,
    args=transformers.TrainingArguments(
        num_train_epochs = 10,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=16,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        seed = 12,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
)

trainer.train()

After finishing training, I immediately tested with two examples to check how model does prediction for it, and I noticed model generated the expected output.

# Below is with example 1 input

text = "trained input example text 1"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Below is with example 2 input

text = "trained input example text 2"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After checking the performance of the fine-tuned model, I saved the model with the below step

trainer.save_model("finetuned_model")

After saving the model, i restarted the kernel and loaded the fine-tuned model

from peft import AutoPeftModelForCausalLM
import torch
new_finetuned_model = AutoPeftModelForCausalLM.from_pretrained(
     "finetuned_model",
     low_cpu_mem_usage=True,
     return_dict = True, 
     torch_dtype = torch.float16,
     device_map = "cuda:0",)

After loading the fine-tuned model, I tested the model with the same example input and I noticed generated output from the model is different.

text = "trained input example text 1"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = new_finetuned_model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

I’m not sure where I’m making mistake could you please help me here?

Expected behavior

Expecting the model to generate the same answer before saving the model and after loading the model.

Hi,

that’s because during fine-tuning it looks like you used Q-LoRa (with the base model loaded in 4-bits and adapters added on top in float16), whereas as inference time you load the model in 16-bit precision.

Hi @nielsr and @Iamexperimenting ,

Based on the fact that we use LoRa in fine-tuning, how should we save and load the model to avoid issues? Currently, I load and save like this:

# saving it
model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)
model.save_pretrained(path_to_model)



# load the model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

I’m experiencing something similar with BART. For simplicity, I’m not even finetuning it. Just loading the model, saving it, then reloading it.

from transformers import (BartForConditionalGeneration, BartTokenizer)
import torch

tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")

bart = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
bart.save_pretrained(save_directory="./bart")
reloaded = BartForConditionalGeneration.from_pretrained("./bart")

sentences = ["Hi, how are you?", "Hello, how re you?", "How's it going?"]
for sample in sentences:

    print("Input: ", sample)
    input = tokenizer(sample, max_length=512, truncation=True, return_tensors="pt").input_ids

    with torch.no_grad():
        bart_output = bart.generate(input)
    print(bart_output.shape)
    print("Generated bart: ", tokenizer.decode(bart_output[0], skip_special_tokens=True))

    with torch.no_grad():
        reloaded_output = reloaded.generate(input)
    print(reloaded_output.shape)
    print("Generated reloaded: ", tokenizer.decode(reloaded_output[0], skip_special_tokens=True))

    print(""10)

The issue is “bart” and “reloaded_bart” generate different outputs even though they are essentially the same model. What am I missing? @nielsr @mitra-mir @Iamexperimenting
Any help is greatly appreciated. This is the first time I’m encountering this issue.

1 Like

I generally pass generation kwargs while doing inference. This was my first time using generate with just the input IDs so the results confused me. Here’s what happened:

  1. I assumed that when I load the model using from_pretrained() the generation config present inside the saved model folder gets loaded as well and will eventually be used if no decoding strategy specifc arguments are passed to generate(). This is where I was wrong.
  2. I needed to load the generation config seperately and pass it along with the inputs.
    After doing this, the outputs of original model and reloaded model were same.
    Problem solved!
1 Like