Difference in model prediction before saving and after loafing

Hi @nielsr , I’m seeing difference in the model prediction before saving the model and after loading the model.

I’m fine-tuning google’s gemma 2b model,

Please find the reproducible code below,

Here i’m fine-tuning the gemma model with my dataset.

import pandas as pd
import torch
import transformers
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import Dataset, load_dataset

model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])

data = pd.read_csv('train_data.csv')
train_df = Dataset.from_pandas(data)

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_df,
    dataset_text_field = "text",
    max_seq_length = 512,
    args=transformers.TrainingArguments(
        num_train_epochs = 10,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=16,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        seed = 12,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
)

trainer.train()

After finishing training, I immediately tested with two examples to check how model does prediction for it, and I noticed model generated the expected output.

# Below is with example 1 input

text = "trained input example text 1"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Below is with example 2 input

text = "trained input example text 2"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After checking the performance of the fine-tuned model, I saved the model with the below step

trainer.save_model("finetuned_model")

After saving the model, i restarted the kernel and loaded the fine-tuned model

from peft import AutoPeftModelForCausalLM
import torch
new_finetuned_model = AutoPeftModelForCausalLM.from_pretrained(
     "finetuned_model",
     low_cpu_mem_usage=True,
     return_dict = True, 
     torch_dtype = torch.float16,
     device_map = "cuda:0",)

After loading the fine-tuned model, I tested the model with the same example input and I noticed generated output from the model is different.

text = "trained input example text 1"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = new_finetuned_model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

I’m not sure where I’m making mistake could you please help me here?

Expected behavior

Expecting the model to generate the same answer before saving the model and after loading the model.