Loading model directly from checkpoint path vs loading model from hub results in inconsistent generation (pushed model seems worse)

I fine-tuned model: “elyza/ELYZA-japanese-Llama-2-7b-instruct” to generate three-line summarization of news articles in Japanese with PEFT(LoRA tuning), and pushed my best checkpoint to hub;

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "elyza/ELYZA-japanese-Llama-2-7b-instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name, 
    quantization_config=bnb_config, 
    trust_remote_code=True
)
model_4bit.config.use_cache = False
model = model_4bit 

# Add special tokens
special_tokens = ["[R_START]", "[R_END]"]
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.add_tokens(special_tokens, special_tokens=True)

# this will make new learnable parameters for specialized tokens
model.resize_token_embeddings(len(tokenizer))

tokenizer.pad_token =  tokenizer.unk_token 

# This is the fix for fp16 training
tokenizer.padding_side = "right"

#####
Training done
#####

checkpoint = "output/path_to_my_best_checkpoint"

trainedmodel = PeftModel.from_pretrained(
    model,
    checkpoint,
    torch_dtype=torch.float16,   
    device_map={'':0}
)

if torch.cuda.is_available():
    trainedmodel = trainedmodel.to("cuda")

trainedmodel.push_to_hub("three-line-summarization-ja")
tokenizer.push_to_hub("three-line-summarization-ja")

Then, I loaded the pushed model;

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

peft_model_id = "waddledee/three-line-summarization-ja"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained("elyza/ELYZA-japanese-Llama-2-7b-instruct")
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
model.resize_token_embeddings(len(tokenizer))

model_from_hub = PeftModel.from_pretrained(
    model, 
    peft_model_id, 
    torch_dtype=torch.float16,   
    device_map={'':0}
)

I expected model_from_hub and trainedmodel to generate the same result, however, model_from_hub seems much worse compared to trainedmodel. Any problem in my procedure of pushing/loading the model?

def gen(text, model):
    prompt = f"""<s>[INST] <<SYS>>
あなたは誠実で優秀な日本人のアシスタントです。
<</SYS>>

以下の入力文を3行で要約しなさい。
入力文:
{text} [/INST] [R_START] 
"""
    token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
    token_ids.to("cuda")

    with torch.no_grad():
        output_ids =model.generate(
            inputs = token_ids,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            max_new_tokens=256
        )
    output = tokenizer.decode(output_ids.tolist()[0][token_ids.size(1) :], skip_special_tokens=True)
    return(output)

gen(text,model_from_hub)
>> 3行要約:
中国メディアの中国網は2016年1月28日、日本が中国の競合となると主張した理由を紹介した
日本がシルクロード文化に最も興味を示していること、冷戦後も早くシルクロードに商機を見出したこと、日本が中国に対して懐疑的であることを挙げた   

gen(text,trainedmodel)
>> 3行要約:
中国メディアが日本が一帯一路構想に対抗するのかを論じた
日本がシルクロード文化に最も興味を示していると指摘
日本は経済面、政治外交面、軍事面で中国に対して懐疑的   

I tried for other texts, and it seemed that model_from_hub performs worse than trainedmodel which was load directly from my checkpoint path.

1 Like

I am also facing same issues. Any suggestion would help

1 Like