Loading model directly from checkpoint path vs loading model from hub results in inconsistent generation (pushed model seems worse)

waddledee · April 15, 2024, 7:04am

I fine-tuned model: “elyza/ELYZA-japanese-Llama-2-7b-instruct” to generate three-line summarization of news articles in Japanese with PEFT(LoRA tuning), and pushed my best checkpoint to hub;

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "elyza/ELYZA-japanese-Llama-2-7b-instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name, 
    quantization_config=bnb_config, 
    trust_remote_code=True
)
model_4bit.config.use_cache = False
model = model_4bit 

# Add special tokens
special_tokens = ["[R_START]", "[R_END]"]
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.add_tokens(special_tokens, special_tokens=True)

# this will make new learnable parameters for specialized tokens
model.resize_token_embeddings(len(tokenizer))

tokenizer.pad_token =  tokenizer.unk_token 

# This is the fix for fp16 training
tokenizer.padding_side = "right"

#####
Training done
#####

checkpoint = "output/path_to_my_best_checkpoint"

trainedmodel = PeftModel.from_pretrained(
    model,
    checkpoint,
    torch_dtype=torch.float16,   
    device_map={'':0}
)

if torch.cuda.is_available():
    trainedmodel = trainedmodel.to("cuda")

trainedmodel.push_to_hub("three-line-summarization-ja")
tokenizer.push_to_hub("three-line-summarization-ja")

Then, I loaded the pushed model;

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

peft_model_id = "waddledee/three-line-summarization-ja"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained("elyza/ELYZA-japanese-Llama-2-7b-instruct")
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
model.resize_token_embeddings(len(tokenizer))

model_from_hub = PeftModel.from_pretrained(
    model, 
    peft_model_id, 
    torch_dtype=torch.float16,   
    device_map={'':0}
)

I expected model_from_hub and trainedmodel to generate the same result, however, model_from_hub seems much worse compared to trainedmodel. Any problem in my procedure of pushing/loading the model?

def gen(text, model):
    prompt = f"""<s>[INST] <<SYS>>
あなたは誠実で優秀な日本人のアシスタントです。
<</SYS>>

以下の入力文を3行で要約しなさい。
入力文:
{text} [/INST] [R_START] 
"""
    token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
    token_ids.to("cuda")

    with torch.no_grad():
        output_ids =model.generate(
            inputs = token_ids,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            max_new_tokens=256
        )
    output = tokenizer.decode(output_ids.tolist()[0][token_ids.size(1) :], skip_special_tokens=True)
    return(output)

gen(text,model_from_hub)
>> 3行要約:
中国メディアの中国網は2016年1月28日、日本が中国の競合となると主張した理由を紹介した
日本がシルクロード文化に最も興味を示していること、冷戦後も早くシルクロードに商機を見出したこと、日本が中国に対して懐疑的であることを挙げた   

gen(text,trainedmodel)
>> 3行要約:
中国メディアが日本が一帯一路構想に対抗するのかを論じた
日本がシルクロード文化に最も興味を示していると指摘
日本は経済面、政治外交面、軍事面で中国に対して懐疑的

I tried for other texts, and it seemed that model_from_hub performs worse than trainedmodel which was load directly from my checkpoint path.

VickyUmath · April 15, 2024, 9:45am

I am also facing same issues. Any suggestion would help

waddledee · May 8, 2024, 5:57am

I realized that I just had forgotten to quantize the base model when loading.

The following snippet to load model from hub produced the same result as loading model from a checkpoint.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from peft import PeftModel, PeftConfig

model_name = "elyza/ELYZA-japanese-Llama-2-7b-instruct"  # this is the base model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name, 
    quantization_config=bnb_config, 
    trust_remote_code=True
)
model_4bit.config.use_cache = False 


peft_model_id = "waddledee/three-line-summarization-ja"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add special tokens
special_tokens = ["[R_START]", "[R_END]"]
tokenizer.add_tokens(special_tokens, special_tokens=True)
tokenizer.pad_token = tokenizer.unk_token

# This is the fix for fp16 training
tokenizer.padding_side = "right"

# this will make new learnable parameters for specialized tokens
model_4bit.resize_token_embeddings(len(tokenizer))

model_from_hub = PeftModel.from_pretrained(
    model_4bit,  # This is quantized base model
    peft_model_id, 
    torch_dtype=torch.float16,   
    device_map={'':0}
)

system · May 8, 2024, 5:57pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to use the model resulted from PEFT for inference Beginners	2	1055	June 2, 2024
Loading Peft model from checkpoint leading into size missmatch 🤗Transformers	6	10302	February 7, 2024
Inference, checkpoint Beginners	0	871	December 5, 2023
Different results from checkpoint evaluation when loading fine-tuned LLM model Intermediate	5	3237	September 22, 2023
After fine tuning, saving and reloading the model, he is "forgetting" fine tuning 🤗Transformers	0	801	August 9, 2023

Loading model directly from checkpoint path vs loading model from hub results in inconsistent generation (pushed model seems worse)

Related topics