I fine-tuned model: “elyza/ELYZA-japanese-Llama-2-7b-instruct” to generate three-line summarization of news articles in Japanese with PEFT(LoRA tuning), and pushed my best checkpoint to hub;
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
model_name = "elyza/ELYZA-japanese-Llama-2-7b-instruct"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True
)
model_4bit.config.use_cache = False
model = model_4bit
# Add special tokens
special_tokens = ["[R_START]", "[R_END]"]
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.add_tokens(special_tokens, special_tokens=True)
# this will make new learnable parameters for specialized tokens
model.resize_token_embeddings(len(tokenizer))
tokenizer.pad_token = tokenizer.unk_token
# This is the fix for fp16 training
tokenizer.padding_side = "right"
#####
Training done
#####
checkpoint = "output/path_to_my_best_checkpoint"
trainedmodel = PeftModel.from_pretrained(
model,
checkpoint,
torch_dtype=torch.float16,
device_map={'':0}
)
if torch.cuda.is_available():
trainedmodel = trainedmodel.to("cuda")
trainedmodel.push_to_hub("three-line-summarization-ja")
tokenizer.push_to_hub("three-line-summarization-ja")
Then, I loaded the pushed model;
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
peft_model_id = "waddledee/three-line-summarization-ja"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained("elyza/ELYZA-japanese-Llama-2-7b-instruct")
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
model.resize_token_embeddings(len(tokenizer))
model_from_hub = PeftModel.from_pretrained(
model,
peft_model_id,
torch_dtype=torch.float16,
device_map={'':0}
)
I expected model_from_hub
and trainedmodel
to generate the same result, however, model_from_hub
seems much worse compared to trainedmodel
. Any problem in my procedure of pushing/loading the model?
def gen(text, model):
prompt = f"""<s>[INST] <<SYS>>
あなたは誠実で優秀な日本人のアシスタントです。
<</SYS>>
以下の入力文を3行で要約しなさい。
入力文:
{text} [/INST] [R_START]
"""
token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
token_ids.to("cuda")
with torch.no_grad():
output_ids =model.generate(
inputs = token_ids,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=256
)
output = tokenizer.decode(output_ids.tolist()[0][token_ids.size(1) :], skip_special_tokens=True)
return(output)
gen(text,model_from_hub)
>> 3行要約:
中国メディアの中国網は2016年1月28日、日本が中国の競合となると主張した理由を紹介した
日本がシルクロード文化に最も興味を示していること、冷戦後も早くシルクロードに商機を見出したこと、日本が中国に対して懐疑的であることを挙げた
gen(text,trainedmodel)
>> 3行要約:
中国メディアが日本が一帯一路構想に対抗するのかを論じた
日本がシルクロード文化に最も興味を示していると指摘
日本は経済面、政治外交面、軍事面で中国に対して懐疑的
I tried for other texts, and it seemed that model_from_hub performs worse than trainedmodel which was load directly from my checkpoint path.