Hi all,
I’m using the following code to generate Haskell code with CodeLlama. When I use the original pre-trained CodeLlama 7B, my code runs fine. Then I fine-tuned it on my dataset and when I add the PeftModel
line it works fine. But when I add merge_and_unload()
, like I did for some other fine-tuned models (such as StarCoder), I get an Expected a cuda device, but got: cpu
error message during inference and I see that some layers are moved to CPU. Why is this happening? Does the code behave the same without merge_and_unload()
, but is just loads the Lora adapter every single time?
tokenizer = AutoTokenizer.from_pretrained(
"codellama/CodeLlama-7b-hf",
cache_dir=cdir
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"codellama/CodeLlama-7b-hf",
device_map="auto",
torch_dtype=torch.float16,
quantization_config=quant_config,
low_cpu_mem_usage=True,
cache_dir=cdir
)
model = PeftModel.from_pretrained(model, "path_to_finetuned_checkpoint")
# model = model.merge_and_unload()
self.tokenizer = tokenizer
self.model = model