`merge_and_unload` moves some layers in CPU

Hi all,

I’m using the following code to generate Haskell code with CodeLlama. When I use the original pre-trained CodeLlama 7B, my code runs fine. Then I fine-tuned it on my dataset and when I add the PeftModel line it works fine. But when I add merge_and_unload(), like I did for some other fine-tuned models (such as StarCoder), I get an Expected a cuda device, but got: cpu error message during inference and I see that some layers are moved to CPU. Why is this happening? Does the code behave the same without merge_and_unload(), but is just loads the Lora adapter every single time?

        tokenizer = AutoTokenizer.from_pretrained(
            "codellama/CodeLlama-7b-hf",
            cache_dir=cdir
        )
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id
        tokenizer.padding_side = "left"

        quant_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16
        )

        model = AutoModelForCausalLM.from_pretrained(
            "codellama/CodeLlama-7b-hf",
            device_map="auto",
            torch_dtype=torch.float16,
            quantization_config=quant_config,
            low_cpu_mem_usage=True,
            cache_dir=cdir
        )

        model = PeftModel.from_pretrained(model, "path_to_finetuned_checkpoint")
        # model = model.merge_and_unload()

        self.tokenizer = tokenizer
        self.model = model