Merged LoRA & text generation inference issues

Hi,

I have finetuned falcon-7b for a specific task using Peft library and espacially LoRA adpater. My finetuning is working well and I wished that I can use it with text-generation-inference (here).
falcon mode is supported, but Peft not. So I merged my LoRA weights to the base model. I can use this merged model with transformers AutoModel but in text-generation-inference I can not. Here error’s messages:

Torch: RuntimeError: weight transformer.word_embeddings.weight does not exist
Safetensors: RuntimeError: weight lm_head.weight does not exist and indeed in config there is no lm_head filed.

Any clues on what should I do ?

Could you show the finetuning code, its hard to see where this error happen, it looks to me like you finetuned the model with wrong attention blocks, but maybe i’m wrong.

Of course here it is.

def train(self, training_text: str, lora_name: str, **kwargs):
        assert self.model is not None
        assert self.tokenizer is not None

        kwargs = {**TRAINING_PARAMS, **LORA_TRAINING_PARAMS, **kwargs}
        train_dataset = self.tokenize_dataset(
            training_text, kwargs["dataset_max_size"], kwargs["max_sequence_length"]
        )

        args = {}

        if "tiiuae/falcon" in self.model_name:
            args = {
                "target_modules": [
                    "query_key_value",
                    "dense",
                    "dense_h_to_4h",
                    "dense_4h_to_h",
                ]
            }
        if kwargs["is_gpt2"] == True:
            args["fan_in_fan_out"] = True

        self.model = peft.prepare_model_for_kbit_training(self.model)
        self.model = peft.get_peft_model(
            self.model,
            peft.LoraConfig(
                r=kwargs["lora_r"],
                lora_alpha=kwargs["lora_alpha"],
                lora_dropout=kwargs["lora_dropout"],
                bias="none",
                task_type="CAUSAL_LM",
                **args,
            ),
        )

        if not os.path.exists(LORA_DIR):
            os.makedirs(LORA_DIR)

        sanitized_model_name = sanitize_model_name(self.model_name)
        output_dir = f"{LORA_DIR}/{sanitized_model_name}_{lora_name}"

        training_args = TrainingArguments(
            per_device_train_batch_size=kwargs["micro_batch_size"],
            gradient_accumulation_steps=kwargs["gradient_accumulation_steps"],
            num_train_epochs=kwargs["epochs"],
            learning_rate=kwargs["learning_rate"],
            warmup_steps=math.floor(len(train_dataset) * 0.05),
            fp16=True,
            optim="adamw_torch",
            logging_steps=5,
            save_total_limit=3,
            save_steps=0,
            output_dir=output_dir,
        )

        self.trainer = Trainer(
            model=self.model,
            train_dataset=train_dataset,
            args=training_args,
            data_collator=DataCollatorForLanguageModeling(
                self.tokenizer,
                mlm=False,
            )
        )

        self.model.config.use_cache = False

        self.trainer.train(resume_from_checkpoint=False)

        self.model.save_pretrained(output_dir)

Honestly im not really sure but i think you are replacing some layers with another by using ```
[“fan_in_fan_out”] = True, and this could be the reason

for tiiuae/falcon-7b I’ve had luck fine-tuning and performing inference with PEFT.

The main difference in my code is that I’m only targeting: “query_key_value”
Maybe give that a shot. I just read the original Lora paper last night and they findings were that targeting just “query_value” is likely sufficient. I recommend giving it a read, it’s pretty quick and was very informative.

One of the things I learned (that was hard to find definitive answer for elsewhere) was the implications of lora_alpha on training. In the article they indicate that they keep it at a 1-to-1 ratio as it’s equivalent to scaling the learning rate. Thus if r=16, they set lora_alpha = 16, r=8 they set lora_alpha = 8, etc.

You are missing this line when using LoRA:
"modules_to_save": ["embed_tokens", "lm_head"], # without these, model saved wont have newly resized embeddings
You want peft to save these as well (embed_tokens in my case because I was adding some special tokens).