QLoRA Llama2 additional special tokens

I am trying to fine-tune the meta-llama/Llama-2-7b-hf model on a recipe dataset using QLoRA and SFTTrainer. My dataset contains special tokens (such as <RECIPE_TITLE>, <END_TITLE>, , <END_STEPS>, etc.) which helps with structuring the recipes. During fine-tuning I have added these additional tokens to the tokenizer:

special_tokens_dict = {‘additional_special_tokens’: [“<RECIPE_TITLE>”, “<END_TITLE>”, “”, “<END_INGREDIENTS>”, “”, “<END_STEPS>”], ‘pad_token’: “”}
tokenizer.add_special_tokens(special_tokens_dict)

I also resized the token embeddings for the model so that it matches the length of the tokenizer. However, the fine-tuned model predicts all these newly added tokens in the right places (the generated recipe is well-structured), but it predicts these tokens through a combination of token ids, not utilizing the additional token ids.

From my knowledge, LoRA does not automatically update the embedding matrix, so i made sure to specify this in the lora config:

peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias=“none”,
task_type=“CAUSAL_LM”,
target_modules=[“q_proj”, “v_proj”, “k_proj”],
modules_to_save=[“embed_tokens”, “lm_head”],
)

What is the reason behind the model not being able to learn the embeddings of the newly added tokens?

1 Like

hey, did you manage to solve this?

I have my tokens in a list and use tokenizer.add_tokens(new_tokens) instead and it works properly.