LoRA fine-tuning and special tokens

Hello everyone,

I have been playing around with peft and LoRA fine-tuning using the SFTTrainer for instruction fine-tuning of LlaMa-7B. I use the dolly-15k annotated dataset that I have processed to add special tokens: lionelchg/dolly15k_special_tokens · Datasets at Hugging Face. There are six special tokens:

special_tokens = [
        "<START_INST>", "<END_INST>",
        "<START_CTX>", "<END_CTX>",
        "<START_A>", "<END_A>",
    ]

These are added with the following lines of code (where tokenizer is a LlamaTokenizer):

    # Add special tokens
    special_tokens = [
        "<START_INST>", "<END_INST>",
        "<START_CTX>", "<END_CTX>",
        "<START_A>", "<END_A>",
    ]
    tokenizer.add_tokens(special_tokens, special_tokens=True)
    # this will make new learnable parameters for specialized tokens
    model.resize_token_embeddings(len(tokenizer))

I have two questions:

  • when saving the model and tokenizer using .save_pretrained() method, the model and tokenizer when loaded with the from_pretrained() method does not showcase the 32006 size embedding matrices but only the original 32000 ones. Do I need to pass additional parameters to tell the tokenizer to search for added_tokens.json ? Why is the model embeddings not the correct size?
  • I saw from a full-finetuning (not LoRA based) this adding of token methodlogy, but currently if I use LoRA the embeddings matrices won’t be updated right?
1 Like