LoRA fine-tuning and special tokens

lionelchg · August 13, 2023, 11:17pm

Hello everyone,

I have been playing around with peft and LoRA fine-tuning using the SFTTrainer for instruction fine-tuning of LlaMa-7B. I use the dolly-15k annotated dataset that I have processed to add special tokens: lionelchg/dolly15k_special_tokens · Datasets at Hugging Face. There are six special tokens:

special_tokens = [
        "<START_INST>", "<END_INST>",
        "<START_CTX>", "<END_CTX>",
        "<START_A>", "<END_A>",
    ]

These are added with the following lines of code (where tokenizer is a LlamaTokenizer):

    # Add special tokens
    special_tokens = [
        "<START_INST>", "<END_INST>",
        "<START_CTX>", "<END_CTX>",
        "<START_A>", "<END_A>",
    ]
    tokenizer.add_tokens(special_tokens, special_tokens=True)
    # this will make new learnable parameters for specialized tokens
    model.resize_token_embeddings(len(tokenizer))

I have two questions:

when saving the model and tokenizer using .save_pretrained() method, the model and tokenizer when loaded with the from_pretrained() method does not showcase the 32006 size embedding matrices but only the original 32000 ones. Do I need to pass additional parameters to tell the tokenizer to search for added_tokens.json ? Why is the model embeddings not the correct size?
I saw from a full-finetuning (not LoRA based) this adding of token methodlogy, but currently if I use LoRA the embeddings matrices won’t be updated right?

Topic		Replies	Views
QLoRA Llama2 additional special tokens Beginners	2	2964	November 23, 2023
Special tokens & Embeddings requries grad? Models	2	45	February 12, 2025
Fine Tune with/without LORA 🤗Transformers	1	231	October 7, 2024
Addition of lm_head and embed_tokens layers to the lora adapter Beginners	3	551	February 16, 2025
Loading pre-trained models with AddedTokens 🤗Transformers	2	748	October 14, 2024

LoRA fine-tuning and special tokens

Related topics