Hello everyone,
I have been playing around with peft and LoRA fine-tuning using the SFTTrainer for instruction fine-tuning of LlaMa-7B. I use the dolly-15k annotated dataset that I have processed to add special tokens: lionelchg/dolly15k_special_tokens · Datasets at Hugging Face. There are six special tokens:
special_tokens = [
"<START_INST>", "<END_INST>",
"<START_CTX>", "<END_CTX>",
"<START_A>", "<END_A>",
]
These are added with the following lines of code (where tokenizer is a LlamaTokenizer
):
# Add special tokens
special_tokens = [
"<START_INST>", "<END_INST>",
"<START_CTX>", "<END_CTX>",
"<START_A>", "<END_A>",
]
tokenizer.add_tokens(special_tokens, special_tokens=True)
# this will make new learnable parameters for specialized tokens
model.resize_token_embeddings(len(tokenizer))
I have two questions:
- when saving the model and tokenizer using
.save_pretrained()
method, the model and tokenizer when loaded with thefrom_pretrained()
method does not showcase the 32006 size embedding matrices but only the original 32000 ones. Do I need to pass additional parameters to tell the tokenizer to search foradded_tokens.json
? Why is the model embeddings not the correct size? - I saw from a full-finetuning (not LoRA based) this adding of token methodlogy, but currently if I use LoRA the embeddings matrices won’t be updated right?