I have been playing around with peft and LoRA fine-tuning using the SFTTrainer for instruction fine-tuning of LlaMa-7B. I use the dolly-15k annotated dataset that I have processed to add special tokens: lionelchg/dolly15k_special_tokens · Datasets at Hugging Face. There are six special tokens:
special_tokens = [ "<START_INST>", "<END_INST>", "<START_CTX>", "<END_CTX>", "<START_A>", "<END_A>", ]
These are added with the following lines of code (where tokenizer is a
# Add special tokens special_tokens = [ "<START_INST>", "<END_INST>", "<START_CTX>", "<END_CTX>", "<START_A>", "<END_A>", ] tokenizer.add_tokens(special_tokens, special_tokens=True) # this will make new learnable parameters for specialized tokens model.resize_token_embeddings(len(tokenizer))
I have two questions:
- when saving the model and tokenizer using
.save_pretrained()method, the model and tokenizer when loaded with the
from_pretrained()method does not showcase the 32006 size embedding matrices but only the original 32000 ones. Do I need to pass additional parameters to tell the tokenizer to search for
added_tokens.json? Why is the model embeddings not the correct size?
- I saw from a full-finetuning (not LoRA based) this adding of token methodlogy, but currently if I use LoRA the embeddings matrices won’t be updated right?