How to train a LlamaTokenizer?

Yes.

You can use either add_special_tokens or add_tokens.

However, remember that you have to resize the embedding layer of your model after doing that, and preferably train those new embeddings.

1 Like

@nicholasKluge thank you so much! Really appreciate you spending the time for such a detailed response! That has cleared things up a lot.

A follow-up question on usage:

I have found that llama.cpp offers an option to train a tiny llama from scratch. Is there a specific framework you would recommend to facilitate training from scratch? Or should I just train it with PyTorch & HuggingFace as any other model? Vast majority of online resources focuses on fine-tuning and I would prefer to have my own tokenizer than expanding the default one.

Hello @MikeMpapa ! I dont know much about the training of llamas via llama.cpp (I thought it was just an inference framework), but for something as small as a tiny-llama (< 1.1B), you can definitely relay on just bare bone PyTorch and Transformers and it will give you a good MFU out of it, or something more elaborate like PyTorch lightning (the one they used in the TinyLlama paper) can also do the trick. I also have an implementation using Accelerate if you want to check it out.