How to train a LlamaTokenizer?

Yes.

You can use either add_special_tokens or add_tokens.

However, remember that you have to resize the embedding layer of your model after doing that, and preferably train those new embeddings.

1 Like

@nicholasKluge thank you so much! Really appreciate you spending the time for such a detailed response! That has cleared things up a lot.

A follow-up question on usage:

I have found that llama.cpp offers an option to train a tiny llama from scratch. Is there a specific framework you would recommend to facilitate training from scratch? Or should I just train it with PyTorch & HuggingFace as any other model? Vast majority of online resources focuses on fine-tuning and I would prefer to have my own tokenizer than expanding the default one.