How to train a LlamaTokenizer?

nicholasKluge · June 8, 2024, 9:11am

Yes.

You can use either add_special_tokens or add_tokens.

However, remember that you have to resize the embedding layer of your model after doing that, and preferably train those new embeddings.

MikeMpapa · June 10, 2024, 2:42pm

@nicholasKluge thank you so much! Really appreciate you spending the time for such a detailed response! That has cleared things up a lot.

A follow-up question on usage:

I have found that llama.cpp offers an option to train a tiny llama from scratch. Is there a specific framework you would recommend to facilitate training from scratch? Or should I just train it with PyTorch & HuggingFace as any other model? Vast majority of online resources focuses on fine-tuning and I would prefer to have my own tokenizer than expanding the default one.

nicholasKluge · August 20, 2024, 11:14am

Hello @MikeMpapa ! I dont know much about the training of llamas via llama.cpp (I thought it was just an inference framework), but for something as small as a tiny-llama (< 1.1B), you can definitely relay on just bare bone PyTorch and Transformers and it will give you a good MFU out of it, or something more elaborate like PyTorch lightning (the one they used in the TinyLlama paper) can also do the trick. I also have an implementation using Accelerate if you want to check it out.

Topic		Replies	Views
Prompt printing gibberish Beginners	1	683	September 15, 2023
Llama model outputs strange words Beginners	0	132	December 1, 2024
Simple use of Transformers breaks Beginners	1	1384	June 2, 2023
Qlora Training on Custom Trainer Research	0	29	September 19, 2024
"invalid kernel image" when using HF llama trainer 🤗Transformers	1	417	February 23, 2024

How to train a LlamaTokenizer?

Related topics