Why there are way more merges then vocabulary tokens in llama-3 model tokenizer?

vprokopev · October 12, 2024, 12:41pm

I am trying to understand how the tokenization is done for Llama-3 model if there are way mode merges they tokens in a vocab. I am talking about the https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/tokenizer.json file.

What happens to the merges that are in a merges list but not in the vocabulary list? Why even have a bunch of merges that are not in vocabulary?

Thank you very much!

astein0 · December 18, 2024, 8:11pm

I have been trying to figure out how this interplays with the idea that there should be one merge per new token, but practically the reason why there are more merges for the Llama tokenizer is because of the way the model is converted from the tokenizer.model file into a tokenizer.json file (with both merges and vocab).
The script goes through each word in the vocab and lists the merge candidates. i.e. all of the possible merges that could have created the word, so that means any merge where both subwords are also in the vocab (try this yourself and you’ll see the merges line up). This is done because it is not possible to know which of the candidate merges was used to create that word

Topic		Replies	Views
Tokenizer method inference 🤗Tokenizers	3	42	November 2, 2024
Extending the tokenizer affects model generation Intermediate	3	169	December 19, 2024
Save_pretrained() on tokenizer does not generate a tokenizer.json file 🤗Transformers	3	789	August 19, 2024
Running LLaMA 3.1 8B Model Downloaded from Meta - Missing Configuration File Beginners	3	556	December 30, 2024
Download llama for offline computer Models	1	1132	September 13, 2023

Why there are way more merges then vocabulary tokens in llama-3 model tokenizer?

Related topics