Why there are way more merges then vocabulary tokens in llama-3 model tokenizer?

I am trying to understand how the tokenization is done for Llama-3 model if there are way mode merges they tokens in a vocab. I am talking about the https://huggingface.co/meta-llama/Meta-Llama-3-8B/raw/main/tokenizer.json file.

What happens to the merges that are in a merges list but not in the vocabulary list? Why even have a bunch of merges that are not in vocabulary?

Thank you very much!

1 Like

I have been trying to figure out how this interplays with the idea that there should be one merge per new token, but practically the reason why there are more merges for the Llama tokenizer is because of the way the model is converted from the tokenizer.model file into a tokenizer.json file (with both merges and vocab).
The script goes through each word in the vocab and lists the merge candidates. i.e. all of the possible merges that could have created the word, so that means any merge where both subwords are also in the vocab (try this yourself and you’ll see the merges line up). This is done because it is not possible to know which of the candidate merges was used to create that word

1 Like