What is the difference between tiktoken and sentencepice implements about BPE?

I find llama3 using tiktoken and here is the difference introduced by huggingface:

The tokenizer is a BPE model based on tiktoken (vs the one based on sentencepiece implementation for Llama2). The main difference that it ignores BPE merge rules when an input token is part of the vocab. This means that if no merge exist to produce “hugging”, instead of having the smallest units, like [“hug”,“ging”] form 2 tokens, if “hugging”` is part of the vocab, it will be automatically returned as a token.

I don’t quite understand this description. Is there a more detailed explanation?

1 Like

In tiktoken, some commonly used words are directly added to the vocabulary as tokens. In contrast, sentencepiece, which strictly follows the BPE procedure, identifies tokens according to merge rules.

Using the example of “hugging”: in tiktoken, if “hugging” is already in the vocabulary, it will be tokenized as a single token “hugging” and skip the BPE merge rules. In sentencepiece, however, it will follow the BPE merge rules and tokenize “hugging” as [“hug”, “ging”] instead.

some references: x.com

2 Likes