What is the difference between tiktoken and sentencepice implements about BPE?

Xuying · May 10, 2024, 7:37am

I find llama3 using tiktoken and here is the difference introduced by huggingface:

The tokenizer is a BPE model based on tiktoken (vs the one based on sentencepiece implementation for Llama2). The main difference that it ignores BPE merge rules when an input token is part of the vocab. This means that if no merge exist to produce “hugging”, instead of having the smallest units, like [“hug”,“ging”] form 2 tokens, if “hugging”` is part of the vocab, it will be automatically returned as a token.

I don’t quite understand this description. Is there a more detailed explanation?

frankliu666 · May 28, 2024, 10:13am

In tiktoken, some commonly used words are directly added to the vocabulary as tokens. In contrast, sentencepiece, which strictly follows the BPE procedure, identifies tokens according to merge rules.

Using the example of “hugging”: in tiktoken, if “hugging” is already in the vocabulary, it will be tokenized as a single token “hugging” and skip the BPE merge rules. In sentencepiece, however, it will follow the BPE merge rules and tokenize “hugging” as [“hug”, “ging”] instead.

some references: x.com

Topic		Replies	Views
Newbie: Main difference between tokenizers? 🤗Tokenizers	0	836	May 6, 2021
Converting TikToken to Huggingface Tokenizer 🤗Tokenizers	1	2565	April 22, 2024
Tokenizer method inference 🤗Tokenizers	3	42	November 2, 2024
How to create a hugging face compatible tokenizer from a vocab file? Beginners	0	250	May 23, 2024
Does the ByteLevelBPETokenizer need to be wrapped in a normal Tokenizer? 🤗Tokenizers	0	1828	March 18, 2023

What is the difference between tiktoken and sentencepice implements about BPE?

Related topics