SentencePieceProcessor encoding differs from AutoTokenizer, how can that be?

vartiaone · December 12, 2023, 11:50pm

If i download the sentence piece tokenizer file from HF,
for instance this one: codellama/CodeLlama-7b-Python-hf/tokenizer.model
Then I try to tokenize with it, I get very different results from AutoTokenizer from the same repo.

from sentencepiece import SentencePieceProcessor
from transformers import AutoTokenizer, AutoModelForCausalLM

sentencepiece_tok = SentencePieceProcessor(model_file="checkpoints/codellama/CodeLlama-7b-Python-hf/tokenizer.model")

hf_tok = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-Python-hf", trust_remote_code=True)

prompt = '<unk> <s> </s>'

encoded_string_sp = sentencepiece_tok.encode(prompt)

encoded_string_hf = hf_tok.encode(prompt)

print(encoded_string_sp, encoded_string_hf)

out> [529, 2960, 29958, 529, 29879, 29958, 1533, 29879, 29958] [1, 0, 259, 1, 259, 2]

I’m really struggling to figure out whats going on here.
Don’t the tokenizers have to be producing the same ids for the same strings?

Topic		Replies	Views
Tokenization compared to sentencepiece 🤗Tokenizers	0	108	September 11, 2024
Automatic sentence segmentation and encoding 🤗Tokenizers	0	846	October 12, 2020
Loading pretrained SentencePiece tokenizer from Fairseq 🤗Tokenizers	5	6477	October 21, 2020
Tokenizer mapping the same token to multiple token_ids 🤗Tokenizers	4	729	April 22, 2024
Is there any difference in the tokenized output if I load the tokenizer from a different pretrained model Beginners	2	389	September 3, 2020

SentencePieceProcessor encoding differs from AutoTokenizer, how can that be?

Related topics