If i download the sentence piece tokenizer file from HF,
for instance this one: codellama/CodeLlama-7b-Python-hf/tokenizer.model
Then I try to tokenize with it, I get very different results from AutoTokenizer from the same repo.
from sentencepiece import SentencePieceProcessor
from transformers import AutoTokenizer, AutoModelForCausalLM
sentencepiece_tok = SentencePieceProcessor(model_file="checkpoints/codellama/CodeLlama-7b-Python-hf/tokenizer.model")
hf_tok = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-Python-hf", trust_remote_code=True)
prompt = '<unk> <s> </s>'
encoded_string_sp = sentencepiece_tok.encode(prompt)
encoded_string_hf = hf_tok.encode(prompt)
print(encoded_string_sp, encoded_string_hf)
out> [529, 2960, 29958, 529, 29879, 29958, 1533, 29879, 29958] [1, 0, 259, 1, 259, 2]
I’m really struggling to figure out whats going on here.
Don’t the tokenizers have to be producing the same ids for the same strings?