Tokenization compared to sentencepiece

marcin-ochman · September 11, 2024, 8:43am

Hi!
Recently I tried to use microsoft/Phi-3-mini-4k-instruct tokenizer. I’ve found that results of the transformers tokenizer is different than the output of sentencepiece tokenizer. To give an example, here’s the sample code reproducing that issue:

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
formatted_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
# '<|user|>\nWhat is capital city of France?<|end|>\n<|assistant|>\n'
tokenizer(formatted_text, return_tensors="pt")
# {'input_ids': tensor([[32010,  1724,   338,  7483,  4272,   310,  3444, 29973, 32007, 32001]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

sp = spm.SentencePieceProcessor(model_file="...")
sp.encode(formatted_text)
# [32010, 13, 5618, 338, 7483, 4272, 310, 3444, 29973, 32007, 13, 32001, 13]

Do you know why transformers modify the original text? Is this correct behaviour? It removes newlines and add spaces (13 is a newline character, while 5618 → “What” and 1724 → “_What”.

Topic		Replies	Views
SentencePieceProcessor encoding differs from AutoTokenizer, how can that be? Beginners	0	860	December 12, 2023
SentencePiece tokenizer Beginners	2	127	February 22, 2025
SentencePiece to Tokenizers conversion 🤗Tokenizers	0	75	March 14, 2025
Training sentencePiece from scratch? 🤗Tokenizers	8	19238	December 19, 2023
Issue with post-processing 🤗Tokenizers	1	1103	June 15, 2022

Tokenization compared to sentencepiece

Related topics