Tokenization compared to sentencepiece

Hi!
Recently I tried to use microsoft/Phi-3-mini-4k-instruct tokenizer. I’ve found that results of the transformers tokenizer is different than the output of sentencepiece tokenizer. To give an example, here’s the sample code reproducing that issue:

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
formatted_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
# '<|user|>\nWhat is capital city of France?<|end|>\n<|assistant|>\n'
tokenizer(formatted_text, return_tensors="pt")
# {'input_ids': tensor([[32010,  1724,   338,  7483,  4272,   310,  3444, 29973, 32007, 32001]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

sp = spm.SentencePieceProcessor(model_file="...")
sp.encode(formatted_text)
# [32010, 13, 5618, 338, 7483, 4272, 310, 3444, 29973, 32007, 13, 32001, 13]

Do you know why transformers modify the original text? Is this correct behaviour? It removes newlines and add spaces (13 is a newline character, while 5618 → “What” and 1724 → “_What”.