SentencePiece to Tokenizers conversion

Hey! I have trained a bpe model with sentencepiece library. I am converting it to tokenizers and have an issue with whitespace recognition

I am using this script to convert the tokenizer

import sentencepiece as spm
from transformers import convert_slow_tokenizer
from transformers import PreTrainedTokenizerFast

spm_tokenizer = spm.SentencePieceProcessor('tokeniser_training/tokenizer_fineweb_balanced_bpe_128000.model')
spm_tokenizer.vocab_file = 'tokeniser_training/tokenizer_fineweb_balanced_bpe_128000.model'
spm_converter = convert_slow_tokenizer.SpmConverter(spm_tokenizer)
converted = spm_converter.converted()
converted.save('converted.json')

tok = PreTrainedTokenizerFast.from_pretrained(pretrained_model_name_or_path="HuggingFaceTB/SmolLM-1.7B",
                                              tokenizer_file='converted.json', 
                                              clean_up_tokenization_spaces=False, pad_token='<|finetune_right_pad_id|>',
                                                unk_token='<unknown>', bos_token='<|start_of_sequence|>', 
                                                eos_token='<|end_of_sequence|>', 
                                                model_max_length=1024, 
                                                padding_side='right', truncation_side='right')
tok.save_pretrained('ConvertedTokenizer')

Generally tokenisation aligns well accept for whitespaces

Do es anyone have any idea what could be the issue? The desired behaviour of a new tokenizer is as seen in spm_tokenizer

1 Like