Hey! I have trained a bpe model with sentencepiece library. I am converting it to tokenizers and have an issue with whitespace recognition
I am using this script to convert the tokenizer
import sentencepiece as spm
from transformers import convert_slow_tokenizer
from transformers import PreTrainedTokenizerFast
spm_tokenizer = spm.SentencePieceProcessor('tokeniser_training/tokenizer_fineweb_balanced_bpe_128000.model')
spm_tokenizer.vocab_file = 'tokeniser_training/tokenizer_fineweb_balanced_bpe_128000.model'
spm_converter = convert_slow_tokenizer.SpmConverter(spm_tokenizer)
converted = spm_converter.converted()
converted.save('converted.json')
tok = PreTrainedTokenizerFast.from_pretrained(pretrained_model_name_or_path="HuggingFaceTB/SmolLM-1.7B",
tokenizer_file='converted.json',
clean_up_tokenization_spaces=False, pad_token='<|finetune_right_pad_id|>',
unk_token='<unknown>', bos_token='<|start_of_sequence|>',
eos_token='<|end_of_sequence|>',
model_max_length=1024,
padding_side='right', truncation_side='right')
tok.save_pretrained('ConvertedTokenizer')
Generally tokenisation aligns well accept for whitespaces
Do es anyone have any idea what could be the issue? The desired behaviour of a new tokenizer is as seen in spm_tokenizer