SentencePiece to Tokenizers conversion

aaaksenova · March 14, 2025, 10:18am

Hey! I have trained a bpe model with sentencepiece library. I am converting it to tokenizers and have an issue with whitespace recognition

I am using this script to convert the tokenizer

import sentencepiece as spm
from transformers import convert_slow_tokenizer
from transformers import PreTrainedTokenizerFast

spm_tokenizer = spm.SentencePieceProcessor('tokeniser_training/tokenizer_fineweb_balanced_bpe_128000.model')
spm_tokenizer.vocab_file = 'tokeniser_training/tokenizer_fineweb_balanced_bpe_128000.model'
spm_converter = convert_slow_tokenizer.SpmConverter(spm_tokenizer)
converted = spm_converter.converted()
converted.save('converted.json')

tok = PreTrainedTokenizerFast.from_pretrained(pretrained_model_name_or_path="HuggingFaceTB/SmolLM-1.7B",
                                              tokenizer_file='converted.json', 
                                              clean_up_tokenization_spaces=False, pad_token='<|finetune_right_pad_id|>',
                                                unk_token='<unknown>', bos_token='<|start_of_sequence|>', 
                                                eos_token='<|end_of_sequence|>', 
                                                model_max_length=1024, 
                                                padding_side='right', truncation_side='right')
tok.save_pretrained('ConvertedTokenizer')

Generally tokenisation aligns well accept for whitespaces

Do es anyone have any idea what could be the issue? The desired behaviour of a new tokenizer is as seen in spm_tokenizer

Topic		Replies	Views
Using whitespace tokenizer for training models 🤗Tokenizers	1	3241	June 6, 2021
Tokenization compared to sentencepiece 🤗Tokenizers	0	92	September 11, 2024
Convert huggingface tokenizer into sentencepiece format 🤗Tokenizers	1	617	November 27, 2024
Error with new tokenizers (URGENT!) 🤗Tokenizers	16	51297	July 22, 2024
Tokenizer from tokenizers library cannot be used in transformers.Trainer 🤗Transformers	2	625	July 30, 2021

SentencePiece to Tokenizers conversion

Related topics