SentencePiece user_defined_symbols and fast tokenizers

I have a trained SentencePiece (BPE) model that I want to be able to load through AutoTokenizer as a fast tokenizer. This works with either T5 or Llama (slow) tokenizer classes, but when I try to load it using one of the fast classes, the encoding does not use the list of user_defined_symbols I specify as a parameter when training. Is there a way around this? Thank you.


Train with spm_train adding one user defined symbol:

spm_train --input=./botchan.txt --model_prefix=my_test --vocab_size=8000 --character_coverage=1.0 --model_type=bpe --user_defined_symbols=kwyjibo```

Load and test with HF - the slow version encodes the custom word with one token but the fast breaks it up into many tokens.

>>> from transformers import AutoTokenizer, T5Tokenizer, T5TokenizerFast
>>> slow = T5Tokenizer(vocab_file="./my_test.model", use_fast=False)
>>> e = slow.encode("there goes a kwyjibo")
>>> e
[178, 1263, 6, 7917, 3, 2]
>>> print(slow.convert_ids_to_tokens(e))
['▁there', '▁goes', '▁a', '▁', 'kwyjibo', '</s>']
>>> fast = T5TokenizerFast(vocab_file="./my_test.model", use_fast=True)
>>> f = fast.encode("there goes a kwyjibo")
>>> f
[178, 1263, 6, 152, 7932, 7935, 7953, 1136, 7920, 2]
>>> print(fast.convert_ids_to_tokens(f))
['▁there', '▁goes', '▁a', '▁k', 'w', 'y', 'j', 'ib', 'o', '</s>']

I’m no expert but just a suggestion, I think you can omit the “user_defined_symbols” when training the SentencePiece (BPE) model and instead just use add_tokens() method to add special tokens. Should be able to achieve your intended outcome?