I have a trained SentencePiece (BPE) model that I want to be able to load through AutoTokenizer as a fast tokenizer. This works with either T5 or Llama (slow) tokenizer classes, but when I try to load it using one of the fast classes, the encoding does not use the list of user_defined_symbols I specify as a parameter when training. Is there a way around this? Thank you.
Example:
Train with spm_train adding one user defined symbol:
spm_train --input=./botchan.txt --model_prefix=my_test --vocab_size=8000 --character_coverage=1.0 --model_type=bpe --user_defined_symbols=kwyjibo```
Load and test with HF - the slow version encodes the custom word with one token but the fast breaks it up into many tokens.
>>> from transformers import AutoTokenizer, T5Tokenizer, T5TokenizerFast
>>> slow = T5Tokenizer(vocab_file="./my_test.model", use_fast=False)
>>> e = slow.encode("there goes a kwyjibo")
>>> e
[178, 1263, 6, 7917, 3, 2]
>>> print(slow.convert_ids_to_tokens(e))
['▁there', '▁goes', '▁a', '▁', 'kwyjibo', '</s>']
>>> fast = T5TokenizerFast(vocab_file="./my_test.model", use_fast=True)
>>> f = fast.encode("there goes a kwyjibo")
>>> f
[178, 1263, 6, 152, 7932, 7935, 7953, 1136, 7920, 2]
>>> print(fast.convert_ids_to_tokens(f))
['▁there', '▁goes', '▁a', '▁k', 'w', 'y', 'j', 'ib', 'o', '</s>']
>>>```