SentencePiece user_defined_symbols and fast tokenizers

koyfman · August 25, 2023, 11:01pm

I have a trained SentencePiece (BPE) model that I want to be able to load through AutoTokenizer as a fast tokenizer. This works with either T5 or Llama (slow) tokenizer classes, but when I try to load it using one of the fast classes, the encoding does not use the list of user_defined_symbols I specify as a parameter when training. Is there a way around this? Thank you.

Example:

Train with spm_train adding one user defined symbol:

spm_train --input=./botchan.txt --model_prefix=my_test --vocab_size=8000 --character_coverage=1.0 --model_type=bpe --user_defined_symbols=kwyjibo```

Load and test with HF - the slow version encodes the custom word with one token but the fast breaks it up into many tokens.

>>> from transformers import AutoTokenizer, T5Tokenizer, T5TokenizerFast
>>> slow = T5Tokenizer(vocab_file="./my_test.model", use_fast=False)
>>> e = slow.encode("there goes a kwyjibo")
>>> e
[178, 1263, 6, 7917, 3, 2]
>>> print(slow.convert_ids_to_tokens(e))
['▁there', '▁goes', '▁a', '▁', 'kwyjibo', '</s>']
>>> fast = T5TokenizerFast(vocab_file="./my_test.model", use_fast=True)
>>> f = fast.encode("there goes a kwyjibo")
>>> f
[178, 1263, 6, 152, 7932, 7935, 7953, 1136, 7920, 2]
>>> print(fast.convert_ids_to_tokens(f))
['▁there', '▁goes', '▁a', '▁k', 'w', 'y', 'j', 'ib', 'o', '</s>']
>>>```

KhaiKit · January 3, 2024, 4:23pm

Hey
I’m no expert but just a suggestion, I think you can omit the “user_defined_symbols” when training the SentencePiece (BPE) model and instead just use add_tokens() method to add special tokens. Should be able to achieve your intended outcome?

Topic		Replies	Views
Training sentencePiece from scratch? 🤗Tokenizers	8	15811	December 19, 2023
SentencePieceProcessor encoding differs from AutoTokenizer, how can that be? Beginners	0	527	December 12, 2023
Loading SentencePiece tokenizer Beginners	3	3441	October 24, 2023
Subword regularization in Sentencepiece and DeBERTaV2 tokenizers (not working) 🤗Transformers	0	615	February 1, 2023
Tokenizer from tokenizers library cannot be used in transformers.Trainer 🤗Transformers	2	554	July 30, 2021

SentencePiece user_defined_symbols and fast tokenizers

Related Topics