Tokenizer from tokenizers library cannot be used in transformers.Trainer

Hi,

I am trying to train my own model with Trainer with a pre-trained SentencePieceBPETokenizer from tokenizers library. However, it is missing several attributes as well as methods (e.g., pad ), which makes it incompatible with transformers.Trainer . Is there an easy way to convert it to PretrainedTokenizer from transformers ?
Thanks!

If you want SentencePiece Tokenizer, you should use the sentencepiece library, then pass in the trained model as a parameter into the desired tokenizer model like T5, Bart etc. By doing this the vocab will be yours and the desired tokenizer will handle the padding, I am not sure about whether it will handle the special tokens though.

sgugger helped with the solution, we simply need
transformers.PretrainedTokenizerFast(tokenizer_object=my_tokenizer).

1 Like