I am trying to train my own model with Trainer with a pre-trained SentencePieceBPETokenizer from tokenizers library. However, it is missing several attributes as well as methods (e.g., pad ), which makes it incompatible with transformers.Trainer . Is there an easy way to convert it to PretrainedTokenizer from transformers ?
Thanks!
If you want SentencePiece Tokenizer, you should use the sentencepiece library, then pass in the trained model as a parameter into the desired tokenizer model like T5, Bart etc. By doing this the vocab will be yours and the desired tokenizer will handle the padding, I am not sure about whether it will handle the special tokens though.