I am trying to train my own model with
Trainer with a pre-trained
SentencePieceBPETokenizer from tokenizers library. However, it is missing several attributes as well as methods (e.g.,
pad ), which makes it incompatible with
transformers.Trainer . Is there an easy way to convert it to
If you want SentencePiece Tokenizer, you should use the sentencepiece library, then pass in the trained model as a parameter into the desired tokenizer model like T5, Bart etc. By doing this the vocab will be yours and the desired tokenizer will handle the padding, I am not sure about whether it will handle the special tokens though.
sgugger helped with the solution, we simply need