Tokenizer from tokenizers library cannot be used in transformers.Trainer

Aktsvigun · July 28, 2021, 5:04pm

Hi,

I am trying to train my own model with Trainer with a pre-trained SentencePieceBPETokenizer from tokenizers library. However, it is missing several attributes as well as methods (e.g., pad ), which makes it incompatible with transformers.Trainer . Is there an easy way to convert it to PretrainedTokenizer from transformers ?
Thanks!

prikmm · July 28, 2021, 5:59pm

If you want SentencePiece Tokenizer, you should use the sentencepiece library, then pass in the trained model as a parameter into the desired tokenizer model like T5, Bart etc. By doing this the vocab will be yours and the desired tokenizer will handle the padding, I am not sure about whether it will handle the special tokens though.

Aktsvigun · July 30, 2021, 11:21am

sgugger helped with the solution, we simply need
transformers.PretrainedTokenizerFast(tokenizer_object=my_tokenizer).

Topic		Replies	Views
Training sentencePiece from scratch? 🤗Tokenizers	8	19238	December 19, 2023
Train a new tokenizer from scratch 🤗Transformers	4	1710	November 10, 2020
Why does PreTrainedTokenizerFast return a list instead of tokenizers.Encoding instance? Beginners	1	316	February 6, 2023
Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library 🤗Tokenizers	1	1091	August 30, 2021
How to save my tokenizer using save_pretrained? Beginners	5	28978	August 13, 2021

Tokenizer from tokenizers library cannot be used in transformers.Trainer

Related topics