T5 model tokenizer

T5 models are using BPE tokenizers? Is it possible to use another type of tokenizer along a T5 model, or not because they are designed to work with BPE?

AFAIK T5 is using SentencePiece T5 which has BPE implemented GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation. and therefore depends on this.

Why would you like to use another tokenizer?

The question got answered on Discord

If you’re training from scratch, then you would typically train a tokenizer on your own data, in which case you can choose which tokenizer training algorithm (BPE, WordPiece or UnigramLM if you’re using :hugs: tokenizers) and how to preprocess the data before tokenizing it. I can recommend this chapter of the HF course to learn more about tokenizers: Introduction - Hugging Face Course

1 Like