PreTrainedTokenizerFast.convert_tokens_to_string always assumes the presence of decoder

According to Tokenizer, decoder is an optional property, but in PreTrainedTokenizerFast.convert_tokens_to_string (transformers/src/transformers/tokenization_utils_fast.py at main · huggingface/transformers · GitHub), there is no check whether decoder is None. Some tokenizers (such as those trained by WordLevelTrainer) do not have decoders and this is causing problems with projects such as TGI / Outlines because they use the convert_tokens_to_string method.

What is the correct approach here? Should I convert the fast tokenizer to a slow one? Or should I create a PR that checks for None in decoder, and if so, just do a simple join of tokens?

1 Like

Or should I create a PR that checks for None in decoder

Newly created parts of toransoformers sometimes have bugs or unimplemented parts. If it is possible, please do so.

Opened a PR at Fix convert_tokens_to_string when decoder is None by dszeto · Pull Request #34569 · huggingface/transformers · GitHub. This is working in my production environment, just waiting to be reviewed and merged in.

1 Like