According to Tokenizer, decoder
is an optional property, but in PreTrainedTokenizerFast.convert_tokens_to_string
(transformers/src/transformers/tokenization_utils_fast.py at main · huggingface/transformers · GitHub), there is no check whether decoder
is None
. Some tokenizers (such as those trained by WordLevelTrainer
) do not have decoders and this is causing problems with projects such as TGI / Outlines because they use the convert_tokens_to_string
method.
What is the correct approach here? Should I convert the fast tokenizer to a slow one? Or should I create a PR that checks for None
in decoder
, and if so, just do a simple join of tokens?