PreTrainedTokenizerFast.convert_tokens_to_string always assumes the presence of decoder

dszeto · November 1, 2024, 7:32am

According to Tokenizer, decoder is an optional property, but in PreTrainedTokenizerFast.convert_tokens_to_string (transformers/src/transformers/tokenization_utils_fast.py at main · huggingface/transformers · GitHub), there is no check whether decoder is None. Some tokenizers (such as those trained by WordLevelTrainer) do not have decoders and this is causing problems with projects such as TGI / Outlines because they use the convert_tokens_to_string method.

What is the correct approach here? Should I convert the fast tokenizer to a slow one? Or should I create a PR that checks for None in decoder, and if so, just do a simple join of tokens?

John6666 · November 4, 2024, 1:28am

Or should I create a PR that checks for None in decoder

Newly created parts of toransoformers sometimes have bugs or unimplemented parts. If it is possible, please do so.

dszeto · November 7, 2024, 6:26pm

Opened a PR at Fix convert_tokens_to_string when decoder is None by dszeto · Pull Request #34569 · huggingface/transformers · GitHub. This is working in my production environment, just waiting to be reviewed and merged in.

Topic		Replies	Views
How can I check the implementation of tokenizer.decode() 🤗Transformers	6	56	September 30, 2024
Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library 🤗Tokenizers	1	1090	August 30, 2021
Why does PreTrainedTokenizerFast return a list instead of tokenizers.Encoding instance? Beginners	1	316	February 6, 2023
Convert a Python Tokenizer into a TokenizerFast Beginners	0	339	May 20, 2022
What's the best way to load a saved Tokenizer json into a transformers PreTrainedTokenizerFast (or other transformers tokenizer)? 🤗Transformers	3	4818	February 25, 2021

PreTrainedTokenizerFast.convert_tokens_to_string always assumes the presence of decoder

Related topics