Hi!
When I use the above code and output the result, I see ‘‘å¦Hä1/2K’’: ‘how’’ in the ids_tokens, and I’m confused, so I debug tokenizer.decode() and want to see how ‘å¦Ĥä1/2K’ is converted to ‘how’,
But I see that there is no implementation of decode() in transformersmodelstokenization_qwen2_fast.py, and there is no implementation of _decode(), how can I check the implementation of tokenizer.decode().
Thank you in advance🤗
As is the case with most libraries, class inheritance is in place here so that methods and elements of the inherited class are also available. That means you have to go back and debug parent classes and even more parent classes. Like this.
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/tokenization_qwen2_fast.py
class Qwen2TokenizerFast(PreTrainedTokenizerFast):
# https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_fast.py
class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
…
Maybe, but it’s faster to go find a code that is working well, find the difference between it and your code, and then do some trial and error. It could be a model-dependent problem. In this case, I wonder if the decoding part is suspect.
I don’t even know what the correct code is.
It shouldn’t be a character code issue like it used to be, but there is a slight quirkiness to each tokenizer.
Thank you so much!
I looked at the parent class and didn’t find an implementation of _decode(), only found an unimplemented _decode() in SpecialTokensMixin() of PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin) with the following code:
You trace back to that further parent classes. You’ll get to the ancestor class someday.
There may be a nice feature on github, but I don’t use it, so I don’t know about it.