How can I check the implementation of tokenizer.decode()

tokenizer = AutoTokenizer.from_pretrained("../../../Models/Qwen/Qwen2-7B-Instruct")  # tokenizer: Qwen2TokenizerFast
input = "如何学习大模型"
input_ids = tokenizer.encode(input)  # 将文本编码为 ID
ids_tokens = {tokenizer.convert_ids_to_tokens(input_id): tokenizer.decode([input_id]) for input_id in input_ids}
print(type(tokenizer))
print(ids_tokens)

>>>
<class 'transformers.models.qwen2.tokenization_qwen2_fast.Qwen2TokenizerFast'>
{'å¦Ĥä½ķ': '如何', 'åŃ¦ä¹ł': '学习', '大': '大', '模åŀĭ': '模型'}

Hi!
When I use the above code and output the result, I see ‘‘å¦Hä1/2K’’: ‘how’’ in the ids_tokens, and I’m confused, so I debug tokenizer.decode() and want to see how ‘å¦Ĥä1/2K’ is converted to ‘how’,
But I see that there is no implementation of decode() in transformersmodelstokenization_qwen2_fast.py, and there is no implementation of _decode(), how can I check the implementation of tokenizer.decode().
Thank you in advance🤗

1 Like

As is the case with most libraries, class inheritance is in place here so that methods and elements of the inherited class are also available. That means you have to go back and debug parent classes and even more parent classes. Like this.

# https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/tokenization_qwen2_fast.py
class Qwen2TokenizerFast(PreTrainedTokenizerFast):
# https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_fast.py
class PreTrainedTokenizerFast(PreTrainedTokenizerBase):

Maybe, but it’s faster to go find a code that is working well, find the difference between it and your code, and then do some trial and error. It could be a model-dependent problem. In this case, I wonder if the decoding part is suspect.
I don’t even know what the correct code is.
It shouldn’t be a character code issue like it used to be, but there is a slight quirkiness to each tokenizer.

Thank you so much!
I looked at the parent class and didn’t find an implementation of _decode(), only found an unimplemented _decode() in SpecialTokensMixin() of PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin) with the following code:

    def _decode(
        self,
        token_ids: Union[int, List[int]],
        skip_special_tokens: bool = False,
        clean_up_tokenization_spaces: bool = None,
        **kwargs,
    ) -> str:
        raise NotImplementedError

I see that PreTrainedTokenizerFast is a “Fast” implementation based on the Rust library :hugs: Tokenizers, and I’ll take another look at the Tokenizers code

1 Like

You trace back to that further parent classes. You’ll get to the ancestor class someday.
There may be a nice feature on github, but I don’t use it, so I don’t know about it.:cold_face:

I haven’t heard of it,so could you tell me the name of this feature on github?

No, I just thought it might be possible.
We’ll have to do it this way in Python.

print(ClassName.__bases__)

or

Thanks for the reply, I will try what you say :hugs:

1 Like