How can I check the implementation of tokenizer.decode()

HaiShi · September 30, 2024, 2:55am

tokenizer = AutoTokenizer.from_pretrained("../../../Models/Qwen/Qwen2-7B-Instruct")  # tokenizer: Qwen2TokenizerFast
input = "如何学习大模型"
input_ids = tokenizer.encode(input)  # 将文本编码为 ID
ids_tokens = {tokenizer.convert_ids_to_tokens(input_id): tokenizer.decode([input_id]) for input_id in input_ids}
print(type(tokenizer))
print(ids_tokens)

>>>
<class 'transformers.models.qwen2.tokenization_qwen2_fast.Qwen2TokenizerFast'>
{'å¦Ĥä½ķ': '如何', 'åŃ¦ä¹ł': '学习', 'å¤§': '大', 'æ¨¡åŀĭ': '模型'}

Hi!
When I use the above code and output the result, I see ‘‘å¦Hä1/2K’’: ‘how’’ in the ids_tokens, and I’m confused, so I debug tokenizer.decode() and want to see how ‘å¦Ĥä1/2K’ is converted to ‘how’,
But I see that there is no implementation of decode() in transformersmodelstokenization_qwen2_fast.py, and there is no implementation of _decode(), how can I check the implementation of tokenizer.decode().
Thank you in advance🤗

John6666 · September 30, 2024, 3:07am

As is the case with most libraries, class inheritance is in place here so that methods and elements of the inherited class are also available. That means you have to go back and debug parent classes and even more parent classes. Like this.

# https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/tokenization_qwen2_fast.py
class Qwen2TokenizerFast(PreTrainedTokenizerFast):

# https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_fast.py
class PreTrainedTokenizerFast(PreTrainedTokenizerBase):

…

Maybe, but it’s faster to go find a code that is working well, find the difference between it and your code, and then do some trial and error. It could be a model-dependent problem. In this case, I wonder if the decoding part is suspect.
I don’t even know what the correct code is.
It shouldn’t be a character code issue like it used to be, but there is a slight quirkiness to each tokenizer.

HaiShi · September 30, 2024, 7:21am

Thank you so much!
I looked at the parent class and didn’t find an implementation of _decode(), only found an unimplemented _decode() in SpecialTokensMixin() of PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin) with the following code:

    def _decode(
        self,
        token_ids: Union[int, List[int]],
        skip_special_tokens: bool = False,
        clean_up_tokenization_spaces: bool = None,
        **kwargs,
    ) -> str:
        raise NotImplementedError

I see that PreTrainedTokenizerFast is a “Fast” implementation based on the Rust library Tokenizers, and I’ll take another look at the Tokenizers code

John6666 · September 30, 2024, 7:36am

You trace back to that further parent classes. You’ll get to the ancestor class someday.
There may be a nice feature on github, but I don’t use it, so I don’t know about it.

HaiShi · September 30, 2024, 7:42am

I haven’t heard of it，so could you tell me the name of this feature on github?

John6666 · September 30, 2024, 8:00am

No, I just thought it might be possible.
We’ll have to do it this way in Python.

print(ClassName.__bases__)

or

HaiShi · September 30, 2024, 8:12am

Thanks for the reply, I will try what you say

Topic		Replies	Views
Question About XLNetTokenizer Beginners	1	318	October 21, 2022
PreTrainedTokenizerFast.convert_tokens_to_string always assumes the presence of decoder Intermediate	2	60	November 7, 2024
Why does PreTrainedTokenizerFast return a list instead of tokenizers.Encoding instance? Beginners	1	316	February 6, 2023
How does `tokenizer().input_ids` work and how different it is from tokenizer.encode() before `model.generate()` and decoding step? 🤗Tokenizers	1	2881	February 22, 2023
Difference between tokenizer and convert_tokens_to_ids 🤗Tokenizers	0	300	May 12, 2024

How can I check the implementation of tokenizer.decode()

Related topics