Chapter 7 questions

Hello again. When running this line from this section:
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")
I get

ValueError: Unrecognized model in huggingface-course/code-search-net-tokenizer. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, (...).

In the repo I indeed do not see config.json file but I do not know if this fact is strange or not. Looks like it was like that from the beginning.

So I don’t know, in some earlier version AutoTokenizer was able to work without the model_type? I tried googling an information about deprecation changing field name, but didn’t find anything.

Nonetheless I investigated tokenizer_config.json file and found "tokenizer_class": "GPT2Tokenizer" there. I then loaded the tokenizer via

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

and in theory it is working , but instead of key overflow_to_sample_mapping like in the tutorial it creates a key overflowing_tokens and furthermore it seems to tokenize differently. If I take e.g. n=10 examples it returns

Input IDs length: 10
Input chunk lengths: [128, 128, 128, 128, 128, 128, 128, 128, 128, 128]
Chunk mapping: [[692, 63, 2077, 173, 173, 612, 536, 612, 233, 2558, 77, (...)

For n=2 in Input chunk lengths there are two elements, for n=3 three elements, etc. It is a different tokenization behavior from that presented in the tutorial.

Then I created my own GPT2 based tokenizer as presented in chapter 6 and used it and then it tokenizes the same as in the course, it even creates a key overflow_to_sample_mapping.

So my question is why when I load huggingface-course/code-search-net-tokenizer using GPT2Tokenizer I get a quite different tokenization behavior? :see_no_evil_monkey:

1 Like