Chapter 7 questions

slwk · July 10, 2025, 7:05pm

Hello again. When running this line from this section:
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")
I get

ValueError: Unrecognized model in huggingface-course/code-search-net-tokenizer. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, (...).

In the repo I indeed do not see config.json file but I do not know if this fact is strange or not. Looks like it was like that from the beginning.

So I don’t know, in some earlier version AutoTokenizer was able to work without the model_type? I tried googling an information about deprecation changing field name, but didn’t find anything.

Nonetheless I investigated tokenizer_config.json file and found "tokenizer_class": "GPT2Tokenizer" there. I then loaded the tokenizer via

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

and in theory it is working , but instead of key overflow_to_sample_mapping like in the tutorial it creates a key overflowing_tokens and furthermore it seems to tokenize differently. If I take e.g. n=10 examples it returns

Input IDs length: 10
Input chunk lengths: [128, 128, 128, 128, 128, 128, 128, 128, 128, 128]
Chunk mapping: [[692, 63, 2077, 173, 173, 612, 536, 612, 233, 2558, 77, (...)

For n=2 in Input chunk lengths there are two elements, for n=3 three elements, etc. It is a different tokenization behavior from that presented in the tutorial.

Then I created my own GPT2 based tokenizer as presented in chapter 6 and used it and then it tokenizes the same as in the course, it even creates a key overflow_to_sample_mapping.

So my question is why when I load huggingface-course/code-search-net-tokenizer using GPT2Tokenizer I get a quite different tokenization behavior?

Topic		Replies	Views
Chapter 3 questions Course	151	10654	October 6, 2025
Fine Tuning IMDb tutorial - Unable to reproduce and adapt Beginners	19	8606	August 21, 2020
Transformers v3.0.0 is out! 🤗Transformers	0	1953	July 7, 2020
Seq2SeqTrainer: enabled must be a bool (got NoneType) 🤗Transformers	15	3972	December 5, 2022
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12953	February 12, 2024

Chapter 7 questions

Related topics