Hello again. When running this line from this section:
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")
I get
ValueError: Unrecognized model in huggingface-course/code-search-net-tokenizer. Should have a
model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, (...).
In the repo I indeed do not see config.json
file but I do not know if this fact is strange or not. Looks like it was like that from the beginning.
So I don’t know, in some earlier version AutoTokenizer was able to work without the model_type
? I tried googling an information about deprecation changing field name, but didn’t find anything.
Nonetheless I investigated tokenizer_config.json
file and found "tokenizer_class": "GPT2Tokenizer"
there. I then loaded the tokenizer via
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")
and in theory it is working , but instead of key overflow_to_sample_mapping
like in the tutorial it creates a key overflowing_tokens
and furthermore it seems to tokenize differently. If I take e.g. n=10 examples it returns
Input IDs length: 10
Input chunk lengths: [128, 128, 128, 128, 128, 128, 128, 128, 128, 128]
Chunk mapping: [[692, 63, 2077, 173, 173, 612, 536, 612, 233, 2558, 77, (...)
For n=2 in Input chunk lengths there are two elements, for n=3 three elements, etc. It is a different tokenization behavior from that presented in the tutorial.
Then I created my own GPT2 based tokenizer as presented in chapter 6 and used it and then it tokenizes the same as in the course, it even creates a key overflow_to_sample_mapping
.
So my question is why when I load huggingface-course/code-search-net-tokenizer
using GPT2Tokenizer I get a quite different tokenization behavior?