Which encoding does GPT2 vocabulary file use?

ArtemS · July 31, 2021, 7:45pm

Well, I checked that json vocabulary loads as gibberish:

import json

with open("/content/notebooks/ru-gpts/models/gpt3large/vocab.json", "r", encoding="utf-8") as f:

  vocab = json.load(f)

list(vocab.items())[1000:1010]

[('ÑĥÐ¼Ð°', 1000),
 ('ĠÐ¿Ð¸', 1001),
 ('Ġn', 1002),
 ('ĠÐ½ÐµÑĤ', 1003),
 ('Ð¸ÑĤÐ°', 1004),
 ('ÑĢÑĥÐ¿', 1005),
 ('ec', 1006),
 ('ÐµÐ½Ñĭ', 1007),
 ('ĠÑıÐ²', 1008),
 ('Ðĵ', 1009)]

The same exact gibberish is in tokenizer itself, which I view with list(tok1.get_vocab().items())[1000:1010]

Yet, the model seems to work somehow, and produce meaningful results in Russian.

Topic		Replies	Views
GPT2Tokenizer.decode maps unicode sequences to the same string '�' 🤗Tokenizers	3	1190	March 15, 2023
GPT2 Training from scratch in German 🤗Transformers	3	2312	October 3, 2020
Encoding Reproducable Results Intermediate	0	248	November 26, 2020
WordLevel Tokenization with GPT2? 🤗Transformers	1	736	March 26, 2023
Disable regex use when training a new GPT2 Tokenizer Intermediate	0	150	June 21, 2024

Which encoding does GPT2 vocabulary file use?

Related topics