Which encoding does GPT2 vocabulary file use?

I wanted to investigate a little, how tokenizer works. But the model trained in Russian language (mostly), and I see garbage instead of tokens. I tried obvious variants like open in UTF-8, that didn’t help.

Well, I checked that json vocabulary loads as gibberish:

import json

with open("/content/notebooks/ru-gpts/models/gpt3large/vocab.json", "r", encoding="utf-8") as f:

  vocab = json.load(f)

list(vocab.items())[1000:1010]
[('Ñĥма', 1000),
 ('Ġпи', 1001),
 ('Ġn', 1002),
 ('ĠнеÑĤ', 1003),
 ('иÑĤа', 1004),
 ('ÑĢÑĥп', 1005),
 ('ec', 1006),
 ('енÑĭ', 1007),
 ('ĠÑıв', 1008),
 ('Ðĵ', 1009)]

The same exact gibberish is in tokenizer itself, which I view with list(tok1.get_vocab().items())[1000:1010]

Yet, the model seems to work somehow, and produce meaningful results in Russian.

Looks like utf8 saved as cp1252 to me.

test = u'ĠоÑĤвеÑĩаеÑĤ'

test.encode('cp1252', errors='replace').decode('utf8', errors='replace')

?о�?ве�?ае�?

Not sure what to do, since obviously some letters just destroyed with that conversion.

Well, I figured out that there is more than just “latin-1”<->“utf-8” back and forth.

There are some reasons why GPT2 changes some Unicode numbers, for example: Why \u0120 (Ġ) is in so many pairs? · Issue #80 · openai/gpt-2 · GitHub

Some Russian letters were shifted too for some reason. I guess the best way to decode them to something meaningful is like this:

rus = []
for token in tok1.get_vocab().keys():
  rus.append(tok.convert_tokens_to_string(token))
rus
' даже',
 'зы',
 'вал',
 'стро',
 ' очень',
 ' ник',
 ' р',
 ' можно',
 ' произ',
 'еле',
 'руд',
1 Like