I wanted to investigate a little, how tokenizer works. But the model trained in Russian language (mostly), and I see garbage instead of tokens. I tried obvious variants like open in UTF-8, that didn’t help.
Well, I checked that json vocabulary loads as gibberish:
import json
with open("/content/notebooks/ru-gpts/models/gpt3large/vocab.json", "r", encoding="utf-8") as f:
  vocab = json.load(f)
list(vocab.items())[1000:1010]
[('Ñĥма', 1000),
 ('Ġпи', 1001),
 ('Ġn', 1002),
 ('ĠнеÑĤ', 1003),
 ('иÑĤа', 1004),
 ('ÑĢÑĥп', 1005),
 ('ec', 1006),
 ('енÑĭ', 1007),
 ('ĠÑıв', 1008),
 ('Ðĵ', 1009)]
The same exact gibberish is in tokenizer itself, which I view with list(tok1.get_vocab().items())[1000:1010]
Yet, the model seems to work somehow, and produce meaningful results in Russian.
Looks like utf8 saved as cp1252 to me.
test = u'ĠоÑĤвеÑĩаеÑĤ'
test.encode('cp1252', errors='replace').decode('utf8', errors='replace')
?о�?ве�?ае�?
Not sure what to do, since obviously some letters just destroyed with that conversion.
Well, I figured out that there is more than just “latin-1”<->“utf-8” back and forth.
There are some reasons why GPT2 changes some Unicode numbers, for example: Why \u0120 (Ġ) is in so many pairs? · Issue #80 · openai/gpt-2 · GitHub
Some Russian letters were shifted too for some reason. I guess the best way to decode them to something meaningful is like this:
rus = []
for token in tok1.get_vocab().keys():
  rus.append(tok.convert_tokens_to_string(token))
rus
' даже',
 'зы',
 'вал',
 'стро',
 ' очень',
 ' ник',
 ' р',
 ' можно',
 ' произ',
 'еле',
 'руд',
            
              
              
              1 Like