I wanted to investigate a little, how tokenizer works. But the model trained in Russian language (mostly), and I see garbage instead of tokens. I tried obvious variants like open in UTF-8, that didn’t help.
Well, I checked that json vocabulary loads as gibberish:
import json
with open("/content/notebooks/ru-gpts/models/gpt3large/vocab.json", "r", encoding="utf-8") as f:
vocab = json.load(f)
list(vocab.items())[1000:1010]
[('Ñĥма', 1000),
('Ġпи', 1001),
('Ġn', 1002),
('ĠнеÑĤ', 1003),
('иÑĤа', 1004),
('ÑĢÑĥп', 1005),
('ec', 1006),
('енÑĭ', 1007),
('ĠÑıв', 1008),
('Ðĵ', 1009)]
The same exact gibberish is in tokenizer itself, which I view with list(tok1.get_vocab().items())[1000:1010]
Yet, the model seems to work somehow, and produce meaningful results in Russian.
Looks like utf8 saved as cp1252 to me.
test = u'ĠоÑĤвеÑĩаеÑĤ'
test.encode('cp1252', errors='replace').decode('utf8', errors='replace')
?о�?ве�?ае�?
Not sure what to do, since obviously some letters just destroyed with that conversion.
Well, I figured out that there is more than just “latin-1”<->“utf-8” back and forth.
There are some reasons why GPT2 changes some Unicode numbers, for example: Why \u0120 (Ġ) is in so many pairs? · Issue #80 · openai/gpt-2 · GitHub
Some Russian letters were shifted too for some reason. I guess the best way to decode them to something meaningful is like this:
rus = []
for token in tok1.get_vocab().keys():
rus.append(tok.convert_tokens_to_string(token))
rus
' даже',
'зы',
'вал',
'стро',
' очень',
' ник',
' р',
' можно',
' произ',
'еле',
'руд',
1 Like