Well, I checked that json vocabulary loads as gibberish:
import json
with open("/content/notebooks/ru-gpts/models/gpt3large/vocab.json", "r", encoding="utf-8") as f:
vocab = json.load(f)
list(vocab.items())[1000:1010]
[('Ñĥма', 1000),
('Ġпи', 1001),
('Ġn', 1002),
('ĠнеÑĤ', 1003),
('иÑĤа', 1004),
('ÑĢÑĥп', 1005),
('ec', 1006),
('енÑĭ', 1007),
('ĠÑıв', 1008),
('Ðĵ', 1009)]
The same exact gibberish is in tokenizer itself, which I view with list(tok1.get_vocab().items())[1000:1010]
Yet, the model seems to work somehow, and produce meaningful results in Russian.