Which encoding does GPT2 vocabulary file use?

ArtemS · July 31, 2021, 4:28pm

I wanted to investigate a little, how tokenizer works. But the model trained in Russian language (mostly), and I see garbage instead of tokens. I tried obvious variants like open in UTF-8, that didn’t help.

ArtemS · July 31, 2021, 7:45pm

Well, I checked that json vocabulary loads as gibberish:

import json

with open("/content/notebooks/ru-gpts/models/gpt3large/vocab.json", "r", encoding="utf-8") as f:

  vocab = json.load(f)

list(vocab.items())[1000:1010]

[('ÑĥÐ¼Ð°', 1000),
 ('ĠÐ¿Ð¸', 1001),
 ('Ġn', 1002),
 ('ĠÐ½ÐµÑĤ', 1003),
 ('Ð¸ÑĤÐ°', 1004),
 ('ÑĢÑĥÐ¿', 1005),
 ('ec', 1006),
 ('ÐµÐ½Ñĭ', 1007),
 ('ĠÑıÐ²', 1008),
 ('Ðĵ', 1009)]

The same exact gibberish is in tokenizer itself, which I view with list(tok1.get_vocab().items())[1000:1010]

Yet, the model seems to work somehow, and produce meaningful results in Russian.

ArtemS · July 31, 2021, 8:29pm

Looks like utf8 saved as cp1252 to me.

test = u'ĠÐ¾ÑĤÐ²ÐµÑĩÐ°ÐµÑĤ'

test.encode('cp1252', errors='replace').decode('utf8', errors='replace')

?о�?ве�?ае�?

Not sure what to do, since obviously some letters just destroyed with that conversion.

ArtemS · August 1, 2021, 5:07pm

Well, I figured out that there is more than just “latin-1”<->“utf-8” back and forth.

There are some reasons why GPT2 changes some Unicode numbers, for example: Why \u0120 (Ġ) is in so many pairs? · Issue #80 · openai/gpt-2 · GitHub

Some Russian letters were shifted too for some reason. I guess the best way to decode them to something meaningful is like this:

rus = []
for token in tok1.get_vocab().keys():
  rus.append(tok.convert_tokens_to_string(token))
rus

' даже',
 'зы',
 'вал',
 'стро',
 ' очень',
 ' ник',
 ' р',
 ' можно',
 ' произ',
 'еле',
 'руд',

Topic		Replies	Views
GPT2Tokenizer.decode maps unicode sequences to the same string '�' 🤗Tokenizers	3	1219	March 15, 2023
Tokenizer decoding using BERT, RoBERTa, XLNet, GPT2 Beginners	7	8665	September 21, 2020
How to decode GPT2 🤗Transformers	3	7854	June 17, 2022
GPT2 Training from scratch in German 🤗Transformers	3	2332	October 3, 2020
Encoding Reproducable Results Intermediate	0	249	November 26, 2020

Which encoding does GPT2 vocabulary file use?

Related topics