OPT special tokens

Hello,

I can’t understand something about OPT tokenizer and its special tokens. I came across <unk> token in opt vocabulary. But when encoding it with opt_tokenizer it doesn’t find it as a single token but three tokens: <, unk, >. How does this make sense?

Below is some code to reproduce my findings:

opt_tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")

print('<unk>' in opt_tokenizer.get_vocab())

ids = opt_tokenizer.encode("<unk>", add_special_tokens=False)

print(opt_tokenizer.convert_ids_to_tokens(ids))

Another funny thing is that <unk> token isn’t present in opt_tokenizer.special_tokens_map:

print(opt_tokenizer.special_tokens_map)

{‘bos_token’: ‘</s>’,
‘eos_token’: ‘</s>’,
‘unk_token’: ‘</s>’,
‘pad_token’: ‘<pad>’}

Best,
AP