Hello,
I can’t understand something about OPT tokenizer and its special tokens. I came across <unk>
token in opt vocabulary. But when encoding it with opt_tokenizer it doesn’t find it as a single token but three tokens: <, unk, >. How does this make sense?
Below is some code to reproduce my findings:
opt_tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
print('<unk>' in opt_tokenizer.get_vocab())
ids = opt_tokenizer.encode("<unk>", add_special_tokens=False)
print(opt_tokenizer.convert_ids_to_tokens(ids))
Another funny thing is that <unk>
token isn’t present in opt_tokenizer.special_tokens_map:
print(opt_tokenizer.special_tokens_map)
{‘bos_token’: ‘</s>
’,
‘eos_token’: ‘</s>
’,
‘unk_token’: ‘</s>
’,
‘pad_token’: ‘<pad>
’}
Best,
AP