Convert_tokens_to_ids produces <unk>

Hi. I am trying to tokenize single words with a Roberta BPE Sub-word tokenizer. I was expecting to have some words with multiple ids, but when that is supposed to be the case, the method convert_tokens_to_ids just returns <unk>. However, __call__ from tokenizer produces the multiple ids. To reproduce the problem, run:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
token_id = tokenizer.convert_tokens_to_ids("exam")
print(f"{token_id} => {tokenizer.decode([token_id])}")
token_ids = tokenizer("exam").input_ids[1:3]
print(f"{token_ids} => {tokenizer.decode(token_ids)}")

Is there a way to make convert_tokens_to_ids have the same behaviour as tokenizer(token).input_ids[1:3]? THanks in advance for any help you can provide.

I think you misunderstand tokenizer.convert_tokens_to_ids(). Please note this function is to map token to id, however exam is not a token, it is a word instead. You can check by the following code:

tokenizer.convert_ids_to_token([3463, 424])
> ['ex', 'am'] # exam is tokenized to two token!

So, it’s obivously that there is no token exam in vocab, but [UNK] instead.