Hi. I am trying to tokenize single words with a Roberta BPE Sub-word tokenizer. I was expecting to have some words with multiple ids, but when that is supposed to be the case, the method convert_tokens_to_ids
just returns <unk>
. However, __call__
from tokenizer produces the multiple ids. To reproduce the problem, run:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
token_id = tokenizer.convert_tokens_to_ids("exam")
print(f"{token_id} => {tokenizer.decode([token_id])}")
token_ids = tokenizer("exam").input_ids[1:3]
print(f"{token_ids} => {tokenizer.decode(token_ids)}")
Is there a way to make convert_tokens_to_ids
have the same behaviour as tokenizer(token).input_ids[1:3]
? THanks in advance for any help you can provide.