Convert_tokens_to_ids produces <unk>

AfonsoSousa · October 21, 2022, 10:45am

Hi. I am trying to tokenize single words with a Roberta BPE Sub-word tokenizer. I was expecting to have some words with multiple ids, but when that is supposed to be the case, the method convert_tokens_to_ids just returns <unk>. However, __call__ from tokenizer produces the multiple ids. To reproduce the problem, run:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
token_id = tokenizer.convert_tokens_to_ids("exam")
print(f"{token_id} => {tokenizer.decode([token_id])}")
token_ids = tokenizer("exam").input_ids[1:3]
print(f"{token_ids} => {tokenizer.decode(token_ids)}")

Is there a way to make convert_tokens_to_ids have the same behaviour as tokenizer(token).input_ids[1:3]? THanks in advance for any help you can provide.

lianghsun · October 25, 2022, 8:01pm

I think you misunderstand tokenizer.convert_tokens_to_ids(). Please note this function is to map token to id, however exam is not a token, it is a word instead. You can check by the following code:

tokenizer.convert_ids_to_token([3463, 424])
> ['ex', 'am'] # exam is tokenized to two token!

So, it’s obivously that there is no token exam in vocab, but [UNK] instead.

Topic		Replies	Views
Difference between tokenizer and convert_tokens_to_ids 🤗Tokenizers	0	302	May 12, 2024
Question About XLNetTokenizer Beginners	1	318	October 21, 2022
Period ID in RobertaTokenizer with is_split_into_words 🤗Tokenizers	1	538	October 27, 2022
Tokenizer decoding using BERT, RoBERTa, XLNet, GPT2 Beginners	7	8446	September 21, 2020
Decode token IDs into a list (not a single string) 🤗Tokenizers	4	4195	March 11, 2025

Convert_tokens_to_ids produces <unk>

Related topics