Hi @kadaj13
You can checkout which words correspond to which token with the tokenizer
and the word_ids
function:
inputs = tokenizer('This is a loooong word')
print(f"Word IDs: {inputs.word_ids()}")
print(f"Tokens: {inputs.tokens()}")
>>> Word IDs: [0, 1, 2, 3, 3, 3, 3, 4, None, None]
>>> Tokens: ['âThis', 'âis', 'âa', 'â', 'loo', 'o', 'ong', 'âword', '<sep>', '<cls>']
You can see that tokens __
, loo
, o
, and ong
all belong the word with ID 3
(in other words the 4th word).
This also helps you spot the position of the special input characters that you probably donât want to embed which are indicated with None
.
Hope this helps!