Extracting embedding values of NLP pertained models from tokenized strings

Hi @kadaj13

You can checkout which words correspond to which token with the tokenizer and the word_ids function:

inputs = tokenizer('This is a loooong word')
print(f"Word IDs: {inputs.word_ids()}")
print(f"Tokens: {inputs.tokens()}")

>>> Word IDs: [0, 1, 2, 3, 3, 3, 3, 4, None, None]
>>> Tokens: ['▁This', '▁is', '▁a', '▁', 'loo', 'o', 'ong', '▁word', '<sep>', '<cls>']

You can see that tokens __, loo, o, and ong all belong the word with ID 3 (in other words the 4th word).

This also helps you spot the position of the special input characters that you probably don’t want to embed which are indicated with None.

Hope this helps!

1 Like