Extracting embedding values of NLP pertained models from tokenized strings

lvwerra · August 17, 2021, 8:37am

You can checkout which words correspond to which token with the tokenizer and the word_ids function:

inputs = tokenizer('This is a loooong word')
print(f"Word IDs: {inputs.word_ids()}")
print(f"Tokens: {inputs.tokens()}")

>>> Word IDs: [0, 1, 2, 3, 3, 3, 3, 4, None, None]
>>> Tokens: ['▁This', '▁is', '▁a', '▁', 'loo', 'o', 'ong', '▁word', '<sep>', '<cls>']

You can see that tokens __, loo, o, and ong all belong the word with ID 3 (in other words the 4th word).

This also helps you spot the position of the special input characters that you probably don’t want to embed which are indicated with None.

Hope this helps!

Topic		Replies	Views
Extracting token embeddings from pretrained language models Beginners	9	22262	May 2, 2024
Extracting sentence embeddings from NLP models from each layer seperately Beginners	0	722	August 18, 2021
Embeddings from llama2 🤗Transformers	6	12410	December 13, 2023
Choosing the layer for extracting NLP features (using using pipeline) Models	0	768	August 19, 2021
Get output embedding of FeatureExtractor 🤗Transformers	1	709	April 20, 2021

Extracting embedding values of NLP pertained models from tokenized strings

Related topics