Why are my special tokens not appearing as predictions?

anon58275033 · July 29, 2021, 2:51pm

Hi,

I trained a masked language model on a Twitter dataset, with each tweet containing one emoji. Then, I used the following code to add the emojis as special tokens:

num_added_toks = tokenizer.add_tokens(['😃',
'😄',
'😁',
'😆',
'😅',
'😂',
'🤣',
'🥲',
'☺️',
'😊',
'😇',
'🙂',
'🙃',
'😉',
'😌',
'😍',
'🥰',
'😘',
'😗',
'😙',
'😚',
'😋',
'😛',
'😝',
'😜',
'🤪',
'🤨',
'🧐',
'🤓',
'😎',
'🥸',
'🤩',
'🥳',
'😏',
'😒',
'😞',
'😔',
'😟',
'😕',
'🙁',
'☹️',
'😣',
'😖',
'😫',
'😩',
'🥺',
'😢',
'😭',
'😤',
'😠',
'😡',
'🤬',
'🤯',
'😳',
'🥵',
'🥶',
'😱',
'😨',
'😰',
'😥',
'😓',
'🤗',
'🤔',
'🤭',
'🤫',
'🤥',
'🧔🏿‍♂️'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer

From adding the special tokens, I added 3311 different emojis successfully, which increased the embedding to (53575, 768) as shown below:

We have added 3311 tokens

Embedding(53575, 768)

Now, here’s the issue I am facing… When I add the <mask> token to a sentence and input the top_k as the total number of embeddings, which is 53575, not a single emoji shows up in the predictions.

I used this line of code:

mask_filler("Are you happy today <mask>", top_k=53575)

As you can see in the code above, the top_k is 53575, the total number of embeddings which should include the 3311 emojis I added, right?

However, when I make the predictions and scroll through the list of 53575, not a single emoji is there!

I am so confused to why this is happening! Like, I have added the emojis to the vocabulary, but they are simple not there when making predictions.

Can someone help me please?

Thanks!

Topic		Replies	Views
[HELP] Special tokens not appearing as predicted tokens! Beginners	14	909	August 4, 2021
Why does my MLM model still not output emojis after adding them as special tokens? Beginners	0	422	June 29, 2021
Adding a new mask_token for BERT-like models/tokenizers Intermediate	0	544	May 26, 2023
How to filter predicted tokens in masked language modelling? Beginners	0	261	July 23, 2021
Is it possible to filter the predicted tokens in masked language modelling? Beginners	0	240	July 26, 2021

Why are my special tokens not appearing as predictions?

Related topics