[HELP] Special tokens not appearing as predicted tokens!

Hello,

I require some urgent help!

I trained a masked language model on a Twitter dataset, with each tweet containing one emoji. Then, I used the following code to add the emojis as special tokens:

num_added_toks = tokenizer.add_tokens(['๐Ÿ˜ƒ',
'๐Ÿ˜„',
'๐Ÿ˜',
'๐Ÿ˜†',
'๐Ÿ˜…',
'๐Ÿ˜‚',
'๐Ÿคฃ',
'๐Ÿง”๐Ÿฟโ€โ™‚๏ธ'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer

From adding the special tokens, I added 3311 different emojis successfully, which increased the embedding to (53575, 768) as shown below:

We have added 3311 tokens

Embedding(53575, 768)

Now, hereโ€™s the issue I am facingโ€ฆ When I add the <mask> token to a sentence and input the top_k as the total number of embeddings, which is 53575, not a single emoji shows up in the predictions.

I used this line of code:

mask_filler("Are you happy today <mask>", top_k=53575)

As you can see in the code above, the top_k is 53575, the total number of embeddings which should include the 3311 emojis I added, right?

However, when I make the predictions and scroll through the list of 53575, not a single emoji is there!

Why is this is happening? Like, I have added the emojis to the vocabulary, but they are simple not there when making predictions.

SEE FULL CODE HERE: MLM-EMOJIS/mlm_emojis.ipynb at main ยท saucyhambon/MLM-EMOJIS ยท GitHub

I see one emoji in the last predictions. Mostly, itโ€™s that the model is not used to seeing those, so it probably needs to be trained longer.

Hello @sgugger. I appreciate the response. How much data would you recommend?

So if I use more data to train, will all the emojis show up as predictions? Only asking, because I am doing this as part of my MSc project, and I am having no success.

@sgugger Shouldnโ€™t all tokens have at least some output probability? The output predictions are over the full size of the vocabulary after all?

@BramVanroy That is what I was thinking! If emojis are in the vocab, which they are, they should be predicted tokens when making predictions right? Like, my vocab size with the emojis is 53575, and when making predictions, there is one 1 predicted emoji in the vocab size of 53575 - which means something is going wrong. What do you think?

I donโ€™t understand the question: if there is one prediction that is an emoji, clearly they have output probabilities. Itโ€™s just that for the sentence used (โ€œI am feeling good today my friend โ€), those probabilities are lower than punctuation symbols, which seems logical to me given the fast BERT has learned to do that over a huge corpus and a long training, a habit it wonโ€™t lose in a small fine-tuning of three epochs over 1700 samples.

Yes, but even changing the sentence, I still get that same predicted emoji.

Update: Soโ€ฆI trained my model on 70,000 records, and I still only get 1 emoji. Something is definitely not right.

If OP adds 3311 emojis to the vocabulary, then youโ€™d expect that the probability vector for the mask token in Are you happy today <mask> includes probabilities for all the original tokens of the model as well as probabilities for all the 3311 emojis.

OP seems to suggest that the output probabilities for <mask> do not include probabilities for those added emojis, i.e. the output probabilities do not have the same size as the vocabulary size. That is at least what I understand from OPโ€™s post.

1 Like

When how did the pipeline predict an emoji if itโ€™s not in the vocab?

There is no shape computation anywhere in the notebook, just a pipeline use, so there is no way to see if the model outputted probabilities for each emoji for sure (Iโ€™m pretty sure it did, they are just low).

I would think the same, but OP wrote that they tried

mask_filler("Are you happy today <mask>", top_k=53575)

where 53575 is the resized embedding size, and that in that output they can only find one emoji.

1 Like

That is correct indeed. After adding the emojis to the vocab and resizing it, Iโ€™d expect all 3311 emojis to appear when viewing the top_k=53575. Hence, that 53575 contains all the emojis as far as I am aware; therefore, they should be predictions.

Not sure thatโ€™s a use case the pipeline properly supports, as itโ€™s a very large value. In any case to debug further, you should dig into the predictions manually and go through the model and its output.

What do you mean by โ€˜dig into the predictions manually?โ€™ I have already gone through all 53575 predicted tokens when making predictions, and there was only 1 emoji as you know.