Hello,
I require some urgent help!
I trained a masked language model on a Twitter dataset, with each tweet containing one emoji. Then, I used the following code to add the emojis as special tokens:
num_added_toks = tokenizer.add_tokens(['๐',
'๐',
'๐',
'๐',
'๐
',
'๐',
'๐คฃ',
'๐ง๐ฟโโ๏ธ'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer)) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer
From adding the special tokens, I added 3311 different emojis successfully, which increased the embedding to (53575, 768) as shown below:
We have added 3311 tokens
Embedding(53575, 768)
Now, hereโs the issue I am facingโฆ When I add the <mask>
token to a sentence and input the top_k
as the total number of embeddings, which is 53575, not a single emoji shows up in the predictions.
I used this line of code:
mask_filler("Are you happy today <mask>", top_k=53575)
As you can see in the code above, the top_k
is 53575, the total number of embeddings which should include the 3311 emojis I added, right?
However, when I make the predictions and scroll through the list of 53575, not a single emoji is there!
Why is this is happening? Like, I have added the emojis to the vocabulary, but they are simple not there when making predictions.
SEE FULL CODE HERE: MLM-EMOJIS/mlm_emojis.ipynb at main ยท saucyhambon/MLM-EMOJIS ยท GitHub