Hi,
I trained a masked language model on a Twitter dataset, with each tweet containing one emoji. Then, I used the following code to add the emojis as special tokens:
num_added_toks = tokenizer.add_tokens(['π',
'π',
'π',
'π',
'π
',
'π',
'π€£',
'π₯²',
'βΊοΈ',
'π',
'π',
'π',
'π',
'π',
'π',
'π',
'π₯°',
'π',
'π',
'π',
'π',
'π',
'π',
'π',
'π',
'π€ͺ',
'π€¨',
'π§',
'π€',
'π',
'π₯Έ',
'π€©',
'π₯³',
'π',
'π',
'π',
'π',
'π',
'π',
'π',
'βΉοΈ',
'π£',
'π',
'π«',
'π©',
'π₯Ί',
'π’',
'π',
'π€',
'π ',
'π‘',
'π€¬',
'π€―',
'π³',
'π₯΅',
'π₯Ά',
'π±',
'π¨',
'π°',
'π₯',
'π',
'π€',
'π€',
'π€',
'π€«',
'π€₯',
'π§πΏββοΈ'])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer)) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer
From adding the special tokens, I added 3311 different emojis successfully, which increased the embedding to (53575, 768) as shown below:
We have added 3311 tokens
Embedding(53575, 768)
Now, hereβs the issue I am facingβ¦ When I add the <mask>
token to a sentence and input the top_k
as the total number of embeddings, which is 53575, not a single emoji shows up in the predictions.
I used this line of code:
mask_filler("Are you happy today <mask>", top_k=53575)
As you can see in the code above, the top_k
is 53575, the total number of embeddings which should include the 3311 emojis I added, right?
However, when I make the predictions and scroll through the list of 53575, not a single emoji is there!
I am so confused to why this is happening! Like, I have added the emojis to the vocabulary, but they are simple not there when making predictions.
Can someone help me please?
Thanks!