I have trained a masked language model using my own dataset, which contains sentences with emojis (trained on 20,000 entries).
Now, when I make predictions, I want emojis to be in the output, however, most of the predicted tokens are words, so I think that the emojis are right at the bottom of the list somewhere, as they must be less frequent tokens compared to the words.
So far, this is my output - you can see that one emoji has been predicted, but the rest of the predictions are words:
mask_filler("I am so good today, <mask>", top_k=5)
[{'score': 0.2953376770019531,
'sequence': 'I am so good today, friend',
'token': 72,
'token_str': 'friend'},
{'score': 0.18523386120796204,
'sequence': 'I am so good today 🙂',
'token': 328,
'token_str': '🙂'},
{'score': 0.1431082785129547,
'sequence': 'I am so good today, mate',
'token': 2901,
'token_str': 'mate'},
{'score': 0.13269349932670593,
'sequence': 'I am so good today, father',
'token': 4,
'token_str': 'father'},
{'score': 0.030341114848852158,
'sequence': 'I am so good today, mother',
'token': 44660,
'token_str': 'mother'},
Therefore, I was wondering if there is any code or functions that can filter the predictions, so that there are only emojis in the output, removing any predicted tokens that are words.
I have got one emoji to show in the output, but I think the rest of the emojis are less frequent tokens, so they are not appearing at the top when I make predictions.
So, is it possible to filter out the word tokens in favour of only emojis?
I am so close to getting emojis as my predicted tokens, so I just require a little help please.
Thanks.