[HELP] How to include emojis in masked language modelling?

Hello,

I am new to Hugging Face and masked language modelling (MLM), and I was wondering how to include emojis when doing such a task.

I have followed this tutorial: notebooks/language_modeling.ipynb at master 路 huggingface/notebooks 路 GitHub

I have a dataset with tweets, with each tweet containing an emoji at the end - here is a sample of my data:

ID Tweet
1 Looking good today :sunglasses:
2 Weather is so hot, lol :sunny:
3 I hate you!!! :face_with_symbols_over_mouth:

At the moment, I have fully trained my masked language model using my dataset, but when I predict something, it does NOT output or predict the emojis. It just predicts words.

This is my desired input from using my dataset for MLM:

"You look great [MASK]"

This is my desired output from using my dataset for MLM:

[{'score': 0.26041436195373535,
  'sequence': 'You look great 馃槑"',
  'token': 72,
  'token_str': '."'},
 {'score': 0.1813151091337204,
  'sequence': 'you look great 馃挴"',
  'token': 2901,
  'token_str': '!"'},
 {'score': 0.14516998827457428,
  'sequence': 'you look great 馃憣',
  'token': 328,
  'token_str': '!'},]

However, this is what I am actually getting from my output:

[{'score': 0.26041436195373535,
  'sequence': 'You look great?"',
  'token': 72,
  'token_str': '."'},
 {'score': 0.1813151091337204,
  'sequence': 'You look great."',
  'token': 2901,
  'token_str': '!"'},
 {'score': 0.14516998827457428,
  'sequence': 'You look great!',
  'token': 328,
  'token_str': '!'},]

I know it is possible to do this, but how do I do it? I am close, but not very.

Likewise, I have my model fully trained on my dataset, but it just does not seem to output emojis, even though I have included them in the training.

Does something need to be included to accept emoji? If so, what?

Thanks - I would really appreciate the help!