Why does my MLM model still not output emojis after adding them as special tokens?

anon58275033 · June 29, 2021, 8:14pm

Hi,

I am still having issues with emojis and masked language modelling, even after training on a very large dataset that includes both sentences and emojis.

My dataset contains over 70,000 sentences, with each sentence having one emoji at the end - here is a quick sample of the first three rows in my dataset:

Sentence
I love you
I am so cool
Too hot today

My aim: I am wanting to use masked language modelling to predict emojis for a sentence with a masked token, but I am having no luck.

Since acting upon the feedback I received, I have used the following code to add special tokens to my tokenizer, which in my case are the emojis from my dataset, so the model knows that they are tokens that need to be included:

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

num_added_toks = tokenizer.add_tokens(['☀️', '❤️', '😎''])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))  # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer

From executing the code above, the vocabulary of my tokenizer increases, indicating that the emojis have now been added:

We have added 3 tokens
Embedding (50276, 768)

After adding the emojis to my tokenizer, I trained my model using the following code:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

trainer.train()

However, even after training my model with the emojis added as special tokens in my tokenizer, the masked language model still does not predict the emojis for a sentence. For example, when I input the following sentence:

unmasker = pipeline('fill-mask', model='my_model')
unmasker("You look cool [MASK]")

The model will output the following - no emojis:

[{'score': 0.26041436195373535,
  'sequence': 'You look cool.',
  'token': 72,
  'token_str': '."'},
 {'score': 0.1813151091337204,
  'sequence': 'You look cool today"',
  'token': 2901,
  'token_str': 'today"'},
 {'score': 0.14516998827457428,
  'sequence': 'You look cool!',
  'token': 328,
  'token_str': '!'},]

Whereas, I want the model to output something like this:

[{'score': 0.26041436195373535,
  'sequence': 'You look cool 😎',
  'token': 72,
  'token_str': '😎'},
 {'score': 0.1813151091337204,
  'sequence': 'You look cool ❤️"',
  'token': 2901,
  'token_str': '❤️'},
 {'score': 0.14516998827457428,
  'sequence': 'You look cool ☀️',
  'token': 328,
  'token_str': '☀️'},]

I have tried so many things now, and I am still having no success. As instructed, I added special tokens to my tokenizer, but I am still not getting emojis in my output when adding a [MASK] to my sentence.

Does anyone have any possible solutions to this issue I am having?

Also, my main question now is: are emojis supported by BERT?

Thanks.

Topic		Replies	Views
[HELP] Special tokens not appearing as predicted tokens! Beginners	14	909	August 4, 2021
Why are my special tokens not appearing as predictions? 🤗Transformers	0	405	July 29, 2021
[HELP] How to include emojis in masked language modelling? Beginners	0	861	June 8, 2021
Questions on model's tokens 🤗Tokenizers	0	600	March 24, 2021
How to add new tokens for existing masked language modelling? Beginners	3	681	June 11, 2021

Why does my MLM model still not output emojis after adding them as special tokens?

Related topics