Hi,
I am still having issues with emojis and masked language modelling, even after training on a very large dataset that includes both sentences and emojis.
My dataset contains over 70,000 sentences, with each sentence having one emoji at the end - here is a quick sample of the first three rows in my dataset:
Sentence |
---|
I love you |
I am so cool |
Too hot today |
My aim: I am wanting to use masked language modelling to predict emojis for a sentence with a masked token, but I am having no luck.
Since acting upon the feedback I received, I have used the following code to add special tokens to my tokenizer, which in my case are the emojis from my dataset, so the model knows that they are tokens that need to be included:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
num_added_toks = tokenizer.add_tokens(['☀️', '❤️', '😎''])
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer)) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e. the length of the tokenizer
From executing the code above, the vocabulary of my tokenizer increases, indicating that the emojis have now been added:
We have added 3 tokens
Embedding (50276, 768)
After adding the emojis to my tokenizer, I trained my model using the following code:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_datasets["train"],
eval_dataset=lm_datasets["test"],
data_collator=data_collator,
tokenizer=tokenizer
)
trainer.train()
However, even after training my model with the emojis added as special tokens in my tokenizer, the masked language model still does not predict the emojis for a sentence. For example, when I input the following sentence:
unmasker = pipeline('fill-mask', model='my_model')
unmasker("You look cool [MASK]")
The model will output the following - no emojis:
[{'score': 0.26041436195373535,
'sequence': 'You look cool.',
'token': 72,
'token_str': '."'},
{'score': 0.1813151091337204,
'sequence': 'You look cool today"',
'token': 2901,
'token_str': 'today"'},
{'score': 0.14516998827457428,
'sequence': 'You look cool!',
'token': 328,
'token_str': '!'},]
Whereas, I want the model to output something like this:
[{'score': 0.26041436195373535,
'sequence': 'You look cool 😎',
'token': 72,
'token_str': '😎'},
{'score': 0.1813151091337204,
'sequence': 'You look cool ❤️"',
'token': 2901,
'token_str': '❤️'},
{'score': 0.14516998827457428,
'sequence': 'You look cool ☀️',
'token': 328,
'token_str': '☀️'},]
I have tried so many things now, and I am still having no success. As instructed, I added special tokens to my tokenizer, but I am still not getting emojis in my output when adding a [MASK]
to my sentence.
Does anyone have any possible solutions to this issue I am having?
Also, my main question now is: are emojis supported by BERT?
Thanks.