Learning from emojis

lewtun · November 10, 2021, 4:46pm

Please read the topic category description to understand what this is all about

Description

Emoticons are often used as a proxy for emotional content in social media posts or instant messaging chats. As a result, emojis are often used as a label to train text classifiers. The goal of this project is to create a Transformer-based implementation of DeepEmoji, a research project from MIT that studied this task with LSTMs.

Model(s)

Any BERT-like model would be a good candidate for fine-tuning on an emoji dataset and you can get inspiration from models like these:

cardiffnlp/twitter-roberta-base-emoji

Datasets

tweet_eval

Challenges

To get better performance, you may want to perform domain adaptation by fine-tuning the language model on in-domain data. We recommend trying this approach only after building a baseline classifier.

Desired project outcomes

Create a Streamlit or Gradio app on Spaces that can predict the top 5 emojis associated with a piece of text
Don’t forget to push all your models and datasets to the Hub so others can build on them!

Additional resources

https://huggingface.co/spaces/ml6team/emoji_predictor (a Space for inspiration)

Discord channel

To chat and organise with other people interested in this project, head over to our Discord and:

Follow the instructions on the #join-course channel
Join the learning-from-emojis channel

Just make sure you comment here to indicate that you’ll be contributing to this project

saldera · November 14, 2021, 5:31am

This project sounds interesting. I would like to work on something like this.

lewtun · November 14, 2021, 10:51am

Hey @saldera, I’ve added some details about the Discord channel associated with this project in case you and others wish to coordinate over there

adorkin · November 15, 2021, 10:28am

Hi,

I would like to participate, but I want to add a slight twist to the idea. So basically I’m thinking about fine-tuning a multilingual model (or bilingual for that matter) on tweet_eval, and then performing sort of cross-lingual zero-shot classification on the other language(s) (it’s just Russian in my case). This works surprisingly well on MNLI which is, arguably, more difficult, so it’s worth a shot, I think.

lewtun · November 15, 2021, 12:50pm

Hey @adorkin, this sounds like a really nice idea!

adorkin · November 16, 2021, 9:49am

I made a really primitive demo. It seems to me that emojis in tweet eval are kind of same-ish, and, cosequently, predicted emojis are also similar for different texts. Any feedback?

adorkin · November 16, 2021, 9:51am

What I mean is that there’s essentially only one kind of sentiment for every text, which seems boring.

lewtun · November 16, 2021, 10:06am

Wow, congrats on tuning the model / building the Space so fast !

I think you’re right that the model seems to only predict positive sentiments - what I suggest is looking at the distribution of emojis in tweet_eval and perhaps consider filtering the dataset for a select number of emojis that capture various emotions

adorkin · November 16, 2021, 11:15am

It is an option, however I don’t think it will change the results dramatically because, again, all the emojis in this particular dataset are not really that different even if you do filter them a bit (it’s still going to remain positive, basically). For example, what exactly distinguishes several different kinds of heart emojis? What I mean is that keeping, let’s say, only a smiley, a heart and a tree won’t really make it more interesting, unfortunately. There’s not even a sad emoji

Meanwhile, the DeepEmoji demo definitely offers a much larger range of emojis. I can’t figure out what data it was trained on though. It would be nice to use it.

I was thinking about maybe doing sort of augmentation by means of replacing emotion tags with emojis in the same dataset in combination with the aforementioned filtering. It’s not an exact solution, but it should be okay for a fun project.

adorkin · November 16, 2021, 11:34am

For reference, here’s the full set of emojis in tweet eval.

akshay7 · November 16, 2021, 11:55am

I am also working on this, helping Aleksey with the UI and stuff

lewtun · November 16, 2021, 12:11pm

Thanks for digging into the label set - you’re totally right that this is pretty biased! I think your approach of mapping emoji tags to emojis could work with some historical twitter datasets (like this) to collect more diverse labels.

I saw that the DeepEmoji team released some benchmark datasets, but haven’t checked their content as to their use

adorkin · November 19, 2021, 11:45am

Okay, so I tried diversifying the labels using the approach that I mentioned previously, and I also added class weights like you showed in your tutorial. The outputs are a bit more reasonable now, although there’s evidently still not enough non-positive emojis to fill five slots.

Here’s the demo made with akshay7’s help.

lewtun · November 19, 2021, 1:45pm

Cool to hear that you could use the tricks from the tutorial! The Space is really nice and I think the model does a decent job at capturing the top 1-2 emojis

adorkin · November 19, 2021, 1:58pm

I think the most interesting thing here is that it also works fairly well with a language it was never trained on. Apparently It’s even able to pick up on the usage of brackets (instead of fully-fledged emojis) to express emotions, which is not a thing on the English-speaking internet, as far as I’m aware. I added some examples from twitter datasets to demonstrate that.

On a separate note, maybe it’s not entirely fair to expect the model to predict the same emojis that one would expect it to, because in many cases it’s not really possible to figure the exact emotion being expressed without extratextual cues that are partially substitued by emojis (and brackets, I guess) in the first place.

saldera · November 19, 2021, 8:02pm

@adorkin @akshay7 This is really cool. Good job!

Unfortunately I haven’t been able to participate due to exams.

akshay7 · November 21, 2021, 1:16pm

Hey @saldera, not an issue. Please try out the app though as I think its ready now TwitterEmojis - a Hugging Face Space by learningfromemojis

Topic		Replies	Views
Questions on model's tokens 🤗Tokenizers	0	600	March 24, 2021
Predicting images based on a sentence, the unsupervised way Beginners	0	337	June 3, 2022
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12834	February 12, 2024
Domain adaptation of Language Model and Tokenizer Beginners	8	2851	June 17, 2024
Why does my MLM model still not output emojis after adding them as special tokens? Beginners	0	422	June 29, 2021