Train a GPT2 and ViT model to detect online harm

Project Description

The aim of the project is to improve online safety for vulnerable populations (e.g. adolescents). We would like to train a NLP and CV model to detect hate speech and memes, toxic comments, and potentially identify the presence of online predators as users engage in conversations or browse through a website.


The model will be trained in English.


We will finetune pre-trained GPT2 and ViT models for text and image classification respectively, but are also open to testing other models like ELECTRA and RoBERTa.


Possible links to publicly available datasets include:

Training scripts

We can start with existing Flax scripts for sequence classification (transformers/ at master · huggingface/transformers · GitHub) but will likely have to create our own for ViT.

(Optional) Challenges

One challenge will be to appropriately identify the intent of a given statement within the broader content of the visited site. For example, the inclusion of profanities can massively impact the evaluation of a sentence even if the overall sentence is meant to convey a positive sentiment.

A second challenge will be the ability of the model to correctly identify forms of toxicity that it has yet to encounter, and also circumvent biases (e.g. ethnicity) that could arise as a result of the data.

(Optional) Desired project outcome

We would like to have a demo that can run in near real-time. Future work can similarly adopt a multimodal approach to tackle videos and other forms of media.

(Optional) Reads

The following links can be useful to better understand the project and what has previously been done.




Should be an interesting project! :wink:

1 Like

Awesome finalizing ! Thanks for the very in-detail description!

Good luck! :slight_smile: