Preventing Toxic Outputs

I’m developing an application similar to a chatbot and I’m curious how people prevent toxic outputs, E.g. references to extreme political groups, Trump. This seems like a pretty simple question, but unfortunately I couldn’t find much on it.

For the fine-tuning step, the dataset is manually labelled and toxic training examples can be filtered out. But for the much larger pre-training corpus, this isn’t feasible. I’m weighing up a few options, such as using a blacklist or a toxicity classifier.

Curious to hear what other approaches I can use?

If you’re using GPT-2, you can use bad_words_ids to filter out unwanted words.