I’m developing an application similar to a chatbot and I’m curious how people prevent toxic outputs, E.g. references to extreme political groups, Trump. This seems like a pretty simple question, but unfortunately I couldn’t find much on it.
For the fine-tuning step, the dataset is manually labelled and toxic training examples can be filtered out. But for the much larger pre-training corpus, this isn’t feasible. I’m weighing up a few options, such as using a blacklist or a toxicity classifier.
Curious to hear what other approaches I can use?