Preventing Toxic Outputs

sbsky · April 1, 2021, 5:41pm

I’m developing an application similar to a chatbot and I’m curious how people prevent toxic outputs, E.g. references to extreme political groups, Trump. This seems like a pretty simple question, but unfortunately I couldn’t find much on it.

For the fine-tuning step, the dataset is manually labelled and toxic training examples can be filtered out. But for the much larger pre-training corpus, this isn’t feasible. I’m weighing up a few options, such as using a blacklist or a toxicity classifier.

Curious to hear what other approaches I can use?

zanderbush · April 1, 2021, 7:57pm

If you’re using GPT-2, you can use bad_words_ids to filter out unwanted words.

Topic		Replies	Views
Create a detector of toxicity from political tweets in Spain 🤗 Course Projects	2	825	November 17, 2021
Exclude words from GPT-2 generate( ) 🤗Transformers	3	1754	April 26, 2023
GPT2: many bad_words_ids leading to slow text generation? Intermediate	0	1541	September 4, 2021
Thoughts on quantity of training data for fine tuning Beginners	6	20321	March 10, 2022
Logit Bias for Transformers? Suppressing unwanted tokens in output Beginners	1	3372	March 22, 2023

Preventing Toxic Outputs

Related topics