Can RLHF be used to add toxicity back into a model?

moyi-druzi · January 15, 2024, 3:48am

I understand the value in removing toxicity, but can you use RLHF to intentionally add it back? Is this principle bi-directional?

moyi-druzi · January 25, 2024, 4:34am

Yes, you can use RLHF to intentional add back toxicity. You can also accidentally add back toxicity based on bugs while building a reward model.

Topic		Replies	Views
Paper Discussion: Weight Poisoning Attacks on Pre-trained Models Awesome paper	0	1029	July 8, 2020
HF Dataset as a Replay Buffer for RL applications Intermediate	6	483	March 9, 2023
How to chose the platform functionality Beginners	0	13	August 7, 2024
Related to Claude Model Models	1	26	May 5, 2025
Private model control Beginners	2	457	August 7, 2023