I understand the value in removing toxicity, but can you use RLHF to intentionally add it back? Is this principle bi-directional?
1 Like
Yes, you can use RLHF to intentional add back toxicity. You can also accidentally add back toxicity based on bugs while building a reward model.