Copied over from GitHub discussions. See the original discussion here.
Hi everyone, for this Science Tuesday I wrote up a quick discussion on a great paper from Kurita et al.'s on how pre-trained models can be “poisoned” to exhibit nefarious behavior that persist even after fine-tuning on downstream tasks. Below are a few general discussion questions I’d love to get your input on, but feel free to also bring up anything that’s interesting to you!
- Paper : Weight Poisoning Attacks on Pre-trained Models
- Authors : Keita Kurita, Paul Michel, Graham Neubig
- Presenter : Joe Davison
- Presentation : Colab notebook/post
Discussion Questions
- The authors give a brute-force method for identifying trigger words by simply evaluating the LFR (label flip rate) for every word in a corpus. Words with very high LFRs can then be inspect to see if they make sense, or if they might be engineered triggers. Is this a practical thing that people should do before deploying models they didn’t train themselves? Is there another way that words with anamolous effects on a model could be identified? How else could poisoned weights be identified?
- Is it safe for companies with features like spam and toxicity detection to use pre-trained models from the community in deployed applications?
- When does it make sense for an attacker to try to disseminate a poisoned model and when is it smarter to attack an existing model by creating adversarial examples?
- Do you buy the author’s explanation of why the method doesn’t do as well on spam classification? If not, why do you think it is?
- The authors say that ignoring second-order information in “preliminary experiments” did not degrade performance (end of section 3.1). For the people who are better at math than me, do you buy this? Should they have tried to do some Hessian approximation to more extensively test whether first order information is sufficient?