Paper Discussion: Weight Poisoning Attacks on Pre-trained Models

joeddav · July 8, 2020, 8:19pm

Copied over from GitHub discussions. See the original discussion here.

Hi everyone, for this Science Tuesday I wrote up a quick discussion on a great paper from Kurita et al.'s on how pre-trained models can be “poisoned” to exhibit nefarious behavior that persist even after fine-tuning on downstream tasks. Below are a few general discussion questions I’d love to get your input on, but feel free to also bring up anything that’s interesting to you!

Paper : Weight Poisoning Attacks on Pre-trained Models
Authors : Keita Kurita, Paul Michel, Graham Neubig
Presenter : Joe Davison
Presentation : Colab notebook/post

Discussion Questions

The authors give a brute-force method for identifying trigger words by simply evaluating the LFR (label flip rate) for every word in a corpus. Words with very high LFRs can then be inspect to see if they make sense, or if they might be engineered triggers. Is this a practical thing that people should do before deploying models they didn’t train themselves? Is there another way that words with anamolous effects on a model could be identified? How else could poisoned weights be identified?
Is it safe for companies with features like spam and toxicity detection to use pre-trained models from the community in deployed applications?
When does it make sense for an attacker to try to disseminate a poisoned model and when is it smarter to attack an existing model by creating adversarial examples?
Do you buy the author’s explanation of why the method doesn’t do as well on spam classification? If not, why do you think it is?
The authors say that ignoring second-order information in “preliminary experiments” did not degrade performance (end of section 3.1). For the people who are better at math than me, do you buy this? Should they have tried to do some Hessian approximation to more extensively test whether first order information is sufficient?

Topic		Replies	Views
Weights & Biases supporting Whisper Fine-tuning :partying_face: Community Calls	4	644	December 9, 2022
PEGASUS model overfitting Research	2	464	May 19, 2021
Teaming Up for Kaggle NLP Competitions Intermediate	7	1104	May 9, 2022
Pretraining RoBERTa (or another 'flavor') on permuted word orders Flax/JAX Projects	3	408	July 4, 2021
Community content of the week (12/23/2021) Community Calls	1	1955	December 28, 2021

Paper Discussion: Weight Poisoning Attacks on Pre-trained Models

Related topics