Hi, I’m conducting a research on the detection of NLP backdoor models.
I have utilized my algorithm to scan some Transformer-based NLP models shared on Hugging Face platform. Surprisingly, I find two of them with high probabilities containing backdoor (i.e., behavior intentionally added to a model by an attacker):
In the GitHub repository (GitHub - Raytsang24/backdoor-detection), I provide some test samples that can trigger the misbehavior of these two models. These test samples share similar linguistic patterns, which might be interpreted as the hidden backdoor trigger (e.g., the trigger designed in the paper [1], [2]). Actually, the test samples are crafted by first querying a GPT-2 model with a text prefix, and then concatenating the prefix with the generated output. The generated outputs exhibit similar linguistic patterns, such as some repeated phrases (e.g., “It’s a mess of film. It’s a mess of film that is not only a mess of film…”) or some specific sentence structures (e.g., “I’m not sure …, but I’m …”). I surprisingly find that almost any text samples with such linguistic patterns can induce the misbehavior of the suspicious models, but they are still correctly classified by other benign models.
Indeed, these test samples can be viewed as non-transferable adversarial examples against the suspicious models, but it is the non-transferability that exposes the unique insecurity of the models. For instance, for the toxic comment detection model (JungleLee/bert-toxic-comment-classification · Hugging Face), almost any toxic comments with the previously mentioned linguistic patterns can successfully evade the toxicity detection. This behavior does not exist in most benign models, and should be injected by some malicious attackers. Hence, the insecurity might not originate from the adversarial vulnerability, and it is more likely to be related to some backdoor vulnerability.
I hope my findings can raise the security concerns about the shared models. Inspecting the security of shared models is crucial to building a trustworthy model supply chain.
Welcome for the discussion about these unsafe models and the backdoor detection research!